INDEX

College of Santa Fe Auditory Theory

Lecture 017 Instrument IV

INSTRUCTOR CHARLES FEILDING

  1. The speaking and singing voice
  2. Sound source in singing
  3. Sound modifiers in singing
  4. Brain Bullets

4.5 The speaking and singing voice 198

The singing voice is probably the most versatile of all musical instruments. Anyone who can speak is capable of singing, but we are not all destined to be opera or pop stars. Whilst considerable mystique surrounds the work of some singing teachers and how they achieve their results, the acoustics of the singing voice is now established as a research topic in its own right. Issues such as the following are being considered:

Knowledge of the acoustics of the singing and speaking voice can be helpful to music technologists when they are developing synthetic sounds since humans are remarkably good at vocalising the sound they desire. In such cases, knowledge of the acoustics of the singing and speaking voice can help in the development of synthesis strategies. This section discusses the human singing voice in terms of the input/ system/ output model and points to some of the key differences between the speaking and singing voice. The discussion presented in this section is necessarily brief. A number of texts are available which consider the acoustics of the speaking voice (e.g. Fant, 1960; Fry, 1979; Borden and Harris, 1980; Baken, 1987; Baken and Danilof, 1991; Kent and Read, 1992; Howard, 1998; Howard and Angus, 1998), and the acotlstics of singing voice (e.g. Benade, 1876, Sundberg, 1987; Bunch, 1993; Dejonckere et al., 1995; Howard, 1999).

4.5.1 Sound source in singing

The sound source in singing is the acoustic result of the vocal folds vibrating in the larynx which is sustained by air flowing from the lungs. The sound modifiers in singing are the spaces between the larynx and the lips and nostrils, known as the 'vocal tract', which can be changed in shape and size by moving the 'articulators', for example the jaw, tongue and lips (see Figure 4.31).

Fig 4.31 A cross section of the vocal tract


As we sing or speak, the shape of the vocal tract is continually changing to produce different sounds. The soft palate acts as a valve to shut off and open the nasal cavity (nose) from the airstream.

Vocal fold vibration in a healthy larynx is a cyclic sequence in which the vocal folds close and open regularly when a note is being sung. Thus the vocal folds of a soprano singing A4 <to = 440.0 Hz) will complete this vocal fold closing and opening sequence 440 times a second. Singers have two methods by which they can change the f0 of vocal fold vibration: they alter the stiffness of the folds themselves by changing the tension of the fold muscle tissue or by altering the vibrating mass by supporting an equal portion of each fold in an immobile position. Adjustments of the physical properties of the folds themselves allows many trained singers to sing over a pitch range of well over two octaves.

The vocal folds vibrate as a result of the Bernoulli effect in much the same way as the lips of a brass player. A consequence of this is that the folds close more rapidly than they open. An acoustic pressure pulse is generated at each instant when the vocal folds snap together, rather like a hand clap. As these closures occur regularly during singing, the acoustic input to the vocal tract consists of a regular series of pressure pulses (see Figure 4.32), the number per second depending on the note being sung. The pressure pulses are shown as negative going in the Figure since the rapid closure of the vocal folds suddenly causes the air flow from the lungs to stop, resulting in a pressure drop immediately above the vocal folds. The time between each pulse is the fundamental period. Benade (1976) notes though that the analogy between the lip vibration of brass players and vocal fold vibration speakers and singers should not be taken too far because the vocal folds can vibrate with little influence being exerted by the presence of the vocal tract, whereas the brass player's lip vibration is very strongly influenced by the presence of the instrument's pipe.

Fig 32 Idealized waveform (left) and spectrum (right) of acoustic excitation due to normal vocal fold vibration.

Fig 4.33 Schematic sequence for two vocal fold vibration cycles to illustrate vocal fold vibration sequence as if viewed from the front and idealized glottal airflow waveform. Vocal fold opening, closing, open and closed phases are indicated.


Figure 4.33 shows a schematic vocal fold vibration sequence as if viewed from the front associated with an idealised airflow waveform between the vibrating vocal folds. This is referred to as 'glottal' airflow since the space between the vocal folds is known as the 'glottis'. Three key phases of the vibration cycle are usefully identified: closed phase (vocal folds together), opening phase (vocal folds parting), closing phase (vocal folds coming together). The opening and closing phases are often referred to as the 'open phase' as shown in the Figure, because this is the time during which air flows. It should also be noted that airflow is not necessarily zero during the closed phase since there are vocal fold vibration configurations for which the vocal folds do not come together over their whole length (e.g. Sundberg, 1987; Howard, 1998, 1999).

The nature of vocal fold vibration changes with voice training, whether for oratory, acting or singing. The time for which the vocal folds are in contact in each cycle, known as 'larynx closed quotient' or 'CQ', has been investigated as a possible means by which trained adult male (Howard et al., 1990) and female (Howard, 1995) singers are helped in producing a more efficient acoustic output. Experimental measurements on trained and untrained singers suggest that CQ is higher at all pitches for trained adult males, and that it tends to increase with pitch for trained adult females in a patterned manner. Howard et al. suggest that the higher CQ provides the potential for a more efficient voice output by three means: (i) the time in each cycle during which there is an acoustic path via the open vocal folds to the lungs where sound is essentially completely absorbed is reduced, (ii) longer notes can be sustained since less air is lost via the open vocal folds in each cycle, and (iii) the voice quality is less breathy since less air flows via the open vocal folds in each cycle.

The frequency spectrum of the regular pressure pulses generated by the vibrating vocal folds during speech and singing consists of all harmonics with an amplitude change on average of -12 dB per octave rise in frequency (see the illustration on the right in Figure 4.32). Thus for every doubling in frequency, equivalent to an increase of one octave the amplitude reduces by 12 dB. The amplitudes of the first, second, fourth and eighth harmonics (which are separated by octaves) in the figure illustrate this effect.

The shape of the acoustic excitation spectrum remains essentially constant while singing, although the amplitude change of -12 dB per octave is varied for artistic effect, singing style and to aid voice projection by professional singers (e.g. Sundberg, 1987). The spacing between the harmonics will change as different notes are sung, and Figure 4.34 shows three input spectra for sung notes an octave apart. Trained singers, particularly those with Western operatic voices, exhibit an effect known as 'vibrato' in which their f0 is varied at a rate of approximately 5.5-7.5 Hz with a range of between +-0.5 and +-semitones (Dejonckere et al., 1995).

Fig 3.34 Idealized vocal tract response plots for the vowels "fast" (left) "feed" (center) and "food" (right)

4.5.2 Sound modifiers in singing

The regular series of pulses from the vibrating vocal folds are modified by the acoustic properties of the vocal tract (see Figure 4.21). In acoustic terms, the vocal tract can be considered as a stopped tube (closed at the larynx which operates as a flowcontrolled reed, open at the lips) which is approximately 17.5 cm in length for an adult male. When the vowel at the end of "announcer" is produced, the vocal tract is set to what is referred to as a neutral position, in which the articulators are relaxed, and the soft palate (see Figure 4.31) is raised to cut off the nose; the vowel is termed 'non-nasalised'. The neutral vocal tract approximates quite closely to a tube of constant diameter throughout its length and therefore the equation governing modal frequencies in a cylindrical stopped pipe can be used to find the vocal tract standing wave mode frequencies for this vowel.


Example 4.3 Calculate the first three mode frequencies of the neutral adult male vocal tract. (Take the velocity of sound in air as 344 ms-l.)

The vocal tract length is 17.5 cm, or 0.175 m.

the fundamental or first mode: f stopped(1)= (C/(4Ls)) = (344/(4 x 0.175)) = 491.4 Hz

the higher mode frequencies are: f stopped(n) = (2n - 1)f stopped(1)

where n = 1, 2, 3, 4, ...

Thus the second mode frequency (n = 2) is: 3 * 491.4 = 1474 Hz

and the third mode frequency (n = 3) is: 5 * 491.4 = 2457 Hz


Example 4.3 gives the frequencies for the neutral vowel, and these are often rounded to 500 Hz, 1500 Hz and 2500 Hz for convenience. When considering the acoustics of speech and singing, the standing wave modes are generally referred to as 'formants'. ldealised frequency response curves for a vocal tract set to produce the vowels in the words fast, feed and food are shown in Figure 4.33. and the centre frequency of each formant is labelled starting with 'F1' or 'first formant' for the peak that is lowest in frequency, continuing with 'F2' (second formant) and 'F3' (third formant) as shown in the figure. The formants are acoustic resonances of the vocal tract itself resulting from the various dimensions of the vocal tract spaces. These are modified during speech and singing by movements of the articulators. When considering the different sounds produced during speech, usually just the first, second and third formants are considered since these are the only formants whose frequencies tend to vary. Six or seven formants can often be identified in the laboratory and the higher formants are thought to contribute to the individual identity of a speaking or singing voice. However, in singing important contributions to the overall projection of sound are believed to be made by formants higher than the third. In order to produce different sounds, the shape of the vocal tract shape is altered by means of the articulators to change its acoustic properties. The perturbation theory principles explored in the context of woodwind reed instruments (see Figure 4.23) can be employed here also (Kent and Read, 1992). Figure 4.35 shows the displacement nodes and antinode positions for the first three formants of the vocal tract during a neutral non-nasalised vowel,

Fig 4.35 Displacement nodes and antinode positions for the first three modes (or formants F1, F2, F3) of the vocal tract during a neutral non nasalized vowel.


which can be confirmed with reference to the upper-right-hand part of Figure 4.23. Following the same line of reasoning as that presented in the context of Figure 4.23, the effect of constrictions (and therefore enlargements) on the first three formants of the vocal tract can be predicted as shown in Figure 4.36. For example, all formants have a volume velocity antinode at the lips, and a lip constriction therefore lowers the frequencies of all formants. (It should be noted that there are two other means of lowering all formant frequencies by means of vocal tract lengthening either by protruding lip or by lowering the larynx.)

A commonly referenced set of average formant frequency values for men, women and children for a number of vowels, taken from Peterson and Barney (1952), is shown in Table 4.3. Formant frequency values for these vowels can be predicted with reference to their articulation. For example, the vowel in beat has a constriction towards the front of the tongue in the region of both N2 and N3 (see Figure 4.35), and reference to Figure 4.36 suggests that Fl is lowered in frequency and F2 and F3 are raised from the values one would expect for the neutral vowel. The vowel in part, on the other hand, has a significant constriction in the region of both A2 and A3 (see Figure 4.35) resulting in a raising of F1, and a lowering of both F2 and F3 from their neutral vowel values. The vowel in boot has a constriction at the lips which are also rounded so as to extend the length of the vocal tract and thus all formant frequencies are lowered from their neutral vowel values. These changes can be confirmed from Table 4.3.

Fig 4.36 Formant frequencty modification with position of vocal tract of constriction.

Table 4.3 Average formant frequencies in HZ for men, women, and children for a selection of vowels


The input/system/output model for singing consists of the acoustic excitation due to vocal fold vibration (input), the vocal tract response (system) to give the output. These are usually considered in terms of their spectra, and both the input and system change with time during singing. Figure 4.37 shows the model for the vowel in fast sung on three different notes. This is to allow one of the main effects of singing at different pitches to be illustrated.

The input in each case is the acoustic spectrum resulting from vocal fold vibration (see Figure 4.32). The output is the result of the response of the vocal tract for the vowel in fast acting on the input vocal fold vibration. The effect of this is to multiply the amplitude of each harmonic of the input by the response of the vocal tract at that frequency. This effectively imparts the formant peaks of the vocal tract response curve onto the harmonics of the input spectrum. In this example, there are three formant peaks shown, and it can be seen that in the cases of the lower two notes, the formant structure can be readily seen in the output, but that in the case of the highest note the formant peaks cannot be identified in the output spectrum because the harmonics of the input are too far apart to represent clearly the formant structure.

Fig 4.37 Singing voice input/system/output model idealised for the vowel in fast sung on three notes an octave apart


The representation of the formant structure in the output spectrum is important if the listener is to identify different vowels. Figure 4.37 suggests that somewhere between the G above middle C and the G an octave above, vowel identification will become increasingly difficult. This is readily tested by asking a soprano to sing different vowels on mid and top G as shown in the figure and listening to the result. In fact, when singing these higher notes, professional sopranos adopt vocal tract shapes which place the lower formants over individual harmonics of the excitation so that they are transmitted via the vocal tract with the greatest amplitude. In this way, sopranos can produce sounds of high intensity which will project well. This effect is used from approximately the C above middle C where the vocal tract is, in effect, being 'tuned-in' to each individual note sung, but at the expense of vowel clarity.

This tuning-in effect is not something that tenors need to do since the ratio between the formant frequencies and the to of the tenor's range is higher than that for sopranos. However, all singers who do not use amplification need to project above accompaniment, particularly when this is a full orchestra and the performance is in a large auditorium. The way in which professional opera singers achieve this can be seen with reference to Figure 4.38

Fig 4.38 Idealised spectra for a singer speaking the text of an opera aria (a) the orchestra playing the accompaniment to the aria (b) and the aria being sungwith orchestral accompaniment (c)


which shows idealised spectra for the following:

It should be noted that the amplitude levels cannot be directly compared between (A) and (B) in the Figure (i.e. the singer does not speak as loudly as the orchestral accompaniment!) since they have been normalised for comparison.

The idealised spectrum for the text read alone has the same general shape as that for the orchestra playing alone. When the professional singer sings the aria with orchestra accompaniment, it can be seen that this combined response curve has a shape similar to both the speech and orchestral accompaniment at low frequencies, but with an additional broad peak between approximately 2.5 kHz and 4 kHz and centred at about 3 kHz. This peak relates to the acoustic output from the singer when singing but not when speaking, since it is absent for the read text and also in the orchestral accompaniment alone. This peak has similar characteristics to the formants in the vocal tract response, and for this reason it is known as the 'singer's formant'. The presence of energy in this peak enables the singer to be heard above an accompanying orchestra because it is a section of the frequency spectrum in which the singer's output prevails. This is what gives the professional singing voice its characteristic 'ring', and it is believed to be the result of lowering the larynx and widening the pharynx (see Figure 4.31) which is adopted by trained Western operatic singers. (The lower plot in Figure 5.5 is an analysis of a CD recording of a professional tenor whose singer's formant is very much in evidence.)

Singing teachers set out to achieve these effects from pupils by suggesting that pupils: 'sing on the point of the yawn', or 'sing as if they have swallowed an apple which has stuck in their throat'. Sundberg (1987) discusses the articulatory origin of the singer's formant as follows: '... it shows a strong dependence on the larynx tube ..:, concluding that: '... it is necessary, however, that the pharynx tube be lengthened and that the cross-sectional area in the pharynx at the level of the larynx tube opening be more than six times the area of that opening'.

Professional singing is a complex task which extends the action of the instrument used for speech. It is salutary to note that the prime function of the vocal folds is to act as a valve to protect the lungs, and not to provide the sound source basic to human communication by means of speech and song.

 

You Need to Know

Sound source in singing

The sound source in singing is the acoustic result of the vocal folds vibrating in the larynx which is sustained by air flowing from the lungs. The sound modifiers in singing are the spaces between the larynx and the lips and nostrils, known as the 'vocal tract', which can be changed in shape and size by moving the 'articulators', for example the jaw, tongue and lips

As we sing or speak, the shape of the vocal tract is continually changing to produce different sounds. The soft palate acts as a valve to shut off and open the nasal cavity (nose) from the airstream.

Vocal fold vibration in a healthy larynx is a cyclic sequence in which the vocal folds close and open regularly when a note is being sung. Thus the vocal folds of a soprano singing A4 <to = 440.0 Hz) will complete this vocal fold closing and opening sequence 440 times a second. Singers have two methods by which they can change the f0 of vocal fold vibration: they alter the stiffness of the folds themselves by changing the tension of the fold muscle tissue or by altering the vibrating mass by supporting an equal portion of each fold in an immobile position. Adjustments of the physical properties of the folds themselves allows many trained singers to sing over a pitch range of well over two octaves.

 

The frequency spectrum of the regular pressure pulses generated by the vibrating vocal folds during speech and singing consists of all harmonics with an amplitude change on average of -12 dB per octave rise in frequency (see the illustration on the right in Figure 4.32). Thus for every doubling in frequency, equivalent to an increase of one octave the amplitude reduces by 12 dB. The amplitudes of the first, second, fourth and eighth harmonics (which are separated by octaves) in the figure illustrate this effect.

 

The shape of the acoustic excitation spectrum remains essentially constant while singing, although the amplitude change of -12 dB per octave is varied for artistic effect, singing style and to aid voice projection by professional singers (e.g. Sundberg, 1987). The spacing between the harmonics will change as different notes are sung, and Figure 4.34 shows three input spectra for sung notes an octave apart. Trained singers, particularly those with Western operatic voices, exhibit an effect known as 'vibrato' in which their f0 is varied at a rate of approximately 5.5-7.5 Hz with a range of between +-0.5 and +-semitones

 

Sound modifiers in singing

The regular series of pulses from the vibrating vocal folds are modified by the acoustic properties of the vocal tract (see Figure 4.21). In acoustic terms, the vocal tract can be considered as a stopped tube (closed at the larynx which operates as a flowcontrolled reed, open at the lips) which is approximately 17.5 cm in length for an adult male. When the vowel at the end of "announcer" is produced, the vocal tract is set to what is referred to as a neutral position, in which the articulators are relaxed, and the soft palate (see Figure 4.31) is raised to cut off the nose; the vowel is termed 'non-nasalised'. The neutral vocal tract approximates quite closely to a tube of constant diameter throughout its length and therefore the equation governing modal frequencies in a cylindrical stopped pipe can be used to find the vocal tract standing wave mode frequencies for this vowel.

The frequencies for the neutral vowel, and these are often rounded to 500 Hz, 1500 Hz and 2500 Hz for convenience. When considering the acoustics of speech and singing, the standing wave modes are generally referred to as 'formants'. ldealised frequency response curves for a vocal tract set to produce the vowels in the words fast, feed and food are shown in Figure 4.33. and the centre frequency of each formant is labelled starting with 'F1' or 'first formant' for the peak that is lowest in frequency, continuing with 'F2' (second formant) and 'F3' (third formant) as shown in the figure. The formants are acoustic resonances of the vocal tract itself resulting from the various dimensions of the vocal tract spaces. These are modified during speech and singing by movements of the articulators. When considering the different sounds produced during speech, usually just the first, second and third formants are considered since these are the only formants whose frequencies tend to vary. Six or seven formants can often be identified in the laboratory and the higher formants are thought to contribute to the individual identity of a speaking or singing voice. However, in singing important contributions to the overall projection of sound are believed to be made by formants higher than the third. In order to produce different sounds, the shape of the vocal tract shape is altered by means of the articulators to change its acoustic properties. The perturbation theory principles explored in the context of woodwind reed instruments (see Figure 4.23) can be employed here also (Kent and Read, 1992). Figure 4.35 shows the displacement nodes and antinode positions for the first three formants of the vocal tract during a neutral non-nasalised vowel,

 

The representation of the formant structure in the output spectrum is important if the listener is to identify different vowels. Figure 4.37 suggests that somewhere between the G above middle C and the G an octave above, vowel identification will become increasingly difficult. This is readily tested by asking a soprano to sing different vowels on mid and top G as shown in the figure and listening to the result. In fact, when singing these higher notes, professional sopranos adopt vocal tract shapes which place the lower formants over individual harmonics of the excitation so that they are transmitted via the vocal tract with the greatest amplitude. In this way, sopranos can produce sounds of high intensity which will project well. This effect is used from approximately the C above middle C where the vocal tract is, in effect, being 'tuned-in' to each individual note sung, but at the expense of vowel clarity.

 

This tuning-in effect is not something that tenors need to do since the ratio between the formant frequencies and the to of the tenor's range is higher than that for sopranos. However, all singers who do not use amplification need to project above accompaniment, particularly when this is a full orchestra and the performance is in a large auditorium.