INDEX

College of Santa Fe Auditory Theory

Lecture 019 Timbre II

INSTRUCTOR CHARLES FEILDING

  1. Deceiving the ear
  2. Perception of pure tones
  3. Masking of one sound by another
  4. Note grouping illusions
  5. Pitch illusions
  6. Brain Bullets

5.5 Deceiving the ear 228

This section concerns sounds which in some sense could be said to 'deceive' the ear. Such sounds have a psychoacoustic realisation which is not what might be expected from knowledge of their acoustic components. In other words, the subjective and objective realisations of sounds cannot be always directly matched up. Whilst some of the examples given may be of no obvious musical use to the performer or composer, they may in the future find musical application in electronically produced music for particular musical effects where control over the acoustic components of the output sound is exact.

5.5.1 Perception of pure tones

When two pure tones are played simultaneously, they are not always perceived as two separate pure tones. The discussion relating to Figure 2.7 introducing critical bandwidth in Chapter 2 provides a first example of sounds which in some sense, deceive the ear. These two pure tones are only perceived as separate pure tones when their frequency difference is greater than the critical bandwidth. Otherwise they are perceived as a single fused tone which is 'rough' or as beats depending on the frequency difference between the two pure tones. When two pure tones are heard together, other tones with frequencies lower than the frequencies of either of the two pure tones themselves may be heard also. These lower tones are not acoustically present in the stimulating signal and they occur as a result of the stimulus consisting of a 'combination' of at least two pure tones and they are known as 'combination tones'. The frequency of one such combination tone which is usually quite easily perceived is the difference (higher minus the lower) between the frequencies of the two tones, and this is known as the 'difference tone':

fd = fh - fl

where

Notice that this is the beat frequency when the frequency difference is less than approximately 12.5 Hz (see Chapter 2). The frequencies of other possible combination tones that can result from two pure tones sounding simultaneously can be calculated as follows:


f(n)= f1 - [n(fh-fl)] = f1 - [n x fd]

where

These tones are always below the frequency of the lower pure tone, and occur at integer multiples of the difference tone frequency below the lower tone. No listeners hear all and some hear none of these combination tones. The difference tone and the combination tones for n = 1 and n = 2, known as the 'second order difference tone' and the 'third-order difference tone', are those that are perceived most readily (e.g. Rasch and Plomp, 1982).


Example 5.2 Calculate the difference tone and first four combination tones which occur when pure tones of 1200 Hz and 1100 Hz sound simultaneously.

Equation 5.2 gives the difference tone frequency =fh - fl = 1200 - 1100 = 100 Hz

Equation 5.3 gives other combination tone frequencies, and the first three are for n = 1, 2, 3 and 4.


When the two pure tone frequencies are both themselves adjacent harmonics of some f0 (in Example 5.2 the tones are the 11 th and 12th harmonics of 100 Hz), then the difference tone is equal to n1 and the other combination tones form 'missing' members of the harmonic series. When the two tones are not members of a harmonic series, the combination tones have no equivalent fo, but they will be equally spaced in frequency. Combination tones are perceived quite easily when two musical instruments which produce fairly pure tone outputs, such as the descant recorder, baroque flute or piccolo, whose fo values are high and close in frequency. When the two notes played are themselves both exact and adjacent members of the harmonic series formed on their difference tone, the combination tones will be consecutive members of the harmonic series adjacent and below the lower played note (i.e. the 10 values of both notes and their combination tones would be exact integer multiples of the difference frequency between the notes themselves). The musical relationship of combination tones to notes played therefore depends on the tuning system in use. Two notes played using a tuning system which results in the interval between the notes never being pure, such as the equal-tempered system, will produce combinations tones which are close but not exact harmonics of the series formed on the difference tone.


Example 5.3

If two descant recorders are playing the notes AS and B5 simultaneously in equal tempered tuning, which notes on the equal tempered scale are closest to the most readily perceived combination tones?

The most readily perceived combination tones are the difference tone and the combination tones for n = 1 and n = 2 in Equation 5.3. Equal-tempered to values for notes are given in Figure 3.21. Thus AS has an f0 of 880.0 Hz and for B5, f0=987.8 Hz.

The difference tone frequency = 987.8 - 880.0 = 107.8 Hz; closest note is A2 (f0=110.0 Hz).

The combination tones are:

for n = 1: 880.0 -107.8 = 772.2 Hz; closest note is G5 (f0=784.0 Hz)

for 11 = 2: 880.0 - 215.6 = 664.4 Hz; closest note is E5 (f0=659.3 Hz)


These combination tones would beat with the f0 component of any other instruments in an ensemble playing a note close to a combination tone. This will not be as marked as it might appear at first, due to an effect known as 'masking', which is described in the next section.

5.5.2 Masking of one sound by another

Asymmetry of masking by pulsed tones (Audio Demos)

Backward forward masking (Audio Demos)

Pulsation Threshold (Audio Demos)

When we listen to music, it is very rare that it consists of just a single pure tone. Whilst it is possible and relatively simple to arrange to listen to a pure tone of a particular frequency in a laboratory or by means of an electronic synthesiser (a useful, important and valuable experience) such a sound would not sustain any prolonged musical interest. Almost every sound we hear in music consists of at least two frequency components. When two or more pure tones are heard together an effect known as 'masking' can occur, where each individual tone can become more difficult or impossible to perceive, or it is partially or completely 'masked', due to the presence of another tone. In such a case the tone which causes the masking is known as the 'masker' and the tone which is masked is known as the 'maskee'. These tones could be individual pure tones, but given the rarity of such sounds in music, they are more likely to be individual frequency components of a note played on one instrument which either mask other components in that note, or frequency components of another note. The extent to which masking occurs depends on the frequencies of the masker and maskee and their amplitudes.

As is the case with most psychoacoustic investigations, masking is usually discussed in terms of the masking effect one pure tone can have on another, and the result is extended to complex sounds by considering the masking effect in relation to individual components. (This is similar, for example, to the approach adopted in the section on consonance and dissonance in Chapter 3, Section 3.3.2.) In psychoacoustic terms, the threshold of hearing of the maskee is shifted when in the presence of the masker, which gives the basis on which masking can be measured as the shift of a listener's threshold hearing curve caused by the presence of the masker.

The dependence of masking on the frequencies of masker and maskee can be illustrated by reference to Figure 2.9 in which an idealised frequency response curve for an auditory filter is plotted. The filter will respond to components in the input acoustic signal which fall within its response curve whose bandwidth is given by the critical bandwidth for the filter's centre frequency. The filter will respond to components in the input whose frequencies are lower than its centre frequency to a greater degree than components which are higher in frequency than the centre frequency due to the asymmetry of the response curve. Masking can be thought of as the filter's effectiveness in analysing a component at its centre frequency (maskee) being reduced to some degree by the presence of another component (masker) whose frequency falls within the filter's response curve. The degree to which the filter's effectiveness is reduced is usually measured as a shift in hearing threshold, or 'masking level', as illustrated in Figure 5.7. The figure shows that the asymmetry of the response curve results in the masking effect being considerably greater for maskees which are above rather than those below the frequency of the masker. This effect is often referred as:

Fig 5.7 Idealized masking level to illustrate the "Low masks high" or "upward spread of masking" effect for a masker of frequency f masker Hz

The dependence of masking on the amplitudes of masker and maskee is illustrated in Figure 5.8 in which idealised masking level curves are plotted for different amplitude levels of a masker of frequency fmasker. At low amplitude levels, the masking effect tends to be similar for frequency above and below fmasker. As the amplitude of the masker is raised the low masks high effect increases and the resulting masking level curve becomes increasingly asymmetric. Thus the masking effect is highly dependent on the amplitude of the masker. This effect is illustrated in Figure 5.9 which is taken from Sundberg (1991). The frequency scale in this Figure is plotted such that each critical bandwidth occupies the same distance. Sundberg summarises this Figure in terms of a three straight-line approximation to the threshold of hearing in the presence of the masker, or 'masked threshold', as follows:

Fig 5.8 Idealised change in masking level with different levels of masker of frequency f masker Hz

Fig 5.9 Idealised masked thresholds for masker pure tones at 300Hz, 350Hz and 400Hz at 50dBSPL 70dBSPL and 90dBSPL respectively plotted on a critical band spaced frequency


The masking effect of individual components in musical sounds which are complex with many spectral components can be determined in terms of the masking effect of individual components on other components in the sound. If a component is completely masked by another component in the sound, the masked component makes no contribution to the perceived nature of the sound itself and is therefore effectively ignored. If the masker is broadband noise, or 'white noise', then components at all frequencies are masked in an essentially linear fashion (i.e. a 10 dB increase in the level of the noise increases the masking effect by 10 dB at all frequencies). This can be the case, for example, with background noise or a brushed snare drum (see Figure 3.6) which have spectral energy generally spread over a wide frequency range and this can mask components of other sounds that fall within that frequency range.

The masking effects considered so far are known as 'simultaneous masking' because the masking effect on the maskee by the masker occurs when both sound together (or simultaneously). Two further masking effects are important for the perception of music where the masker and maskee are not sounding together, and these are referred to as 'non-simultaneous masking'. These are 'forward masking' or 'post-masking' and 'backward masking' or pre-masking. In forward masking, a pure tone masker can mask another tone (maskee) which starts after the masker itself has ceased to sound. In other words the masking effect is 'forward' in time from the masker to the maskee. Forward masking can occur for time intervals between the end of the masker and the start of the maskee of up to approximately 30 ms. In backward masking a maskee can be masked by a masker which follows it in time, starting up to approximately 10 ms after the maskee itself has ended. It should be noted,

Fig 5.10 Idealised illustration of simultaneous and non simultabeous masking

however, that considerable variation exists between listeners in terms of the time intervals over which forward and backward masking takes place.

Simultaneous and non-simultaneous masking are summarised in an idealised graphical format in Figure 5.10, which gives an indication of the masking effect in the time domain. The instant at which the masker starts and stops is indicated at the bottom of the figure, and it is assumed that the simultaneous masking effect is such that the threshold is raised by 50 dB. The potential spreading in time of masking as non-simultaneous pre- and post-masking effects is also shown. Moore (1996) makes the following observations about non-simultaneous masking:

Masking is exploited practically in digital systems that store and transmit digital audio in order to reduce the amount of information that has to be handled, and therefore reduce the transmission resource, or bandwidth, and memory, disk or other storage medium required. Such systems are generally referred to as perceptual coders because they exploit knowledge of human perception. For example, perceptual coding is the operating basis of the MP3 system that is used to transmit music over the Internet, MP3 players that store many hours of such music in a pocket-sized device, multi-channel sound in digital audio broadcasting and satellite television systems, MiniDisk recorders (Maes, 1996), and the now obsolete digital compact cassette (DCC).

There are international standards that define perceptual coding schemes for the encoding (recording) and decoding (playback) parts of these systems which enable different manufacturers to produce equipment, and the Moving Pictures Expert Group (MPEG) was set up in 1988. Their task was then and still is now to develop international standards for the coding of moving pictures and associated audio, and their work has resulted in standards such as MPEG-l, MPEG-2 and MPEG-4, each of which includes three layers: I, 2, and 3. MP3 itself is based on MPEGI, layer III (not MPEG-3 as this does not exist!). The basic principles of perceptual coding schemes for audio are outlined below and more details can be found in Watkinson (1994, 1999), Gilchrist and Grewin (1996) and Rumsey (1996).

Figure 5.11 shows the basic block structure of all audio perceptual coders. The input signal is first split into a number of frequency bands (box 1), generally by means of a bank of bandpass filters, and these are sometimes referred to as sub-bands giving some coders the often used name sub-baud coders. The extent to which this process matches the human peripheral hearing system critical band analysis (see Chapter 2) depends on the complexity of the particular coding scheme itself. The energy in each of these sub-bands is used with reference to the original signal to calculate the simultaneous (and in some cases also the non-simultaneous) masking effects for that instant of input signal (box 2). Those elements of the sub-bands that the system decides would not be masked are then digitally coded (box 3) for transmission and/or storage. At the receiving end there is an encoder which reverses this process to reproduce the original audio material, which is not of course an exact copy of the original input since masking predictions have been employed to remove material that the listener would not have heard in that context.

Fig 5.11 Basic schematic for a perceptual coding system

 

Demonstrations of masking effects are available on the CD recording of Houtsma et al. (1987).

5.5.3 Note grouping illusions

There are some situations when the perceived sound is unexpected, as a result of either what amounts to an acoustic illusion or the way in which the human hearing system analyses sounds. Whilst some of these sounds will not be found in traditional musical performances using acoustic instruments since they can only be generated electronically, some of the effects have a bearing on how music is performed. The nature of the illusion and its relationship with the acoustic input which produced it can give rise to new theories of how sound is perceived, and in some cases, the effect might have already or could in the future be used in the performance of music.

Diana Deutsch describes a number of note grouping acoustic illusions, some of which are summarised below with an indication of their manifestation in music perception and/or performance. Deutsch (1974) describes an 'octave illusion' in which a sequence of two tones an octave apart with high (800 Hz) and low (400 Hz) to values are alternated between the ears as illustrated in the upper part of Figure 5.12, Most listeners report hearing a high tone in the right ear alternating with a low tone in the left ear as illustrated in the figure, no matter which way round the headphones are placed. She further notes that righthanded listeners tend to report hearing the high tone in the right ear alternating with a low tone in the left ear whilst left-handed listeners tend to hear a high tone alternating with a low tone but it is equally likely that the high tone is heard in the left or right ear. This illusion persists when the stimuli are played over loudspeakers. This stimulus is available on the CD recording of Houtsma et al. (1987).

In a further experiment (Deutsch, 1975) played an ascending and descending C major scale simultaneously with alternate notes being switched between the two ears as shown in the lower part of Figure 5.12. The most commonly perceived response is also shown in the figure. Once again the high notes tend to be heard in the right ear and the low notes in the left ear, resulting in a snippet of a C major scale being heard in each ear. Such effects are known as 'grouping' or 'streaming', and by way of explanation, Deutsch invokes some of the grouping principles of the 'Gestalt school' of psychology known as 'good continuation', 'proximity' and 'similarity'. She describes these (Deutsch, 1982) as follows:

Fig 5.12 A schematic representation of the stimulus for and most common response to the 'octave illusion' described by Diana Deutsch and scale illusion (lower) described by Diana Deutsch


In each case the 'elements' referred to are the individual notes in these stimuli. Applying these principles to the stimuli shown in the figure, Deutsch suggests that the principle of proximity is important, grouping the higher tones (and lower tones) together, rather than good continuation which would suggest that complete ascending and/or descending scales of C major would be perceived. Deutsch (1982) describes other experiments which support this view. Music in which grouping of notes together by frequency proximity produces the sensation of a number of parts being played even though only a single line of music is being performed, includes works for solo instruments such as the Partitas and Sonatas for solo violin by J.S. Bach. An example of this effect is shown in Figure 5.13 from the Preludio from Partita number III in E major for solo violin by J .S. Bach. The score (upper stave) and three parts usually perceived (lower stave) are shown, where the perceived parts are grouped by frequency proximity.

The rather extraordinary string part writing in the final movement of Tchaikovsky's 6th symphony in the passage shown in Figure . . .


Fig 5.13 Bars 45 to 50 of the Preludio from Partita III in E Major for solo violin by J.S.Bach showing the notes scored for violin (upper stave) and the three parts normally perceived by streaming (lower three staves)


5.14 is also often noted in this context because it is generally perceived as the four-part passage shown. This can again be explained by the principle of grouping by frequency proximity. The effect can be considered in terms of stereo listening if the strings are heard in a contemporaneous orchestral positioning jn the following order (as viewed from left to right): first violins, double basses, cellos, violas, second violins. This is as opposed to the more common arrangement today (as viewed from left to right): first violins, second violins, violas, double basses and cellos.

Other illusions can be produced which are based on timbral proximity streaming. Pierce (1992) describes an experiment

Fig 5.14 Snippet of the final movement of Tschaikowski's sixth symphony showing the notes scored for the strings and the four parts normally perceived.


'described in 1978 by David L. Wessel' and illustrated in Figure 5.15. In this experiment the rising arpeggio shown as response (A) is perceived as expected for stimulus (A) when all the note timbres are the same. However, the response changes to two separate falling arpeggii shown as response (B) if note timbres are alternated between two timbres represented by the different notehead shapes and 'the difference in timbres is increased' as shown for stimulus (B). This is described as timbral streaming (e.g. Bregman, 1990). A variation on this effect is shown in Figure 5.16 in which the pattern of notes shown is produced with four different timbres represented by the different notehead shapes. (This forms the basis of one of our laboratory exercises

Fig 5.15 Stimulus and usually perceived responses for timbral streaming experiment of Wessel. Different timbres in B are respresented by open square and filled diamond

for music technology students.) The score is repeated indefinitely and the speed can be varied. Ascending or descending scales are perceived depending on the speed at which this sequence is played. For slow speeds (less than one note per second) an ascending sequence of scales is perceived (stave B in the figure). The streaming is based on 'note order'. When the speed is increased, for example to greater than ten notes per second, a descending sequence of scales of different timbres is perceived (staves C-F in the figure). The streaming is based on

Fig 5.16 Stimulus (stave A) used in timbre and note order streaming experiment in which note heads represent different timbres. At slow speeds note order streaming is perceived (stave B) and at higher speeds timbre streaming is perceived.

Fig 5.17 Traditional performer and audience layout in a concert situation showing treble/bass bias in the ears of performers and listeners


timbre. The ear can switch from one descending stream to another between those shown in staves (C-F) in the figure by concentrating on another timbre in the stimulus.

The finding that the majority of listeners to the stimuli shown in Figure 5.12 hear the high notes in the right ear and the low notes in the left ear may have some bearing on the natural layout of groups of performing musicians. For example, a string quartet will usually play with the cellist sitting on the left of the viola player who is sitting on the left of the second violinist who in turn is sitting on the left of the first violinist as illustrated in Figure 5.17. This means that each player has the instruments playing parts lower than their own on their left-hand side, and those instruments playing higher parts on their right-hand side. Vocal groups tend to organise themselves such that the sopranos are on the right of the altos, and the tenors are on the right of the basses if they are in two or more rows. Small vocal groups such as a quartet consisting of a soprano, alto, tenor and bass will tend to be in a line with the bass on the left and the soprano on the right. In orchestras, the treble instruments tend to be placed with the highest pitched instruments within their section (e.g. first violin, piccolo, trumpet etc.) on the left and bass instruments on the right. Such layouts have become traditional and moving players or singers around such that they are not in this physical position with respect to other instruments or singers is not welcomed. This tradition of musical performance layout may well be in part due to a right-ear preference for the higher notes.

However, whilst this may work well for the performers, it is back-to-front for the audience. When an audience faces a stage to watch a live performance (see Figure 5.17), the instruments or singers producing the treble notes are on the left and the bass instruments or singers are on the right. This is the wrong way round in terms of the right-ear treble preference, but the correct way round for observing the performers themselves. It is interesting to compare the normal concert hall layout as a listener with the experience of sitting in the audience area behind the orchestra which is possible in halls such as the Royal Festival Hall in London. Unfortunately this is not a test that can be carried out very satisfactorily since it is not usually possible to sit far enough behind the players to gain as good an overall balance as can be obtained from the auditorium in front of the orchestra. It is, however, possible to experience this effect by turning round when listening to a good stereo recording over loudspeakers.

5.5.4 Pitch Illusions

A pitch illusion, which has been compared with the continuous staircase pictures of Maurits Escher, has been demonstrated by Shepherd (1964) and is often referred to as a 'Shepherd tone'. This illusion produces the sensation of an endless scale which ascends in semitone steps. After 12 semitone steps when the pitch has risen by one octave, the spectrum is identical to the starting spectrum so the scale ascends but never climbs acoustically more than one octave. This stimulus is available on the CD recording of Houtsma et al. (1987). Figure 5.18 illustrates the spectral nature of the Shepherd tone stimuli. Only the fundamental and harmonics which are multiple octaves above the fundamental are employed in the stimuli. The component frequencies of the Shepherd tone can be represented as:

fShepherd = (2nf0) (5.4)

where

fShepherd = frequencies of Shepherd tone components n = 0, 1, 2, 3, ...

The amplitudes of the components are constrained within the curved envelope shown. Each time the tone moves up one

Fig 5.18 Illustration of the spectra of stimuli which would produce the Shepherd Tone continuous ascending scale illusion.


semi tone, the partials all move up by a twelfth of an octave, or one semi tone. The upper harmonics become weaker and eventually disappear and new lower harmonics appear and become stronger.
A musical example relating to the Shepherd tone effect in which some pitch ambiguity is perceived by some listeners can be found in the pedal line starting at bar 31 of the Fantasia in G millor (BWV 542) for organ by J.S. Bach. These bars are reproduced in Figure 5.19 as an organ score in which the lower of the three staves is played on the pedals while the upper two staves are played with the left and right hands. The pedal line consists of a sequence of five descending scales with eight notes in each except the last.

Fig An extract from bar 31 of the Fantasia in G for Organ by J.S Bach

Each scale ends with an upward leap of a minor seventh and the exact moment where the upward leap occurs is often perceived with some ambiguity, even when listeners have the score in front of them. The strength of this effect depends on the particular stops used. This ambiguity is particularly common amongst listeners in the third bar of the extract where the upward leap is often very strongly perceived as occurring one or even two notes later. This could be due to the entry of a new part in the left hand playing f3 which starts as the pedal part jumps up to written Bb3. (Reference to 'written' Bb3 is made since the 16' rank provides the fundamental on the pedals which sounds an octave lower than written pitch as discussed in Section 5.4.) The to components of these two notes, i.e. F3 and Bb2, form the second and third harmonics of Bb1, which would have been the next sounding note of the pedal descending scale had it not jumped up an octave. At all the upward leaps in the descending pedal scales the chord in the manual part changes from minor to major. In bar 32 the left hand change from Eb to E natural adds a member of the harmonic series (see Figure 3.3) of what would have been the next note (written C3) in the pedal scale had it not risen up the octave. Eb is not a member of that harmonic series. Similarly for the 0 natural in the right-hand in the third bar of the extract with the entry of the left-hand F3, and the left-hand C natural in the fourth bar. These entries of notes which are members of the harmonic series of what would have been the next note in the descending pedal scale had it not jumped up the octave serve to provide the perceived ambiguity in definition as to the exact instant at which the upward leap occurs.

The illusion produced by the combination of organ pipes to produce a sensation of pitch lower than any note actually sounding is also used in organ pedal resultant bass stops. These sound at 32' (and very occasionally 64'), and their f0 values for bottom C is 16.25 Hz and 8.175 Hz respectively. A resultant bass at 32' pitch is formed by sounding together stops of 16' and 10 2/3' which form the second and third harmonics of the 32' harmonic series (see section 5.4). A 32' stop perhaps labelled 'acoustic bass' is a mutation stop of 10 2/3' which when sounded with a 16' rank produces a perceived pitch at 32' (place theory of pitch perception from the second and third harmonics-see Chapter 3). Similarly, a 64' stop perhaps labelled 'resultant bass' works similarly, sounding a 22 1/2 rank with a 32' rank. The f0 value of the middle C of a 32' stop (C2) is 65.4 Hz and thus its bottom note is two octaves below this (CO) with an f0 of (65 4/4) or 16.35 Hz. The f0 for the bottom note of a 64' stop (C-1) is 8.175 Hz which is below the human hearing range but within the frequency range of difference frequencies that are perceived as beats (see Figure 2.6). Harmonics that are within the human hearing range will contribute to a perception of pitch at these f0 values which are themselves below the frequency range of the hearing system. Organists will sometimes play fifths in the pedals to imitate this effect, particularly on the last note of a piece. However, the effect is not as satisfactory as that obtained with a properly voiced resultant bass stop because the third harmonic (e.g. 10 2/3) should be softer than the second harmonic (16') for best effect.

Roederer (1975) describes an organ-based example to illustrate residue pitch which constitutes a pitch illusion. The solo line of a chorale prelude, he suggests chorale number 40 from the Orgelbluchlein by J.S. Bach, is played using a number of mutation stops (see Section 5.5) if available (e.g. 8',4',2 2/3',2', 3/5', 1 1/3, 1') accompanied by 8', 4' in the left hand and 16', 8' in the pedal. A musically trained audience should be asked to track the pitch of the melody and warned that timbre changes will occur. After playing a short snippet, play some more without the 8', then without the 4', then without the 2' and finally without the 1'. What remains in the solo part is only mutation stops (Le. those with a non-unison or non-octave pitch relationship to the funda mental). Roederer suggests making: 'the audience aware of what was left in the upper voice and point out that the pitch of the written note was absent altogether (in any of its octaves)-they will find it hard to believe! A repetition of the experiment is likely to fail-because the audience will redirect their pitch processing strategies!' Experience shows that such an experiment relies on pitch context being established when using such stimuli, usually through the use of a known or continuing musical melody. A musical illusion only works by virtue of establishing a strong expectation in the mind's ear of the listener.

 

You Should Know

Sum and Difference Tones

When two pure tones are played simultaneously, they are not always perceived as two separate pure tones.

When two pure tones are heard together, other tones with frequencies lower than the frequencies of either of the two pure tones themselves may be heard also. These lower tones are not acoustically present in the stimulating signal and they occur as a result of the stimulus consisting of a 'combination' of at least two pure tones and they are known as 'combination tones'. The frequency of one such combination tone which is usually quite easily perceived is the difference (higher minus the lower) between the frequencies of the two tones, and this is known as the 'difference tone':

These tones are always below the frequency of the lower pure tone, and occur at integer multiples of the difference tone frequency below the lower tone. No listeners hear all and some hear none of these combination tones. The difference tone and the combination tones for n = 1 and n = 2, known as the 'second order difference tone' and the 'third-order difference tone', are those that are perceived most readily.

When the two tones are not members of a harmonic series, the combination tones have no equivalent fo, but they will be equally spaced in frequency. Combination tones are perceived quite easily when two musical instruments which produce fairly pure tone outputs, such as the descant recorder, baroque flute or piccolo, whose fo values are high and close in frequency. When the two notes played are themselves both exact and adjacent members of the harmonic series formed on their difference tone, the combination tones will be consecutive members of the harmonic series adjacent and below the lower played note (i.e. the 10 values of both notes and their combination tones would be exact integer multiples of the difference frequency between the notes themselves). The musical relationship of combination tones to notes played therefore depends on the tuning system in use. Two notes played using a tuning system which results in the interval between the notes never being pure, such as the equal-tempered system, will produce combinations tones which are close but not exact harmonics of the series formed on the difference tone.

 

Masking of one sound by another

Almost every sound we hear in music consists of at least two frequency components. When two or more pure tones are heard together an effect known as 'masking' can occur, where each individual tone can become more difficult or impossible to perceive, or it is partially or completely 'masked', due to the presence of another tone. In such a case the tone which causes the masking is known as the 'masker' and the tone which is masked is known as the 'maskee'. These tones could be individual pure tones, but given the rarity of such sounds in music, they are more likely to be individual frequency components of a note played on one instrument which either mask other components in that note, or frequency components of another note. The extent to which masking occurs depends on the frequencies of the masker and maskee and their amplitudes.

 

The dependence of masking on the frequencies of masker and maskee can be illustrated by reference to Figure 2.9 in which an idealised frequency response curve for an auditory filter is plotted. The filter will respond to components in the input acoustic signal which fall within its response curve whose bandwidth is given by the critical bandwidth for the filter's centre frequency. The filter will respond to components in the input whose frequencies are lower than its centre frequency to a greater degree than components which are higher in frequency than the centre frequency due to the asymmetry of the response curve. Masking can be thought of as the filter's effectiveness in analysing a component at its centre frequency (maskee) being reduced to some degree by the presence of another component (masker) whose frequency falls within the filter's response curve. The degree to which the filter's effectiveness is reduced is usually measured as a shift in hearing threshold, or 'masking level', as illustrated in Figure 5.7. The figure shows that the asymmetry of the response curve results in the masking effect being considerably greater for maskees which are above rather than those below the frequency of the masker.

 

At low amplitude levels, the masking effect tends to be similar for frequency above and below fmasker. As the amplitude of the masker is raised the low masks high effect increases and the resulting masking level curve becomes increasingly asymmetric. Thus the masking effect is highly dependent on the amplitude of the masker.


The masking effect of individual components in musical sounds which are complex with many spectral components can be determined in terms of the masking effect of individual components on other components in the sound. If a component is completely masked by another component in the sound, the masked component makes no contribution to the perceived nature of the sound itself and is therefore effectively ignored. If the masker is broadband noise, or 'white noise', then components at all frequencies are masked in an essentially linear fashion (i.e. a 10 dB increase in the level of the noise increases the masking effect by 10 dB at all frequencies). This can be the case, for example, with background noise or a brushed snare drum (see Figure 3.6) which have spectral energy generally spread over a wide frequency range and this can mask components of other sounds that fall within that frequency range.

The masking effects considered so far are known as 'simultaneous masking' because the masking effect on the maskee by the masker occurs when both sound together (or simultaneously). Two further masking effects are important for the perception of music where the masker and maskee are not sounding together, and these are referred to as 'non-simultaneous masking'. These are 'forward masking' or 'post-masking' and 'backward masking' or pre-masking. In forward masking, a pure tone masker can mask another tone (maskee) which starts after the masker itself has ceased to sound. In other words the masking effect is 'forward' in time from the masker to the maskee. Forward masking can occur for time intervals between the end of the masker and the start of the maskee of up to approximately 30 ms. In backward masking a maskee can be masked by a masker which follows it in time, starting up to approximately 10 ms after the maskee itself has ended.

Moore (1996) makes the following observations about non-simultaneous masking:

MP3

Masking is exploited practically in digital systems that store and transmit digital audio in order to reduce the amount of information that has to be handled, and therefore reduce the transmission resource, or bandwidth, and memory, disk or other storage medium required. Such systems are generally referred to as perceptual coders because they exploit knowledge of human perception. For example, perceptual coding is the operating basis of the MP3 system that is used to transmit music over the Internet, MP3 players that store many hours of such music in a pocket-sized device, multi-channel sound in digital audio broadcasting and satellite television systems, MiniDisk recorders.

There are international standards that define perceptual coding schemes for the encoding (recording) and decoding (playback) parts of these systems which enable different manufacturers to produce equipment, and the Moving Pictures Expert Group (MPEG) was set up in 1988. Their task was then and still is now to develop international standards for the coding of moving pictures and associated audio, and their work has resulted in standards such as MPEG-l, MPEG-2 and MPEG-4, each of which includes three layers: I, 2, and 3. MP3 itself is based on MPEGI, layer III (not MPEG-3 as this does not exist!).

The input signal is first split into a number of frequency bands, generally by means of a bank of bandpass filters, and these are sometimes referred to as sub-bands giving some coders the often used name sub-baud coders. The extent to which this process matches the human peripheral hearing system critical band analysis depends on the complexity of the particular coding scheme itself. The energy in each of these sub-bands is used with reference to the original signal to calculate the simultaneous (and in some cases also the non-simultaneous) masking effects for that instant of input signal . Those elements of the sub-bands that the system decides would not be masked are then digitally coded for transmission and/or storage. At the receiving end there is an encoder which reverses this process to reproduce the original audio material, which is not of course an exact copy of the original input since masking predictions have been employed to remove material that the listener would not have heard in that context.

Note grouping illusions

There are some situations when the perceived sound is unexpected, as a result of either what amounts to an acoustic illusion or the way in which the human hearing system analyses sounds. Whilst some of these sounds will not be found in traditional musical performances using acoustic instruments since they can only be generated electronically, some of the effects have a bearing on how music is performed. The nature of the illusion and its relationship with the acoustic input which produced it can give rise to new theories of how sound is perceived, and in some cases, the effect might have already or could in the future be used in the performance of music.

Diana Deutsch describes a number of note grouping acoustic illusions, some of which are summarised below with an indication of their manifestation in music perception and/or performance. Deutsch describes an 'octave illusion' in which a sequence of two tones an octave apart with high (800 Hz) and low (400 Hz) to values are alternated between the ears. Most listeners report hearing a high tone in the right ear alternating with a low tone in the left ear as illustrated in the figure, no matter which way round the headphones are placed. She further notes that righthanded listeners tend to report hearing the high tone in the right ear alternating with a low tone in the left ear whilst left-handed listeners tend to hear a high tone alternating with a low tone but it is equally likely that the high tone is heard in the left or right ear. This illusion persists when the stimuli are played over loudspeakers.

In a further experiment (Deutsch, 1975) played an ascending and descending C major scale simultaneously with alternate notes being switched between the two ears as shown in the lower part of Figure 5.12. The most commonly perceived response is also shown in the figure. Once again the high notes tend to be heard in the right ear and the low notes in the left ear, resulting in a snippet of a C major scale being heard in each ear. Such effects are known as 'grouping' or 'streaming', and by way of explanation, Deutsch invokes some of the grouping principles of the 'Gestalt school' of psychology known as 'good continuation', 'proximity' and 'similarity'. She describes these (Deutsch, 1982) as follows:

The finding that the majority of listeners hear the high notes in the right ear and the low notes in the left ear may have some bearing on the natural layout of groups of performing musicians. For example, a string quartet will usually play with the cellist sitting on the left of the viola player who is sitting on the left of the second violinist who in turn is sitting on the left of the first violinist. This means that each player has the instruments playing parts lower than their own on their left-hand side, and those instruments playing higher parts on their right-hand side. Vocal groups tend to organise themselves such that the sopranos are on the right of the altos, and the tenors are on the right of the basses if they are in two or more rows. Small vocal groups such as a quartet consisting of a soprano, alto, tenor and bass will tend to be in a line with the bass on the left and the soprano on the right. In orchestras, the treble instruments tend to be placed with the highest pitched instruments within their section (e.g. first violin, piccolo, trumpet etc.) on the left and bass instruments on the right. Such layouts have become traditional and moving players or singers around such that they are not in this physical position with respect to other instruments or singers is not welcomed. This tradition of musical performance layout may well be in part due to a right-ear preference for the higher notes.

However, whilst this may work well for the performers, it is back-to-front for the audience. When an audience faces a stage to watch a live performance, the instruments or singers producing the treble notes are on the left and the bass instruments or singers are on the right. This is the wrong way round in terms of the right-ear treble preference, but the correct way round for observing the performers themselves. It is interesting to compare the normal concert hall layout as a listener with the experience of sitting in the audience area behind the orchestra which is possible in halls such as the Royal Festival Hall in London. Unfortunately this is not a test that can be carried out very satisfactorily since it is not usually possible to sit far enough behind the players to gain as good an overall balance as can be obtained from the auditorium in front of the orchestra. It is, however, possible to experience this effect by turning round when listening to a good stereo recording over loudspeakers.