Latest revision as of 15:26, 10 April 2013

Post Lecture Speech notes. ECE438, Fall 2009, Prof. Boutin

(Click here to go to course wiki.)

Basic Idea

Speech is an acoustic signal, which we approximate as an analog signal. It is our goal to change this analog signal into a digital so that we can perform various forms of processing on it.

Parts of Speech

Before we jump into the mathematical "deep end" we first need to know the basic building blocks of speech

A sentence that we hear is made up of syllables (sound) and separations (no sound). Simply put, a syllable is a single, uninterrupted sound that forms the rhythmical foundation of a language. For example, the word 'water' has two syllables, 'wa' and 'ter' separated by a tiny break in speech.

If we go down further, each syllable is formed of phonemes. A phoneme is the smallest, segmental unit of sound. It is what forms the difference between utterances. Even though two different groups use the same language and have different accents and because the phonemes have the same function.

Since phonemes are the smallest block of a speech signal, it is no surprise that they form the basis for speech analysis.

Phonemes

There are two types of phones, voiced and unvoiced.

Voiced phonemes consist of all vowels and some consonants whereas unvoiced phonemes are just consonants.

There are two methods to differentiate between the two when observing the time domain plot of the audio signal

1) Average Power: By comparing the average power $P = \frac{1}{L} \sum_{n=1}^L x^2(n)$ .

$P_{avg,voiced} > P_{avg,unvoiced}$

2) Zero-crossings: Alternatively, you can compare how many times the signal crosses the zero-axis.

Zero crossing (Unvoiced) > Zero crossing (Voiced)

Since the voiced section is approximately period, it can be modeled as a pulse train. Similarly, the unvoiced section can be modeled as white noise.

Modeling of Speech Production

Utilizing this valuable information we can now construct a model for speech production as shown below.

where $s(t) = x(t) \ast v(t)$ where x(t) is the excitation (either pulse train or white noise)

Modeling of the vocal tract

I'm not going to go into extensive detail here (see class lecture notes) but I will go over the basics and the end result..

Basically, the vocal tract is modeled as a finite series of connected tubes.

(u(t) is the signal going in the front of the tube, v(t) is the signal coming out of the front of the tube, w(t) is the signal coming out of the end of the tube, x(t) is the signal going into the end of the tube)

Using this model, we get two equations.

1) Going through a tube: $\begin{pmatrix} U_d(z) \\ V_d(z) \end{pmatrix} = z \begin{pmatrix} 1 & 0 \\ 0 & z^{-2} \end{pmatrix} \begin{pmatrix} W_d(z) \\ X_d(z) \end{pmatrix}$

This shows there is a time delay when traveling through a tube (the $$ z^(-2) $$ term)

2) Between two tubes: $\begin{pmatrix} U_d(z) \\ V_d(z) \end{pmatrix} = \frac{1}{1+r} \begin{pmatrix} A+B & A-B \\ A-B & A+B \end{pmatrix} \begin{pmatrix} W_d(z) \\ X_d(z) \end{pmatrix}$

where r is a reflection coefficient $$ -1 <= r <= 1 $$ such that $r = \frac{B-A}{B+A}$

The end result after modeling is a transfer function that is an all-pole filter with a gain and a time delay.

Since the vocal tract is a cavity that resonates it makes sense that our output of a pulse train (voiced) convolved with a transfer function results in a series of varying-amplitude pulses with certain frequencies that are amplified.
These frequencies are known as formants

Transfer function

As noted above, the transfer function is usually an all-pole filter. We can observe what the resonances, or formants, are just by looking at the function

A real pole -> 1 formant
A complex pole pair -> 1 formant

If you are instead given a z-model...

$F = \frac{\theta}{2 \pi T}$ , where F is the formant frequency and T is the sampling period.

Also, it should be noted that the transfer function is not always an all-pole filter.

Zeros, or anti-resonances, will occur when there is no measurable output (i.e. Nasals and Fricatives)
- Nasals => The output from the mouth is approximately 0
- Fricatives/stop consonants => blockage behind source is infinite (forcing air through a constriction)

Spectrograms

Since this was pretty well covered by both the lecture and lab, I'm just going to post a link to a site that I found helpful for further understanding how spectrograms work and how they tell us about different parts of speech.

-> How to read a spectrogram by Rob Hagiwara

http://home.cc.umanitoba.ca/~robh/howto.html

prelecture notes here SupplementarySpeech_prelecture

[[Back to ECE438, Fall 2009

@@ Line 1: / Line 1: @@
-Due to a kiwi server fail, my pre-lecture notes are not as substantial as I would have liked
+[[Category:lecture notes]]
-See my post-lecture notes for a more detailed description.
+[[Category:ECE438]]
-*The server failed??? When?? Zach do you know anything about this? --[[User:Mboutin|Mboutin]] 19:45, 3 November 2009 (UTC)
+[[Category:speech]]
-*It was from around 2pm till about 5:30pm Tuesday.  When I tried to preview my page that I had started writing, it said something like "Server not available."  --[[User:Pclay|Pclay]]
-*We will look into this. Thanks for the detailed info Peter! --[[User:Mboutin|Mboutin]] 13:33, 4 November 2009 (UTC)
+==Post Lecture Speech notes. [[ECE438]], Fall 2009, [[user:mboutin|Prof. Boutin]]==
+(Click [[ECE438_(BoutinFall2009)|here]] to go to course wiki.)
+= Basic Idea =
+* Speech is an acoustic signal, which we approximate as an analog signal.  It is our goal to change this analog signal into a digital so that we can perform various forms of processing on it.
-Notes for speech lecture:
+= Parts of Speech =
+* Before we jump into the mathematical "deep end" we first need to know the basic building blocks of speech
-Structure:
+* A sentence that we hear is made up of syllables (sound) and separations (no sound).  Simply put, a '''syllable''' is a single, uninterrupted sound that forms the rhythmical foundation of a language.  For example, the word 'water' has two syllables, 'wa' and 'ter' separated by a tiny break in speech.
--> Basic speech stuff (pipes, fricatives)
--> Voiced vs. Unvoiced
-) avg power
-) zero crossing
--> x(t) -> v(t) => s(t) = conv( x(t), v(t) )
-  periodic filter  phoneme
-  pulse
-  train
--> Model vocal tract as a series of tubes
-  - Going through tube delays the signal (show function)
- - between tubes (show function)
-+ This model leads to a transfer function
--> Transfer function V(d)
-  Since the vocal tract is a cavity that resonates, it amplifies certain frequencies
+* If we go down further, each syllable is formed of phonemes.  A '''phoneme''' is the smallest, segmental unit of sound.  It is what forms the difference between utterances.  Even though two different groups use the same language and have different accents and because the phonemes have the same function.
-  X(f) is sum(a_k * delta(f-kf_a))
- This frequencies, which are the local maxes of |S(f)| are called formants
+* Since phonemes are the smallest block of a speech signal, it is no surprise that they form the basis for speech analysis.
-  - Generally, the vocal tract transfer function is an all-pole filter
+= Phonemes =
-    where a real pole or a complex pole pair correspond to a resonance.
+There are two types of phones, voiced and unvoiced.
-  - Also, if you are given a z-model, F = theta / (2*pi*T) where T is
+* Voiced phonemes consist of all vowels and some consonants whereas unvoiced phonemes are just consonants.
-    the sampling period. (same thing as wT = theta
-  - zeros, anti-resonances, of the transfer function will occur when there is no
+There are two methods to differentiate between the two when observing the time domain plot of the audio signal
-    measurable output (i.e. Nasals and Fricatives)
-  - Nasal => output from the mouth is zero
-    Fricatives/stop consonants => blockage behind source is infinite (forcing air
-    through constriction)
--> Spectrograms
-   - Models frequency vs. time
-   - Use a short-time DTFT to obtain useful info about an utterance
-     X_m(e^jw) = sum( x(n)w(n-m)e^(-jwn))
-   - wideband uses window length = one period
-     - high time resolution, low freq
-     - striations due to energy variation
-   - narrowband captures several periods
-     - high freq, low time
-     - striations correspond to peaks in frequency spectrum.
-   The formants correspond to the dark bands.
+[[Image:VoicedvsUnvoiced.jpg|500px|Voiced and Unvoiced components of a voice signal]]
+) Average Power:  By comparing the average power <math>P = \frac{1}{L} \sum_{n=1}^L x^2(n)</math>.
+* <math>P_{avg,voiced} > P_{avg,unvoiced}</math>
+) Zero-crossings:  Alternatively, you can compare how many times the signal crosses the zero-axis.
+* Zero crossing (Unvoiced) > Zero crossing (Voiced)
+Since the voiced section is approximately period, it can be modeled as a pulse train.  Similarly, the unvoiced section can be modeled as white noise.
+= Modeling of Speech Production =
+Utilizing this valuable information we can now construct a model for speech production as shown below.
+[[Image:Vocaltractmodel.jpg|500px|Model of Vocal tract]]
+where <math>s(t) = x(t) \ast v(t)</math> where x(t) is the excitation (either pulse train or white noise)
+= Modeling of the vocal tract =
+I'm not going to go into extensive detail here (see class lecture notes) but I will go over the basics and the end result..
+Basically, the vocal tract is modeled as a finite series of connected tubes.
+(u(t) is the signal going in the front of the tube, v(t) is the signal coming out of the front of the tube, w(t) is the signal coming out of the end of the tube, x(t) is the signal going into the end of the tube)
+Using this model, we get two equations.
+) Going through a tube: <math>
+\begin{pmatrix}
+   U_d(z) \\
+  V_d(z)
+\end{pmatrix}
+= z
+\begin{pmatrix}
+& 0 \\
+& z^{-2}
+\end{pmatrix}
+\begin{pmatrix}
+  W_d(z) \\
+  X_d(z)
+\end{pmatrix}
+</math>
+* This shows there is a time delay when traveling through a tube (the <math>z^(-2)</math> term)
+) Between two tubes: <math>
+\begin{pmatrix}
+  U_d(z) \\
+  V_d(z)
+\end{pmatrix}
+= \frac{1}{1+r}
+\begin{pmatrix}
+  A+B & A-B \\
+  A-B & A+B
+\end{pmatrix}
+\begin{pmatrix}
+  W_d(z) \\
+  X_d(z)
+\end{pmatrix}
+</math>
+where r is a reflection coefficient <math>-1 <= r <= 1</math> such that <math>r = \frac{B-A}{B+A}</math>
+The end result after modeling is a transfer function that is an all-pole filter with a gain and a time delay.
+* Since the vocal tract is a cavity that resonates it makes sense that our output of a pulse train (voiced) convolved with a transfer function results in a series of varying-amplitude pulses with certain frequencies that are amplified.
+* These frequencies are known as '''formants'''
+= Transfer function =
+As noted above, the transfer function is usually an all-pole filter.  We can observe what the resonances, or formants, are just by looking at the function
+* A real pole -> 1 formant
+* A complex pole pair -> 1 formant
+If you are instead given a z-model...
+* <math>F = \frac{\theta}{2 \pi T} </math> , where F is the formant frequency and T is the sampling period.
+Also, it should be noted that the transfer function is not always an all-pole filter.
+* Zeros, or anti-resonances, will occur when there is no measurable output (i.e. Nasals and Fricatives)
+** Nasals => The output from the mouth is approximately 0
+** Fricatives/stop consonants => blockage behind source is infinite (forcing air through a constriction)
+= Spectrograms =
+Since this was pretty well covered by both the lecture and lab, I'm just going to post a link to a site that I found helpful for further understanding how spectrograms work and how they tell us about different parts of speech.
 -> How to read a spectrogram by Rob Hagiwara
-   http://home.cc.umanitoba.ca/~robh/howto.html
+http://home.cc.umanitoba.ca/~robh/howto.html
+prelecture notes here
+[[SupplementarySpeech_prelecture]]
+----
+[[[[ECE438_(BoutinFall2009)|Back to ECE438, Fall 2009]]

Difference between revisions of "Student summary speech" - Rhea