This book includes Speech and Language Processing, it explores natural language processing (NLP) by using Stanford Universty documentary. An Introduction to Natural Language Processing,Computational Linguistics, and Speech Recognition.
16.2.2 Windowing From the digitized, quantized representation of the waveform, we need to extract spectral features from a small window of speech that characterizes part of a particular phoneme. Inside this small window, we can roughly think of the signal as stationary (that is, its statistical properties are constant within this region). (By contrast, in general, speech is a non-stationary signal, meaning that its statistical properties are not constant over time). We extract this roughly stationary portion of speech by using a window which is non-zero inside a region and zero elsewhere, running this window across the speech signal and multiplying it by the input waveform to produce a windowed waveform.
id: 0bc93bf40565fbfc61cf6bbbf95c4357 - page: 341
The speech extracted from each window is called a frame. The windowing is characterized by three parameters: the window size or frame size of the window (its width in milliseconds), the frame stride, (also called shift or offset) between successive windows, and the shape of the window. To extract the signal we multiply the value of the signal at time n, s[n] by the value of the window at time n, w[n]: y[n] = w[n]s[n] The window shape sketched in Fig. 16.2 is rectangular; you can see the extracted windowed signal looks just like the original signal. The rectangular window, however, abruptly cuts off the signal at its boundaries, which creates problems when we do Fourier analysis. For this reason, for acoustic feature creation we more commonly use the Hamming window, which shrinks the values of the signal toward (16.1) 334 CHAPTER 16 AUTOMATIC SPEECH RECOGNITION AND TEXT-TO-SPEECH Window25 ms Window25 ms Window25 ms Shift10 ms
id: ed9a7469922da537b8574b74058adb2b - page: 341
Shift10 ms Figure 16.2 Windowing, showing a 25 ms rectangular window with a 10ms stride. zero at the window boundaries, avoiding discontinuities. Figure 16.3 shows both; the equations are as follows (assuming a window that is L frames long): rectangular Hamming w[n] = w[n] = (cid:26) (cid:26) 1 0 0.54 0 L n 0 otherwise 0.46 cos( 2n L ) 1 L n 0 otherwise 1 0.4999 Time (s)0.00455938 Time (s)0.00455938 0 0 0.4999 Time (s)0 0.5 0.0256563 0.0256563 0.4999 0.0475896 0.4826 0 0.5 Rectangular windowHamming window Figure 16.3 Windowing a sine wave with the rectangular or Hamming windows. 16.2.3 Discrete Fourier Transform
id: 30d9c99808e72f02c5b191e72a142eff - page: 342
Discrete Fourier transform DFT The next step is to extract spectral information for our windowed signal; we need to know how much energy the signal contains at different frequency bands. The tool for extracting spectral information for discrete frequency bands for a discrete-time (sampled) signal is the discrete Fourier transform or DFT. (16.2) (16.3) Eulers formula fast Fourier transform FFT mel 16.2 FEATURE EXTRACTION FOR ASR: LOG MEL SPECTRUM 335 The input to the DFT is a windowed signal x[n]...x[m], and the output, for each of N discrete frequency bands, is a complex number X[k] representing the magnitude and phase of that frequency component in the original signal. If we plot the magnitude against the frequency, we can visualize the spectrum that we introduced in Chapter 28. For example, Fig. 16.4 shows a 25 ms Hamming-windowed portion of a signal and its spectrum as computed by a DFT (with some additional smoothing). 0.04414 0 0.039295 0.04121 Time (s)0.0141752 0
id: 690cfd69b6d19bd5fe121ba2c89a9c75 - page: 342