Recognizing voice passwords using Fixed point FFT - Rhea

Recognizing Voice passwords using FFT

by Anuraag Rajasekhar

Background

The need for reliable voice recognition has become a universal goal ever since apple unveiled Siri 5 years ago. Even though initial version of voice assistant was sloppy people quickly realized the potential for a pocket personal assistant controlled mainly by your voice. With the improvement of the voice recognition technology, it's applications expanded from personal assistant to everything from voice dialing, appliance control, speech to text processing, data entry and security application to mention a few. I felt that an embedded application for voice password recognition would be a great addition to any security technology with the rise of IoT applications.

Theory

The most common algorithm used by voice recognition systems is the Mel-Frequency Cepstral (MFC), but implementing this on a micro-controller would be very hard due to the complexity of the algorithm. I used the Cepstrum algorithm to process a recorded voice signal. It was much simpler than the MFC algorithm and the essence of the Cepstrum algorithm is given by this formula:

Power Cepstrum of signal $=\left|{\mathcal {F}}^{-1}\left\{{\mbox{log}}(\left|{\mathcal {F}}\left\{f(t)\right\}\right|^{2})\right\}\right|^{2}$

The power cepstrum is a unique signature of the signal and the similarity between two signals can be determined by comparing their power cepstrums. For a security application users will be able to program voice passwords into a system, which will be used as reference signals to compare the voice password used when the user wants to unlock the system. This comparison can be done using correlation coefficient and say that if there is an 70% correlation with the reference signal, then the system will be unlocked.

The code

The recursive function FFT uses the radix-2 algorithm to compute the FFT. Since C doesn't have the capability to deal with imaginary numbers, the real and imaginary coefficients of the twiddle factor are calculated using the polar form and are used to calculate the magnitude which is always a real number. The user would have a 2.5 second window to say a password and will get a result based on the match with the reference signal. The microphone samples audio at 20 kHz, but this would generate a lot of data points so only 400 Hz is used as the sampling frequency. This would call for 1024 point FFT if the audio is recorded for 2.56 seconds.

int* FFT(int record[1024], int N, int s) 
{
  //DECLARATIONS
  int A[N/2];
  int B[N/2];
  int *a;
  int *b; 
  int out[N];
  int wn;
 
  //STATEMENTS
  if (N != 2 && N != 0) 
  {
     for(int k = 0; k < N; k= k+2)
    {
      A[k] = record[k];
    }
 
    for(int k = 1; k < N; k= k+2)
    {
      B[k] = record[k];
    }
 
    a = FFT(A,N/2,s);
 
    b = FFT(A,N/2,s);
 
 
    for(int k = 0; k < N/2; k= k+2)
    {
      wn = twiddle(N,k,s);
      out[k] = a[k]+wn*b[k+N/2];
    }
 
    for(int k = N/2; k < N; k= k+2)
    {
      wn = twiddle(N,k,s);
      out[k] = a[k]-wn*b[k+N/2];
    }
 
    return out; 
 
  }
  else if(N ==2)
  {
 
    out[0] = record[0]+record[1];
    out[1] = record[0]-record[1];
 
    return out;
  }
 
}
 
float twiddle(int N, int k, int s)
{
  float out;
  if ( s==1)
  {
    out = cos(12.5836*k/N);
  }
  else if( s==2)
  {
    out = -sin(12.5836*k/N);
  }
 
  else if( s==3)
  {
    out = sin(12.5836*k/N);
  }
 
  return out;
}
 
int* butterfly(int A[1024],int B[1024], int N, int s)
{
  int out[N];
  float wn; 
  for(int x = 0; x < N/2; x++)
  {
    A[x] = A[x];
  }
  for(int x = 0; x < N/2; x++)
  {
    B[x] = B[x];
  }
  for(int r = 0; r < N/2; r++)
  {
    wn = twiddle(N,r,s);
    out[r] = A[r]+wn*B[r+(N/2)];
  }
 
  for(int y = N/2; y < N; y++)
  {
    wn = twiddle(N,y,s);
    out[y] = A[y]-wn*B[y+(N/2)];
  }
 
  return out;
}
 
int* decimation(int A[1024], int N, int l)
{
  int out[N];
  for( int r = l; r < N; r = r+2)
  {
    out[r/2] = A[r];
  }
  return out; 
}

Conclusion

The approximations used and the use of integer type instead of a double type to represent the audio reduced the maximum match obtained to about 40%. Increasing the sampling frequency and reducing the window of data collection might result in better accuracy of this implementation. The algorithm needs more testing and adjustments before it could be used more reliably

Sources

http://www.phon.ucl.ac.uk/courses/spsci/matlab/lect10.html