Building an Autotuner

Garrett McMindes

Theory
In a simplistic approach, speech or song are created using two components. The first is the output from the lungs, and the second is the vocal tract, which acts as a filter. It is also useful to divide the sounds created into two categories, voiced and unvoiced sound. Unvoiced sound is modeled by having noise pass through the vocal cords, creating a sound without much ability to vary in pitch. An example is the sound an "f" makes. For the process of pitch correction, voiced sounds are much more of a concern. This case is modeled as an impulse train passing through the vocal tract. While in both cases vocal tract selects the noise heard, in voiced sound the impulse train determines the pitch of the sound. Linear predictive coding takes advantage of this model to separate the output from the lungs from the vocal tract. The vocal tract filter is then modeled by a series of values. This allows the sound and the frequency to be separated. By reconstructing the signal with a different impulse train, the note can be modified to match the target note. The wav file below is an example of this implementation used on an octave.
Autotuned Octave

In addition, while not required for pitch correction by LPC coefficients, determining the frequency of the singer could potentially prove useful for any further applications. Most ways of accurately finding the core frequency of the sound are based on the repetition of the wave in the frequency domain. When a person sings, they generate a core frequency, which determines the note, as well as harmonics, which are additional local maximums in the frequency domain. These harmonics appear at multiples of the core frequency. For example, a singer singing at 105hz would generate harmonics at 210hz, 315hz...etc. The code used in on this page was based on a theory by Gareth Middleton. The method uses down sampling to isolate the core frequency. First the DFT of the signal is taken, than that DFT is down sampled. The results are then multiplied. Typically, as harmonics get farther away from the core frequency, their amplitudes decrease. This is not always the case, which makes a simple search for the max of the DFT inaccurate. However by multiplying the DFTs together two goals are accomplished. First, noise is typically eliminated almost completely. Since the noise is not periodic in most cases, any substantial noise is effectively multiplied by a zero. The probability of this occurring increases as more levels of down sampling are used. Secondly, while the second peak could potentially have a greater magnitude than the first peak, it is very unlikely that the second harmonic is greater than the fourth harmonic, due to increased separation.

Code

The first step to coding an auto tuner is to create a set of target notes in order of their occurrence. How this set is formatted is based on the windowing size and tempo. These concepts will be explored more later in the code.

%frequencies in Hz
A1=110;
Bf=117;
B=123;
C=131;
Df=139;
D=147;
EF=156;
E=165;
F=175;
Gf=185;
G=196;
Af=208;
A2=220;
notes=[E E E E E E E E E E D D D D D D D D D D C C C C C C C C C C ];

The next step is to record the audio. This example uses Matlab's audiorecorder function. Ideally, the length of audio recorded should allow for a completed window for every target note.

n1=length(notes);
fs=8000;
recObj =audiorecorder(8000, 16, 1);
recordblocking(recObj, n1*.1);
audiototal = getaudiodata(recObj);

Once both the target notes and the audio have been inputted, the audio should be analyzed. The audio is examined in a windowed format. This means the audio is broken up into smaller segments in which the pitch of the singer is unlikely to change. This implementation used .1 second window.

S(:,q)=x(1+800*(q-1):800*q);
%windows the total audio into segments of 800

The time length of the window is found by dividing the window length by the sampling frequency. This means for a 1 second long note, ten of that note are needed in the target note set. These windows are examined to determine the energy of the window, the number of zero crossings, and the linear predictive coefficients.

energy(q)=var(S(:,q));
znum(q)=zero_cross(S(:,q));
lpc=mylpc(S(:,q),15);

The final step is to reconstruct the windows at a the target frequency. The first step in reconstruction is to decide if the window represents voiced or unvoiced sound. In this implementation, the sound was considered voiced if the number of zeros crossings it contained was less than a quarter of the window length. The values in the array NP were derived from the target frequency.

    if VU(q)==1
        exc=exciteV(800,NP(w));
    elseif VU(q)==0;
        exc=exciteUV(800);
    end

The impulse train was then combined with the LPC coefficients using Matlab's filter function.

s = filter(1,[1 -(A(:,w)')],exc);

Finally the sound is rescaled to match the energy of the original and then each window is concatenated to the output array.

     e=var(s);
     s=s*(energy(w)/e);
    output=[output s];

This output can then be played using Matlab's soundsc function. The full code for this implementation can be found after the References section.

Further Improvements

This implementation of an auto tuner has two major areas that need improvement to increase usability. The first of these problems is the modeling of the output of the lungs as an impulse train. This simplification allows for easier implementation, however it also causes the output to sound very sporadic and robot like. By using a more complex excitation, the output can be made to sound more like the input sound. Secondly, the auto tuner can be changed to act in real time. To implement this Matlab would need to use several audiorecorder functions. Each function takes approximately .06 seconds to run outside of the window recording sound. This means there will be a .06 second window between audio inputs. When those audio inputs need to be between .1 and .3 seconds long to capture the note, the .06 pause is substantial enough to be detrimental to the output. Therefore it is likely that a real time version would need to use a program other than Matlab. An example of these two improvements is the following.
https://www.youtube.com/watch?v=JB7jSFeVz1U



References

Michael Peimani, Pitch Correction for the Human Voice, June 10, 2009, http://dave.ucsc.edu/physics195/thesis_2009/m_peimani.pdf.
Purdue ECE 438, ECE438 - Laboratory 9:Speech Processing (Week 2), October 6, 2010, https://engineering.purdue.edu/VISE/ee438L/lab9/pdf/lab9b.pdf.

Complete Code
Code is currently programmed for Mary Had a Little Lamb at 60 beats per second.

A1=110;
Bf=117;
B=123;
C=131;
Df=139;
D=147;
EF=156;
E=165;
F=175;
Gf=185;
G=196;
Af=208;
A2=220;
notes=[E E E E E E E E E E D D D D D D D D D D C C C C C C C C C C D D D D D D D D D D E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D E E E E E E E E E E G G G G G G G G G G G G G G G G G G G G E E E E E E E E E E D D D D D D D D D D C C C C C C C C C D D D D D D D D D D E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E D D D D D D D D D D D D D D D D D D D D D E E E E E E E E E E D D D D D D D D D D C C C C C C C C C C];
n1=length(notes);
fs=8000;
recObj =audiorecorder(8000, 16, 1);
recordblocking(recObj, n1*.1);
audiototal = getaudiodata(recObj);
next=1;
for q=1:length(notes)
    Np(q)=floor(fs/notes(next));
    next=next+1;
end
[VU,energy,A,n]=compression(audiototal);
output=synthesis(VU,energy,A,Np,length(audiototal));
soundsc(output,fs)
 
function [VU,energy,A,n]=compression(x)
n=length(x);
for q=1:floor(n/800)
    S(:,q)=x(1+800*(q-1):800*q);
    energy(q)=var(S(:,q));
    znum(q)=zero_cross(S(:,q));
    n=length(S(:,q));
    if znum(q)<n/4
        VU(q)=1;
    else
        VU(q)=0;
    end
    b=mylpc(S(:,q),15);
    b=b';
    A(:,q)=b;
end
 
function output=synthesis(VU,energy,A,NP,n)
output=[];
q=floor(n/800);
for w=1:q
    if VU(q)==1
        exc=exciteV(800,NP(w));
    elseif VU(q)==0;
        exc=exciteUV(800);
    end
    s = filter(1,[1 -(A(:,w)')],exc);
     e=var(s);
     s=s*(energy(w)/e);
    output=[output s];
end
 
function coef=mylpc(x,P)
N=length(x);
for m=1:P+1
    for n=1:N-m-2
        component(n)=x(n)*(x(n+(m-1)));
    end
    rss(m)=1/N*sum(component);
    clearvars component
end
l=length(rss);
for q=1:P
    rs(q)=rss(q+1);
end
for e=1:P
    for w=1:P
        Rs(w,e)=rss(abs(w-e)+1);
    end
end
coef=rs/Rs;
 
function y=zero_cross(x)
n=length(x);
y=0;
for q=1:n-1
    if (x(q))>0
        if x(q+1)<0
            y=y+1;
        end
    end
    if (x(q))<0
        if x(q+1)>0
            y=y+1;
        end
    end 
end
 
function x=exciteV(N,Np)
x=zeros(1,N);
for q=0:floor((N-1)/Np)
    x((q)*Np+1)=1;
end
 
function x=exciteUV(N)
x=zeros(1,N);
if mod(N,2)
    x(2:2:N)=1;
else
    x(2:2:N+1)=1;
end
 
function f=frequencyfinder(x)
x=abs(x);
n=length(x);
x1=x(1:2:n);
x2=x(1:4:n);
x3=x(1:8:n);
n3=length(x3);
x1=[x1',zeros(1,n3*4)];
x2=[x2',zeros(1,n3*6)];
x3=[x3',zeros(1,n3*7)];
xt=x'.*x1.*x2.*x3;
c=max(xt);
i=find(xt==c);
w=0:24000/n:24000-24000/n;
f=w(i);

Alumni Liaison

Questions/answers with a recent ECE grad

Ryne Rayburn