Fast Fourier Transform (FFT) and C Implementation Using the Octave GNU Tool
1. Abbreviation
DSP – Digital Signal Processing
DFT – Discrete Fourier Transform
IDFT – Inverse Discrete Fourier Transform
FFT – Fast Fourier Transform
FIR – Finite Impulse Response
IIR – Infinite Impulse Response
2. Introduction
Often in Digital Signal Processing applications it is necessary to estimate the signal spectrum, or vice versa, knowing the signal spectrum calculate the signal itself. For this purpose digital technology uses the Discrete Fourier Transform (DFT) and Inverse Discrete Fourier Transform( IDFT), which are described by formulas (1) and (2):
where n = 0…N1 signal index and k = 0…N1 spectrum index
These transformations are quite timeconsuming and to understand this, we will describe (1) in more detail:
From formulas (3) it follows:

Multiplications are made in (N1) equations and each equation contains (N1) product, i.e. DFT (IDFT) requires
complex multiplications. Now consider what complex multiplication means:
Complex multiplication is 4 real multiplications and 2 real additions!
Note: Equality (4) is true for N >>1 
The addition is performed in N equations and each equation contains (N1) additions, i.e. DFT (IDFT) requires:
complex additions. Complex addition is 2 real additions. See the (7):
Note: Equality (6) is true for N >>1
There are algorithms that simplify the calculation of DFT (IDFT), such algorithms are called Fast Fourier Transform (FFT). The idea is that the existing input array is split into shorter subsequences, for which the Fourier Transform is calculated. For example, we divide the original sequence of length N into two subsequences with length N/2 each. Using (4) and (6) we obtain for each subsequence:
Obviously, by halving the size of the array, the computational load of the algorithm is reduced fourfold! This is the principle that FFT uses.
In the future array length will be used:
This restriction (9) does not lead to complications, if our array has a different length, then we can expand the array according to (9) and fill the new elements of the sequence with zeros.
There are two most commonly used FFT algorithms:

Fast Fourier Transform with Decimation In Time. This algorithm was developed by Cooley and Tukey

Fast Fourier Transform with Decimation In Frequency. This algorithm was developed by Sande and Tukey
Next, consider these two methods.
Notes:

Usually for complex exponents the following notation is used, which simplifies the writing of formulas:

More detailed description of FFT see in [1], [2]
3. Decimation In Time FFT. Algorithm from CooleyTukey
We divide the input sequence x(n) of length
into two subsequences consisting of even and odd elements:
where n = 0, 1, …, (N/2)1
Then (1) can be rewritten using (10) and (11):
Using the periodicity of the spectrum:
where k = 0…(N/2) 1
As well as the formula:
where k = 0…(N/2) 1
Using (13) and (14), the formula (12) can be rewritten:
Formula (15) relates the spectra of the original sequence x(n) and the subsequences
with even and odd indices.
Often this basic operation, described by formula (15), is represented as a graphical butterfly (flow graph), shown in Figure 31.
Figure 31: Butterfly of the Decimation In Time FFT
We divide the already obtained subsequences
into new ones consisting of even and odd elements. Continuing this process, we will come to a twopoints FFT. See the Figure 32.
Figure 32: Division the Input Sequence x(n) for the Decimation In Time FFT
Now using (15) and starting with the twopoints Fourier Transform, we can build a diagram Decimation In Time FFT. See the Figure 33.
Figure 33: 16 points of the Decimation In Time FFT
Now we will write this FFT algorithm in steps:

Bit Reversal Permutation. The Figure 32 and Figure 33 are shown that the input array must be reordered: x(0), x(8), x(4), x(12),…x(7), x(15). This corresponds to the bit reversal of the sequence indices. Below in Table 31 is an example for a 4bit index for a 16point FFT.
Table 31: Bit Reversal Example for the 4 Bit Number (16 Point FFT)The binary number is written in reverse order (vice versa). Obviously, the bit reversal of the number 0 remains 0, as well as the bit reversal of (N1)= 11…1(binary) remains N1 (in Table 31: N1 = 15 = 1111binary). The bit reversal operation can be performed using the Rader algorithm (in the literature it is often also called the GoldRader algorithm). The next bit reversal value is calculated from the current bit reversal value. For a fourbit number:
Rader_Algorithm(0) = 8
Rader_Algorithm(8) = 4
Rader_Algorithm(4) = 12
Rader_Algorithm(12) = 2
…
Rader_Algorithm(7) = 15Here, one is added to the highest bit and an addition operation is performed to the number written in reverse. This algorithm is implemented in fft_integer.c/h and fft_float.c/h, in function: rader_bit_reversal_permutation(…)

To implement FFT, we need to use complex exponents, so here we should talk about the properties of complex exponents. One of these properties has already been mentioned in formula (14). So the important properties of the W function are:
To implement FFT, we can calculate and store N/2 exponents values in constant memory:
The FFT is implemented in fft_integer.c/h with N/2 exponents.Another method uses (20) to calculate the next value of the complex exponent using a rotation factor. Consider this method using the example of N=16 point FFT, see the Figure 33. In the first step, when calculating the twopoint FFT, only one value is required
In the second step, when calculating fourpoint FFTs, two exponents values are needed
The second value can be obtained by multiplying the previous value by rotation factor
Then:
In the third step, when calculating eightpoint FFTs, four values are needed
Each next value can be obtained by multiplying the previous one by rotation factor
Then:
In the fourth step, when calculating the sixteen point FFTs, eight values are needed
Each next value can be obtained by multiplying the previous one by rotation factor
Then:
Thus, for this method, we need to save
rotation factors in the constant memory
The FFT is implemented in fft_float.c/h with the
exponents. 
The implementation of the FFT algorithm consists the three nested loops:


the main loop with the length
for all 2, 4, …, N points FFTs: 
the second loop with the length current_points/2 (for example: 2 points FFT => 1; 4 points FFT => 2; … ; N points FFT => N/2) for all current_points FFTs

the third loop with the length (N/current_points) for all current_points FFTs only for one complex exponent

4. Decimation In Frequency FFT. Algorithm from SandeTukey
This method is based on the other way of splitting the original sequence into subsequences.
where n = 0, 1, …, (N/2)1
Then (1) can be rewritten using (10) and (23):
We can write for X(2k) and X(2k+1) where k=0…(N/2)1:
Formulas (25) and (26) define two new subsequences
with length N/2, which are calculated using formulas:
where n=0…(N/2)1
In this case, the spectrum
consists of even values of the spectrum of the original sequence x(n), and the spectrum
consists of odd values of the spectrum of the original sequence x (n).
The basic operation described by formulas (27) and (28) can be represented as a graphical butterfly (flow graph), shown in Figure 41.
Figure 41: Butterfly of the Decimation In Frequency FFT
Using the same formulas (27) and (28) to subsequences
we will get four subsequences
with length N/4 each. Continuing this process, we will come to subsequences with length 1. These obtained values are the spectrum of the original sequence x(n), mixed up in accordance with the bit reversal. See the Figure 42.
Figure 42: Division the Input Sequence x(n) for the Decimation In Frequency FFT. The spectrum of the obtained subsequences is shown on the figure
Now we can build a diagram of the Decimation In Frequency FFT. See the Figure 43.
Figure 43: 16 points of the Decimation In Frequency FFT
The implementation of this algorithm is similar to the implementation described in the previous section. Only here other butterfly is used and the array is rearranged according to the bit reversal at the end of the algorithm. The Decimation In Frequency FFT is done in the fft_float.c/h and fft_integer.c/h
5. Realization of the Inverse Fourier Transform Using the FFT Algorithm
The FFT implementation corresponds to the Discrete Fourier Transform (DFT). It is often necessary to perform an Inverse Discrete Fourier Transform (DFT). Consider how to perform the inverse Fourier transform using the FFT algorithm.
Let’s introduce notation for complexconjugate operation with *:
Then using (29), we can rewrite IDFT (2) as:
If x(n) is real signal then (30) can be written in more simple form:
Now we will describe the IDFT algorithm in steps:

Calculate complex conjugate values for the input array X(k)

Start the FFT algorithm

Divide the resulting output array by N

If x(n) is a complex signal, then we calculate the complex conjugate values for the output array. Otherwise, the algorithm ends at the third step
Note: IDFT is used in the fir_filter_using_fft ( … ) function, and in main(…) as a test in the fft_float.c/h and fft_integer.c/h
6. FFT Algorithm Complexity Estimation
The computational complexity of both FFT algorithms is the same. Using the Figure 33 and 43 we can write the number:

complex multiplications
It should be noted here that the number of multiplications is actually less than that given by formula (32), since the values of:
and in this case, no multiplications are made 
complex additions and subtractions
Note:
Two methods have been described of the using complex exponents for FFT. The second method uses rotation factors, that require less exponential values to be stored in memory . But in this case, we have extra complex multiplications:
7. FIR Filter. Fast Convolution using FFT
In the article “FIR Filter Implementation Using Octave GNU Tool and C Language“ is described the Fast Convolution algorithm, which relies on the DFT and IDFT method. In the fft_float.c/h is realized the function fir_filter_using_fft (…) of the FIR Fast Convolution.
Consider the fast convolution method using FFT. We introduce the following notation:
K – the FIR order of the filter or the pulse response length
M – length of one block at the filter input
Description of the FIR Fast Convolution from the article “FIR Filter Implementation Using Octave GNU Tool and C Language“:

Add zeros to the impulse response: h (0), h(1),…, h(K1), 0,…, 0 so that the array size is N=K + M – 1, where M is the size of the input signal block. Then using FFT algorithm, the DFT is calculated: H(k)

Take a block, for example iblock of input data of length M. We add zeros to this data:
to get an array of length K+M1, where K is the size of the impulse response of the FIR filter. Using FFT algotithm, we calculate the DFT: Xi(k) 
Multiply the Yi(k)=Xi(k)H(k). Calculate the IDFT using FFT from the Yi(k) and get the yi(n)
The traditional method of calculating the FIR filter is performed using the formula (35).
Now let’s estimate the computational complexity of both methods: (35) and fast convolution. Let’s assume that we have a real (not complex) signal at the FIR filter input. We will compare the number of multiplications used by these algorithms.
To calculate the FIR filter using the formula (35) for an input block of size M:
Fast Convolution использует FFT and the N value is selected as a power of two:
This algorithm uses the DFT and IDFT, which include
complex multiplications. Multiplication of the spectra is also used, this is N more complex multiplications. Each complex multiplication corresponds to four multiplications of real numbers. Then:
It makes sense to apply FIR Fast Convolution when the multiplications number (36) are greater than (38).
Example:
FIR filter K=60
Choose the block size M=197, then:
In this example the Fast Convolution has a gain in multiplications number: 9216 < 11820
Let’s build two graphs for this example using (36) and (38). See the Figure 71
Figure 71: Number of multiplications for fast convolution (red) and FIR filter (blue) using (35)
Notes:

If the input sequence x(n) is complex, then (36) will correspond to complex multiplication and in this case the efficiency of fast convolution increases

FIR filter is usually used with a linear phase. This is achieved by a symmetric or antisymmetric impulse response. Using this property, the number of multiplications can be reduced in (35) by half, i.e. the formula (36) can be rewritten:
In this case, (35) may be more efficient than fast convolution. Therefore, it is important to always make a comparative analysis and choose the best method
8. CSource Code fft_float.c/h
The file fft_float.c/h is used the float arithmetic.

Defines and constants:
#define FFT_LOG2_POINTS_NMB 5u – FFT size definition. Hier is 2^5=32 FFT points
flt_complex_nmb_t flt_complex_exp_table[FFT_LOG2_POINTS_NMB] – FFT rotation exponent factors. The table is prepared from the Octave GNU Tool, see the FloatExp2file( )

Private Function
/*
Rader’s Bit Reversal Permutation
Example: 001 => 100
Input
uint16 previous_index – previous bit reversal value
uint16 number – FFT number = 2^M
Return
uint16 – next bit reversal value
*/
uint16 rader_bit_reversal_permutation(uint16 previous_index, uint16 number) 
Export Functions
/*
Complex Multiplication
(a1+jb1)(a2+jb2)= a1 x a2 – b1 x b2 + j(a1 x b2 + a2 x b1)
Input
flt_complex_nmb_t* ptr_complex_nmb1 – pointer on the operand_1
flt_complex_nmb_t* ptr_complex_nmb2 – pointer on the operand_2
Return
flt_complex_nmb_t – (operand_1 x operand_2)
*/
flt_complex_nmb_t flt_complex_mult(flt_complex_nmb_t* complex_nmb1, flt_complex_nmb_t* complex_nmb2)/*
Complex Addition
a1+jb1 + a2+jb2 = a1+a2 + j(b1+b2)
Input
flt_complex_nmb_t* ptr_complex_nmb1 – pointer on the operand_1
flt_complex_nmb_t* ptr_complex_nmb2 – pointer on the operand_2
Return
flt_complex_nmb_t – (operand_1 + operand_2)
*/
flt_complex_nmb_t flt_complex_add(flt_complex_nmb_t* complex_nmb1, flt_complex_nmb_t* complex_nmb2)/*
Complex Subtraction
(a1+jb1) – (a2+jb2) = a1a2 + j(b1b2)
Input
flt_complex_nmb_t* ptr_complex_nmb1 – pointer on the operand_1
flt_complex_nmb_t* ptr_complex_nmb2 – pointer on the operand_2
Return
flt_complex_nmb_t – (operand_1 – operand_2)
*/
flt_complex_nmb_t flt_complex_sub(flt_complex_nmb_t* complex_nmb1, flt_complex_nmb_t* complex_nmb2)/*
Complex Conjugate
conjugate(a + jb) = a – jb
Input
flt_complex_nmb_t* ptr_complex_nmb1 – pointer on the operand
Return
flt_complex_nmb_t – conjugate(operand)
*/
flt_complex_nmb_t flt_complex_conjugate(flt_complex_nmb_t* complex_nmb)/*
Fast Fourier Transform (FFT), Decimation In Time, CooleyTukey
Input
flt_complex_nmb_t* ptr_input_array – input array
flt_complex_nmb_t* ptr_exp_array – pointer on the complex exponential function
uint16 log2_size – FFT size: N = 2^log2_size
Return
void
FFT result in the ptr_input_array
*/
void flt_fft_decimation_in_time(flt_complex_nmb_t* ptr_input_array, flt_complex_nmb_t* ptr_exp_array, uint16 log2_size)/*
Fast Fourier Transform (FFT), Decimation In Frequency, Sande–Tukey
Input
flt_complex_nmb_t* ptr_input_array – input array
flt_complex_nmb_t* ptr_exp_array – pointer on the complex exponential function
uint16 log2_size – FFT size: N = 2^log2_size
Return
void
FFT result in the ptr_input_array
*/
void flt_fft_decimation_in_frequency(flt_complex_nmb_t* ptr_input_array, flt_complex_nmb_t* ptr_exp_array, uint16 log2_size)/*
FIR Filter using FFT: Fast Convolution
Input
float* fir_coeff_ptr – pointer on the filter coefficients
uint16 fir_order – FIR order or coefficients number
sint16* input_ptr – pointer on the input signal array
uint16 input_length – input array length
sint16 output_ptr – pointer on the output array. Note output array shall be cleared: output = {0}
uint16 log2_size – FFT size: N = 2^log2_size
Return
uint8 – TRUE is OK: FIR filter is done / FALSE is NOT OK: FFT size is not suitable for the fir order FIR output in the output_ptr array
*/
uint8 fir_filter_using_fft(float* fir_coeff_ptr, uint16 fir_order, sint16* input_ptr, uint16 input_length, sint16* output_ptr)int main(void) – only test
9. CSource Code fft_integer.c/h
The file fft_integer.c/h is used the fixed point arithmetic.

Defines and constants
#define FFT_LOG2_POINTS_NMB 5u – FFT size: 2^5=32 FFT points
#define EXP_LOG2_FACTOR 15u – Factor: 2^15=32768 => 32767*exp(j*2*pi*n/N)
int_complex_nmb_t int_complex_exp_table[FFT_POINTS_NMB_DIV2] – FFT: N/2 exponents 
Private Functions:
/*
Rader’s Bit Reversal Permutation
The Algorithm has two names: Rader or GoldRader
Example: 001 => 100
Input
uint16 previous_index – previous bit reversal value
uint16 number – FFT number = 2^FFT_LOG2_POINTS_NMB
Return
uint16 – next bit reversal value
*/
uint16 rader_bit_reversal_permutation(uint16 previous_index, uint16 number)/*
Number Limit: sint32 to sint16
Input
sint32 number – input value
Return
sint16: SINT16_MIN_NEGATIVE_NMB <= output_number <= SINT16_MAX_POSITIVE_NMB
*/
sint16 number_limit(sint32 number) 
Export Functions
/*
Complex Multiplication
(a1+jb1)(a2+jb2)= a1 x a2 – b1 x b2 + j(a1 x b2 + a2 x b1)
Multiplication shall be done with the A*exp(j*2*pi*n/N)
and is used the round and 2^EXP_LOG2_FACTOR normalization
Input
int_complex_nmb_t* ptr_complex_nmb1 – pointer on the operand_1
int_complex_nmb_t* ptr_complex_nmb2 – pointer on the operand_2
Return
int_complex_nmb_t – (operand_1 x operand_2)
*/
int_complex_nmb_t int_complex_mult(int_complex_nmb_t* complex_nmb1, int_complex_nmb_t* complex_nmb2)/*
Complex Addition
a1+jb1 + a2+jb2 = a1+a2 + j(b1+b2)
Input
int_complex_nmb_t* ptr_complex_nmb1 – pointer on the operand_1
int_complex_nmb_t* ptr_complex_nmb2 – pointer on the operand_2
Return
int_complex_nmb_t – (operand_1 + operand_2)
*/
int_complex_nmb_t int_complex_add(int_complex_nmb_t* complex_nmb1, int_complex_nmb_t* complex_nmb2)/*
Complex Subtraction
(a1+jb1) – (a2+jb2) = a1a2 + j(b1b2)
Input
int_complex_nmb_t* ptr_complex_nmb1 – pointer on the operand_1
int_complex_nmb_t* ptr_complex_nmb2 – pointer on the operand_2
Return
int_complex_nmb_t – (operand_1 – operand_2)
*/
int_complex_nmb_t int_complex_sub(int_complex_nmb_t* complex_nmb1, int_complex_nmb_t* complex_nmb2)/*
Complex Conjugate
conjugate(a + jb) = a – jb
The function is used for the Inverse FFT
Input
int_complex_nmb_t* ptr_complex_nmb1 – pointer on the operand
Return
int_complex_nmb_t – conjugate(operand)
*/
int_complex_nmb_t int_complex_conjugate(int_complex_nmb_t* complex_nmb)/*
Fast Fourier Transform (FFT), Decimation In Time, CooleyTukey
Input
int_complex_nmb_t* ptr_input_array – input array
int_complex_nmb_t* ptr_exp_array – pointer on the complex exponential function
uint16 log2_size – FFT size: N = 2^log2_size
Return
void
FFT result in the ptr_input_array
*/
void int_fft_decimation_in_time(int_complex_nmb_t* ptr_input_array, int_complex_nmb_t* ptr_exp_array, uint16 log2_size)/*
Fast Fourier Transform (FFT), Decimation In Frequency, Sande–Tukey
Input
int_complex_nmb_t* ptr_input_array – input array
int_complex_nmb_t* ptr_exp_array – pointer on the complex exponential function
uint16 log2_size – FFT size: N = 2^log2_size
Return
void
FFT result in the ptr_input_array
*/
void int_fft_decimation_in_frequency(int_complex_nmb_t* ptr_input_array, int_complex_nmb_t* ptr_exp_array, uint16 log2_size)int main(void) – only test
10. Octave GNU Tool Script FFT_Support.m
The FFT_Support.m is supported the Csource codes.

Constants:
fft_power = 5; % 2^5=32 FFT Points
fir_coeff_original = […]; % Fast Convolution support
F_sample_in_Hz = 12e3; % Fast Convolution support 
Functions:
% FFT Float Complex Exponent Rotation Table to File
% Support CCode
% See the fft_float.c
% Array flt_complex_exp_table
% Input parameters:
% fftPower – FFT power: N=2^fftPower
% FileNameString – output file name
function FloatExp2file(fftPower, FileNameString)% FFT Integer Complex Exponent Table to File
% Support CCode
% See the fft_integer.c
% Array int_complex_exp_table
% Input parameters:
% fftPower – FFT power: N=2^fftPower
% FileNameString – output file name
function IntegerExp2file(fftPower, IntegerFactor, FileNameString)% FFT Float Test Signal Table to File
% Support CCode
% See the fft_float.c
% Array flt_test_signal1/flt_test_signal2
% Input parameters:
% fftPower – FFT power: N=2^fftPower
% FileNameString – output file name
% Output:
% txt file with the signal data
% signal_sum = 10sin(2*pi*n/N) + 10sin(2*pi*n/N)
function signal_sum = FloatTestSignal2file(fftPower, FileNameString)% FFT Integer Test Signal Table to File
% Support CCode
% See the fft_integer.c
% Array int_test_signal1/int_test_signal2
% Input parameters:
% fftPower – FFT power: N=2^fftPower
% FileNameString – output file name
% Output:
% txt file with the signal data
% signal_sum = 512*sin(2*pi*n/N) + 512*sin(2*pi*n/N)
function signal_sum = IntegerTestSignal2file(fftPower, FileNameString)%————————–
% Fast Convolution Support
% FIR Low Pass Filter
%————————–% Integer Test Signal to File
% Support CCode
% Input parameters:
% SignalAmpl – amplitude of the test signal
% SignalFreqInHz – frequency of the test signal in Hz
% SignalLength – table length
% FsampleInHz – sampling frequency in Hz
% FileNameString – output file name
% Output parameters:
% TestSignal – test signal vector
% txt file
function TestSignal = IntegerTestSignals2file(SignalAmpl, SignalFreqInHz, SignalLength, FsampleInHz, FileNameString)% Integer Table to File
% Support CCode
% Input parameters:
% ParameterVector – data array
% FileNameString – output file name
function IntegerParamVector2file(ParameterVector, FileNameString)% Float Table to File
% Support CCode
% Input parameters:
% ParameterVector – data array
% FileNameString – output file name
function FloatParamVector2file(ParameterVector, FileNameString)
11. Download the FFT_article.zip
You can download the files:
fft_float.c/h
fft_integer.c/h
FFT_Support.m
with the button:
12. Conclusions and Notes

FFT using fixed point arithmetic requires additional control over overflows. The most dangerous situation occurs if the signal contains sinusoidal signals. In this case, the output values can be amplified by a factor of N/2

To improve the accuracy of calculations, use the rounding rule. This is especially important to provide for the fixed point arithmetic case

The Float arithmetic for FFT algorithm is effective for microprocessors that have a float unit. In this case, commands on floating numbers are supported at the assembly language (machine code) level and do not require the use of heavy library functions

Effective use of FFT often requires the implementation of the algorithm in Assembly language, which significantly speeds up the execution of the algorithm. This is because most DSP microprocessors support special commands and special addressing that increases the speed of FFT and other DSP algorithms. For example, reverse bit addressing is usually supported

To calculate the FFT, it is sometimes necessary to calculate the complex exponents in real time. After executing the FFT algorithm, we need to calculate the logarithmic amplitude characteristic, as well as the phase characteristic. To do this, we need to be able to quickly calculate functions:

sine, cosine

logarithm

square root

arctangent

See the article: “Functions Implementation: Square Root, Logarithm, Sine, Cosine, Arctangent. C Code and Octave Script.“

For shortterm Fourier analysis, the signal is divided into finite sections, and then the spectrum is calculated for each fragment of the signal. In this case, there is a spectrum distortion caused by the time limit of the signal, called Gibbs phenomenon. To combat these distortions, special windows have been developed: Hamming, Hanning, Kaiser and so on. See the article “Goertzel Algorithm and C Implementation Using the Octave GNU Tool”

In my article the only case was considered:
In this case the maximum efficiency of the FFT is obtained.
However, there is DFT optimization algorithm for the case:
See [1] for a description of this algorithm.
13. Literature / References
[1] L.Rabiner, B.Gold “Theory and application of digital signal processing“, PrenticeHall, Ing Englewood Cliffs, New Jercy 1975
[2] A.Oppenheim, R.Schafer “DiscreteTime Signal Processing“, published by Pearson Education, Inc, publishing as Prentice Hall, 1999