In this paper we investigate quantitatively the effect of data gaps for four
methods of estimating the amplitude spectrum of a time series: fast Fourier
transform (FFT), discrete Fourier transform (DFT),

Spectral analysis is a widely used tool in data analysis and processing in most fields of science. The technique became very popular with the introduction of the fast Fourier transform algorithm which allowed for an extremely rapid computation of the Fourier Transform. In the absence of modern supercomputers, this was not just useful, but also the only realistic solution for such calculations. This limitation is no longer relevant except for data sets of extremely large size. Still, FFT has remained the most popular tool for spectral analysis, because it is both easy to use and very fast. This makes it an extremely powerful tool and generally it is the first choice from the “toolbox” of spectral analysis methods. It is readily available in all programming languages of notice and it is accurate under perfect conditions. However, real data is rarely perfect. We address here the very common problem of data gaps.

Due to the wide usage of FFT, the literature on this subject is spread over a
wide array of scientific disciplines: astronomy

The problem of data gaps, i.e., occasionally missing observations from an
otherwise continuous set of measurements, can be viewed as a special case of
the more general problem of non-uniformly sampled data, i.e., measurements
made at random time intervals throughout the data set. Although our study is
mostly concerned with data gaps, all the methods used in our paper can also
be straight-forwardly applied to problems related to non-uniform sampling. A
recent review of the methods for spectral analysis of non-uniformly sampled
data is presented in

Data gaps can occur in a variety of ways, from temporary malfunctions of the
measuring instrument, to the inherent non-continuous nature of the physical
phenomenon being studied.

Many papers dealing with spectral analysis of real data often avoid the
problems related to data gaps by subdividing the time series into smaller
samples with continuous data coverage. One such paper is the study by

In the case of multiple data gaps distributed throughout the data set and/or
irregular time sampling, where avoiding data gaps is not an option, one can
either use data reconstruction (interpolation) and FFT, or more advanced
spectral analysis methods designed to handle non-uniform sampling.
Reconstruction can be achieved with a simple linear interpolation across the
gaps (which is one of the methods used in our study), or with more advanced
reconstruction techniques, like the method based on singular-spectrum
analysis (SSA)

We intend to quantify the distortions introduced by data gaps using four
popular methods of estimating the frequency spectrum: fast Fourier transform,
discrete Fourier transform,

We use a synthetic time series consisting of several periodic signals and
added noise. Different, frequently encountered data gap configurations are
applied to it and the amplitude spectra are calculated and compared to the
known spectrum of the unaltered signal. Two gap configurations are used to
highlight the effects: (a) a single large gap, where the original data set is
altered by removing an increasing number of points from the central part, and
(b) multiple small gaps, where we remove short series of consequent points
whose length and precise location are randomly selected. The same methodology
is also used to test the effect of data gaps on the amplitude spectra of
magnetic field measurements made by the

The paper is structured as follows: a description of the general methodology
and methods is presented in Sect. 2; Sect. 3 shows the results for a
synthetic data set comprising of four sinusoidal signals, and Sect. 4
applies the same methodology on a real data set of

The fast Fourier transform is extremely fast to calculate, but requires strictly uniform sampling. It is by far the most popular method for computing the frequency spectrum. It is sometimes used on non-uniformly sampled data, first using linear interpolation to fill in the data gaps. Linear interpolation alters the signals, but the FFT is still able to capture an acceptable level of spectral details, depending on the size and number of gaps. This is demonstrated quantitatively and qualitatively in Sects. 3 and 4.

The effect of linear interpolation can be derived analytically from first
principles. For a given signal

The discrete Fourier transform is a discretization of the Fourier integral of
Eq. (1), which we chose to do using the trapezoidal method:

The

The Lomb-Scargle method performs a least squares fit of the data using a
superposition of sinusoidal modes

By applying the four methods above we obtain a complex amplitude spectrum

By squaring the one-sided amplitude spectrum one can obtain the periodogram, which is a non-parametric estimate of the power spectral density (PSD). Our methodology of systematically comparing the spectra for the signals with gaps with the spectrum for the original, uniformly sampled signal, allows us to use only the simple one-sided spectrum to study the effect of data gaps. While more comprehensive methodologies of estimating the PSD do exist, they are beyond the scope of this study. In the following, we will use the term amplitude spectrum to denote the one-sided amplitude spectrum.

While DFT, ZTR and LST are equivalent to FFT for uniform sampling, they can provide, as we will show later, very different results when analyzing time series with data gaps.

Note that the issues due to data gaps are not limited to a single method, but
are a fundamental property of the resulting amplitude spectrum. For a
uniformly sampled data set, there is an orthogonal set of frequencies for
which the values of the Fourier coefficients are independent. In the case of
non-uniformly sampled data, such an orthogonal set generally does not exist

In order to determine the effect introduced by data gaps on the Fourier transform, we test the four above-mentioned spectral analysis methods on a synthetic signal with artificial gaps. The FFT method is applied to a signal where the gaps are linearly interpolated, while DFT, ZTR and LST are applied to a signal with no data interpolation. We test two configurations of gaps: (a) a single large gap (SLG), based on the alteration of the original data set by removing an increasing number of points from the central part; the procedure is repeated until the central gap reaches 99.8 % of the total length of the signal, and (b) multiple small gaps (MSG), based on the alteration of the signal by removal of short series of consequent points whose length and precise location are randomly selected; the procedure is applied repeatedly for various distributions of random gaps. The two cases were studied first on a synthetic signal sampled uniformly.

The synthetic data set is a superposition of four sinusoidal modes with unit
amplitude and the following frequencies:

In this case the data gap is created by removing a number of points from the
central part of the synthetic signal. We generate 100 signals from
Eq. (9) on which we generate a
single gap with size varying from 1 to 99.8 % of the length of the
original time series. As an example, we illustrate in Fig. 2 the performance
of the four methods on a signal whose gap is 50 % of the length of

Case study for the SLG test applied to synthetic data. Panel

The accuracy of the amplitude spectrum computed with DFT is sensitive not only to the size of the central gap but also to the phase at the two end points of the gap. If at least one end point has a value different from the mean value of the signal (which is zero in the case of our synthetic signals) then the results are distorted. This distortion is also seen in Fig. 2, where the DFT amplitudes depict a very large background level (even above one), and the four signals barely rise above the background. This distortion is larger for large deviations of the two end points from the mean.

Since we are interested mainly in the effect of the gap size, we apply a
Tukey (tapered cosine) window

Case study for the SLG test applied to the synthetic data. Panel

Figure 2 shows that the two other spectral analysis methods, ZTR and LST,
provide very accurately the same amplitude level as the original, full
data set, even when the gap is quite wide. We also see that the spectral
background level

Figures 4 and 5 show how the amplitudes change when the length of the central
gap is increased. Figure 4 shows the amplitude spectra obtained by the four
methods in color coding, with the y-axis giving the total gap percentage
(TGP) of the signal and the

The spectra obtained with ZTR and LST are quite different, but bear some similar features. As the size of the central gap increases, the signal amplitude remains almost constant for all the four spectral peaks. Moreover, Fig. 4 shows that the side lobes (the palm-tree) and the background noise have considerably larger amplitudes for ZTR and LST than for FFT and DFT.

In order to illustrate even more quantitatively the response of the four
methods to the increasing size of the central gap we have studied the change
for the first frequency,

SLG test applied to the synthetic data: amplitude in color code as a
function of TGP and frequency. For the SLG test, we generate 100 signals,
indexed from 1 to 100, with signal 1 corresponding to the signal with the
smallest TGP and signal 100 to the one with the highest TGP. For SLG, the TGP
is a linear function of signal index. Panels

SLG test applied to the synthetic data: amplitude of the 10 Hz
sinusoidal mode

SLG test applied to the synthetic data: integral amplitudes as a
function of TGP for the signal with noise (panel

Figure 6 shows the integral of the amplitude spectrum, i.e., the sum of all amplitudes, as a function of the TGP. In order to study the effect of spectral noise we calculated the integral not only for the synthetic signal given by Eq. (9), but also for a “clean” signal (the sum of sinusoids without noise) as well as for pure noise. We found that FFT and DFT behave in a similar way within the limit of small TGP, up to about 60 % (see earlier discussion). The integral corresponding to the clean signal has an almost constant value as we increase the gap size. For pure noise, the integral decreases with increasing gap size. For the signal contaminated with noise, the original signal given by Eq. (9), the result is a combination of these two cases, i.e., the integral slowly decreases with increasing TGP.

For the ZTR and LST methods, the integral increases as we increase the gap size not only for pure noise and the noisy signal, but also for the clean sum of sinusoids. We already observed in Fig. 3 that, for the 50 % TGP, the spectral background was larger than the original background. We see here that the background level increases systematically as we increase the gap size, eventually dominating the integral for large TGP and explaining the overall increase for these two methods. This result will have an important impact on the analysis of solar wind data, where dominant harmonics are less frequent and the spectral noise dominates the integral.

The second test performed on the synthetic signal of Eq. (8) consists of removing a number of randomly distributed points from the original signal. This test mimics the situation often encountered in the experimental investigation of various geophysical or space systems (e.g., ground-based measurement of the geomagnetic field or satellite measurement of the plasma and field parameters of the solar wind), where randomly distributed data gaps are an inherent problem.

We choose the size distribution of the gaps using the gamma function,
which can be described by two parameters: the shape parameter

As in the case of SLG, we calculate the Fourier spectra and index them
according to the selection of

Figure 7 shows an example of a distribution of gaps for

Case study for the MSG test applied to the synthetic data. The
format is similar to Fig. 2 and 3, except for panel

MSG test applied to the synthetic data:

The amplitude spectrum obtained with FFT shows that the amplitude of the sinusoidal modes decreases systematically with increasing frequency. The other three methods (DFT, ZTR and LST) are very robust for this configuration of gaps and show no major modification in the amplitude spectra compared with the original results (see Fig. 1).

The methodology outlined above was applied to an ensemble of 100 synthetic signals obtained by degrading the original signal by increasing the number of missing points according to the gamma distribution. The mean and variance of the distribution of gaps increase with signal index, and thereby, the total number of removed points increases, although not strictly linearly. Figure 8a shows the distribution of gap percentage as a function of signal index and gap size, and Fig. 8b shows for each of the 100 degraded signals the corresponding TGP. We note that this statistical ensemble of signals covers indeed a broad range of different possible configurations, relevant to investigate the response of the four spectral analysis methods.

MSG test applied to the synthetic data: amplitude spectra in color
code as a function of TGP and frequency for: FFT (panel

Figure 9 depicts the amplitude spectra of the 100 signals described above and in Fig. 8 as a function of TGP and frequency, similar to the SLG test of Fig. 4. One can see that the FFT amplitude decreases with frequency and with increasing TGP. The results for DFT, ZTR and LST show little decrease in amplitude and no frequency dependence when TGP increases.

Figure 10 shows the FFT amplitudes as a function of TGP separately for the
four frequencies

MSG test applied to the synthetic data: FFT amplitude as a function
of TGP for the four frequencies of the synthetic signal:

Figure 11 shows the analogue of Fig. 6 for the MSG test, i.e., the amplitude integral as a function of the TGP. As in Fig. 6, the amplitude integral for the original signal is removed in each case. There are interesting differences between Figs. 11 and 6 that are related to the different gap structure between the two cases, even for the same total TGP. In particular, there is a much larger integral amplitude in FFT, even for clean sinusoids, which is due to the increased background power. Figure 9 shows that the FFT amplitude at low frequencies increases with TGP. However, there is an increased background level even at lower TGP of about 40 % (see Fig. 9), where the integral amplitude attains its maximum, as seen in Fig. 11.

MSG test applied to the synthetic data: integral amplitudes as a function of TGP. The thick lines are the average values of the individual integral amplitudes. The format is identical to Fig. 6.

The solar wind is a supersonic, turbulent plasma stream released from the
upper atmosphere of the Sun. It is often considered to be the ideal turbulence
laboratory, due to its very large scale, compared to the usual Earth-based
laboratories, and to the large fleet of spacecraft actively investigating its
properties. Starting with the work by

We now apply the above methods to the magnetic field data from the

This testing method is identical to the one described above in Sect. 3.1. A
case study obtained by removing 50 % of the central part of the original
signal (TGP

A

Case study for the SLG test applied to the VEX data set. The format
is similar to Fig. 3. In addition, panels

SLG test applied to the VEX data set. Represented are the difference
of amplitude spectra (method – original) as a function of TGP and frequency
for:

The results for the ensemble of 100 signals with the TGP increasing linearly from 1 to 99.8 % (the procedure described in Sect. 3.1) are shown in Figs. 14 and 15 (analogues of Figs. 4 and 5). These figures illustrate the difference between the estimated spectra of the signal with gaps and the original spectrum. FFT and DFT show in Fig. 14 an overall decrease in amplitude with increasing gap size. However, due to the more complicated spectral content of the real signal, the results are not as clear as for the synthetic signal (Fig. 4). Also, due to the more complex nature of the signal, the windowing procedure is not very effective at large TGP values, and the edge effect will result in a large increase of DFT amplitudes for TGP above 80 %. ZTR and LST show an increase in amplitude with increasing gap size. Figure 14 shows the integral of the amplitude spectra as a function of TGP. We see here a pattern very similar to the one obtained for pure noise analysis (see Fig. 6b): a decrease in FFT and DFT (until about 60 % TGP) and an increase in ZTR and LST as we increase the gap size.

SLG test applied to the VEX data set: integral amplitudes as a function of TGP. Shown are the results for: FFT (red), DFT (green), ZTR (blue) and LST (magenta).

A case study for the VEX signal from which we have removed 49.4 % of the
points by introducing small gaps according to the gamma distribution is shown
in Fig. 16. We see that this gap configuration produces excessive power
over most of the frequency range for DFT, ZTR and LST. Only FFT produces an
amplitude spectrum that is close to the original one. One can notice a
threshold frequency at about

Case study for the MSG test applied to the VEX data set. The format
is similar to Fig. 7. Panels

MSG test applied to the VEX data set. The format is similar to Fig. 14.

Figures 17 and 18 show the results of the test for the ensemble of 100 signals with variable gap size distribution generated by the gamma function. Figure 17 shows the difference between the actual amplitude spectrum and the original spectrum. It gives us a synoptic view of the behavior of the four methods when degradation is increasing. Figure 17 shows that the FFT amplitude spectrum underestimates the original spectrum for frequencies higher than about 0.1 Hz.

MSG test applied to the VEX data set: integral amplitudes as a function of TGP. The format is similar to Fig. 15.

On the other hand, DFT, ZTR and LST overestimate the spectrum over most of the frequency range. A significant increase is detected for frequencies larger than 0.1 Hz, in agreement with the case study depicted in Fig. 16. Interestingly the increase seen in these three methods is frequency dependent, and the affected range of frequencies seems to depend on the actual distribution of the gaps and the power of the signal. These results indicate that, at least when the data includes gaps, FFT is the best method to approximate reliably the spectral slope of a signal recorded in a turbulent environment. Figure 18 illustrates the integral spectral amplitude (as difference to the original) as a function of the TGP. The integral gives a global measure of the spectrally differentiated behavior seen in Fig. 17. Figure 18 shows quantitatively the better agreement of the FFT amplitude with the original spectrum than the three other methods, even for large TGP.

We have analyzed the effect of data gaps on four commonly used spectral
analysis methods: the fast Fourier transform, the discrete Fourier transform,
the

In order to study the effect of data gaps and to mimic frequently encountered
gap configurations, we devised two tests: the single-large-gap (SLG) test,
which removes a number of consecutive points from the signal, and the
multiple-small-gaps (MSG) test, which removes a number of randomly
distributed gaps whose size was given by the gamma distribution. Both tests
include an ensemble of 100 signals with gap percentage increasing from 1 %
(4.3 %) to 99.8 % (83.6 %) in the case of the SLG test (MSG test,
respectively). The tests are applied to two data sets: a simple noisy
superposition of four sinusoidal modes and magnetic field measurements made
by the

Summary of the results of the comparative numerical studies with
various types of gaps applied to synthetic and real data from the

The DFT method is very sensitive to the phase at the two end points of the gap, giving very distorted results for large deviations of the end points from the mean value of the signal. To remove this distortion, the SLG test uses a windowing procedure, where we apply a window function on the two parts of the signal around the central gap, thus canceling the offset on either side of the gap (as well as at the start and end of the signal). The signal was windowed for all four methods, although the effect of windowing was minimal for the other three methods.

For FFT and DFT, the SLG test shows monotonically decreasing amplitudes of the sinusoidal modes, with increasing total gap percentage (TGP). For DFT this decrease continues only to the limit of TGP of about 80 %, where after the background level increases considerably because the windowing procedure becomes less effective (due to the small number of data points and related numerical noise). For the VEX data, used here as an example of a realistic data set with a more complex spectrum, the results are influenced by the non-stationarity of the time series, but, overall, we see a similar decrease in FFT and DFT amplitude when increasing the TGP, as for the synthetic data. For the synthetic data set, ZTR and LST show neither decrease in the amplitude of the sinusoidal modes nor frequency dependence, when the TGP is increased in the SLG test. However, for very large gaps, the spectral background becomes increasingly noisy, leading to an overall increase of the average level of the spectrum. For the VEX data, since we do not have clear spectral peaks, we only see the increase of the average level of the spectrum as TGP increases.

The MSG test probes the effect of small, randomly distributed, gaps within the time series. For the synthetic data set, FFT is the only method which is severely affected by such small gaps, showing decreased amplitude and systematic frequency dependence in amplitude reduction, with high frequencies being most affected. On the other hand, DFT, ZTR and LST are able to recover the amplitudes of the sinusoidal modes, but the spectral background becomes increasingly noisy when increasing the TGP. For the VEX data, the FFT shows a similar decrease in amplitude and frequency dependence as for the synthetic case. On the other hand, DFT, ZTR and LST seriously overestimate the high frequency part of the amplitude spectrum above a certain threshold frequency. Moreover, we found that this threshold is dependent on the distribution of the small data gaps, and is moving to a lower frequency as the gap size increases. Beyond this threshold the spectral amplitude is roughly constant since the data gaps cover a large range of gap sizes corresponding to the frequency range above the threshold.

Table 1 summarizes the main conclusions of our study. The two columns show
the results of the two tests (SLG – left and MSG – right) for the synthetic
data set (row 1) and for the

Concluding, the FFT method can be used even for relatively large single data gaps, although the absolute value of the amplitude spectrum is systematically reduced with gap size. On the other hand, the ZTR and LST methods preserve the absolute level of the amplitude spectrum, but are more vulnerable to increasing spectral background arising from increasing the TGP. They are recommended for the analysis of signals with strong sinusoidal modes, giving robust results for the amplitude of sinusoidal modes. For more turbulent spectra, the appearance of side lobs and spectral noise makes the effect of data gaps more pronounced for these methods than for FFT. Thus, our results indicate that, at least when the data includes gaps, FFT is the best among the four tested methods to approximate reliably the spectral slope of a signal recorded in a turbulent environment.

The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007–2013) under grant agreement 313038 (STORM), and a grant of the Romanian Ministry of National Education, CNCSUEFISCDI, project No. PN-II-ID-PCE-2012-4-0418. We also acknowledge the financial support by the Academy of Finland to the ReSoLVE Center of Excellence (project 272157) and to project 264994. The topical editor, G. Balasis, thanks Z. Voros and R. V. Donner for help in evaluating this paper.