One-step ahead prediction of foF2 using time series forecasting techniques

. In this paper the problem of one-step ahead prediction of the critical frequency ( f o F2) of the middle-latitude ionosphere, using time series forecasting methods, is considered. The whole study is based on a sample of about 58 000 observations of f o F2 with 15-min time resolution, derived from the Athens digisonde ionograms taken from the Digisonde Portable Sounder (DPS4) located at Palaia Penteli (38 ◦ N, 23.5 ◦ E), for the period from October 2002 to May 2004. First, the embedding dimension of the dynamical system that generates the above sample is estimated using the false nearest neighbor method. This information is then utilized for the training of the predictors employed in this study, which are the linear predictor, the neural network predictor, the persistence predictor and the k -nearest neighbor predictor. The results obtained by the above predictors suggest that, as far as the mean square error is considered as performance criterion, the ﬁrst two predictors are signiﬁcantly better than the latter two predictors. In addition, the results obtained by the linear and the neural network predictors are not signif-icantly different from each other. This may be taken as an indication that a linear model sufﬁces for one step ahead prediction of f o F2.


Introduction
The accurate prediction of ionospheric conditions is critical for several applications affected by the space weather, including H F communications, satellite positioning and navigation applications.Ionospheric storms can cause largescale, drastic changes to the usable range of H F frequencies.Large solar flares cause short-wave fadeouts, resulting in blackouts of H F signals.Also, protons emitted from the Sun result in polar cap absorption events and consequently Correspondence to: K. Koutroumbas (koutroum@space.noa.gr) in blackouts of H F signals propagating through the Earth's polar regions.Ionospheric effects can also generate timevarying ionospheric currents, especially in the northern latitudes causing problems in ground systems, such as systems for power generation and supply; oil and gas pipeline distribution; aerial surveying for minerals, oil and gas; drilling for oil and gas; railways.
To handle the complexity of the problem, the development of accurate models to forecast the ionospheric conditions is of crucial importance, especially during disturbed conditions.Considerable effort has been devoted on the development of physical models (Anderson et al., 1998) based on the coupling between the thermosphere and the ionosphere.However, the use of such models is not suitable for realtime applications due to both the input data requirements for the simulation of the thermosphere (Fuller-Rowell and Rees, 1980) and the computational load.
The empirical models driven by magnetic indices are a second group of ionospheric models.Fuller-Rowell et al. (2001) developed the empirical storm-time ionospheric correction model driven by the previous time-history of the geomagnetic index, ap, and it is designed to scale the quiettime F -layer critical frequency (f oF2) to account for stormtime changes in the ionosphere.The model provides a useful, yet simple tool for estimating the changes to ionosphere in response to geomagnetic activity.
Another well-known statistical model for the prediction of ionospheric parameters was introduced by Muhtarov and Kutiev (1999), which makes use of the auto correlation function of the parameter under consideration without using any geomagnetic index.Muhtarov et al. (2002) further improved the prediction capability of the autocorrelation model by adding a geomagnetic index and its statistical characteristics.
An alternative approach for predicting ionospheric conditions is based on time series forecasting techniques.Datadriven modelling techniques, such as neural networks, are used to predict the behaviour of the ionosphere under the assumption that non linear processes are the dominant mechanisms that generate f oF2 variability (Tulunay et al., 2004a, b;Wintoft and Cander, 2000;McKinnell and Poole, 2000).Wintoft and Cander (2000) used time-delay, feed-forward neural networks to predict the hourly values of the ionospheric F 2 layer critical frequency, f oF2, 24 h ahead.The 24 measurements of f oF2 per day are reduced to five coefficients with principal component analysis.A time delay line of these coefficients is then used as input to a feed-forward neural network.Also included in the input are the 10.7-cm solar flux and the geomagnetic index A p .The network is trained using f oF2 data from 1965 to 1985 gathered at the Slough ionospheric station and validated on an independent validation set from the same station for the periods 1987-1990 and 1992-1994.
In a recent study, Tulunay et al. (2004)  The problem considered in this study is the estimation of the next value of f oF2 using time series forecasting methods.The available data sample X={x 1 , x 2 , . .., x p } consists of about p=58 000 observations of f oF2 derived from the Athens digisonde ionograms taken from the ionospheric station located at Palaia Penteli, for the period from October 2002 to May 2004.The sampling rate is 15 min.There is a small fraction of missing observations that have been neglected from the subsequent prediction stages.
If we denote the current value of f oF2 by x(n), then the estimation of the next value x(n+1) is based on (1) The first problem to be faced is the estimation of d, the socalled embedding dimension.This is estimated using the false nearest neighbor method.After the determination of d, two sets of pairs of the form (y(n), x(n+1)) are created.
The first one, denoted by S 1 and called the training set, will be used for the training of the predictors, while the second one, denoted by S 2 and called the test set, will be used for the evaluation of the performance of the predictors.The evaluation criterion for the above predictors is the mean square error (MSE), that is the mean value of the squared difference between the actual and the predicted values of f oF2.
In this study, both parametric and non-parametric predictors are used.Specifically, from the first category the linear predictor, as well as neural network predictors, are considered, while from the second category the persistence predictor, as well as the k nearest neighbor predictor, are considered.The experimental results show that all predictors exhibit a less than 13% error on the test set, in terms of the MSE criterion.This issue will be discussed further in the simulation results section.
The rest of the paper is organized as follows.In Sect.2, the definition of the embedding dimension, d, is given, together with a short description of the false nearest neighbor method that estimates d.In Sect. 3 a short description of the predictors considered in this study is given.In Sect. 4 the procedure that generates the training and the test sets, S 1 and S 2 , is described.In addition, the results of the predictors followed by a short discussion are provided.Finally, concluding remarks, as well as future research directions, are included in Sect. 5.

The embedding dimension
Let A denote the d-dimensional dynamical system that produces the available time series of observations and let s(n) denotes its state vector at time n.Assuming that A is a discrete dynamical system, it is described by the following equation (also called map) (2) Clearly, this system is unknown, that is we do not know the dimension d, nor the function h.The only available information about it is through the sequence of observations {x(n)}, which are related with the state vector s(n) via the following equation: Since, in general, the available sequence of observations1 does not represent properly the multi-dimensional phase space of the dynamical system, one has to employ some technique to unfold the multi-dimensional structure using the available data series (Hegger et al., 1999).
The most important technique for the phase space reconstruction is the method of delays (see, e.g.Tsonis, 1992;Hegger et al., 1999).According to this method, the vectors in the new space (the embedding space) are formed from time delayed values of the scalar measurements, i.e.2 and d is the dimension of the embedding space, called the embedding dimension.Knowledge of d is of crucial importance, but, of course, it is unavailable in real world situations and has to be estimated from the available data series.Specifically, d should be chosen large enough to allow for the unfolding of the multi-dimensional structure of the system, but not too large, in order to avoid the undesirable effects of the noise encountered in the measurements, as well as the unnecessary increase in computational complexity (Kennel et al., 1992).
A method that has been extensively used for the estimation of d is the so-called method of false nearest neighbors (Kennel et al., 1992), which is described below.Let X ={x(1), x(2), . .., x(q)} 3 be the set of observations on which the estimation of d will be based.
The false nearest neighbor method • Set R tot =15 and A tot =2 (as suggested in Kennel et al., 1992).
• Choose a high enough value of d, say, d max , and use X to construct the set , where u i and v i are the i-th coordinates of u and v, respectively.
-If their fraction is smaller than 1% of q (as suggested in Kennel et al., 1992), choose d as the embedding dimension and terminate the procedure.
In words, the above method tests if d is an appropriate estimate for the embedding dimension, by utilizing the neighborhood information of the d max -dimensional vectors y(n) in Z. Specifically, for each vector y(n) in Z, its nearest neighbor y (n) is determined based on the last d coordinates of the vectors.Let R 2 d (n) be the distance between y(n) and y (n) when only the last d coordinates are taken into account.Then R 2 d+1 (n) is computed and the difference between and R 2 d+1 (n) differ significantly, then we say that y (n) is a false nearest neighbor of y(n)4 .If this happens for a significant number of points y(n) ∈ Z, it is an indication that the multi-dimensional structure of the system does not "unfold" well in the d dimensional space, i.e. a larger value of d must be considered.
Finally, it is worth noting that the above algorithm may also be used by considering not only the nearest neighbor of each vector y ∈ Z but also its k-nearest neighbors.

Neural networks predictor
In this study we consider only two-layer, feedforward neural networks (2LF NN ), with m nodes in their hidden layer and a single output node5 .These networks are modelled by the following equation where y is the input vector and x is the output of the network.f is typically chosen to be equal to logi(x)=1/(1+e −ax ) or tanh(x)=(1−e −ax )/(1+e −ax ), while g may be chosen to be equal to x, logi(x) or tanh(x).The m (d + 1)-dimensional vectors w j , as well as the values of v j , j =0, . . ., m are the parameters of the network.Let W denote a vector that contains all these parameters.W is usually estimated by optimizing an appropriately defined cost function, using tools from nonlinear optimization theory.Given a data set Y ={x(1), . . .x(q)}, a typical cost function that is frequently employed is the sum of square errors, defined as The advantage of the above types of models is that they can describe more reliably phenomena that exhibit significant nonlinearities.However, their major disadvantage follows from the fact that the cost function to be optimized is nonconvex, due to the nonlinear nature of f and (probably) g in Eq. ( 8).As a consequence, it is difficult to obtain the global optimum W * of J (W ) that best represents the data at hand.Thus, instead of trying to determine the global optimum of J (W ), we seek for local optima of J (W ), which are (hopefully) suitable for the problem at hand.Their suitability is assessed through the test set.

The persistence predictor
In this case, the estimator of x(n + 1), x(n + 1), is x(n), that is x(n + 1)=x(n).This simple predictor is expected to give satisfactory results in cases where the sample-to-sample variation is small, as is the case for periods where no significant disturbances occur in the ionosphere.

The k nearest neighbor predictor
In this case, for a given vector y(n), the predictor computes the estimate of x(n+1) as follows.First, the k nearest neighbors, denoted by y(n 1 ), y(n 2 ), . . ., y(n k ), of y(n) in S 1 are identified.Then, the estimate of x(n + 1) is taken to be equal to the mean of x(n 1 + 1), x(n 2 + 1), . . ., x(n k + 1).This method is met under the name "first order local approximation" in Tsonis (1992).

Experimental results
Before we proceed with the generation of the training and the test sets, we need to estimate the embedding dimension d of the space where the dynamical system that produces the data sample X at hand "unfolds" in a satisfactory fashion its multi-dimensional structure.In applying the false nearest neighbor method described in Sect. 2 at the first half of the data sample X (that is q=p/2), we find that a good choice for d is 6.Specifically, for dimensions greater than or equal to 6, the percentage of false nearest neighbors falls well below 1% (see also Fig. 1) 6 .
Having estimated d, we then describe the way the training and the test sets are generated.Specifically, the data sample X is split into two halves, X 1 (first half) and X 2 (second half).From each X i , i=1, 2, a corresponding set S i , i=1, 2 is generated as follows and where d is chosen to be equal to 6, y(n) is defined as in Eq. ( 1) and the vectors y(n) with missing values are omitted.All the predictors have been trained using S 1 , and their performance has been measured on the test set S 2 .The results are summarized in Table 1.Also, in Figs. 2, 3, 4 and 5 the histogram of the absolute differences between the actual and the estimated values on the test set, as well as the plot of the actual and the predicted values for a short time   interval, are given for each of the four predictors.It is noted that various 2LF NN architectures with different numbers of hidden layer nodes have been examined.Specifically, 2LF N Ns with up to m=50 nodes in the hidden layer have been considered.However, the best performance was obtained for m=4 nodes.We also note that the mean square error (MSE) value for the 2LF N N s, shown in Table 1, is the average of the MSEs of 10 networks, with m=4 hidden nodes, that have been trained with the Levenberg-Marquardt algorithm starting from different initial values for the parameters.Also, for the k-nearest neighbor predictor, the results for k=1, 2, . . ., 30 have been considered.Here, only the best results are provided.
As can be seen from the results shown in Table 1 (and supported by Figs. 2, 3, 4 and 5), all predictors seem to exhibit more or less a similar performance.However, in order to quantify the significance of the differences among the mean square errors (MSE) produced by any pair of the above classifiers when they are applied on the test set, we use the t-test statistic (see, e.g.Mendenhall et al., 1995).The choice of this test is justified by the fact that the errors produced by the predictors are independent, since each one of the predictors follows a different prediction strategy from all the others.More specifically, for any two of the above predictors, P 1 and P 2 , we test the hypothesis where MSE i is the mean square error (MSE) for P i , i=1, 2. Denoting the sample MSE for P i (as it is given in Table 1) by MSE i and assuming that MSE 1 >MSE 2 , we compute the quantity where s i is the sample data deviation of the squared errors produced by P i , and n is the number of samples (in our case n=25605).The values of z for all pairs of predictors are shown in Table 2. Setting the level of significance a equal to   0.01 and taking into account the above value of n, the value z a for the t-test is 2.326.Recalling that the H 0 hypothesis is rejected when z>z a , the values in the table lead to the following conclusions: -At significance level 0.01, there is not enough sufficient evidence to reject the hypothesis that the MSE for the linear predictor and neural network predictor are equal 7 .
-At significance level 0.01, there is not enough sufficient evidence to reject the hypothesis that the MSE for the persistence predictor and k-nearest neighbor predictor are equal.
-At significance level 0.01, the MSEs for the nonparametric predictors differ significantly from the MSEs for the parametric predictors.
Adopting the Occam's razor principle, that is seeking for the simplest model that best describes the observed data and 7 However, at significance level 0.05 the H 0 hypothesis is rejected, since z 0.05 =1.645.
taking into account the above analysis, a linear model seems to be sufficient for one-step ahead predictions.
Focusing on the performance of the various predictors on the training set, we notice that the k-nearest neighbor predictor exhibits the best performance.The fact that this predictor exhibits the worst performance on the test set may be taken as an indication that the k-nearest neighbor predictor exhibits some degree of overfitting on the training set8 .On the contrary, no such conclusion is supported from the above results for the other three classifiers.

Concluding remarks and future directions
In this paper we considered the problem of performing onestep ahead predictions on the f oF2 parameter, using time    series forecasting methods.Specifically, assuming that n denotes the current time slot, the purpose is to estimate x(n+1) based on x(n), . . ., x(n−(d−1)), where {x(n)} denotes the f oF2 time series.Our first concern was to estimate the value of d, the dimension of the space where the dynamical system that generates the observed f oF2 measurements is embedded.This was carried out by applying the false nearest neighbor method.Then, based on the estimated value of d, we generated the appropriate training and test sets for the training and the evaluation of the performance of four wellknown predictors: the linear predictor, the two-layer, feed-forward neural network predictor, the persistence predictor and the k-nearest neighbor predictor.
The results show that the parametric predictors work significantly better than the non-paramtric ones.In addition, the performance of the linear predictor does not vary significantly from the neural network predictor.The latter fact may be taken as an indication that a linear model suffices for onestep ahead prediction of f oF2.In addition, it seems that the k-nearest neighbor predictor exhibits some degree of overfitting on the training set.
Below, we briefly give some future guidelines for further investigation.First, we intend to apply nonlinearity tests on the observed data series, in order to gain some further insight on the nature of the process that produces the observed f oF2 measurements (see, e.g.Schreiber and Schmitz, 2000).
Furthermore, an interesting variation during the training of the predictor would be to supply additional information related to the presence or the absence of a disturbance.
In addition, it seems interesting to see how the above predictors can be adapted in a time-varying environment.In such an environment the time series at hand exhibits significant variations in time and, thus, the predictor has to adapt its parameters in order to be able to follow these changes.In this case we say that we deal with adaptive predictors.In the prediction of the f oF2, the above idea may be utilized as follows: first, we use the data of a short time period to train a specific predictor.Then, this predictor is used with the current parameter values for prediction for a short time period in the future.Then, its parameters are re-evaluated in light of the new observations and the procedure is repeated.
Finally, an obvious extension of the above work is the multi-step ahead prediction, where, of course, the error estimate is expected to increase, compared to that of the one-step ahead prediction.It should be noted however, that allowing the time delay T to take values other than 1, interesting results may be obtained in this direction.For example, if we set T =26 (which corresponds to 6.5 h, since the sampling rate for the data set at hand is 15 min), the predictions exhibits a mean square error slightly greater than 1 MHz.
based on the last d coordinates of the y values, i.e. choose y'(n) such that d(y(n), y (n))= min y∈Z−{y(n)} d(y(n), y), where the distance between two d maxdimensional vectors, u and v, is defined as d Figure 1:

Fig. 1 .
Fig. 1.Plot of the percentage of the false nearest neighbors versus the space dimension.For dimensions greater than or equal to 6 the percentage falls well below 1%.

Fig. 2 .
Fig. 2. One-step ahead (15 min) prediction, with the linear predictor.(a) The histogram of the absolute differences of the computed and the actual outputs, for the test set.(b) The actual (solid line) and the computed (dotted line) outputs for a short time interval of the test set.

Fig. 3 .
Fig. 3. One-step ahead (15 min) prediction, with the 2LF NN model with 4 nodes.(a) The histogram of the absolute differences of the computed and the actual outputs, for the test set.(b) The actual (solid line) and the computed (dotted line) outputs for a short time interval of the test set.

Fig. 4 .
Fig. 4. One-step ahead (15 min) prediction, with the persistence predictor.(a) The histogram of the absolute differences of the computed and the actual outputs, for the test set.(b) The actual (solid line) and the computed (dotted line) outputs for a short time interval of the test set.

Fig. 5 .
Fig. 5. One-step ahead (15 min) prediction, with the 9 nearest neighbor predictor.(a) The histogram of the absolute differences of the computed and the actual outputs, for the test set.(b) The actual (solid line) and the computed (dotted line) outputs for a short time interval of the test set.
presented the application of the Middle East Technical University Neural Network (METUNN) to forecast the f oF2 values one hour in advance, based on hourly resolution data.The input parameters are year, month, coded season, day, hour, coded hour, f oF2 value observed one hour ago, first and second relative difference, station code.The method was applied to data from Poitier, Slouth and Uppsala, and the mean square errors were within reasonable limits (0.11353 to 0.21145 MHz).

Table 1 .
Fifteen-minute ahead prediction.The table shows the mean square error on the training set and on the test set for the linear predictor, the 2LFNN predictor with 4 nodes in the hidden layer, the persistence predictor and the k-nearest neighbor predictor, for k=12.In parentheses the standard deviation of the squared errors for each predictor on the test set is shown.

Table 2 .
Fifteen-minute ahead prediction.The table shows the values of the t-test that quantify the significance of the differences in the mean square error produced by two predictors when the test set is considered.