Automatic detection of the Earth Bow Shock and Magnetopause from in-situ data with machine learning

from in-situ data with machine learning Gautier Nguyen1, Nicolas Aunai1, Bayane Michotte de Welle1, Alexis Jeandet1, and Dominique Fontaine1 1CNRS, Ecole polytechnique, Sorbonne Université, Univ Paris Sud, Observatoire de Paris, Institut Polytechnique de Paris, Université Paris-Saclay, PSL Research Univsersity, Laboratoire de Physique des Plasmas, Palaiseau, France Correspondence: Gautier Nguyen (gautier.nguyen@lpp.polytechnique.fr)

We selected data within dawn, dayside and dusk operation phases of THEMIS and thus expect a good coverage of both magnetopause and shock surfaces. This is confirmed by the actual spatial coverage of our labelled dataset shown in Figure 2.
We then expect the method to be robust enough to the variability one can find in the data through the three different THEMIS operation phases.

60
In the following, we will designate the subset that has been used to fit our algorithm by training set. We will designate by test set the remaining subset of data that is used to evaluate the performance of our model. For each of the configurations we have been testing our algorithm with, the training set represents 70% of the dataset while the test set represents the remaining 30% of the dataset. Unless otherwise noted, all the performances reached by the models that we will be presented have been obtained with a random split in the dataset between the training set and the test set for which we ensure the class distribution 65 is still respected.

Algorithm
Gradient boosting is a class of machine learning algorithms that can be used for either classification of regression purposes. The principle of the gradient boosting algorithms is as follows: at first, we train a decision tree, another class of machine learning based on multiple decisions over the values taken by the features of the dataset, by minimizing the false predictions made over 70 the training set. Additional decision trees are iteratively used to model residuals of the predictions obtained from the previous tree, until the maximum of trees or sufficiently small residuals are reached. The final ensemble of trees is used to model the regions and make predictions on yet unseen data points giving as an output, the probability of belonging to each class. The final prediction will then correspond to the class with the highest probability. A complete description of the principle and the fitting process of decision trees and gradient boosting methods can be found in Geron (2017).

75
With the actual size of our training set, it takes two minutes for our Gradient Boosting Classifier to train.

Model performance
After fitting our Gradient Boosting model to our training set, we evaluate its performance by comparing its prediction on the test set with the corresponding labels. The value of the prediction for a given time interval can be shown in the last subplot of Figure 1. From then on, a prediction made by our model can be split into four categories for each class:

80
-A true positive (TP) is a point of a class that has predicted correctly -A true negative (TN) is a point not belonging to the concerned class that has been predicted correctly (e.g a magnetosheath point that has not been predicted as a magnetosphere point in the case of magnetosphere) -A false negative (FN) is a point of a class that has not been correctly predicted (e.g a Magnetosphere point that has been predicted as a Magnetosheath point)

85
-A false positive (FP) is a point not belonging to the concerned class that has been predicted as belonging to the class (e.g a Magnetosheath point that has been predicted as a Magnetosphere point in the Magnetosphere case) With these four categories, we can define the true positive rate TPR as the ratio between the number of TPs over the total number of expected positives points: The false positive rate FPR is defined as the ratio between the number of FNs over the total number of expected negative points: An ideal model would be a model without any FN or FP. In this case, we would then expect the TPR to be equal to 1 and the FPR to be equal to 0 for the three classes. These two values are obtained for a given decision threshold based on the predicted 95 probability as explained in the previous subsection. Logically, low decision thresholds would imply more points predicted as belonging to a certain class and then raise both FPR and TPR. On the opposite, higher decision thresholds would decrease the number of positive points and thus decrease the FPR and the TPR. The evolution of the TPR as a function of the FPR for continuously varying decision threshold can be represented as the Receiving Operator Curve (ROC) shown in the Figure   3 for the three classes. As expected, we notice an increasing TPR with an increasing FPR. The main interest in this curve 100 stands in the inflexion point that correspond to the best compromise we can find between low FPR and high TPR. We want this point to be as close to the top left of each curve as possible as this would imply a FPR close to 0 and a TPR close to 1. A random classifier would, for each decision threshold, increase the TPR and FPR by the same amount and thus be have a ROC representation as a straight line of slope 1. The quality of the ROC can be quantified by by computing the area under curve (AUC) of each of the ROCs. And we then expect this AUC to be as close to 1 as possible. To ensure the independence of our 105 method from the split we made between training and test set, we trained and predicted our model 10 times for 10 different splits and computed the AUC. The average AUC we obtained for each class is shown on the first row of Table 1. At first, the high AUC obtained for the three classes indicate how well our model perform in classifying the three regions. Moreover, the low standard deviation we obtain is lower than 1e − 3. This shows that our method is independent from the split we make between our two sets.

110
By performing this random split, we allow temporally close points, and thus almost identical points in the features space to be in both training and test set. Consequently, performing this split only partially shows the reproducibility of our method and how well it would perform on really unknown data. To ensure there were not any temporal overfit due to this split and that our method was truly reproducible, we trained and evaluated our model 3 times by splitting our dataset temporally instead of randomly. For this, we considered our training set to be a time interval that represented 2/3 of our dataset and left the remain 115 to be the training set. As the average AUC we obtained in this case, also shown in Table 1, has very few variations compared to the random split, this ensures the temporal independance of our method as well as our reproducibility.
Last but not least, the greatest part of our label has been obtained after successive train and prediction of our model on the parts of the dataset that had not been labelled yet. Thus, the label can eventually contain errors that could affect the quality of our prediction and high AUC would then not indicate the classification ability or our model but its ability to learn from an 120 erroneous label. To figure this out, we led trainings and evaluations of the algorithm by voluntarily mislabelling 25% of our dataset. If our model completely followed the indicated label in the training set, we would once again expect a high AUC. This is not the case, as shown in Table 1. This shows the real capacity of our algorithm to classify the three near-earth regions as well as the reliability of our label. 3 Adaptability of the model: from a mission to the other 125 Having developed an automatic detection method of the three near-earth regions with high reliability, one could think of its adaptation to the data provided by additional spacecraft that went through these regions. In this section, we will adapt our method to the data provided by Cluster, Double Star and ARTEMIS.

Cluster
In the case of Cluster, we used the available data from Cluster 1 between the 1st of January 2001 and the 1st of January 2013.

130
The magnetic field data were provided by the Fluxgate Magnetometer with a temporal resolution of 4s (Balogh et al., 2001).
The plasma moments were provided by the Hot Ion Analyzer instrument (Rème et al., 2001) when the instrument was working under the magnetosphere or the magnetosheath mode. Following what we did for THEMIS, the data were resampled to a 1 minute resolution.
A typical representation of such data is shown in Figure 4. Intuitively, one could think that the efficient model we have trained in the previous section on THEMIS would also perform well on Cluster data. Nevertheless, the lower AUC we have in Table 1 without retraining indicate the adaptability is not this 140 obvious. This is mostly due to the spatial coverage of THEMIS that has a much more equatorial orbit than Cluster. The data provided by the two missions can therefore be substantially different and an algorithm that has only seen equatorial data could perform better on unseen polar data. To cope with it, we refitted the model trained on THEMIS with a training set we picked in our labelled Cluster data. The resulting model was then evaluated on the remaining test set and the operation was repeated for 10 different train-test split configurations in a similar way than what was done for THEMIS. The increasing AUC we obtained, 145 shown in Table 1, proves the necessity we had to adapt our algorithm to the specificity of the cluster data as well as the capacity we have to adapt our algorithm to the data of another mission after a small labelling phase to take into account the mission specificity.

Double Star
In the case of Double Star, we used the data of the TC1 spacecraft whole mission period (between the 1st of January 2004 150 and the 1st of january 2008) The magnetic field data were provided by the Fluxgate Magnetometer (Carr et al., 2005) with a temporal resolution of 2s. The plasma moments were provided by the CIS-HIA instrument (Fazakerley et al., 2005) with a temporal resolution of 4s. Once again, the data were resampled to a 1 minute resolution.
A typical representation of the data is shown in Figure 5.
Here the part of the data we labelled is made of 20671 magnetosphere points, 23091 magnetosheath points and 4944 solar 155 wind points taken at the beginning of the year 2005. The main reason explaining the noticed imbalance in the data stands in the orbit of TC1 itself that is not supposed to cross the bow shock. The spatial distribution of our labelled data is also shown in Appendix A.
As TC1 also has an equatorial orbit, we expect the model trained on THEMIS to perform well even without having to be retrained and this is the main reason why we do not have an entire coverage of the (Y-Z) plane in the specific case of Double 160 Star. We have this confirmation with the value of the AUC without retraining in Table 1. Refitting the model would then allow a finer detection that would be specific to the quality of the data provided by TC1 in comparison to the THEMIS data and this is indicated with the increasing AUC with a retraining of the model with our labelled data.

Artemis
The mission ARTEMIS actually corresponds to the THEMIS B and C spacecrafts when they moved from a terrestrial to a lunar it is necessary to take into account the moments in the data for which the spacecraft will be in the moon's wake. This new region will then be considered as an additional possible region (and thus another label with the value 3) for the data. A typical representation of the data that includes a passage in the moon's wake is shown in Figure 6. As, in this case, the spacecraft 170 spends the greatest part of its orbit in the solar wind, and because our dataset consists of a quasi complete solar cycle, we do here expect much more variability of the data specific to this class than what we could have had for the three previous missions.
To cope with it, we labelled a month per year of our dataset for a total of 26560 magnetosphere points, 131656 magnetosheath points, 429283 solar wind points and 15070 points of moon's wake which spatial distribution is shown in Appendix A.
Additionally, the greatest part of the data here corresponds to the distant night side for which the transition from the magne-175 tosheath to the solar wind is much less obvious than on the dayside and where small velocity, density and temperature variations could be wrongly interpreted as a boundary crossing. To cope with it, we then added the spacecraft GSM coordinates as a feature of the dataset which will then consist in 11 input variables.  Having a different dataset and a different number of classes, it is here no use of using the model we trained in the previous section and we will then focus on the specific model we trained for this mission. The resulting high AUC shown in Table 1 180 proves the adaptability of this method to another kind of orbit and and its flexibility and robustness regarding the addition of another region.

Comparison with threshold based method
The automatic identification of the three near-earth regions had already been attempted by Jelínek et al. (2012) with the use of thresholds on the magnetic field magnitude and the density. Nevertheless, the quality of the classification has never been 185 evaluated nor its adaptation to non-equatorial data. Figure 7 represents the 2D histogram of B and N p for THEMIS B, TC1 and Cluster 1 on the periods on which we labelled our different datasets. We divided these parameters by the corresponding solar wind density and the IMF that we obtained from the shifted OMNI data. At first sight, one can easily distinguish three main regions that are separated with the solid red lines for the three missions. Nevertheless, these linear boundaries have been set manually and we can not ensure these could the 190 best choice for the three missions. To evaluate the quality of the classification, we computed the TPR and the FPR for the three missions and for varying boundary lines. We then used these values to compute the AUC that is shown in the table 2. Once again, we notice a lower AUC in the case of Cluster which is consistent with the difference we have between equatorial and polar orbits as explained in the previous section. Additionally, even if the boundaries plotted in the Figure 7 seem to provide a decent separation between the three regions, the AUC is lower than the one we obtained with the gradient boosting. Which 195 indicates our model performs better in classifying the three regions by setting more flexible boundaries while requiring less time than the one required to set the thresholds used in the Figure 7. The same kind of histogram gets messier with a much less obvious transition from the magnetosheath to the solar wind and the addition of the moon's wake as shown with the Artemis data in the Appendix C. This shows the difficulty a threshold-based method would have for a night side oriented mission and the interest of using machine learning in this case.

Automatic elaboration of boundary crossings catalogs
Our method proved its reliability and efficiency in classifying the three near Earth regions. From now on, we could use it to elaborate our own magnetopause and crossing catalog by classifying the streaming in-situ data provided by any near-Earth space craft and by selecting time intervals enclosing two predicted regions.

Magnetopause catalog 205
In the case of the magnetopause, we define a crossing as a 1 hour interval that contains as much magnetosheath points as magnetosphere points. We then elaborate a complete magnetopause crossing by running our model trained in section 2 on the data provided by THEMIS A, B, C, D and E spacecraft. To gain time in the construction of the crossings and because we do not expect any magnetopause crossing in these regions, we restricted ourselves to the dayside, dawn and dusk operation phase and we removed the parts of the orbit in which the spacecraft was far away in the solar wind (X GSM > 15Re). The  Table 3. And all of the magnetopause lists can be found here: https: //github.com/gautiernguyen/in-situ_Events_lists.
Given that our detection method has been evaluated on large parts of orbits, the high quality of the classification is made with 215 regions where the spacecraft is not expected to cross a boundary. In these regions, the algorithm is less likely to hesitate on its prediction. On the other hand, it is more probable it hesitates on the predictions made close to the boundaries. Consequently, we have to ensure the classification is still of decent quality. Figure 8 represents the ROC we have on the classification between magnetosphere and magnetosheath points for THEMIS B, Cluster 1 and Double Star for the subset of our test set that lies in the proximity of a magnetopause or shock crossing. These predictions have been obtained with a model that has been trained with 220 13 https://doi.org/10.5194/angeo-2019-149 Preprint.  the complement part of the dataset, i.e. the subset that excludes the proximity of the crossings. Even if the AUC is lower than the one we obtained in the previous section, its still high value indicate the good quality of the classification when a spacecraft arrives close to the magnetopause and thus our capacity of building crossings from the prediction made by our model.
Another method we could have to ensure the consistency of the obtained crossings would stand in the certitude of the prediction made by the algorithm and their position in comparison to a standard magnetopause position. To do so, we computed 225 the mean probability of each crossing by averaging the probabilities of belonging to the predicted class of each point present in the crossing. Events with high probability would then correspond to undoubtful crossings while the events with the lowest probability would be the most likely to be actual crossings. The probability distribution of our 14186 is shown in Figure 9.
Having a high probability for the greatest part of our events then ensures the consistency of our magnetopause lists.  Finally, the spatial distributions of the crossings that have a probability higher than 75% in the GSM XZ, XY and YZ planes 230 is shown in Figure 10. These crossings represent 98.5% of the crossings built with our models and are then expected to be the most likely to be actual magnetopause crossings. The solid black lines represent the stand off position of the magnetopause model established by (Lin et al., 2010) computed for a dynamic pressure of 2 nPa, a null B z and assuming no dipole-tilt. The proximity we have between this distance and our actual crossings ends up proving the capacity our method has to elaborate a decent magnetopause crossings catalog with a decent coverage of the magnetopause at all latitudes and longitudes.

Bow shock catalog
We define a bow shock crossing event as 10 minutes interval that contains as much magnetosheath points as solar wind points.
We then run the models we trained for the different missions detailed in Section 3 on the same dataset we used in the case of the elaboration of the Magnetopause crossing catalog. The total number of obtained crossings can is once again summarized in Table 3 and the bow shock lists can also be found at https://github.com/gautiernguyen/in-situ_Events_lists.

240
In a similar way to what has been done for the magnetopause, we ensure the consistency of the massive bow shock crossing detection by evaluating the algorithm on the part of the crossings for which we had labelled data. The associated ROC curve  can be found in the Appendix B and the high value of the AUC coupled to the probability distribution we have in Figure 11 convince us that the greatest part of the detected events are actually bow shock crossings.
The spatial distribution of the crossings with a probability higher than 75% in the GSM XZ, XY and YZ planes is shown in

Discussion and conclusion
Using a Gradient Boosting Classifier, we established an automatic detection method of the different near-Earth environment regions when they are traversed by the THEMIS spacecrafts during the dawn, dusk and dayside mission phase.

250
After a small retraining phase, necessary to consider the differences between the data of different missions, this method was adapted to the method to other equatorial (Double Star) and non-equatorial (Cluster) missions. The adaptability of the method has even been tested on night side oriented missions for which we obtained the same quality of prediction after the addition of the moon's wake as an additional class and the spacecraft position as an additional feature. Having proved this adaptability, we could also think of using the method on the data of additional near-Earth missions such as MMS, IMP8 or 255 ISEE. The classification could even be enhanced by the consideration of additional regions like the Ion foreshock. Such lightweight algorithms could eventually be taken onboard of upcoming missions to automatically select the data of interest and thus automatically decide of the data that should be stored and kept for further analysis. Moreover, the method does not use the specificity of being in the near-Earth and could then also be adapted to other planetary missions in the solar system.
We used this method to elaborate one of the largest existing magnetopause and bow shock crossings catalogs that will be 260 automatically enlarged with the increasing quantity of data. Having a large list of events also gives the opportunity to study these two near-Earth boundaries and physical processes occurring in their vicinity, from a statistical point of view . One could think, for instance, of the identification of the magnetic reconnection jets, which will be the topic of a forthcoming study.
If the work has been done for the magnetopause and the bow shock, the same process presented in Section 5 could be used to build a moon's wake catalog which would pave the way for its study from a statistical point of view.

265
Last but not least, the high number of events we found is expected to be linked with a great variety in the associated solar wind conditions and it could then be interesting to link the position of the crossings with these upstream conditions through the construction of magnetopause and bow shock data-driven models.