This article presents the results of automatic detection of dust impact signals observed by the Solar Orbiter – Radio and Plasma Waves instrument.
A sharp and characteristic electric field signal is observed by the Radio and Plasma Waves instrument when a dust particle impacts the spacecraft at high velocity. In this way,
It is however challenging to automatically detect and separate dust signals from the plural of other signal shapes for two main reasons. Firstly, since the spacecraft charging causes variable shapes of the impact signals, and secondly because electromagnetic waves (such as solitary waves) may induce resembling electric field signals.
In this article, we propose a novel machine learning-based framework for detection of dust impacts. We consider two different supervised machine learning approaches: the support vector machine classifier and the convolutional neural network classifier. Furthermore, we compare the performance of the machine learning classifiers to the currently used on-board classification algorithm and analyze 2 years of Radio and Plasma Waves instrument data.
Overall, we conclude that detection of dust impact signals is a suitable task for supervised machine learning techniques. The convolutional neural network achieves the highest performance with 96 %
The proposed convolutional neural network classifier (or similar tools) should therefore be considered for post-processing of the electric field signals observed by the Solar Orbiter.
The interplanetary dust population in the inner solar system (
Model calculations show that the number density of dust within 1 AU is diminished by collisional destruction
At present, the inner solar system is explored by the Parker Solar Probe
In this work, we analyze data acquired by the Solar Orbiter. The spacecraft orbits the Sun in an elliptic orbit with a period of approximately 6 months. At perihelion, the Solar Orbiter reaches a minimum solar distance of 0.28 AU, just within the perihelion of the Mercury orbit. The expected mission duration is 7 years, with a possible 3-year extension. The Solar Orbiter will thus provide long-term, in situ observations of the environment in the inner solar system with multiple instruments. One of these instruments is the Radio and Plasma Waves instrument, allowing observations of the cosmic dust flux with typical diameters ranging from
Radio and plasma waves instruments (i.e., antennas) have been used for studying dust in the solar system since the Voyager mission
The impact ionization process occurs when dust particles hit a target in space with impact speeds on the order of
Radio and plasma waves instruments allow for the entire spacecraft body to serve as a dust detector, providing a large collection area in comparison to dedicated dust detection instruments. Thus, radio and plasma waves instruments can provide dust distribution estimates based on thousands of dust impacts each year, statistical products that are difficult to acquire by dedicated dust instruments. Still, radio and plasma waves instruments have lower sensitivities than dedicated dust detectors
In this article, we present a machine learning-based framework as a novel method for detecting dust impact signals in radio and plasma waves instrument data. Machine learning methods, in particular neural networks in the recent decade, have been extensively used for challenging time series classification problems, such as: speech recognition
A neural network has previously been used for selecting the signals of interest observed by the WAVES instrument on board the Wind spacecraft
The main motivation for this work was to develop a dedicated dust detection tool that can be used to automatically process the large amount of data acquired by the Radio and Plasma Waves instrument on board the Solar Orbiter. The aim was to develop a classifier with a high overall classification accuracy on a balanced data set that can make statistical studies more reliable and easier to conduct. For this project, we defined high accuracy to be (
The remaining of this article is structured as follows. Section
Waveforms recorded by the TDS receiver and measured by one of the RPW antennas. The signal label, classified by the TDS classification algorithm, is included for each snapshot in the subplot titles. The top row presents dust waveforms:
This work focuses on electric field signals (i.e., waveforms) observed by the Radio and Plasma Waves (RPW) instrument on board the Solar Orbiter
The TDS receiver is designed to capture plasma waves (such as ion acoustic and Langmuir waves) in the frequency range 200 Hz–100 kHz, in addition to the dust impact signals
The XLD1 mode is the most commonly used observational mode of the RPW–TDS system
For this project, we use the Triggered Snapshot WaveForms (TSWF) data product, processed with software version 2.1.1 and acquired over a 25-month period, spanning between 15 June 2020 to 14 July 2022. The TSWF data product consists of signal packets (63 ms snapshots) that are down-linked only if the classification algorithm on board the Solar Orbiter is triggered. The accuracy of the on-board classification algorithm is therefore important in order to optimize the data transfer and provide reliable data products for statistical analysis.
The input to the on-board classification algorithm, hereafter named the TDS classifier or the TDS classification algorithm, is the 63 ms signal packet, while the output is categorized into one out of three labels: The snapshot peak amplitude ( The ratio of the peak amplitude to the median absolute value of the signal ( The full width half maximum (
The signal label is then determined by comparing the extracted feature values against configurable thresholds. The threshold criterion reflects that observations of waves are typically narrow band (low
A dust waveform observed by antenna 2 on 8 September 2021. The panels illustrate the different stages of the pre-processing procedure.
Data flow, from the TDS data sets to the machine learning performance metrics. The diagram illustrates the data flow by the black arrows and the applied process by the arrow label. The cylinders indicate the signal waveforms and the cylinder color indicates the associated label. The gray circles mark data transformation processes. The random draw of the TDS data and the pre-processing is explained in Sect.
Figure
The goal of the machine learning classifier is to take a monopole RPW snapshot as an input and automatically output if the signal contains a dust impact or not. For this purpose, we use a supervised classifier. A supervised classifier relies on manually labeled data to learn (i.e., train) the function that maps the input observation (the electric field signal) to the output label. For this work, we focus exclusively on detecting dust impact signals, we therefore use the binary labels:
In order to construct a balanced data set, we selected
Manually labeled data are used both to train the machine learning classifiers and to test the performance of the trained models. Thus, great care is needed in order to construct a high-quality labeled data set, without significant contamination of corrupted data files, biases and mislabeled signals.
We manually labeled the data into either
It should be noted that 134 signals (i.e., 4.5 %), out of 3000 manually labeled waveforms, were marked as ambiguous and did not clearly fit into either the
The manually labeled data were split into a training set (containing 80 % of the data) and a testing set (with the remaining 20 %). The training data are used to optimize the free parameters of the machine learning classifiers with respect to the assigned labels, while the testing data are used as an independent set to evaluate the performance of the trained classifiers. The performance of a machine learning classifier is quantified by comparing the outputs of the trained model to the labels of the testing data. Figure
There are numerous machine learning techniques that are suitable for time series classification. In this work, we focus on two well-known techniques: the support vector machine (SVM), described in Sect.
The support vector machine
In order to develop a baseline machine learning classifier, comparable to the on-board TDS classification algorithm, a simple 2-dimensional SVM classifier was considered. Thus, every observation with dimension (
The two features (standard deviation and convolution ratio) were extracted from all observations in the training data. The decision hyperplane, in this 2-dimensional case a decision line, is defined by a polynomial of degree 2 that is optimized by minimizing the non-separable SVM cost function, see e.g.,
The signal examples are presented in
The performance of the trained SVM classifier is evaluated using the independent testing data, i.e., the remaining manually labeled data (20 %) that were not used for training the classifier. Figure
Overall, the SVM classifier achieves a classification accuracy of 94 % on the testing data using the 2-dimensional feature vectors. Note that the inclusion of additional extracted features can possibly enhance the SVM performance. Several additional features can be considered, such as the mean amplitude of the signal, the range between the signal maximum and minimum values and the cross-correlation length (the time lag to the first zero crossing).
Ideally, we want to develop a machine learning classifier that not only has a high accuracy, but also makes decisions that are understandable for human experts
The three-layer fully convolutional network used for the dust impact classification task. The input to the network is the (
It should be noted that the signal examples in Fig.
Convolutional neural networks are algorithms designed for processing grid-like data and have achieved premium performance on a number of different tasks in the recent decade, such as image
Unlike the SVM, the CNN does not require pre-defined feature extraction routines. Instead, the CNN extracts the features based on a chain of convolution operations and automatically optimizes the convolution filters based on the training data and the associated labels.
For this work, we employed the three-layer fully convolutional network architecture presented in
The signal examples and the CAM analysis are presented in
The three-layer fully convolutional network consists of 267 010 free parameters (weights and biases) that need to be optimized to solve the dust impact classification task. The free parameters are randomly initialized and thereafter optimized using the ADAM gradient descent optimizer
In order to visualize the features extracted by the CNN, we employ the t-distributed Stochastic Neighbor Embedding (
Overall, the CNN obtains a high (
Neural networks have traditionally been regarded as black boxes
Class activation maps (CAMs)
The TDS, SVM and CNN classification performance metrics: accuracy, precision, recall and F1-score. The SVM and CNN scores and error values are the mean and the standard deviation across 10 training runs. The bold numbers indicate statistically enhanced performance with a significance level of 0.01, computed using a
The CAM analysis in Fig.
In general however, the CNN achieves a high accuracy (
The average classification performance is obtained by training and testing the machine learning classifiers over 10 runs, each run with different training and testing sets. The classifiers are initialized from scratch and the training and testing sets are selected independently 10 times by randomization and splitting of the manually labeled data, as indicated by the gray circles in Fig.
The classification performance is further evaluated by the accuracy, precision, recall and F1 score. The definitions for the performance metrics are included in Appendix
The results from both the confusion matrices and the performance metrics strongly suggest that the SVM and CNN classifiers provide binary classification results with higher reliability than the TDS classifier and further that the CNN is the most reliable classifier overall. We therefore propose that the CNN classifier (or similar tools) should be considered for post-processing of the TDS data product in statistical studies of dust impacts observed by the Solar Orbiter RPW instrument.
In addition, it should be noted that 134 signals (i.e., 4.5 %), out of 3000 manually labeled waveforms, were marked as ambiguous, illustrated by the yellow cylinder in Fig.
Both the trained SVM and CNN classifiers are computationally inexpensive to run. One thousand observations are classified in 5 s using the SVM model, while the CNN classifier requires 50 s on a modern laptop, including the needed time for pre-processing and feature extraction. The proposed machine learning classifiers are therefore suitable for processing large data sets with thousands of new observations acquired every month as the Solar Orbiter continues its operation.
In this section, we use the trained classifiers to automatically process a large data set, consisting of 104 032 observations. This data set contains all TSWF observations acquired over a 25-month period, spanning between 15 June 2020 to 14 July 2022, that satisfy the criteria in Sect.
Figure
Figure
Furthermore, Fig.
Still, the shape of the dust impact signal is dependent on the local plasma environment, where influential parameters are as follows: the electron plasma density, the mean electron velocity and the electron temperature
Finally, we note that a dip in the SVM and CNN dust impact rates can be observed in Fig.
We have presented a machine learning-based framework for fully automated detection of dust impacts observed by the Solar Orbiter – Radio and Plasma Waves (RPW) instrument. Two different supervised machine learning approaches were considered: the support vector machine (SVM) and the convolutional neural network (CNN). The CNN classifier obtained the highest performance across all evaluation metrics and achieved 96 %
The SVM and CNN classifiers were used to analyze 104 032 observations acquired over a 2-year period, spanning between 15 June 2020 to 14 July 2022. On average, the machine learning classifiers detected more dust particles than the currently used TDS algorithm, the SVMs had a 16 %
The labeled data and the trained SVM and CNN classifiers are available online with included user instructions. The proposed method and the presented classifiers can thus provide the interplanetary dust community with thoroughly tested and more reliable data products than those currently in use. The daily dust count numbers from the CNN classification were employed by
The presented machine learning classifiers may be considered for on board processing of the observed electric field signals. However, the trained SVM and CNN classifiers presented in this article are trained on Triggered Snapshot WaveForms (TSWF) data, and should not be used for processing `untriggered” signals without additional training and testing on `untriggered” data. Additional training can also be used to further enhance the performance of the machine learning classifiers. In particular, adding labeled data acquired near the Sun (
It should also be noted that the classifiers presented in this work are trained and tested on data labeled by one scientist, although with consultations with other experts. Labeled data from several experts can provide machine learning classifiers that are more in line with the labeling consensus in the interplanetary dust community. Additional labeling can also be used to extend the machine learning classifiers to include automatic detection of other characteristic signatures, such as ion acoustic, Langmuir and solitary waves
Finally, we would like to highlight that a machine learning-based framework can be developed for automatic post-processing of data acquired by radio and plasma waves instruments on board other spacecrafts, such as the Solar Terrestrial Relations Observatory (STEREO)
Figure
The manual labeling user interface showing a signal observed 28 July 2021.
The classification performance metrics are calculated using the true positive (TP), true negative (TN), false positive (FP) and false negative (FN) values, defined by comparing the predicted classes and the manually labeled classes, illustrated in Fig.
The overall accuracy of the classifier is the proportion of observations that were correctly predicted by the classifier. The accuracy is mathematically defined as
Precision (in this case) is defined as the proportion of data points predicted by the classifier as
Recall (in this case) is the proportion of observations manually labeled as
The F1 score acts as a weighted average of precision and recall and is calculated as
The code used for this work, the trained classifiers, and the training and testing data sets can be accessed via the following link:
AK wrote the article text, trained and tested the machine learning classifiers and manually labeled the waveforms. KW aided the development of the machine learning classifiers, analyzed the machine learning performance metrics and commented/edited the article. SK performed the dust impact rate analysis, aided with the theoretical background and commented/edited the article. JV and LN contributed to the analysis of the TDS waveforms and the theoretical background. AZ and KRB contributed with the theoretical background and helpful discussions. AG contributed with knowledge of the Solar Orbiter data availability and discussions on the dust waveform shapes. DP and JS provided the data used for this work and explained the data content. IM is the main contributor to the theoretical background, aided the article with numerous comments/suggestions/discussions and shared knowledge that was crucial for this work.
The contact author has declared that none of the authors has any competing interests.
Publisher’s note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Andreas Kvammen thanks Audun Theodorsen for aiding the motivation and objective of the article. In addition, Andreas Kvammen thanks Juha Vierienen, Björn Gustavsson and Patrick Guio for helpful discussions. The authors would like to thank the reviewers for their valuable suggestions and appropriate comments.
Andreas Kvammen received support from the Research Council of Norway (grant nos. 262941 and 326039). Samuel Kociscak is supported by the Tromsø Research Foundation (grant no. 19_SG_AT). Jakub Vaverka, David Pisa and Jan Soucek received support from the Czech Science Foundation (grant no. 22-10775S). Kristina Rackovic Babic received support from the Ministry of Education, Science and Technological Development of the Republic of Serbia through contract no. 451-03-9/2021-14/200104.
This paper was edited by Gunter Stober and reviewed by two anonymous referees.