A note on the statistical evidence for an influence of geomagnetic activity on Northern Hemisphere seasonal-mean stratospheric temperatures using the Japanese 55-year Reanalysis

We employ JRA-55 (Japanese 55-year Reanalysis), a recent second-generation global reanalysis providing data of high quality in the stratosphere, to examine whether a distinguishable effect of geomagnetic activity on Northern Hemisphere stratospheric temperatures can be detected. We focus on how the statistical significance of stratospheric temperature differences may be robustly assessed during years with high and low geomagnetic activity. Two problems must be overcome. The first is the temporal autocorrelation of the data, which is addressed with a correction of the t statistics by means of the estimate of the number of independent values in the series of correlated values. The second is the problem of multiplicity due to strong spatial autocorrelations, which is addressed by means of a false discovery rate (FDR) procedure. We find that the statistical tests fail to formally reject the null hypothesis, i.e. no significant response to geomagnetic activity can be found in the seasonal-mean Northern Hemisphere stratospheric temperature record.


Introduction
There is a large interest in the potential climate impact of geomagnetic activity. One of the main mechanisms by which geomagnetic activity is thought to affect the middle atmosphere is through the production of nitrogen oxides (NO x ), either by the continuous precipitation of auroral electrons penetrating into the lower thermosphere (Sinnhuber et al., 2012) or by the more episodic precipitation of higher energy electrons into the mesosphere (Andersson et al., 2014;Päivärinta et al., 2016). Downward transport from the mesosphere to the stratosphere in winter results in the increased availability of NO x in the dark polar stratosphere, where it is long lived. NO x can catalytically reduce ozone concentrations as the Sun returns (Brasseur and Solomon, 1986;Callis et al., 2005), and thus alter radiative heating rates, with potential observable impacts on stratospheric temperatures and possible implications also for surface air temperature (SAT). The amount of NO x in the middle atmosphere during late winter and spring depends on the cumulative effect of geomagnetic activity over the preceding months on the NO x reservoir (Jacob, 1999). Stratospheric NO x concentrations however also depend on the magnitude of the downward transport from this reservoir and are thereby affected by internal variability of the atmospheric circulation from year to year, especially in the Northern Hemisphere (NH; Funke et al., 2005;Randall et al., 2006;Päivärinta et al., 2016).
The impact of energetic electron precipitation (EEP) driven by geomagnetic activity on NO x and ozone concentrations has been well documented after detailed satellite studies were carried out in the early 2000 (Funke et al., 2005;Randall et al., 2005). Several recent studies (Baumgaertner et al., 2010;Bucha, 2014;Lu et al., 2008;Seppälä et al., 2009Seppälä et al., , 2013 suggest a significant signal associated with geomagnetic activity in the observed climate. However, there remains considerable uncertainty regarding the precise attri-bution of such a signal, and the existence of a direct link between EEP and stratospheric and tropospheric temperatures has remained controversial. Among other things, the study of Seppälä et al. (2009), henceforth S09, in particular, claims to find a significant, direct relationship between the SAT and geomagnetic activity based on reanalysis data from the European Centre for Medium Range Weather Forecasts (ECMWF). In essence, S09 finds that the hypothesis that geomagnetic activity influences the SAT is supported by reanalysis data, whereas the null hypothesis that the SAT is not influenced by the geomagnetic activity at all is rejected. S09 compares the seasonal SAT in years with high and low geomagnetic activity and also considered the separate effect of the variation in solar irradiance associated with the 11-year heliomagnetic cycle.
The selection of years in S09 was based on two indices, Ap and f10.7. Ap (Rostoker, 1972) provides a measure for the daily average level of geomagnetic activity. To account for the cumulative effect of NO x production, transport and diffusion processes, Ap was commonly averaged over 4 months from late autumn to winter (Seppälä et al., 2009;Funke et al., 2014;Tomikawa, 2017). In particular, S09 used Ap averaged between October and January to define winters of high and low geomagnetic activity in the Northern Hemisphere. The second index, f10.7 (https://www.swpc. noaa.gov/phenomena/f107-cm-radio-emissions, last access: 13 March 2020), is an indicator of the phase and intensity of the solar cycle. By compositing separately on the basis of Ap and f10.7, S09 obtained different samples of seasonal-mean data for years with high geomagnetic activity and for years with low geomagnetic activity. They then computed the SAT differences of the seasonal means (December, January and February -DJF; March, April and May -MAM; June, July and August -JJA; and September, October and November -SON) between the two samples and employed a t test based on the set of daily means (Annika Seppälä, personal communication, 2018) used to compute the seasonal averages to discriminate against a null hypothesis of no effect. As a consequence of such a procedure, S09's claim of significance is marred by the presence of very strong temporal and spatial autocorrelation within the samples.
In this paper, we revisit the S09 hypothesis by adopting a rigorous methodology for significance testing on strongly autocorrelated data. We focus on wintertime stratospheric temperatures between 200 and 1 hPa, a prerequisite for possible surface impacts associated with EEP-related changes in ozone concentrations. Although our analysis is focused, in this paper on stratospheric temperature, we look at all the levels present in the dataset. We show that statistical testing appropriate to the data at hand is a crucial step in any analysis purporting to demonstrate an observed climate signal of geomagnetic activity. Data and methods are described in Sect. 2, including a discussion on the problem of autocorrelation in time and space. In Sect. 3, the results obtained by applying the t test to the stratospheric temperatures are shown. The analysis is applied to four different cases: with no correction at all, with the temporal and the spatial autocorrelation correction applied separately, and with both the corrections applied. In Sect. 4 conclusions are drawn.
2 Data and methods

Data
To analyse the possible impact of geomagnetic activity in the stratosphere, we use the Japanese 55-year Reanalysis (JRA-55) covering more than 55 years, extending from 1958 to the present (Kobayashi et al., 2015). Due to the selection of cases of high and low geomagnetic activity as in S09, only data up to 2006 are used here. In the JRA-55 reanalysis, ozone is used interactively in the radiation code, although it is treated differently in the pre-and post-1979 satellite era. This is an important asset for JRA-55, since the EEP will primarily affect NO x and ozone, and this feature is not commonly found in other reanalysis systems, such as the ECMWF reanalyses. Older-generation reanalyses tend to suffer from temporal inhomogeneities because of the sequential introduction of new satellite data during the assimilation period, especially in the Southern Hemisphere (SH) as shown recently by Long et al. (2017). For these various reasons, we restricted our analysis to the recent JRA-55 reanalysis. Tomikawa (2017) also used the JRA-55 reanalyses to investigate the signature of geomagnetic activity but focused exclusively on the SH. He found a temperature signal in the upper stratosphere, but only in July. The S09 selection shown in Table 1 is used to compute the significance of the seasonal differences. The criteria used to select the different years are based on the Ap and f10.7 values and are the same as used by S09. The definition of high and low geomagnetic activity is the same as S09. We hence investigate the potential signatures on stratospheric temperature during the same winters and in the following seasons of the same calendar year as S09 did for the SAT. The set of data is denominated N1 as in S09 (Table 1).

Data autocorrelation and statistical significance
S09 computed the SAT differences of the seasonal means (DJF, MAM, JJA and SON) between those selected high-Ap and low-Ap years and employed Welch's t test (hereafter only t test) to assess the likelihood of the differences given a null hypothesis of no effect. Such a test assumes a statistical model in which observations are normally distributed and statistically independent. In particular, the t test is sensitive to the temporal autocorrelation or serial correlation within the samples. When serial correlation is not taken into account in the data, statistically significant differences in two means, which may not be different at all, are found more frequently than expected (Zwiers and von Storch, 1995). S09's analysis is affected by this problem, because the authors used daily- Table 1. Years used to define the N1 set, following S09.

Case Hemisphere High-Ap years
Low-Ap years N1 NH 1962N1 NH , 1965N1 NH , 1966N1 NH , 1967N1 NH , 1958N1 NH , 1960N1 NH , 1961N1 NH , 1975N1 NH , 1968N1 NH , 1969N1 NH , 1970N1 NH , 1971N1 NH , 1982N1 NH , 1984N1 NH , 1985N1 NH , 1989N1 NH , 1972N1 NH , 1977N1 NH , 1978N1 NH , 1980N1 NH , 1990N1 NH , 1993N1 NH , 1994N1 NH , 1995N1 NH , 1981N1 NH , 1987N1 NH , 1988N1 NH , 1991N1 NH , 2003N1 NH , 2004N1 NH , 2005N1 NH 1996N1 NH , 1997N1 NH , 1998N1 NH , 1999N1 NH , 2001N1 NH , 2002N1 NH , 2006 mean data in their t test, which are highly autocorrelated in time. As seasonal averages can still suffer from temporal autocorrelation, the serial dependence is checked by means of the Durbin-Watson test (Durbin and Watson, 1950). While the serial correlation, in general, is reduced from seasonal averaging, it can still persist, especially in summer. To deal with such serial correlation, a correction is applied as suggested by Zwiers and von Storch (1995). The temporal autocorrelation is not the only potential caveat that needs to be considered when testing a hypothesis. When performing a significance test simultaneously on many samples, one will at some point find statistically significant temperature differences simply by accident. Unfortunately, the dominant approach to the multiplicity problem is generally to test the single grid points and then to report them as "significant" when the null hypothesis is locally rejected (Wilks, 2016). Sometimes temporal and spatial autocorrelation is not addressed at all, but, there are some exceptions. Maliniemi et al. (2014), for instance, while trying to find a relationship between solar activity and surface air temperature, dealt with temporal and spatial autocorrelation using a Monte Carlo approach. To overcome this multiplicity problem in our analysis, we apply the false discovery rate controlling introduced by Benjamini and Hochberg (1995) and proposed in the atmospheric sciences by Wilks (2006Wilks ( , 2016.

Accounting for temporal autocorrelation
The t test is a widely used method for hypothesis testing within the climate community. It is however well known that the t test assumes a statistical model where observations are statistically independent, and it is widely, but incorrectly, believed that the t test is valid only for normally distributed outcomes. Several authors (Efron, 1969;de Winter, 2013;Poncet et al., 2016) have shown that the t test is suitable under symmetric, not necessarily normal and asymmetric distributions. The t test is sensitive to time autocorrelation or serial correlation within the samples. The effect of serial correlation is, usually, to make comparisons of means which are too liberal. The null hypothesis, assuming equal means, is hence rejected more frequently than expected. Two separate reasons favour the use of seasonal-mean data instead of daily-mean data. The first reason is that any influence of EEP on temperature is expected to accumulate over seasonal timescales. The second reason is that daily temperatures are strongly serially correlated, whereas seasonal data have less correlation between 2 consecutive years for instance. In fact, one of the causes of the serial correlation is that the variable of interest varies seasonally. Nevertheless, even for seasonal means, it is important to account for serial correlations, as there may be other causes leading temporal autocorrelation, including persistence. Figure 1a shows the results of the Durbin-Watson test (Durbin and Watson, 1950) applied at the seasonal temperatures at 5 hPa. Similar pictures can be obtained by plotting the lag-1 autocorrelation (Fig. 1b), but the Durbin-Watson test, which is a classical test to check whether data are serially correlated, is better, compared to including the lagged response, as it tests for autocorrelation in the residuals, and it is suitable when in time series there are trends or seasonal patterns. When data are serially correlated, the test gives values close to zero, whereas when data are not correlated, the test statistic values, as a rule of thumb, are in the range of 1.5 to 2.5. There is also the possibility of serial anticorrelation: in such a case, the value would be above 2.5, but this situation was not found in our study. During the winter and spring seasons, the data generally do not have a very strong temporal autocorrelation, and the t test can be applied with a lower risk of obtaining falsepositive outcomes. There are some regions where the temporal autocorrelation still persists, such as over North America. Local higher autocorrelation values during other seasons can also be a result due also to low-frequency variance caused by large-scale teleconnections (Madden, 1977). During the summer season -and to a large extent also in autumn -data are very autocorrelated, but they will be analysed in any case, as it is worthwhile as well to show how the procedure used to assess the possible impact of the geomagnetic activity responds to serially correlated data. In general, autocorrelation is mainly due to the persistence of temperature patterns year by year. For instance, this is the case for example of the large value of temperature autocorrelation found during the summer season. However, we cannot exclude other causes, including a possible impact of the solar activity.
Serial correlation can be corrected for by adopting, for example, the strategy suggested by Zwiers and von Storch (1995). This procedure is valid under the assumption that the time series, from which the data are sampled, can be modelled as an autoregressive process of order 1 or AR(1). Vyushin et al. (2012) have shown that the AR(1) representation fits modelled stratospheric temperature data very well according to standard goodness-of-fit tests. Seidel and Lanzante (2004) found a similar result with temperature observed by radiosondes and satellites.
If EEP has a cumulative impact during the different seasons, it has to be shown that the means of two subsets with high-Ap (H) and low-Ap (L) values from the set N1 must be different.
To test the null hypothesis H 0 : µ H = µ L with the t statistics at the 5 % significance level one, let us apply the t test under the condition that the standard deviation is scaled by the equivalent sample sizes m e and n e that can be computed by where n is the original size of one out of two samples and ρ 1 is the parameter of the AR(1) process representing the autocorrelation at lag 1; this is similar for m e . The t test is then corrected in the following way: where H and L are the sample averages and s 2 is the pooled variance, (3)

Accounting for spatial autocorrelation
Spatial autocorrelation produces the so-called multiplicity problem, which arises when testing a statistical hypothesis on many samples (the domain's grid points, in our case) simultaneously. A single hypothesis test allows for a null hypothesis and an alternative hypothesis. The alternative hypothesis will be favoured when an extreme value, usually with a probability (called value) that is less than 5 %, is found (Wilks, 2016). Making a statistical test on multiple points, for example within a spatial domain, means that more realizations will be available and that there will be many grid points where one is more likely to reject the null hypothesis. In an ideal situation, where the value is set to 0.05 and each point is statistically independent of the others, it is expected to be found that 5 % of the points will be statistically significant by accident. The situation is worse when the grid points are correlated, as is often the case when analysing meteorological and climate data. This problem, known in the literature as the multiplicity problem, has been encountered in several studies, although most of the studies in the atmo- Figure 2. Northern Hemisphere seasonal differences in stratospheric temperature (high-Ap-low-Ap values) at 5 hPa without (a) and with (b) temporal (TCC) and spatial (FDR) autocorrelation correction. Grey areas represent statistically significant temperature differences at the 5 % confidence levels.
spheric sciences have not properly addressed the issue yet (Wilks, 2016). Some solutions have been proposed, each having their own advantages and disadvantages. Wilks (2016) gives a brief historical outline and shows different solutions to this problem. One technique to address this issue is by using the false discovery rate (Benjamini and Hochberg, 1995). According to Wilks (2006Wilks ( , 2016, the false discovery rate is the expectation of the fraction of true null hypothesis rejections among all the rejections, and it is the best available approach to analyse multiple hypothesis test results, even when those results are mutually correlated. As stated by Wilks (2016), the FDR (false discovery rate) procedure requires smaller values to reject the local null hypothesis, arising the standard of the test. For the sake of the reader, we will describe the FDR algorithm as described in Wilks (2016). The algorithm operates on the collection of H 0 : µ H = µ L values from m e (number of grid points) of local hypothesis tests p i , with i = 1, . . . , N, which are sorted in ascending order. Rejection of the test happens when the p i values are not larger than a threshold level p FDR that is a function of the distribution of the sorted p i values. More specifically, to define which values pass the test, the following formula is used: where α FDR is the chosen FDR control level that here is taken to equal 0.05. For a given value of α FDR , the largest value of i such that p i α FDR i N defines the threshold below which the local null hypotheses are rejected.

Stratospheric levels
We start with the application of the t test on 5 hPa temperature (Fig. 2), which represents the level where the statistically significant area is the largest among all the examined pressure levels. There are large areas with a statistically significant temperature difference at the 5 % level, especially during winter and summer.
At 5 hPa, the area with significant differences covers most of the Northern Hemisphere in JJA, but, as can be seen from the analysis of the Durbin-Watson test, the summer season exhibits a large temporal autocorrelation. Hence, the statistically significant areas observed in JJA should originate from this autocorrelation. In winter, the area with significant points cover North America, another region where the Durbin-Watson test suggests serial correlation. It is clear from Fig. 2 that a possible impact, of the geomagnetic activity, if it exists, would be limited at higher latitudes, from 40 to 90 • .  Because of the strong temporal autocorrelation, it is expected that at least in summer these significant differences should be false-positive outcomes, and they should be reduced or completely removed when applying the serial correlation correction. In fact, by applying the correction of serial dependence to the 5 hPa temperature differences, the t test results change dramatically, as Fig. 3 shows. The statistically significant differences are removed everywhere in JJA. However, in DJF small areas with significant differences are still present at that level. The Durbin-Watson test somehow predicted that there will be not significant points after applying the Zweirs and von Storch algorithm in the areas where the Durbin-Watson test value was close to zero.
On the other hand, the problem of multiplicity is solved here by means of the FDR procedure described in Sect. 2. When applying such a procedure without correcting the serial dependence, some significant temperature differences still persist at 5 hPa during summertime (Fig. 4). However, the only application of FDR is to remove all significant differences when it is applied to other pressure levels (e.g. 10 hPa). This result is important, as the FDR procedure is quite powerful in removing most of the false-positive differences, but, as Fig. 2 shows, it is not sufficient in presence of a strong temporal correlation that can still leave regions where the t test rejects the null hypothesis, when, in fact, it would be true. This result is particularly important, and it recommends the application of both the corrections strongly.
The application of such corrections dealing both with temporal and spatial autocorrelation removes all the statistically significant differences in the domain (Fig. 2b), and the t test with the combined correction fails to reject the null hypothesis.
A similar result is obtained for all the other levels in the dataset; temperature differences at 1 and 100 hPa temperatures are shown in Figs. 5 and 6 without and with both the corrections. The application of the false discovery rate on those fields eliminates all the significant temperature differences, showing that also at those levels, there is no detectable impact of geomagnetic activity on the atmospheric temperature.

Zonally averaged temperature and 2 m temperatures
Several studies have shown the possible impact of EEP or energetic particle precipitation on the observations using zonal mean temperatures (Tomikawa, 2017;Seppälä et al., 2013). Thus, we show how, without any correction, even the zonal mean temperature difference has areas that are statistically  www.ann-geophys.net/38/545/2020/ Ann. Geophys., 38, 545-555, 2020 Figure 7. Zonal mean temperature differences (high-Ap-low-Ap values) without (a) and with (b) temporal (TCC) and spatial (FDR) autocorrelation corrections. The grey areas indicate statistically significant temperature differences at the 5 % confidence level. significant at the 5 % level (Fig. 7a). In particular, there are statistically significant areas in all the seasons but spring between 10 and 1 hPa. There are no statically significant areas (Fig. 7b) after applying the two corrections that account for spatial and temporal autocorrelations. It is natural to think that EEP would influence upper-and mid-stratosphere temperatures through its impact on ozone. The results discussed in the previous sections suggest that the EEP influence on NH stratospheric temperatures is problematic to detect, as it is much weaker than other causes of variability, among which the internal dynamical variability is paramount. As this work is motivated by S09 that analysed the 2 m temperature, Fig. 8a shows the 2 m temperature difference (high-Ap-low-Ap values) without any correction. There are large areas where the seasonal temperature differences are statistically significant at the 5 % level.
The application of both the spatial and temporal autocorrelation corrections remove almost all these areas. However, some small areas of statistically significant temperature differences are still present. They are in the polar region and over Russia during the winter season and over the Scandinavia during the spring (Fig. 8b). As it is not easy to explain these statistically significant surface temperature differences with a causal relationship with EEP, given the lack of signal aloft, there may be some other reasons that can justify this significance with other causes, among which a positive outcome is obtained by chance.

Conclusions
Climate data often exhibit temporal and spatial autocorrelations which should be taken into account when testing a hypothesis, a task that is often neglected (Wilks, 2016). The Figure 8. Temperature differences at 2 m (high-Ap-low-Ap values) without (a) and with (b) temporal (TCC) and spatial (FDR) autocorrelation corrections. The grey areas indicate statistically significant temperature differences at the 5 % level. effect of temporal autocorrelation was addressed with an appropriate procedure described in Zwiers and von Storch (1995). The problem of evaluating the results of multiple hypothesis tests in a spatial domain was further addressed by means of the false discovery rate procedure. In this paper, the possible impact of geomagnetic activity on the seasonalmean stratospheric temperature in the JRA-55 reanalysis was evaluated by means of Welch's t test under four different cases: (1) with no correction of temporal and spatial autocorrelation, (2) with correction on temporal autocorrelation only, (3) with correction on spatial autocorrelation only, and finally (4) with both the corrections. Most of the cases examined show significant points when temporal and spatial autocorrelations are not corrected, while they do not show any significant point when including just one out of the two corrections. In other words, in most cases, there is not even a need to apply both corrections to infer that there is no impact of geomagnetic activity. However, the statistically significant temperature differences at 5 hPa show that it strongly recommended the application of both the corrections for the spatial and temporal autocorrelation. In some cases, like for the JJA temperature difference at 5 hPa, there are a few significant areas remaining when applying one out of the two corrections (Figs. 3 and 4), but those significant areas disappeared when both corrections were applied. Finally, the procedures to take into account these autocorrelations, the significance test typically fails to reject the null hypothesis. This result is found for all the pressure levels analysed and for zonally averaged temperature. The only temperature field that still has statistically significant differences after applying both the corrections is the 2 m temperature. There are two seasons, DJF and MAM, where small statistically significant areas are present in the polar region. In the absence of a signature aloft, we therefore conclude that, based on the JRA-55 reanalyses, not enough evidence is available at present to suggest that the null hypothesis of no impact of geomagnetic activity on NH stratospheric temperatures is false. A remaining caveat concerns the definition of seasons of high or low geomagnetic activity, which is here the same as in S09 and is based on a lagged 4-month-averaged Ap index (i.e. from October to January for wintertime geomagnetic activity). Some sensitivity studies to this definition, e.g. to treat more intense shorter episodes of EEP or to treat differently the seasonal lag or accumulation of EEP, are certainly warranted for future studies. It is clear that the absence or the presence of significance does not put an end to the research of a possible relationship between EEP and stratospheric temperature, which we suppose to be weak and consequently difficult to detect.
Data availability. Data can be downloaded from the Meteorological Research Institute, Japan Meteorological Agency, Japan, or from the Research Data Archive at the National Center for Atmospheric Research, Computational and Information Systems Laboratory.