Significance testing is important for assessing whether a given initialized prediction system is skillful. Some questions that significance testing can answer are:
Is the correlation coefficient of a lead time series significantly different from zero?
What is the probability that the retrospective forecast is more valuable than a historical/uninitialized simulation?
Are correlation coefficients statistically significant despite temporal and spatial autocorrelation?
All of these questions deal with statistical significance. See below on how to use
climpred to address these questions.
Please also have a look at the
significance testing example.
p value for temporal correlations#
For the correlation metrics, like
climpred also hosts the associated p-value, like
that this correlation is significantly different from zero.
_pearson_r_eff_p_value() also incorporates the reduced
degrees of freedom due to temporal autocorrelation. See
Bootstrapping with replacement#
Testing statistical significance through bootstrapping is commonly used in the field of
climate prediction. Bootstrapping relies on
resampling the underlying data with replacement for a large number of
proposed by the decadal prediction framework [Boer et al., 2016, Goddard et al., 2013].
This means that the
initialized ensemble is resampled with replacement along a
member) and then that resampled ensemble is verified against
the observations. This leads to a distribution of
initialized skill. Further, a
reference forecast uses the resampled
initialized ensemble, which creates a
reference skill distribution. Lastly, an
uninitialized skill distribution is
created from the underlying historical members or the control simulation.
The probability or p value is the fraction of these resampled
beaten by the
uninitialized or resampled reference metrics calculated from their
respective distributions. Confidence intervals using these distributions are also
Use DelSole’s sign test relying on the statistics of a random walk to decide whether
one forecast is significantly better than another forecast
[Benjamini and Hochberg, 1994, DelSole and Tippett, 2016], see
sign test example.
Yoav Benjamini and Yosef Hochberg. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1):13, 1994. doi:10.1111/j.2517-6161.1995.tb02031.x.
G. J. Boer, D. M. Smith, C. Cassou, F. Doblas-Reyes, G. Danabasoglu, B. Kirtman, Y. Kushnir, M. Kimoto, G. A. Meehl, R. Msadek, W. A. Mueller, K. E. Taylor, F. Zwiers, M. Rixen, Y. Ruprich-Robert, and R. Eade. The Decadal Climate Prediction Project (DCPP) contribution to CMIP6. Geosci. Model Dev., 9(10):3751–3777, October 2016. doi:10/f89qdf.
Timothy DelSole and Michael K. Tippett. Forecast Comparison Based on Random Walks. Monthly Weather Review, 144(2):615–626, February 2016. doi:10/f782pf.
L. Goddard, A. Kumar, A. Solomon, D. Smith, G. Boer, P. Gonzalez, V. Kharin, W. Merryfield, C. Deser, S. J. Mason, B. P. Kirtman, R. Msadek, R. Sutton, E. Hawkins, T. Fricker, G. Hegerl, C. a. T. Ferro, D. B. Stephenson, G. A. Meehl, T. Stockdale, R. Burgman, A. M. Greene, Y. Kushnir, M. Newman, J. Carton, I. Fukumori, and T. Delworth. A verification framework for interannual-to-decadal predictions experiments. Climate Dynamics, 40(1-2):245–272, January 2013. doi:10/f4jjvf.
D. S. Wilks. “The Stippling Shows Statistically Significant Grid Points”: How Research Results are Routinely Overstated and Overinterpreted, and What to Do about It. Bulletin of the American Meteorological Society, 97(12):2263–2273, March 2016. doi:10/f9mvth.