####################
Significance Testing
####################
Significance testing is important for assessing whether a given initialized prediction
system is skillful. Some questions that significance testing can answer are:
- Is the correlation coefficient of a lead time series significantly different from
zero?
- What is the probability that the retrospective forecast is more valuable than a
historical/uninitialized simulation?
- Are correlation coefficients statistically significant despite temporal and
spatial autocorrelation?
All of these questions deal with statistical significance. See below on how to use
``climpred`` to address these questions.
Please also have a look at the
`significance testing example `__.
p value for temporal correlations
#################################
For the correlation `metrics `__, like
:py:func:`~climpred.metrics._pearson_r` and :py:func:`~climpred.metrics._spearman_r`,
``climpred`` also hosts the associated p-value, like
:py:func:`~climpred.metrics._pearson_r_p_value`,
that this correlation is significantly different from zero.
:py:func:`~climpred.metrics._pearson_r_eff_p_value` also incorporates the reduced
degrees of freedom due to temporal autocorrelation. See
`example `__.
Bootstrapping with replacement
##############################
Testing statistical significance through bootstrapping is commonly used in the field of
climate prediction. Bootstrapping relies on
resampling the underlying data with replacement for a large number of ``iterations``, as
proposed by the decadal prediction framework :cite:p:`Goddard2013,Boer2016`.
This means that the ``initialized`` ensemble is resampled with replacement along a
dimension (``init`` or ``member``) and then that resampled ensemble is verified against
the observations. This leads to a distribution of ``initialized`` skill. Further, a
``reference`` forecast uses the resampled ``initialized`` ensemble, which creates a
``reference`` skill distribution. Lastly, an ``uninitialized`` skill distribution is
created from the underlying historical members or the control simulation.
The probability or p value is the fraction of these resampled ``initialized`` metrics
beaten by the ``uninitialized`` or resampled reference metrics calculated from their
respective distributions. Confidence intervals using these distributions are also
calculated.
This behavior is incorporated by :py:meth:`.HindcastEnsemble.bootstrap` and
:py:meth:`.PerfectModelEnsemble.bootstrap`, see
`example `__.
Field significance
##################
Please use :py:func:`esmtools.testing.multipletests` to control the false discovery
rate (FDR) in geospatial data from the above obtained p-values :cite:p:`Wilks2016`.
See the `FDR example `__.
Sign test
#########
Use DelSole's sign test relying on the statistics of a random walk to decide whether
one forecast is significantly better than another forecast
:cite:p:`Benjamini1994,DelSole2016`, see :py:func:`xskillscore.sign_test` and
`sign test example `__.
References
##########
.. bibliography::
:filter: docname in docnames