# Significance Testing¶

Significance testing is important for assessing whether a given initialized prediction system is skillful. Some questions that significance testing can answer are:

• Is the correlation coefficient of a lead time series significantly different from zero?

• What is the probability that the retrospective forecast is more valuable than a historical/uninitialized simulation?

• Are correlation coefficients statistically significant despite temporal and spatial autocorrelation?

All of these questions deal with statistical significance. See below on how to use `climpred` to address these questions. Please also have a look at the significance testing example.

## p value for temporal correlations¶

For the correlation metrics, like `_pearson_r()` and `_spearman_r()`, `climpred` also hosts the associated p-value, like `_pearson_r_p_value()`, that this correlation is significantly different from zero. `_pearson_r_eff_p_value()` also incorporates the reduced degrees of freedom due to temporal autocorrelation. See example.

## Bootstrapping with replacement¶

Testing statistical significance through bootstrapping is commonly used in the field of climate prediction [could add some example citations here]. Bootstrapping relies on resampling the underlying data with replacement for a large number of `iterations`, as proposed by the decadal prediction framework of Goddard et al. 2013 [Goddard2013]. This means that the `initialized` ensemble is resampled with replacement along a dimension (`init` or `member`) and then that resampled ensemble is verified against the observations. This leads to a distribution of `initialized` skill. Further, a `reference` forecast uses the resampled `initialized` ensemble, which creates a `reference` skill distribution. Lastly, an `uninitialized` skill distribution is created from the underlying historical members or the control simulation.

The probability or p value is the fraction of these resampled `initialized` metrics beaten by the `uninitialized` or resampled reference metrics calculated from their respective distributions. Confidence intervals using these distributions are also calculated.

This behavior is incorporated by `bootstrap()`, see example.

## Field significance¶

Please use `esmtools.testing.multipletests()` to control the false discovery rate (FDR) in geospatial data from the above obtained p-values [Wilks2016]. See the FDR example.

## Sign test¶

Use DelSole’s sign test relying on the statistics of a random walk to decide whether one forecast is significantly better than another forecast [DelSole2016], see sign test example.