Setting up your own output

Contents

Setting up your own output#

This demo demonstrates how you can setup your raw model output with climpred.preprocessing to match climpred’s expectations.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr

import climpred

/home/docs/checkouts/readthedocs.org/user_builds/climpred/conda/stable/lib/python3.10/site-packages/xclim/sdba.py:12: UserWarning: The `xclim.sdba` module has been split into its own package `xsdba`. Users are encouraged to use `xsdba` directly. For the time being, `xclim.sdba` will import `xsdba` to allow for API compatibility. This behaviour may change in the future. For more information, see: https://xsdba.readthedocs.io/en/stable/xclim_migration_guide.html
  warnings.warn(
/home/docs/checkouts/readthedocs.org/user_builds/climpred/conda/stable/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

from climpred.preprocessing.shared import load_hindcast, set_integer_time_axis
from climpred.preprocessing.mpi import get_path

Assuming your raw model output is stored in multiple files per member and initialization, climpred.preprocessing.shared.load_hindcast() is a nice wrapper function based on climpred.preprocessing.mpi.get_path() designed for the output format of MPI-ESM to aggregated all hindcast output into one file as expected by climpred.

The basic idea is to look over the output of all members and concatinate, then loop over all initializations and concatinate. Before concatination, it is important to make the time dimension identical in all input datasets for concatination.

To reduce the data size, use the preprocess function provided to xr.open_mfdataset wisely in combination with climpred.preprocessing.shared.set_integer_axis(), e.g. additionally extracting only a certain region, time-step, time-aggregation or only few variables for a multi-variable input file as in MPI-ESM standard output.

v = "global_primary_production"

def preprocess_1var(ds, v=v):
    """Only leave one variable `v` in dataset """
    return ds[v].to_dataset(name=v).squeeze()

# lead_offset because yearmean output
%time ds = load_hindcast(inits=range(1961, 1965), members=range(1, 3), preprocess=preprocess_1var, get_path=get_path)

Processing init 1961 ...

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
File <timed exec>:1

File ~/checkouts/readthedocs.org/user_builds/climpred/conda/stable/lib/python3.10/site-packages/climpred/preprocessing/shared.py:49, in load_hindcast(inits, members, preprocess, lead_offset, parallel, engine, get_path, **get_path_kwargs)
     46 member_list = []
     47 for member in members:
     48     # get path p
---> 49     p = get_path(member=member, init=init, **get_path_kwargs)
     50     # open all leads for specified member and init
     51     member_ds = xr.open_mfdataset(
     52         p,
     53         combine="nested",
   (...)
     60         compat="override",  # speed up
     61     ).squeeze()

File ~/checkouts/readthedocs.org/user_builds/climpred/conda/stable/lib/python3.10/site-packages/climpred/preprocessing/mpi.py:34, in get_path(dir_base_experiment, member, init, model, output_stream, timestr, ending)
     14 """Get the path of a file for MPI-ESM standard output file names and directory.
     15 
     16 Args:
   (...)
     31 
     32 """
     33 # get experiment_id
---> 34 dirs = _os.listdir(dir_base_experiment)
     35 experiment_id = [x for x in dirs if (f"{init}" in x and f"r{member}i" in x)]
     36 if len(experiment_id) != 1:

FileNotFoundError: [Errno 2] No such file or directory: '/work/bm1124/m300086/CMIP6/experiments'

# what we need for climpred
ds.coords

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[5], line 2
      1 # what we need for climpred
----> 2 ds.coords

NameError: name 'ds' is not defined

ds[v].data

	Array	Chunk
Bytes	320 B	4 B
Shape	(4, 2, 10)	(1, 1, 1)
Count	720 Tasks	80 Chunks
Type	float32	numpy.ndarray

# loading the data into memory
# if not rechunk
%time ds = ds.load()

CPU times: user 216 ms, sys: 44 ms, total: 260 ms
Wall time: 220 ms

# go on with creation of PredictionEnsemble
# climpred.HindcastEnsemble(ds).add_observations(obs).verify(metric='acc', comparison='e2o', dim='init', alignment='maximize')
# climpred.PerfectModelEnsemble(ds).add_control(control).verify(metric='acc', comparison='m2e', dim=['init','member'])

`intake-esm` for cmorized output#

In case you have access to cmorized output of CMIP experiments, consider using intake-esm <https://intake-esm.readthedocs.io/en/stable/>_. With climpred.preprocessing.shared.set_integer_time_axis() you can align the time dimension of all input files. Finally, climpred.preprocessing.shared.rename_to_climpred_dims() only renames.

from climpred.preprocessing.shared import rename_to_climpred_dims, set_integer_time_axis

# make to have to install intake-esm installed, which is not included in climpred-dev
import intake # this is enough for intake-esm to work

col_url = "/home/mpim/m300524/intake-esm-datastore/catalogs/mistral-cmip6.json"
col_url = "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json"
col = intake.open_esm_datastore(col_url)

col.df.columns

Index(['activity_id', 'institution_id', 'source_id', 'experiment_id',
       'member_id', 'table_id', 'variable_id', 'grid_label', 'dcpp_init_year',
       'version', 'time_range', 'path'],
      dtype='object')

# load 2 members for 2 inits for one variable from one model
query = dict(experiment_id=[
    'dcppA-hindcast'], table_id='Amon', member_id=['r1i1p1f1', 'r2i1p1f1'], dcpp_init_year=[1970, 1971],
    variable_id='tas', source_id='MPI-ESM1-2-HR')
cat = col.search(**query)
cdf_kwargs = {'chunks': {'time': 12}, 'decode_times': False}

cat.df.head()

	activity_id	institution_id	source_id	experiment_id	member_id	table_id	variable_id	grid_label	dcpp_init_year	version	time_range	path
0	DCPP	MPI-M	MPI-ESM1-2-HR	dcppA-hindcast	r1i1p1f1	Amon	tas	gn	1971.0	v20190906	197111-198112	/work/ik1017/CMIP6/data/CMIP6/DCPP/MPI-M/MPI-E...
1	DCPP	MPI-M	MPI-ESM1-2-HR	dcppA-hindcast	r1i1p1f1	Amon	tas	gn	1970.0	v20190906	197011-198012	/work/ik1017/CMIP6/data/CMIP6/DCPP/MPI-M/MPI-E...
2	DCPP	MPI-M	MPI-ESM1-2-HR	dcppA-hindcast	r2i1p1f1	Amon	tas	gn	1971.0	v20190906	197111-198112	/work/ik1017/CMIP6/data/CMIP6/DCPP/MPI-M/MPI-E...
3	DCPP	MPI-M	MPI-ESM1-2-HR	dcppA-hindcast	r2i1p1f1	Amon	tas	gn	1970.0	v20190906	197011-198012	/work/ik1017/CMIP6/data/CMIP6/DCPP/MPI-M/MPI-E...

def preprocess(ds):
    # extract tiny spatial and temporal subset to make this fast
    ds = ds.isel(lon=[50, 51, 52], lat=[50, 51, 52],
                 time=np.arange(12 * 2))
    # make time dim identical
    ds = set_integer_time_axis(ds,time_dim='time')
    return ds

dset_dict = cat.to_dataset_dict(
    cdf_kwargs=cdf_kwargs, preprocess=preprocess)

Progress: |███████████████████████████████████████████████████████████████████████████████| 100.0% 

--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
             
--> There are 1 group(s)

# get first dict value
_, ds = dset_dict.popitem()
ds.coords

Coordinates:
    height          float64 ...
  * dcpp_init_year  (dcpp_init_year) float64 1.97e+03 1.971e+03
  * lat             (lat) float64 -42.55 -41.61 -40.68
  * time            (time) int64 1 2 3 4 5 6 7 8 9 ... 17 18 19 20 21 22 23 24
  * lon             (lon) float64 46.88 47.81 48.75
  * member_id       (member_id) <U8 'r1i1p1f1' 'r2i1p1f1'

# rename to comply with climpred's required dimension names
ds = rename_to_climpred_dims(ds)

# what we need for climpred
ds.coords

Coordinates:
    height   float64 ...
  * init     (init) float64 1.97e+03 1.971e+03
  * lat      (lat) float64 -42.55 -41.61 -40.68
  * lead     (lead) int64 1 2 3 4 5 6 7 8 9 10 ... 15 16 17 18 19 20 21 22 23 24
  * lon      (lon) float64 46.88 47.81 48.75
  * member   (member) <U8 'r1i1p1f1' 'r2i1p1f1'

ds['tas'].data

	Array	Chunk
Bytes	3.46 kB	432 B
Shape	(2, 2, 24, 3, 3)	(1, 1, 12, 3, 3)
Count	176 Tasks	8 Chunks
Type	float32	numpy.ndarray

# loading the data into memory
# if not rechunk
# this is here quite fast before we only select 9 grid cells
%time ds = ds.load()

CPU times: user 218 ms, sys: 74 ms, total: 292 ms
Wall time: 237 ms

# go on with creation of PredictionEnsemble
# climred.HindcastEnsemble(ds).add_observations(obs).verify(metric='acc', comparison='e2o', dim='init', alignment='maximize')
# climred.PerfectModelEnsemble(ds).add_control(control).verify(metric='acc', comparison='m2e', dim=['init','member'])