Setting up your own output¶

This demo demonstrates how you can setup your raw model output with climpred.preprocessing to match climpred’s expectations.

[1]:

from dask.distributed import Client
import multiprocessing
ncpu = multiprocessing.cpu_count()
threads = 6
nworker = ncpu//threads
print(f'Number of CPUs: {ncpu}, number of threads: {threads}, number of workers: {nworker}')

Number of CPUs: 48, number of threads: 6, number of workers: 8

[2]:

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr

import climpred

[3]:

from climpred.preprocessing.shared import load_hindcast, set_integer_time_axis
from climpred.preprocessing.mpi import get_path

Assuming your raw model output is stored in multiple files per member and initialization, load_hindcast is a nice wrapper function based on get_path designed for the output format of MPI-ESM to aggregated all hindcast output into one file as expected by climpred.

The basic idea is to look over the output of all members and concatinate, then loop over all initializations and concatinate. Before concatination, it is important to make the time dimension identical in all input datasets for concatination.

To reduce the data size, use the preprocess function provided to xr.open_mfdataset wisely in combination with set_integer_axis, e.g. additionally extracting only a certain region, time-step, time-aggregation or only few variables for a multi-variable input file as in MPI-ESM standard output.

[4]:

# check the code of load_hindcast
# load_hindcast??

[5]:

v = "global_primary_production"

def preprocess_1var(ds, v=v):
    """Only leave one variable `v` in dataset """
    return ds[v].to_dataset(name=v).squeeze()

[6]:

# lead_offset because yearmean output
%time ds = load_hindcast(inits=range(1961, 1965), members=range(1, 3), preprocess=preprocess_1var, get_path=get_path)

Processing init 1961 ...
Processing init 1962 ...
Processing init 1963 ...
Processing init 1964 ...
CPU times: user 5.07 s, sys: 2.06 s, total: 7.13 s
Wall time: 5.19 s

[7]:

# what we need for climpred
ds.coords

[7]:

Coordinates:
    depth    float64 0.0
    lat      float64 0.0
    lon      float64 0.0
  * lead     (lead) int64 1 2 3 4 5 6 7 8 9 10
  * member   (member) int64 1 2
  * init     (init) int64 1961 1962 1963 1964

[8]:

ds[v].data

[8]:

	Array	Chunk
Bytes	320 B	4 B
Shape	(4, 2, 10)	(1, 1, 1)
Count	720 Tasks	80 Chunks
Type	float32	numpy.ndarray

[9]:

# loading the data into memory
# if not rechunk
%time ds = ds.load()

CPU times: user 216 ms, sys: 44 ms, total: 260 ms
Wall time: 220 ms

[10]:

# you may actually want to use `compute_hindcast` to calculate skill from hindcast.
# for this you also need an `observation` to compare to
# here `compute_perfect_model` compares one member to the ensemble mean of the remain members in turn
climpred.prediction.compute_perfect_model(ds, ds.rename({'lead':'time'}))

[10]:

<xarray.Dataset>
Dimensions:                    (lead: 10)
Coordinates:
    depth                      float64 0.0
    lon                        float64 0.0
    lat                        float64 0.0
  * lead                       (lead) int64 1 2 3 4 5 6 7 8 9 10
Data variables:
    global_primary_production  (lead) float64 -0.1276 0.593 ... 0.09196 0.5038
Attributes:
    prediction_skill:              calculated by climpred https://climpred.re...
    skill_calculated_by_function:  compute_perfect_model
    number_of_initializations:     4
    number_of_members:             2
    metric:                        pearson_r
    comparison:                    m2e
    dim:                           ['init', 'member']
    units:                         None
    created:                       2020-02-07 18:27:28

`intake-esm` for cmorized output¶

In case you have access to cmorized output of CMIP experiments, consider using intake-esm. With the preprocess function you can align the time dimension of all input files. Finally, rename_to_climpred_dims only renames.

[11]:

from climpred.preprocessing.shared import rename_to_climpred_dims, set_integer_time_axis

[12]:

# make to have intake-esm installed
import intake # this is enough for intake-esm to work

[13]:

# https://github.com/NCAR/intake-esm-datastore/
col_url = "/home/mpim/m300524/intake-esm-datastore/catalogs/mistral-cmip6.json"
col = intake.open_esm_datastore(col_url)

[14]:

col.df.columns

[14]:

Index(['activity_id', 'institution_id', 'source_id', 'experiment_id',
       'member_id', 'table_id', 'variable_id', 'grid_label', 'dcpp_init_year',
       'version', 'time_range', 'path'],
      dtype='object')

[15]:

# load 2 members for 2 inits for one variable from one model
query = dict(experiment_id=[
    'dcppA-hindcast'], table_id='Amon', member_id=['r1i1p1f1', 'r2i1p1f1'], dcpp_init_year=[1970, 1971],
    variable_id='tas', source_id='MPI-ESM1-2-HR')
cat = col.search(**query)
cdf_kwargs = {'chunks': {'time': 12}, 'decode_times': False}

[16]:

cat.df.head()

[16]:

	activity_id	institution_id	source_id	experiment_id	member_id	table_id	variable_id	grid_label	dcpp_init_year	version	time_range	path
0	DCPP	MPI-M	MPI-ESM1-2-HR	dcppA-hindcast	r1i1p1f1	Amon	tas	gn	1971.0	v20190906	197111-198112	/work/ik1017/CMIP6/data/CMIP6/DCPP/MPI-M/MPI-E...
1	DCPP	MPI-M	MPI-ESM1-2-HR	dcppA-hindcast	r1i1p1f1	Amon	tas	gn	1970.0	v20190906	197011-198012	/work/ik1017/CMIP6/data/CMIP6/DCPP/MPI-M/MPI-E...
2	DCPP	MPI-M	MPI-ESM1-2-HR	dcppA-hindcast	r2i1p1f1	Amon	tas	gn	1971.0	v20190906	197111-198112	/work/ik1017/CMIP6/data/CMIP6/DCPP/MPI-M/MPI-E...
3	DCPP	MPI-M	MPI-ESM1-2-HR	dcppA-hindcast	r2i1p1f1	Amon	tas	gn	1970.0	v20190906	197011-198012	/work/ik1017/CMIP6/data/CMIP6/DCPP/MPI-M/MPI-E...

[17]:

def preprocess(ds):
    # extract tiny spatial and temporal subset to make this fast
    ds = ds.isel(lon=[50, 51, 52], lat=[50, 51, 52],
                 time=np.arange(12 * 2))
    # make time dim identical
    ds = set_integer_time_axis(ds,time_dim='time')
    return ds

[18]:

dset_dict = cat.to_dataset_dict(
    cdf_kwargs=cdf_kwargs, preprocess=preprocess)

Progress: |███████████████████████████████████████████████████████████████████████████████| 100.0%

--> The keys in the returned dictionary of datasets are constructed as follows:
        'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'

--> There are 1 group(s)

[19]:

# get first dict value
_, ds = dset_dict.popitem()
ds.coords

[19]:

Coordinates:
    height          float64 ...
  * dcpp_init_year  (dcpp_init_year) float64 1.97e+03 1.971e+03
  * lat             (lat) float64 -42.55 -41.61 -40.68
  * time            (time) int64 1 2 3 4 5 6 7 8 9 ... 17 18 19 20 21 22 23 24
  * lon             (lon) float64 46.88 47.81 48.75
  * member_id       (member_id) <U8 'r1i1p1f1' 'r2i1p1f1'

[20]:

# rename to comply with climpred's required dimension names
ds = rename_to_climpred_dims(ds)

[21]:

# what we need for climpred
ds.coords

[21]:

Coordinates:
    height   float64 ...
  * init     (init) float64 1.97e+03 1.971e+03
  * lat      (lat) float64 -42.55 -41.61 -40.68
  * lead     (lead) int64 1 2 3 4 5 6 7 8 9 10 ... 15 16 17 18 19 20 21 22 23 24
  * lon      (lon) float64 46.88 47.81 48.75
  * member   (member) <U8 'r1i1p1f1' 'r2i1p1f1'

[22]:

ds['tas'].data

[22]:

	Array	Chunk
Bytes	3.46 kB	432 B
Shape	(2, 2, 24, 3, 3)	(1, 1, 12, 3, 3)
Count	176 Tasks	8 Chunks
Type	float32	numpy.ndarray

[23]:

# loading the data into memory
# if not rechunk
# this is here quite fast before we only select 9 grid cells
%time ds = ds.load()

CPU times: user 218 ms, sys: 74 ms, total: 292 ms
Wall time: 237 ms

[24]:

# you may actually want to use `compute_hindcast` to calculate skill from hindcast.
# for this you also need an `observation` to compare to
# here `compute_perfect_model` compares one member to the ensemble mean of the remain members in turn
climpred.prediction.compute_perfect_model(ds, ds.rename({'lead':'time'}))

/work/mh0727/m300524/miniconda3/envs/climpred-dev/lib/python3.6/site-packages/xskillscore/core/np_deterministic.py:182: RuntimeWarning: invalid value encountered in true_divide
  r = r_num / r_den

[24]:

<xarray.Dataset>
Dimensions:    (bnds: 2, lat: 3, lead: 24, lon: 3)
Coordinates:
    height     float64 2.0
  * lat        (lat) float64 -42.55 -41.61 -40.68
  * lead       (lead) int64 1 2 3 4 5 6 7 8 9 10 ... 16 17 18 19 20 21 22 23 24
  * lon        (lon) float64 46.88 47.81 48.75
Dimensions without coordinates: bnds
Data variables:
    lat_bnds   (lead, lat, bnds) float64 nan nan nan nan nan ... nan nan nan nan
    time_bnds  (lead, bnds) float64 nan nan nan nan nan ... nan nan nan nan nan
    lon_bnds   (lead, lon, bnds) float64 nan nan nan nan nan ... nan nan nan nan
    tas        (lead, lat, lon) float32 0.933643 0.88438606 ... -0.8581106
Attributes:
    nominal_resolution:            100 km
    institution:                   Max Planck Institute for Meteorology, Hamb...
    experiment_id:                 dcppA-hindcast
    variable_id:                   tas
    experiment:                    hindcast initialized based on observations...
    tracking_id:                   hdl:21.14100/f41d2fe5-bb1c-49d0-8afa-0cb9e...
    table_info:                    Creation Date:(09 May 2019) MD5:e6ef8ececc...
    references:                    MPI-ESM: Mauritsen, T. et al. (2019), Deve...
    frequency:                     mon
    source_id:                     MPI-ESM1-2-HR
    physics_index:                 1
    initialization_index:          1
    product:                       model-output
    source_type:                   AOGCM
    title:                         MPI-ESM1-2-HR output prepared for CMIP6
    institution_id:                MPI-M
    project_id:                    CMIP6
    Conventions:                   CF-1.7 CMIP-6.2
    intake_esm_varname:            tas
    activity_id:                   DCPP
    branch_method:                 standard lagged initialization
    table_id:                      Amon
    data_specs_version:            01.00.30
    grid_label:                    gn
    forcing_index:                 1
    external_variables:            areacella
    cmor_version:                  3.5.0
    realm:                         atmos
    history:                       2019-09-06T14:20:04Z ; CMOR rewrote data t...
    grid:                          spectral T127; 384 x 192 longitude/latitude
    license:                       CMIP6 model data produced by MPI-M is lice...
    source:                        MPI-ESM1.2-HR (2017): \naerosol: none, pre...
    mip_era:                       CMIP6
    contact:                       cmip6-mpi-esm@dkrz.de
    prediction_skill:              calculated by climpred https://climpred.re...
    skill_calculated_by_function:  compute_perfect_model
    number_of_initializations:     2
    number_of_members:             2
    metric:                        pearson_r
    comparison:                    m2e
    dim:                           ['init', 'member']
    units:                         None
    created:                       2020-02-07 18:27:51

Read the Docs v: v2.1.0

Versions: latest; stable; v2.1.0.post1; v2.1.0; v2.0.0; v1.2.1; v1.2.0; v1.1.0; v1.0.1; v1.0.0

Downloads

On Read the Docs: Project Home; Builds

Free document hosting provided by Read the Docs.