Skip to content Skip to sidebar Skip to footer

With Xarray, How To Parallelize 1d Operations On A Multidimensional Dataset?

I have a 4D xarray Dataset. I want to carry out a linear regression between two variables on a specific dimension (here time), and keep the regression parameters in a 3D array (the

Solution 1:

It is possible to apply scipy.stats.linregress (and other non-ufuncs) to the xarray Dataset using apply_ufunc() by passing vectorize=True like so:

# return a tuple of DataArrays
res = xr.apply_ufunc(scipy.stats.linregress, ds[x], ds[y],
        input_core_dims=[['year'], ['year']],
        output_core_dims=[[], [], [], [], []],
        vectorize=True)
# add the data to the existing dataset
for arr_name, arr in zip(array_names, res):
    ds[arr_name] = arr

Although still serial, apply_ufunc is around 36x faster than the loop implementation in this specific case.

However the parallelization with dask is still not implemented with multiple output like the one from scipy.stats.linregress:

NotImplementedError: multiple outputs from apply_ufunc not yet supported with dask='parallelized'

Solution 2:

The previous answer by LCT covers most of what should be said here, however I think it is possible to incorporate dask='parallelized' with multiple outputs like you get from scipy.stats.linregress.

The trick here is to stack the multiple outputs into one array and then output that, you'll also have to use the output_core_dims kwarg to specify that the DataArray output from the apply_ufunc() call will have an extra dimension now:

defnew_linregress(x, y):
    # Wrapper around scipy linregress to use in apply_ufunc
    slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
    return np.array([slope, intercept, r_value, p_value, std_err])
# return a new DataArray
stats = xr.apply_ufunc(new_linregress, ds[x], ds[y],
                       input_core_dims=[['year'], ['year']],
                       output_core_dims=[["parameter"]],
                       vectorize=True,
                       dask="parallelized",
                       output_dtypes=['float64'],
                       output_sizes={"parameter": 5},
                      )

N.B. This approach currently only works with dask='parallelized' if you have dask<2.0, but it seems to work fine for multiple outputs if you have something else like dask='allowed'. Have a look at this Github issue for more detail.

Hope it helps!

Edit: I've been informed that the dask<2.0 issue has been rectified as long as you have xarray>=0.15.0 ! So can use dask='parallelized' now to speed things up. :)

Post a Comment for "With Xarray, How To Parallelize 1d Operations On A Multidimensional Dataset?"