arviz_base.dataset_to_dataframe

arviz_base.dataset_to_dataframe#

arviz_base.dataset_to_dataframe(ds, sample_dims=None, labeller=None, multiindex=False, new_dim='label')[source]#

Convert a Dataset to a DataFrame via a stacked DataArray, using a labeller.

Parameters:
dsxarray.Dataset
sample_dimssequence of hashable, optional
labellerlabeller, optional
multiindex{“row”, “column”} or bool, default False
new_dimhashable, default “label”
Returns:
pandas.DataFrame

Examples

The output will have whatever is uses as sample_dims as the columns of the DataFrame, so when these are much longer we might want to transpose the output:

from arviz_base import load_arviz_data, dataset_to_dataframe
idata = load_arviz_data("centered_eight")
dataset_to_dataframe(idata.posterior.dataset)
mu theta[Choate] theta[Deerfield] theta[Phillips Andover] theta[Phillips Exeter] theta[Hotchkiss] theta[Lawrenceville] theta[St. Paul's] theta[Mt. Hermon] tau
(0, 0) 7.871796 12.320686 9.905367 14.951615 11.011485 5.579602 16.901795 13.198059 15.061366 4.725740
(0, 1) 3.384554 11.285623 9.129324 3.139263 9.433211 7.811516 2.393088 10.055223 6.176724 3.908994
(0, 2) 9.100476 5.708506 5.757932 10.944585 5.895436 9.992984 8.143327 7.604753 8.767647 4.844025
(0, 3) 7.304293 10.037275 8.809068 9.900924 5.768832 9.062876 6.958424 10.298256 3.155304 1.856703
(0, 4) 9.879675 9.149146 5.764986 7.015397 15.688710 3.097395 12.025763 11.316745 17.046142 4.748409
... ... ... ... ... ... ... ... ... ... ...
(3, 495) 1.542688 3.737751 5.393632 0.487845 4.015486 0.717057 -2.675760 0.415968 -4.991247 2.786072
(3, 496) 1.858580 -0.291737 0.110315 1.468877 -3.653346 1.844292 6.055714 4.986218 9.290380 4.281961
(3, 497) 1.766733 3.532515 2.008901 0.510806 0.832185 2.647687 4.707249 3.073314 -2.623069 2.740607
(3, 498) 3.486112 4.182751 7.554251 4.456034 3.300833 1.563307 1.528958 1.096098 8.452282 2.932379
(3, 499) 3.404464 0.192956 6.498428 -0.894424 6.849020 1.859747 7.936460 6.762455 1.295051 4.461246

2000 rows × 10 columns

The default is to only return a single index, with the labels or tuples of coordinate values in the stacked dimensions. To keep all data from all coordinates as a multiindex use multiindex=True

dataset_to_dataframe(idata.posterior.dataset, multiindex=True)
label mu theta[Choate] theta[Deerfield] theta[Phillips Andover] theta[Phillips Exeter] theta[Hotchkiss] theta[Lawrenceville] theta[St. Paul's] theta[Mt. Hermon] tau
variable mu theta theta theta theta theta theta theta theta tau
school NaN Choate Deerfield Phillips Andover Phillips Exeter Hotchkiss Lawrenceville St. Paul's Mt. Hermon NaN
sample chain draw
(0, 0) 0 0 7.871796 12.320686 9.905367 14.951615 11.011485 5.579602 16.901795 13.198059 15.061366 4.725740
(0, 1) 0 1 3.384554 11.285623 9.129324 3.139263 9.433211 7.811516 2.393088 10.055223 6.176724 3.908994
(0, 2) 0 2 9.100476 5.708506 5.757932 10.944585 5.895436 9.992984 8.143327 7.604753 8.767647 4.844025
(0, 3) 0 3 7.304293 10.037275 8.809068 9.900924 5.768832 9.062876 6.958424 10.298256 3.155304 1.856703
(0, 4) 0 4 9.879675 9.149146 5.764986 7.015397 15.688710 3.097395 12.025763 11.316745 17.046142 4.748409
... ... ... ... ... ... ... ... ... ... ... ... ...
(3, 495) 3 495 1.542688 3.737751 5.393632 0.487845 4.015486 0.717057 -2.675760 0.415968 -4.991247 2.786072
(3, 496) 3 496 1.858580 -0.291737 0.110315 1.468877 -3.653346 1.844292 6.055714 4.986218 9.290380 4.281961
(3, 497) 3 497 1.766733 3.532515 2.008901 0.510806 0.832185 2.647687 4.707249 3.073314 -2.623069 2.740607
(3, 498) 3 498 3.486112 4.182751 7.554251 4.456034 3.300833 1.563307 1.528958 1.096098 8.452282 2.932379
(3, 499) 3 499 3.404464 0.192956 6.498428 -0.894424 6.849020 1.859747 7.936460 6.762455 1.295051 4.461246

2000 rows × 10 columns

The only restriction on sample_dims is that it is present in all variables of the dataset. Consequently, we can compute statistical summaries, concatenate the results into a single dataset creating a new dimension.

import xarray as xr

dims = ["chain", "draw"]
post = idata.posterior.dataset
summaries = xr.concat(
    (
        post.mean(dims).expand_dims(summary=["mean"]),
        post.median(dims).expand_dims(summary=["median"]),
        post.quantile([.25, .75], dim=dims).rename(
            quantile="summary"
        ).assign_coords(summary=["1st quartile", "3rd quartile"])
    ),
    dim="summary"
)
summaries
<xarray.Dataset> Size: 864B
Dimensions:  (summary: 4, school: 8)
Coordinates:
  * summary  (summary) object 32B 'mean' 'median' '1st quartile' '3rd quartile'
  * school   (school) <U16 512B 'Choate' 'Deerfield' ... 'Mt. Hermon'
Data variables:
    mu       (summary) float64 32B 4.486 4.548 2.234 6.802
    theta    (summary, school) float64 256B 6.46 5.028 3.938 ... 9.598 8.293
    tau      (summary) float64 32B 4.124 3.269 1.868 5.367

Then convert the result into a DataFrame for ease of viewing.

dataset_to_dataframe(summaries, sample_dims=["summary"]).T
mean median 1st quartile 3rd quartile
mu 4.485933 4.547775 2.234131 6.802475
theta[Choate] 6.460064 6.081710 3.222971 9.435743
theta[Deerfield] 5.027555 5.010779 1.539086 8.235701
theta[Phillips Andover] 3.938031 4.226613 1.017415 7.317208
theta[Phillips Exeter] 4.871612 5.021936 1.591279 8.096595
theta[Hotchkiss] 3.666841 3.892372 0.753101 7.098060
theta[Lawrenceville] 3.974687 4.136356 0.936569 7.222736
theta[St. Paul's] 6.580924 6.065121 3.511060 9.598407
theta[Mt. Hermon] 4.772411 4.705673 1.590737 8.292752
tau 4.124223 3.269352 1.868277 5.366589

Note that if all summaries were scalar, it would not be necessary to use expand_dims or renaming dimensions, using assign_coords on the result to label the newly created dimension would be enough. But using this approach we already generate a dimension with coordinate values and can also combine non scalar summaries.