arviz_base.dataset_to_dataframe#
- arviz_base.dataset_to_dataframe(ds, sample_dims=None, labeller=None, multiindex=False, new_dim='label')[source]#
Convert a Dataset to a DataFrame via a stacked DataArray, using a labeller.
- Parameters:
- Returns:
pandas.DataFrame
Examples
The output will have whatever is uses as sample_dims as the columns of the DataFrame, so when these are much longer we might want to transpose the output:
from arviz_base import load_arviz_data, dataset_to_dataframe idata = load_arviz_data("centered_eight") dataset_to_dataframe(idata.posterior.dataset)
mu theta[Choate] theta[Deerfield] theta[Phillips Andover] theta[Phillips Exeter] theta[Hotchkiss] theta[Lawrenceville] theta[St. Paul's] theta[Mt. Hermon] tau (0, 0) 7.871796 12.320686 9.905367 14.951615 11.011485 5.579602 16.901795 13.198059 15.061366 4.725740 (0, 1) 3.384554 11.285623 9.129324 3.139263 9.433211 7.811516 2.393088 10.055223 6.176724 3.908994 (0, 2) 9.100476 5.708506 5.757932 10.944585 5.895436 9.992984 8.143327 7.604753 8.767647 4.844025 (0, 3) 7.304293 10.037275 8.809068 9.900924 5.768832 9.062876 6.958424 10.298256 3.155304 1.856703 (0, 4) 9.879675 9.149146 5.764986 7.015397 15.688710 3.097395 12.025763 11.316745 17.046142 4.748409 ... ... ... ... ... ... ... ... ... ... ... (3, 495) 1.542688 3.737751 5.393632 0.487845 4.015486 0.717057 -2.675760 0.415968 -4.991247 2.786072 (3, 496) 1.858580 -0.291737 0.110315 1.468877 -3.653346 1.844292 6.055714 4.986218 9.290380 4.281961 (3, 497) 1.766733 3.532515 2.008901 0.510806 0.832185 2.647687 4.707249 3.073314 -2.623069 2.740607 (3, 498) 3.486112 4.182751 7.554251 4.456034 3.300833 1.563307 1.528958 1.096098 8.452282 2.932379 (3, 499) 3.404464 0.192956 6.498428 -0.894424 6.849020 1.859747 7.936460 6.762455 1.295051 4.461246 2000 rows × 10 columns
The default is to only return a single index, with the labels or tuples of coordinate values in the stacked dimensions. To keep all data from all coordinates as a multiindex use
multiindex=True
dataset_to_dataframe(idata.posterior.dataset, multiindex=True)
label mu theta[Choate] theta[Deerfield] theta[Phillips Andover] theta[Phillips Exeter] theta[Hotchkiss] theta[Lawrenceville] theta[St. Paul's] theta[Mt. Hermon] tau variable mu theta theta theta theta theta theta theta theta tau school NaN Choate Deerfield Phillips Andover Phillips Exeter Hotchkiss Lawrenceville St. Paul's Mt. Hermon NaN sample chain draw (0, 0) 0 0 7.871796 12.320686 9.905367 14.951615 11.011485 5.579602 16.901795 13.198059 15.061366 4.725740 (0, 1) 0 1 3.384554 11.285623 9.129324 3.139263 9.433211 7.811516 2.393088 10.055223 6.176724 3.908994 (0, 2) 0 2 9.100476 5.708506 5.757932 10.944585 5.895436 9.992984 8.143327 7.604753 8.767647 4.844025 (0, 3) 0 3 7.304293 10.037275 8.809068 9.900924 5.768832 9.062876 6.958424 10.298256 3.155304 1.856703 (0, 4) 0 4 9.879675 9.149146 5.764986 7.015397 15.688710 3.097395 12.025763 11.316745 17.046142 4.748409 ... ... ... ... ... ... ... ... ... ... ... ... ... (3, 495) 3 495 1.542688 3.737751 5.393632 0.487845 4.015486 0.717057 -2.675760 0.415968 -4.991247 2.786072 (3, 496) 3 496 1.858580 -0.291737 0.110315 1.468877 -3.653346 1.844292 6.055714 4.986218 9.290380 4.281961 (3, 497) 3 497 1.766733 3.532515 2.008901 0.510806 0.832185 2.647687 4.707249 3.073314 -2.623069 2.740607 (3, 498) 3 498 3.486112 4.182751 7.554251 4.456034 3.300833 1.563307 1.528958 1.096098 8.452282 2.932379 (3, 499) 3 499 3.404464 0.192956 6.498428 -0.894424 6.849020 1.859747 7.936460 6.762455 1.295051 4.461246 2000 rows × 10 columns
The only restriction on sample_dims is that it is present in all variables of the dataset. Consequently, we can compute statistical summaries, concatenate the results into a single dataset creating a new dimension.
import xarray as xr dims = ["chain", "draw"] post = idata.posterior.dataset summaries = xr.concat( ( post.mean(dims).expand_dims(summary=["mean"]), post.median(dims).expand_dims(summary=["median"]), post.quantile([.25, .75], dim=dims).rename( quantile="summary" ).assign_coords(summary=["1st quartile", "3rd quartile"]) ), dim="summary" ) summaries
<xarray.Dataset> Size: 864B Dimensions: (summary: 4, school: 8) Coordinates: * summary (summary) object 32B 'mean' 'median' '1st quartile' '3rd quartile' * school (school) <U16 512B 'Choate' 'Deerfield' ... 'Mt. Hermon' Data variables: mu (summary) float64 32B 4.486 4.548 2.234 6.802 theta (summary, school) float64 256B 6.46 5.028 3.938 ... 9.598 8.293 tau (summary) float64 32B 4.124 3.269 1.868 5.367
Then convert the result into a DataFrame for ease of viewing.
dataset_to_dataframe(summaries, sample_dims=["summary"]).T
mean median 1st quartile 3rd quartile mu 4.485933 4.547775 2.234131 6.802475 theta[Choate] 6.460064 6.081710 3.222971 9.435743 theta[Deerfield] 5.027555 5.010779 1.539086 8.235701 theta[Phillips Andover] 3.938031 4.226613 1.017415 7.317208 theta[Phillips Exeter] 4.871612 5.021936 1.591279 8.096595 theta[Hotchkiss] 3.666841 3.892372 0.753101 7.098060 theta[Lawrenceville] 3.974687 4.136356 0.936569 7.222736 theta[St. Paul's] 6.580924 6.065121 3.511060 9.598407 theta[Mt. Hermon] 4.772411 4.705673 1.590737 8.292752 tau 4.124223 3.269352 1.868277 5.366589 Note that if all summaries were scalar, it would not be necessary to use
expand_dims
or renaming dimensions, usingassign_coords
on the result to label the newly created dimension would be enough. But using this approach we already generate a dimension with coordinate values and can also combine non scalar summaries.