arviz_base.dataset_to_dataframe

arviz_base.dataset_to_dataframe#

arviz_base.dataset_to_dataframe(ds, sample_dims=None, labeller=None, multiindex=False, new_dim='label')[source]#

Convert a Dataset to a DataFrame via a stacked DataArray, using a labeller.

Parameters:

dsxarray.Dataset
sample_dimssequence of hashable, optional
labellerlabeller, optional
multiindex{“row”, “column”} or bool, default False
new_dimhashable, default “label”

Returns:

pandas.DataFrame

Examples

The output will have whatever is uses as sample_dims as the columns of the DataFrame, so when these are much longer we might want to transpose the output:

from arviz_base import load_arviz_data, dataset_to_dataframe
idata = load_arviz_data("centered_eight")
dataset_to_dataframe(idata.posterior.dataset)

	mu	theta[Choate]	theta[Deerfield]	theta[Phillips Andover]	theta[Phillips Exeter]	theta[Hotchkiss]	theta[Lawrenceville]	theta[St. Paul's]	theta[Mt. Hermon]	tau
(0, 0)	7.871796	12.320686	9.905367	14.951615	11.011485	5.579602	16.901795	13.198059	15.061366	4.725740
(0, 1)	3.384554	11.285623	9.129324	3.139263	9.433211	7.811516	2.393088	10.055223	6.176724	3.908994
(0, 2)	9.100476	5.708506	5.757932	10.944585	5.895436	9.992984	8.143327	7.604753	8.767647	4.844025
(0, 3)	7.304293	10.037275	8.809068	9.900924	5.768832	9.062876	6.958424	10.298256	3.155304	1.856703
(0, 4)	9.879675	9.149146	5.764986	7.015397	15.688710	3.097395	12.025763	11.316745	17.046142	4.748409
...	...	...	...	...	...	...	...	...	...	...
(3, 495)	1.542688	3.737751	5.393632	0.487845	4.015486	0.717057	-2.675760	0.415968	-4.991247	2.786072
(3, 496)	1.858580	-0.291737	0.110315	1.468877	-3.653346	1.844292	6.055714	4.986218	9.290380	4.281961
(3, 497)	1.766733	3.532515	2.008901	0.510806	0.832185	2.647687	4.707249	3.073314	-2.623069	2.740607
(3, 498)	3.486112	4.182751	7.554251	4.456034	3.300833	1.563307	1.528958	1.096098	8.452282	2.932379
(3, 499)	3.404464	0.192956	6.498428	-0.894424	6.849020	1.859747	7.936460	6.762455	1.295051	4.461246

2000 rows × 10 columns

The default is to only return a single index, with the labels or tuples of coordinate values in the stacked dimensions. To keep all data from all coordinates as a multiindex use multiindex=True

dataset_to_dataframe(idata.posterior.dataset, multiindex=True)

		label	mu	theta[Choate]	theta[Deerfield]	theta[Phillips Andover]	theta[Phillips Exeter]	theta[Hotchkiss]	theta[Lawrenceville]	theta[St. Paul's]	theta[Mt. Hermon]	tau
		variable	mu	theta	theta	theta	theta	theta	theta	theta	theta	tau
		school	NaN	Choate	Deerfield	Phillips Andover	Phillips Exeter	Hotchkiss	Lawrenceville	St. Paul's	Mt. Hermon	NaN
sample	chain	draw
(0, 0)	0	0	7.871796	12.320686	9.905367	14.951615	11.011485	5.579602	16.901795	13.198059	15.061366	4.725740
(0, 1)	0	1	3.384554	11.285623	9.129324	3.139263	9.433211	7.811516	2.393088	10.055223	6.176724	3.908994
(0, 2)	0	2	9.100476	5.708506	5.757932	10.944585	5.895436	9.992984	8.143327	7.604753	8.767647	4.844025
(0, 3)	0	3	7.304293	10.037275	8.809068	9.900924	5.768832	9.062876	6.958424	10.298256	3.155304	1.856703
(0, 4)	0	4	9.879675	9.149146	5.764986	7.015397	15.688710	3.097395	12.025763	11.316745	17.046142	4.748409
...	...	...	...	...	...	...	...	...	...	...	...	...
(3, 495)	3	495	1.542688	3.737751	5.393632	0.487845	4.015486	0.717057	-2.675760	0.415968	-4.991247	2.786072
(3, 496)	3	496	1.858580	-0.291737	0.110315	1.468877	-3.653346	1.844292	6.055714	4.986218	9.290380	4.281961
(3, 497)	3	497	1.766733	3.532515	2.008901	0.510806	0.832185	2.647687	4.707249	3.073314	-2.623069	2.740607
(3, 498)	3	498	3.486112	4.182751	7.554251	4.456034	3.300833	1.563307	1.528958	1.096098	8.452282	2.932379
(3, 499)	3	499	3.404464	0.192956	6.498428	-0.894424	6.849020	1.859747	7.936460	6.762455	1.295051	4.461246

2000 rows × 10 columns

The only restriction on sample_dims is that it is present in all variables of the dataset. Consequently, we can compute statistical summaries, concatenate the results into a single dataset creating a new dimension.

Then convert the result into a DataFrame for ease of viewing.

dataset_to_dataframe(summaries, sample_dims=["summary"]).T

	mean	median	1st quartile	3rd quartile
mu	4.485933	4.547775	2.234131	6.802475
theta[Choate]	6.460064	6.081710	3.222971	9.435743
theta[Deerfield]	5.027555	5.010779	1.539086	8.235701
theta[Phillips Andover]	3.938031	4.226613	1.017415	7.317208
theta[Phillips Exeter]	4.871612	5.021936	1.591279	8.096595
theta[Hotchkiss]	3.666841	3.892372	0.753101	7.098060
theta[Lawrenceville]	3.974687	4.136356	0.936569	7.222736
theta[St. Paul's]	6.580924	6.065121	3.511060	9.598407
theta[Mt. Hermon]	4.772411	4.705673	1.590737	8.292752
tau	4.124223	3.269352	1.868277	5.366589

Note that if all summaries were scalar, it would not be necessary to use expand_dims or renaming dimensions, using assign_coords on the result to label the newly created dimension would be enough. But using this approach we already generate a dimension with coordinate values and can also combine non scalar summaries.

arviz_base.dataset_to_dataframe

Contents

arviz_base.dataset_to_dataframe#