Statistics example problems

1. ANOVA - geological example

2. Non-parametric tests

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import pandas as pd
import pingouin as pg
/Users/tomconnolly/programs/miniconda3/envs/data-book/lib/python3.8/site-packages/outdated/utils.py:14: OutdatedPackageWarning: The package pingouin is out of date. Your version is 0.5.0, the latest is 0.5.1.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.
  return warn(

1. ANOVA geological example

a. Implementation

Data come from Table 10.1 of McKillup and Dyar, Geostatistics Explained, Cambridge University Press, 2010 (excerpt available on class Google Drive). Values represent the weight percent of MgO present in tourmalines from three locations in Maine.

Use two different methods to test whether there is a significant difference in the mean MgO content between the three different sites.

Method 1: Scipy

df = pd.read_csv('data/MgO_example/MgO_Maine.csv') # dataframe
df
Mount Mica Sebago Batholith Black Mountain
0 7 4 1
1 8 5 2
2 10 7 4
3 11 8 5
stats.f_oneway(df['Mount Mica'],df['Sebago Batholith'],df['Black Mountain'])
F_onewayResult(statistic=10.799999999999999, pvalue=0.004058306777237465)

Method 2: Pingouin

df2 = pd.read_csv('data/MgO_example/MgO_Maine_list.csv')
df2
MgO Location
0 7 Mount Mica
1 8 Mount Mica
2 10 Mount Mica
3 11 Mount Mica
4 4 Sebago Batholith
5 5 Sebago Batholith
6 7 Sebago Batholith
7 8 Sebago Batholith
8 1 Black Mountain
9 2 Black Mountain
10 4 Black Mountain
11 5 Black Mountain
pg.anova(data=df2,dv='MgO',between='Location')
Source ddof1 ddof2 F p-unc np2
0 Location 2 9 10.8 0.004058 0.705882

Post-hoc test

pg.pairwise_tukey(data=df2,dv='MgO',between='Location')
A B mean(A) mean(B) diff se T p-tukey hedges
0 Black Mountain Mount Mica 3.0 9.0 -6.0 1.290994 -4.64758 0.003085 -2.857683
1 Black Mountain Sebago Batholith 3.0 6.0 -3.0 1.290994 -2.32379 0.102975 -1.428841
2 Mount Mica Sebago Batholith 9.0 6.0 3.0 1.290994 2.32379 0.102975 1.428841

b. ANOVA interpretation

Write a summary of your interpretation of the statistical results conducted above. Address the following questions.

  • What is the null hypothesis being tested?

  • Should the null hypothesis be accepted or rejected?

  • What does the post-hoc test tell you?

2. Non-parametric tests

a. Wilcoxon signed-rank test: implementation

This example uses data from: http://www.biostathandbook.com/wilcoxonsignedrank.html

The data are observations of aluminum content in 13 different poplar clones in a polluted area. The scientific question is whether there is a significant change in the aluminum content between August and November.

df_al = pd.read_csv('data/wilcoxon_example/Al_content.csv',
                   delimiter='\t')
df_al
Clone August November
0 Columbia River 18.3 12.7
1 Fritzi Pauley 13.3 11.1
2 Hazendans 16.5 15.3
3 Primo 12.6 12.7
4 Raspalje 9.5 10.5
5 Hoogvorst 13.6 15.6
6 Balsam Spire 8.1 11.2
7 Gibecq 8.9 14.2
8 Beaupre 10.0 16.3
9 Unal 8.3 15.5
10 Trichobel 7.9 19.9
11 Gaver 8.1 20.4
12 Wolterson 13.4 36.8
plt.figure()
plt.hist(df_al['November']);
_images/week07d-stats-examples_18_0.png
stats.skewtest(df_al['November'])
SkewtestResult(statistic=3.449022139607473, pvalue=0.0005626205706886182)
stats.normaltest(df_al['November'])
/Users/tomconnolly/programs/miniconda3/envs/data-book/lib/python3.8/site-packages/scipy/stats/stats.py:1541: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=13
  warnings.warn("kurtosistest only valid for n>=20 ... continuing "
NormaltestResult(statistic=21.55304457655946, pvalue=2.0884103462437462e-05)
stats.wilcoxon(df_al['August'],df_al['November'])
WilcoxonResult(statistic=16.0, pvalue=0.039794921875)

b. Interpretation

Under what situations are non-parametric statistics useful? What are the potential drawbacks in using non-parametric statistics when a parametric approach is justified?