Example: comparing temperature in three regions

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from scipy import stats
import statsmodels.formula.api as smf
import pingouin as pg

import PyCO2SYS as pyco2
/Users/tomconnolly/programs/miniconda3/envs/data-book/lib/python3.8/site-packages/outdated/utils.py:14: OutdatedPackageWarning: The package pingouin is out of date. Your version is 0.5.0, the latest is 0.5.1.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.
  return warn(

Load the data. Note: you may need to change the path to the data already downloaded last week.

filename07 = 'data/wcoa_cruise_2007/32WC20070511.exc.csv'
df07 = pd.read_csv(filename07,header=29,na_values=-999,parse_dates=[[6,7]])

Use the PyCO2SYS package to calculate seawater carbon chemistry parameters.

https://pyco2sys.readthedocs.io/en/latest/

c07 = pyco2.sys(df07['ALKALI'], df07['TCARBN'], 1, 2,
               salinity=df07['CTDSAL'], temperature=df07['CTDTMP'], 
                pressure=df07['CTDPRS'])

df07['OmegaA'] = c07['saturation_aragonite']
/Users/tomconnolly/programs/miniconda3/envs/data-book/lib/python3.8/site-packages/autograd/tracer.py:48: RuntimeWarning: invalid value encountered in sqrt
  return f_raw(*args, **kwargs)
/Users/tomconnolly/programs/miniconda3/envs/data-book/lib/python3.8/site-packages/PyCO2SYS/equilibria/p1atm.py:99: RuntimeWarning: invalid value encountered in sqrt
  lnKF = 1590.2 / TempK - 12.641 + 1.525 * IonS ** 0.5
/Users/tomconnolly/programs/miniconda3/envs/data-book/lib/python3.8/site-packages/PyCO2SYS/equilibria/p1atm.py:577: RuntimeWarning: overflow encountered in power
  K1 = 10.0 ** -(pK1)
/Users/tomconnolly/programs/miniconda3/envs/data-book/lib/python3.8/site-packages/PyCO2SYS/equilibria/p1atm.py:583: RuntimeWarning: overflow encountered in power
  K2 = 10.0 ** -(pK2)
/Users/tomconnolly/programs/miniconda3/envs/data-book/lib/python3.8/site-packages/PyCO2SYS/equilibria/p1atm.py:603: RuntimeWarning: overflow encountered in power
  K1 = 10.0 ** -pK1
/Users/tomconnolly/programs/miniconda3/envs/data-book/lib/python3.8/site-packages/PyCO2SYS/equilibria/p1atm.py:611: RuntimeWarning: overflow encountered in power
  K2 = 10.0 ** -pK2
/Users/tomconnolly/programs/miniconda3/envs/data-book/lib/python3.8/site-packages/PyCO2SYS/equilibria/p1atm.py:636: RuntimeWarning: overflow encountered in power
  K1 = 10.0 ** -pK1
/Users/tomconnolly/programs/miniconda3/envs/data-book/lib/python3.8/site-packages/PyCO2SYS/equilibria/p1atm.py:641: RuntimeWarning: overflow encountered in power
  K2 = 10.0 ** -pK2
/Users/tomconnolly/programs/miniconda3/envs/data-book/lib/python3.8/site-packages/PyCO2SYS/equilibria/p1atm.py:653: RuntimeWarning: overflow encountered in power
  K1 = 10.0 ** -pK1
/Users/tomconnolly/programs/miniconda3/envs/data-book/lib/python3.8/site-packages/PyCO2SYS/equilibria/p1atm.py:658: RuntimeWarning: overflow encountered in power
  K2 = 10.0 ** -pK2
/Users/tomconnolly/programs/miniconda3/envs/data-book/lib/python3.8/site-packages/PyCO2SYS/equilibria/p1atm.py:715: RuntimeWarning: overflow encountered in power
  K2 = 10.0 ** -pK2
/Users/tomconnolly/programs/miniconda3/envs/data-book/lib/python3.8/site-packages/PyCO2SYS/solubility.py:41: RuntimeWarning: overflow encountered in power
  KAr = 10.0 ** logKAr  # this is in (mol/kg-SW)^2
/Users/tomconnolly/programs/miniconda3/envs/data-book/lib/python3.8/site-packages/PyCO2SYS/solubility.py:25: RuntimeWarning: overflow encountered in power
  KCa = 10.0 ** logKCa  # this is in (mol/kg-SW)^2 at zero pressure

Create a subset of good data in the upper 10m (near surface).

iisurf07 = ((df07['CTDPRS'] <= 10) &
      (df07['NITRAT_FLAG_W'] == 2) & (df07['PHSPHT_FLAG_W'] == 2)
      & (df07['CTDOXY_FLAG_W'] == 2) & (df07['CTDSAL_FLAG_W'] == 2) 
        & (df07['ALKALI_FLAG_W'] == 2) & (df07['TCARBN_FLAG_W'] == 2))

df07surf = df07[iisurf07]

Comparing three sample means

In this example, we are interested in comparing the observed surface temperature in three regions of the California Current System.

Create a categorical variable for the region

We will divide the data into three categories: north if the Columbia River, between the Columbia River and Golden Gate, and south of the Golden Gate.

# create a new variable called "region" with no values
df07surf = df07surf.assign(region = [None]*len(df07surf))
df07surf['region']
22      None
46      None
71      None
95      None
151     None
        ... 
2275    None
2287    None
2310    None
2334    None
2335    None
Name: region, Length: 138, dtype: object
# assign string values to region based on latitude
northern = (df07surf['LATITUDE'] > 46.2)
central = (df07surf['LATITUDE'] <= 46.2) & (df07surf['LATITUDE'] >= 37.8)
southern = (df07surf['LATITUDE'] < 37.8) 

df07surf.loc[northern,'region'] = 'north'
df07surf.loc[central,'region'] = 'central'
df07surf.loc[southern,'region'] = 'south'
df07surf['region']
22      north
46      north
71      north
95      north
151     north
        ...  
2275    south
2287    south
2310    south
2334    south
2335    south
Name: region, Length: 138, dtype: object

Box plot

plt.figure()
plt.boxplot([df07surf['CTDTMP'][df07surf['region']=='north'],
             df07surf['CTDTMP'][df07surf['region']=='central'],
             df07surf['CTDTMP'][df07surf['region']=='south']],
            labels=['north','central','south'],showmeans=True,notch=True);
plt.title('CTDTMP - upper 10m 2007')
plt.ylabel('[deg C]')
plt.xlabel('region')
Text(0.5, 0, 'region')
_images/week07e-example-comparing-wcoa-temperature_12_1.png

Analysis

  1. What test (or tests) would you use to determine whether there is a statistically significant difference in surface temperature between these regions?

  2. What possible alternative tests did you rule out?

  3. What is the null hypothesis being tested?

  4. What are the underlying assumptions?

  5. Try to implement the analysis with pingouin. For reference, see the pingouin documentation.