2. Descriptive Statistics#

Warning

Under construction

This notebook comprises various text and code snippets for generating plots and other content for the lectures corresponding to this topic. It is not a coherent set of lecture notes. Students should refer to the actual lecture slides available on Blackboard.

2.1. Intro#

2.1.1. Recap of last time#

In the Introductory lecture, we loaded some a dataset of daily observations at a weather station in Central Park spanning from 1869 to 2023. We printed out all the values of the precipitation and realized that this was a useful way of analyzing a dataset of this size.

We then plotted this timeseries along with the corresponding one of daily average temperature. This enabled us to glean quite a few important things about each dataset! But it also made clear that either one is sufficiently large and complicated that we need to go further in order to really understand them. The first step in that process is what we’ll covere here: descriptive statistics.

2.1.2. Today: Descriptive Statistics#

Suppose you’ve just gotten hold of some data to analyze. In this case, the Central Park precipitation (or “precip” for short) timeseries. What ways can you condense it into digestible pieces?

The most compact way of representing anything really is boiling it down to a single number (sometimes called a scalar). Now we’ll walk through some key scalar measures of any dataset. They are split into what aspects of the data they most directly capture:

  • Measures of central tendency: roughly, what do “typical” values of the dataset look like?

  • Measures of dispersion: roughly, are the values tightly clumped together or spread far apart?

  • Measures of shape: moving away from where the values most clump together, is there a long “tail” of values extending in one direction, or the other, or both? (I.e. how lopsided is it, and how “fat-tailed” or “skinny-tailed.”)

Before jumping in further, let’s re-load that same dataset, and while we’re at it import the needed packages we’ll need to make nice plots:

import xarray as xr

filepath_in = "../data/central-park-station-data_1869-01-01_2023-09-30.nc"
ds_central_park = xr.open_dataset(filepath_in)
precip_central_park = ds_central_park["precip"]
temp_central_park = ds_central_park["temp_avg"]
Hide code cell content
# First, import the matplotlib package that we'll use for plotting.
from matplotlib import pyplot as plt

# Then update the plotting aesthetics using my own custom package named "puffins"
# See: https://github.com/spencerahill/puffins
from puffins import plotting as pplt
plt.rcParams.update(pplt.plt_rc_params_custom)

2.2. Measures of Central Tendency#

2.2.1. Mean#

Probably the single most intuitive measure is the average, or mean. Sum up all the values, and divide by the number of values. Symbolically:

\[\overline{X}_i=\frac{1}{N}\sum_{i=1}^N X_i,\]

where

  • the overbar \(\overline{\phantom{X}}\) denotes the mean

  • \(X_i\) is our dataset

  • the subscript \(_i\) indexes the individual data points. So \(X_1\) is the first value, \(X_2\) is the second value, etc.

  • \(N\) is the total number of points

  • \(\sum_{i=1}\) is the standard notation for summation. It means: sum over all the values of \(X_i\) from \(i=1\) to \(i=N\)

(footnote: weighted averages)

The mean of the Central Park rainfall dataset is:

precip_central_park.mean().values
array(0.12487456)

and of daily average temperature is:

temp_central_park.mean().values
array(54.07264685)

2.2.2. Median (and other quantiles)#

The median is the value such that exactly half of the data points lie below it, and half lie above it.

Compared to the mean, it is insensitive to “outliers”—that is, points that are way different than most of the other points.

The median of the Central park rainfall dataset is:

precip_central_park.median().values
array(0.)

…zero? Can that be right? Yes: on most days, meaning more than half of days (in New York City as well as most places other than rainforests and other extremely wet places) there is no rain. So the median—which separates the dataset into a a lower half and upper half—will be zero.

This highlights that precipitation is not a truly continuous quantity the way say temperature is. On days that precipitation occurs, the amount is indeed continuous—there are no discrete amounts of rainfall that must occur. But on days with no precipitation at all, there is a single discrete value: zero.

Now let’s look at the temperature median:

temp_central_park.median().values
array(55.)

The median is also just a particular example of a more general quantity, quantiles (if expressed as a fraction from 0 to 1) or percentiles (if expressed as percentages from 0% to 100%). These will come up below when we discuss measure of variation.

Let’s look at a few select quantiles of the Central Park rainfall:

precip_central_park.quantile([0.25, 0.5, 0.75, 0.99])
<xarray.DataArray 'precip' (quantile: 4)>
array([0.  , 0.  , 0.05, 1.71])
Coordinates:
  * quantile  (quantile) float64 0.25 0.5 0.75 0.99

So in this case, the bottom 75% of points span only a 0.05 inch range, while the 75th to 99th percentile spans 1.66 inches!

2.2.3. Mode#

The mode is simply the value that occurs most frequently.

Same as for the median, there’s not a nice compact way of expressing that in an equation. You just count up how many times different values occur and see which one does the most.

So based on our discussion so far of the Central Park rainfall dataset, it should come as no suprise that the mode is zero:

import scipy
scipy.stats.mode(precip_central_park)
/var/folders/3g/s0brcg452zn0z3962qt_0pn00000gn/T/ipykernel_45889/4100454237.py:2: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  scipy.stats.mode(precip_central_park)
ModeResult(mode=array([0.]), count=array([37744]))

Note that this is a rare example where, for a continuous variable, the mode is well defined. We can contrast that with the Central Park temperature record, right? After all, temperature is a continuously defined quantity: the actual temperature outside right now could be (a very pleasant) 72.0 degrees Fahrenheit, or 72.04, or 72.040049, and so on: there’s no inherent minimum “gap” between possible values of temperature.

If that’s the case, then no value will ever repeat, because every value will always differ from all the others, even if by some very tiny amount. So let’s try it:

scipy.stats.mode(temp_central_park)
/var/folders/3g/s0brcg452zn0z3962qt_0pn00000gn/T/ipykernel_45889/2249048866.py:1: FutureWarning: Unlike other reduction functions (e.g. `skew`, `kurtosis`), the default behavior of `mode` typically preserves the axis it acts along. In SciPy 1.11.0, this behavior will change: the default value of `keepdims` will become False, the `axis` over which the statistic is taken will be eliminated, and the value None will no longer be accepted. Set `keepdims` to True or False to avoid this warning.
  scipy.stats.mode(temp_central_park)
ModeResult(mode=array([72.5]), count=array([630]))

…what?! There are 627 instances of the daily average temperature being exactly 72.5 degrees Fahrenheit. How is that possible?

In practice, any instrument has finite precision, meaning how finely it can measure different values. So if a dataset records temperature to the nearest tenth of a degree (e.g. a value of 84.2 degrees Fahrenheit, and the next highest possible recording is 84.3, rather than say 84.21), then the variable which is in nature truly continuous is, in practice, discrete. In that case there can be an actual mode value.

(As an aside, for those of us who have the pleasure of living in New York City, I find it comforting to think that the most frequently occuring temperature is quite lovely at 72.5 degrees Fahrenheit!)

2.3. Measures of Dispersion#

While the above measures of central tendency tell you about “typical” values in a dataset, they tell you nothing about how the values vary, which can often be as important (or even more so) than what the “typical” value is. For example:

  • Are the values mostly clumped together very tightly around the average? Or do they span a very wide range?

  • Just how far apart are the single largest and smallest values?

  • Do values fall relatively evenly on both sides of the average? Or do most of them fall below it, or above it?

These and related questions can be usefully answered using measures of variation, which includes two sub-categories: measures of dispersion and measures of shape. We’ll start with measures of dispersion, of which there are three: range, variance, and standard deviation.

2.3.1. Range#

The range is simply the difference between the maximum and minimum values. In other words, it tells you the total span of the dataset: just how different can two values be?

For precipitation, we know already that the minimum possible value is zero, so the range is identical to the maximum:

import numpy as np
print(np.min(precip_central_park.values))
print(np.max(precip_central_park.values))
0.0
8.28
# "ptp" stands for "peak to peak"
np.ptp(precip_central_park.values)
8.28

In contrast, for temperature there is no lower bound (as long as you’re much warmer than absolute zero, or -273.15\(^\circ\)C, which fortunately is the case for Earth’s atmosphere).

print(np.min(temp_central_park.values))
print(np.max(temp_central_park.values))
-5.5
94.0
np.ptp(temp_central_park.values)
99.5

2.3.2. Interquartile range#

The interquartile range is probably more widely used than the full range. It is the difference between the 75th percentile value and the 25th percentile value. In other words, it is the range spanned by the “middle half” of the dataset (“half” because betwen the 75th and 25th percentiles lies 75-25=50% of the data).

For the Central Park precip, we know that the 25th percentile value must be zero (because the median, or 50th percentile, is 0, and so all lower percentiles must also be 0). But we don’t know yet what the 75th percentile value is. Let’s see:

print(np.percentile(precip_central_park, 75))
print(np.percentile(precip_central_park, 25))
0.05
0.0

So the IQR for precip is 0.05 inches:

scipy.stats.iqr(precip_central_park)
0.05

For temperature, the 25th and 75th percentiles are:

print(np.percentile(temp_central_park, 75))
print(np.percentile(temp_central_park, 25))
69.5
40.0

Yielding an IQR of 29.5 degrees F:

scipy.stats.iqr(temp_central_park)
29.5

2.3.3. Variance#

The variance

np.var(precip_central_park.values)
0.12510359360604995
np.var(temp_central_park.values)
316.2875568299918

2.3.4. Standard deviation#

np.std(precip_central_park.values)
0.35369986373484785
np.std(temp_central_park.values)
17.78447516318634

2.3.5. Coefficient of variation#

# The scipy package provides the C. of V. as 
# a function called just `variation'.
scipy.stats.variation(precip_central_park)
2.832441374046085
scipy.stats.variation(temp_central_park)
0.3288996599759609

2.4. Measures of Shape#

2.4.1. Skewness#

The skewness tells you how much the dataset’s values clump together below or above the mean.

scipy.stats.skew(precip_central_park)
5.559430548312563
scipy.stats.skew(temp_central_park)
-0.23580275858681166

2.4.2. Kurtosis#

scipy.stats.kurtosis(precip_central_park)
50.90321795472724
scipy.stats.kurtosis(temp_central_park)
-0.8350567610482575

2.5. Measures of Association#

  1. Covariance

  2. Correlation

  3. Example: Correlation between Temperature and Humidity

2.6. Data Visualization#

2.6.1. Importance of Visualization#

  1. Understanding Data

  2. Communicating Findings

2.6.2. Types of Plots and Graphs#

  1. Histograms

  2. Scatter Plots

  3. Line Charts

  4. Bar Charts

  5. Box Plots

2.6.3. Visualization Tools in Python#

  1. Matplotlib

  2. Seaborn

2.6.4. Best Practices in Visualization#

  1. Choosing the Right Chart Type

  2. Labeling and Annotations

  3. Aesthetics and Accessibility

2.7. Descriptive statistics and visualization as quality control#

The Central Park weather station has not always stayed in the exact same place or used the exact same instruments.

2.8. Conclusion#

  1. Summary of Key Concepts

  2. Relevance to Upcoming Topics

  3. Q&A and Feedback

2.9. Supplementary Materials#