HW04: Fitting normal distributions for each calendar month and fitting GEV to annual maxima#

DUE Wednesday, October 22nd

Introduction#

In this assignment, you’ll first answer some questions about the fundamental concepts underlying probability theory, and then you’ll compute various empirical probabilities using the Central Park weather dataset.

Load the Central Park data into this python session#

Hide code cell content

!pip install pooch

# The above command installs the needed `pooch` 3rd-party package if it's not already installed.


import hashlib  # for verifying that the Central Park file is not corrupted
import pathlib  # for constructing paths to the dataset's location on disk
import sys  # for checking if this is a Google Colab session or not
import pooch  # for downloading the dataset from the web, if needed


# Replace "../data" as needed to point to the correct directory for you.
# This can be an *absolute path* or a *relative path*.  One dot, `.`, means
# "this directory", while two dots, `..`, means "go up one directory."
LOCAL_DATA_DIR = "../data"  # If you're in Colab: just ignore this.

# The URL where the dataset can be downloaded from.
DATA_URL = (
    "https://spencerahill.github.io/25f-stat-methods-course/_downloads/"
    "91803b82950d49961a65355c075439b3/central-park-station-data_1869-01-01_2023-09-30.nc"
)

# This HASH_HEX stores a "hash" which we use to verify that the data you end up
# with has not been altered or corrupted compared to the one at the above URL.
HASH_HEX = "85237a4bae1202030a36f330764fd5bd0c2c4fa484b3ae34a05db49fe7721eee"


def create_data_path(
    colab_dir="/content/data", 
    local_dir=LOCAL_DATA_DIR,
    filename="central-park-station-data_1869-01-01_2023-09-30.nc",
):
    
    """Set the path for the data, whether on colab or a local Jupyter session."""
    is_this_a_colab = "google.colab" in sys.modules
    if is_this_a_colab:
        data_dir = colab_dir 
    else: 
        data_dir = local_dir

    DATA_DIR = pathlib.Path(data_dir)
    DATA_DIR.mkdir(parents=True, exist_ok=True)
    return DATA_DIR / filename


def sha256sum(path: pathlib.Path) -> str:
    """Get the hash of the file at the specified path."""
    return hashlib.sha256(path.read_bytes()).hexdigest()


DATA_PATH = create_data_path()
# Determine if we'll need to download the data, which we'll do if either (a) 
# the data can't be found, or (b) it appears corrupted/modified from the
# "master" file at the above URL.
need_fetch = (not DATA_PATH.exists()) or (sha256sum(DATA_PATH) != HASH_HEX)

# Download the data if needed.
if need_fetch:
    fetched_data = pooch.retrieve(
        url=DATA_URL, 
        known_hash=f"sha256:{HASH_HEX}",
        path=DATA_PATH.parents[0], 
        fname=DATA_PATH.name,
    )
    print(f"\nDownloaded and verified: {fetched_data}")
else:
    print(f"\nVerified existing file at {DATA_PATH}")
Looking in links: https://pypi.python.org/pypi, https://testpypi.python.org/pypi
Requirement already satisfied: pooch in /Users/sah2249/miniconda3/envs/stats-book/lib/python3.13/site-packages (1.8.2)
Requirement already satisfied: platformdirs>=2.5.0 in /Users/sah2249/miniconda3/envs/stats-book/lib/python3.13/site-packages (from pooch) (4.4.0)
Requirement already satisfied: packaging>=20.0 in /Users/sah2249/miniconda3/envs/stats-book/lib/python3.13/site-packages (from pooch) (25.0)
Requirement already satisfied: requests>=2.19.0 in /Users/sah2249/miniconda3/envs/stats-book/lib/python3.13/site-packages (from pooch) (2.32.5)
Requirement already satisfied: charset_normalizer<4,>=2 in /Users/sah2249/miniconda3/envs/stats-book/lib/python3.13/site-packages (from requests>=2.19.0->pooch) (3.4.3)
Requirement already satisfied: idna<4,>=2.5 in /Users/sah2249/miniconda3/envs/stats-book/lib/python3.13/site-packages (from requests>=2.19.0->pooch) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/sah2249/miniconda3/envs/stats-book/lib/python3.13/site-packages (from requests>=2.19.0->pooch) (2.5.0)
Requirement already satisfied: certifi>=2017.4.17 in /Users/sah2249/miniconda3/envs/stats-book/lib/python3.13/site-packages (from requests>=2.19.0->pooch) (2025.10.5)

Verified existing file at ../data/central-park-station-data_1869-01-01_2023-09-30.nc
import xarray as xr

# `DATA_PATH` variable was created by the hidden cell just above. 
# Un-hide that cell if you want to see the details.
ds_cp = xr.open_dataset(DATA_PATH)
ds_cp
<xarray.Dataset> Size: 5MB
Dimensions:        (time: 56520)
Coordinates:
  * time           (time) datetime64[ns] 452kB 1869-01-01 ... 2023-09-30
Data variables:
    temp_max       (time) int64 452kB ...
    temp_min       (time) int64 452kB ...
    temp_avg       (time) float64 452kB ...
    temp_anom      (time) float64 452kB ...
    heat_deg_days  (time) int64 452kB ...
    cool_deg_days  (time) int64 452kB ...
    precip         (time) float64 452kB ...
    snow_fall      (time) float64 452kB ...
    snow_depth     (time) int64 452kB ...

Your specific tasks#

Normal distribution fits throughout the year#

For each of the 12 calendar months, January through December, do the following:

  • [ ] fit a normal distribution to the daily maximum temperature from the Central Park weather station dataset, for all days in that month across all years

  • [ ] Plot the histogram (with density=True) for that month, and overlay the curve of the fitted PDF

After you’ve done that, make two more plots. One is for the sample mean, one is for the sample standard deviation. For the sample mean:

  • [ ] Plot the sample mean for each calendar month as a function of month. So the x-axis is month, numbered 1-12, and the y-axis is the mean.

  • [ ] On the same axes, plot the mean from the fitted normal distribution.

For the sample standard deviation, do the exact same:

  • [ ] Plot the sample standard deviation for each calendar month as a function of month. So the x-axis is month, numbered 1-12, and the y-axis is the sample standard deviation.

  • [ ] On the same axes, plot the standard deviation from the fitten normal distribution.

Put these panels right next to each other in one single pyplot.Figure object: use plt.subplots for this.

Last, once you’ve done this, plot the histogram for all days. You’ll see that, as we saw in one of the lectures, this has a double peak structure. Based on your histograms and fitted normal distributions for each of the 12 calendar months, explain in a few sentences how this double-peaked structure for the whole year comes about.

Note that you don’t need to appeal to physical processes or arguments…make your arguments solely as regards how the sample means and standard deviations vary across the twelve months. (Hint: consider, what would the annual-mean distribution look like if the standard deviation was constant across months, and the mean shifted smoothly up and down? Would that get you two peaks or not?)

Block maxima and other metrics of extremes for diurnal temperature range#

Compute the diurnal temperature range by taking the daily maximum temperature minus the daily minimum temperature.

Then, compute the following metrics of extreme values for this new variable:

  • block max (single largest value in each calendar year)

  • an exceedance count: the number of days exceeding the climatological 95th percentile, meaning the 95th percentile computed using all days across all years

  • the exceedance count again but using the 99.9th percentile

  • The 99th percentile value computed for that individual year

Compare these different metrics of extremes. Describe in a few sentences the extent to which they behave similarly vs. differ from one another. This is an important part of extreme value analysis: making sure that your results don’t sensitively depend on the specific definition of “extreme” or specific threshold

GEV fit for the block maxima#

Use scipy.stats.genextreme to fit a GEV to the block max computed just above for the diurnal temperature range. Plot the normalized histogram of this block max and overlay the fitted GEV curve. Describe your impressions of the goodness of fit based on visual inspection of this plot.

How to submit#

Submit via this Google form

Extra credit#

None this time! Spend the extra time preparing for Friday’s midterm :)