# --- DATA BOOTSTRAP (pooch + SHA-256 verification) ---

!pip -q install pooch

import sys, pathlib, hashlib
import xarray as xr
import pooch

# Public file and expected checksum
DATA_URL = (
    "https://spencerahill.github.io/25f-stat-methods-course/_downloads/"
    "91803b82950d49961a65355c075439b3/central-park-station-data_1869-01-01_2023-09-30.nc"
)
DATA_HASH = "sha256:85237a4bae1202030a36f330764fd5bd0c2c4fa484b3ae34a05db49fe7721eee"

# Local storage: /content in Colab, ./data otherwise
DATA_DIR = pathlib.Path("/content/data" if "google.colab" in sys.modules else "../data")
DATA_DIR.mkdir(parents=True, exist_ok=True)
DATA_PATH = DATA_DIR / "central-park-station-data_1869-01-01_2023-09-30.nc"

def sha256sum(p: pathlib.Path) -> str:
    return hashlib.sha256(p.read_bytes()).hexdigest()

if DATA_PATH.exists():
    # Strict guard: verify existing file before skipping download
    if sha256sum(DATA_PATH) == DATA_HASH.split(":", 1)[1]:
        print(f"Verified existing file at {DATA_PATH}")
    else:
        print("Existing file failed checksum — refetching...")
        DATA_PATH.unlink()
        fetched = pooch.retrieve(url=DATA_URL, known_hash=DATA_HASH, path=DATA_DIR)
        print(f"Downloaded and verified: {fetched}")
else:
    fetched = pooch.retrieve(url=DATA_URL, known_hash=DATA_HASH, path=DATA_DIR)
    print(f"Downloaded and verified: {fetched}")

# Open with xarray
ds = xr.open_dataset(DATA_PATH)
ds
Verified existing file at ../data/central-park-station-data_1869-01-01_2023-09-30.nc
<xarray.Dataset> Size: 5MB
Dimensions:        (time: 56520)
Coordinates:
  * time           (time) datetime64[ns] 452kB 1869-01-01 ... 2023-09-30
Data variables:
    temp_max       (time) int64 452kB ...
    temp_min       (time) int64 452kB ...
    temp_avg       (time) float64 452kB ...
    temp_anom      (time) float64 452kB ...
    heat_deg_days  (time) int64 452kB ...
    cool_deg_days  (time) int64 452kB ...
    precip         (time) float64 452kB ...
    snow_fall      (time) float64 452kB ...
    snow_depth     (time) int64 452kB ...

HW04: Fitting normal distributions for each calendar month and fitting GEV to annual maxima#

Introduction#

In this assignment, you’ll first answer some questions about the fundamental concepts underlying probability theory, and then you’ll compute various empirical probabilities using the Central Park weather dataset.

Your specific tasks#

Normal distribution fits throughout the year#

For each of the 12 calendar months, January through December, do the following:

  • [ ] fit a normal distribution to the daily maximum temperature from the Central Park weather station dataset, for all days in that month across all years

  • [ ] Plot the histogram (with density=True) for that month, and overlay the curve of the fitted PDF

After you’ve done that, make two more plots. One is for the mean, one is for the standard deviation. For the mean:

  • [ ] Plot the sample mean for each calendar month as a function of month. So the x-axis is month, numbered 1-12, and the y-axis is the mean.

  • [ ] On the same axes, plot the mean from the fitten normal distribution.

For the standard deviation, do the exact same:

  • [ ] Plot the sample standard deviation for each calendar month as a function of month. So the x-axis is month, numbered 1-12, and the y-axis is the mean.

  • [ ] On the same axes, plot the standard deviation from the fitten normal distribution.

Put these panels right next to each other in one single pyplot.Figure object: use plt.subplots for this.

Last, once you’ve done this, plot the histogram for all days. You’ll see that, as we saw in one of the lectures, this has a double peak structure. Based on your histograms and fitted normal distributions for each of the 12 calendar months, explain in a few sentences how this double-peaked structure for the whole year comes about.

Note that you don’t need to appeal to physical processes or arguments…make your arguments solely as regards how the sample means and standard deviations vary across the twelve months. (Hint: consider, what would the annual-mean distribution look like if the standard deviation was constant across months, and the mean shifted smoothly up and down? Would that get you two peaks or not?)

Block maxima and other metrics of extremes for diurnal temperature range#

Compute the diurnal temperature range by taking the daily maximum temperature minus the daily minimum temperature.

Then, compute the following metrics of extreme values for this new variable:

  • block max (single largest value in each calendar year)

  • an exceedance count: the number of days exceeding the climatological 95th percentile, meaning the 95th percentile computed using all days across all years

  • the exceedance count again but using the 99.9th percentile

  • The 99th percentile value computed for that individual year

Compare these different metrics of extremes. Describe in a few sentences the extent to which they behave similarly vs. differ from one another. This is an important part of extreme value analysis: making sure that your results don’t sensitively depend on the specific definition of “extreme” or specific threshold

GEV fit for the block maxima#

Use scipy.stats.genextreme to fit a GEV to the block max computed just above for the diurnal temperature range. Plot the normalized histogram of this block max and overlay the fitted GEV curve. Describe your impressions of the goodness of fit based on visual inspection of this plot.

How to submit#

Submission URL#

Google form will be posted soon

DEADLINE: Wednesday 10/22

Extra credit#

Each extra credit option below earns you up to an extra 5% on this assignment.

(Will update this with the extra credit options soon)