HW04: Fitting normal distributions for each calendar month and fitting GEV to annual maxima#
DUE Wednesday, October 22nd
Introduction#
In this assignment, you’ll first answer some questions about the fundamental concepts underlying probability theory, and then you’ll compute various empirical probabilities using the Central Park weather dataset.
Load the Central Park data into this python session#
Explanation of data downloading logic (if you’re interested)
To make these Jupyter notebooks work when launched to Google Colab—which you can do by clicking the “rocket” icon in the top right from the rendered version of this page on the web—we need some logic that downloads the data.
While we’re at it, we use the file’s “hash” to check that it has not been altered or corrupted from its original version. We do this whether or not you’ve downloaded the file, since it’s possible to (accidentally) modify the netCDF file on disk after you downloaded it.
In the rendered HTML version of the site, this cell is hidden, since otherwise it’s a bit distracting. But you can click on it to reveal its content.
If you’re in a Google Colab session, you don’t need to modify anything in that cell; just run it. Otherwise, modify the LOCAL_DATA_DIR variable defined in the next python cell to point to where the dataset lives on your machine—or where you want it to be downloaded to if you don’t have it already.
import xarray as xr
# `DATA_PATH` variable was created by the hidden cell just above.
# Un-hide that cell if you want to see the details.
ds_cp = xr.open_dataset(DATA_PATH)
ds_cp
<xarray.Dataset> Size: 5MB
Dimensions: (time: 56520)
Coordinates:
* time (time) datetime64[ns] 452kB 1869-01-01 ... 2023-09-30
Data variables:
temp_max (time) int64 452kB ...
temp_min (time) int64 452kB ...
temp_avg (time) float64 452kB ...
temp_anom (time) float64 452kB ...
heat_deg_days (time) int64 452kB ...
cool_deg_days (time) int64 452kB ...
precip (time) float64 452kB ...
snow_fall (time) float64 452kB ...
snow_depth (time) int64 452kB ...Your specific tasks#
Normal distribution fits throughout the year#
For each of the 12 calendar months, January through December, do the following:
[ ] fit a normal distribution to the daily maximum temperature from the Central Park weather station dataset, for all days in that month across all years
[ ] Plot the histogram (with
density=True) for that month, and overlay the curve of the fitted PDF
After you’ve done that, make two more plots. One is for the sample mean, one is for the sample standard deviation. For the sample mean:
[ ] Plot the sample mean for each calendar month as a function of month. So the x-axis is month, numbered 1-12, and the y-axis is the mean.
[ ] On the same axes, plot the mean from the fitted normal distribution.
For the sample standard deviation, do the exact same:
[ ] Plot the sample standard deviation for each calendar month as a function of month. So the x-axis is month, numbered 1-12, and the y-axis is the sample standard deviation.
[ ] On the same axes, plot the standard deviation from the fitten normal distribution.
Put these panels right next to each other in one single pyplot.Figure object: use plt.subplots for this.
Last, once you’ve done this, plot the histogram for all days. You’ll see that, as we saw in one of the lectures, this has a double peak structure. Based on your histograms and fitted normal distributions for each of the 12 calendar months, explain in a few sentences how this double-peaked structure for the whole year comes about.
Note that you don’t need to appeal to physical processes or arguments…make your arguments solely as regards how the sample means and standard deviations vary across the twelve months. (Hint: consider, what would the annual-mean distribution look like if the standard deviation was constant across months, and the mean shifted smoothly up and down? Would that get you two peaks or not?)
Block maxima and other metrics of extremes for diurnal temperature range#
Compute the diurnal temperature range by taking the daily maximum temperature minus the daily minimum temperature.
Then, compute the following metrics of extreme values for this new variable:
block max (single largest value in each calendar year)
an exceedance count: the number of days exceeding the climatological 95th percentile, meaning the 95th percentile computed using all days across all years
the exceedance count again but using the 99.9th percentile
The 99th percentile value computed for that individual year
Compare these different metrics of extremes. Describe in a few sentences the extent to which they behave similarly vs. differ from one another. This is an important part of extreme value analysis: making sure that your results don’t sensitively depend on the specific definition of “extreme” or specific threshold
GEV fit for the block maxima#
Use scipy.stats.genextreme to fit a GEV to the block max computed just above for the diurnal temperature range. Plot the normalized histogram of this block max and overlay the fitted GEV curve. Describe your impressions of the goodness of fit based on visual inspection of this plot.
How to submit#
Submit via this Google form
Extra credit#
None this time! Spend the extra time preparing for Friday’s midterm :)