3. Key concepts from probability theory and empirical probabilities computed from the Central Park weather dataset#

3.1. Introduction#

In this assignment, you’ll first answer some questions about the fundamental concepts underlying probability theory, and then you’ll compute various empirical probabilities using the Central Park weather dataset.

3.1.1. NOTICE: use the updated Central Park dataset that includes data through the end of September 2023#

I have generated a new version of the dataset that includes through the end of September, in order to include the historic rain event we experienced on September 29th. Please download and use this updated version. It is labeled on Blackboard as “Central Park weather station dataset, 1869-01-01 to 2023-09-30 (as netCDF file)”. And see farther below for specific instructions about the filename.

3.2. Your specific tasks#

3.2.1. Sets and probabilities#

Answer each of the following, by providing two things:

  • A symbolic answer: that is, the answer written using numbers and mathematical symbols like \(P(E)\), etc. Follow all notation that we used in class. OR, if the question is yes/no, then either “Yes” or “No”.

  • A one or two sentence explanation in plain English of what your answer means.

Here’s an example:

Question 0: Is \(\{1/6, 1/6, 1/6, 1/6, 1/6, 1/6\}\) a valid set?

Answer 0: No. Each member in a set must be unique. The closest to this that would be a valid set is \(\{1/6\}\).

OK, now here are the actual 12 questions for you to answer:

  1. If \(E_1=\{1,2,3\}\) and \(E_2=\{3,1,2\}\), what’s the relationship between \(E_1\) and \(E_2\)?

  2. What is \(\{1, 3, 5\}\cup\{1,2,3\}\)?

  3. What is \(\{1, 3, 5\}\cap\{1,2,3\}\)?

  4. Consider a single roll of a fair, standard 6-sided dice. Is the event {roll a 1, roll a 6} a simple event?

  5. If some outcome is impossible, does that mean it can’t be a valid event? (Provide an example.)

  6. What is the sample space of three consecutive coin flips? (Use H to denote Heads and T to denote tails.)

  7. If \(S_1\) and \(S_2\) are sets, and \(S_1\subseteq S_2\), what is \(S_1\cup S_2\)?

  8. If \(S_1\) and \(S_2\) are sets, and \(S_1\subseteq S_2\), what is \(S_1\cap S_2\)?

  9. What is \(P((E_1\cup E_2)^C)\)? (The Venn diagram graphical depiction of the sample space is helpful here.)

  10. What is \(P((E_2\cap E_1)^C)\)?

  11. Suppose you know \(P(E_1|E_2)\), \(P(E_1)\), and \(P(E_2)\). What is \(P(E_2|E_1)\) in terms of those three quantities?

  12. Can two events (each with nonzero probability) that are independent also be mutually exclusive?

3.2.2. Empirical probabilities in the Central Park dataset#

For each of the following probabilities:

  1. P(daily average temperature > 70°F)

  2. P(daily average temperature > 70°F | the day is in July)

  3. P(daily average temperature > 70°F | the day is in January)

  4. P(snow fall > 0”)

  5. P(snow fall > 0” | the day is in January or February)

  6. P(temp_anom magnitude exceeds 5°F)

  7. P(temp_anom magnitude exceeds 5°F | the day is a Wednesday)

  8. P(daily minimum temperature < 32°F)

  9. P(daily minimum temperature < 32°F | year is 1901-1930)

  10. P(daily minimum temperature < 32°F | year is 1991-2020)

  11. P(precip > 5”)

  12. P(precip > 5” | the day is in September)

…do each of the following:

  • [ ] Compute the empirical probability using the Central Park dataset. Include the code you used to compute it as well as the actual result.

  • [ ] Describe in 1-2 sentences your interpretation.

3.3. Empirical PDFs and CDFs in the Central Park dataset#

For any 3 of the following variables in the Central Park daily weather dataset…

  • temp_avg (daily average temperature)

  • temp_min (daily minimum temperature)

  • temp_max (daily maximum temperature)

  • temp_anom (daily average temperature departure from “normal”, i.e. a 30-year average)

  • heat_deg_days (heating degree days)

  • cool_deg_days (cooling degree days)

  • precip (precipitation in inches; when it’s snow this is snow water equivalent)

  • snow_fall (snowfall in inches that day)

  • snow_depth (depth in inches of snow currently on the ground)

…do each of the following:

  • [ ] revisit the histogram you plotted for it from the last assignment. Experiment with the bin spacing and the density keyword argument to come up with what you feel like is the best balance between resolving fine-grained details (recall the single vs. double peak in daily average temperature) on the one hand vs. excessive noise on the other hand. Include only the final histogram you decide on as a plot in your notebook, and in 1-3 sentences describe any salient features, including how it compares to the one you generated last assignment.

  • [ ] compute and plot its empirical cumulative distribution function, and describe it in 1-2 sentences.

Warning

scipy.stats.ecdf only available for scipy version >=1.11

The scipy.stats.ecdf function was added to scipy in version 1.11, which was released on June 25, 2023. This is pretty recent—enough that some students don’t have this version of scipy installed, leading to a NameError when they try to cal it.

If this happens to you, you have a few options:

  1. Update your scipy to the latest version, 1.11.3. If you use the GUI version of Anaconda, you can do this within Anaconda Navigator. If you use the command-line version of conda, try conda update scipy. If you use neither of these, try pip install scipy --upgrade.

  2. You can use the ECDF function implemented in the statsmodels package. (If you don’t have this installed, you’d have to install it.)

  3. You can implement the ECDF function yourself by hand…this is easier than it might sound! In fact, I’ve added it below as an extra credit option to entice one or more of you to try it.

3.4. How to submit#

3.4.1. Submission URLs#

Use the Google Form link in the “Links for homework submissions” document on the course Blackboard site to submit. It’s the one labeled “HW3”. (You must be logged into your CUNY citymail account in Google to be granted access to the submission form.)

Submit the notebook as a single .ipynb file with a filename matching exactly the pattern eas42000_hw3_lastname-firstname.ipynb, replacing lastname with your last name and firstname with your first name.

3.4.2. Use a relative path to the netCDF file in your code#

Important: you must copy the Central Park dataset netCDF file into the same directory/folder that holds your homework .ipynb file, and your code must refer to that file using a relative rather than absolute filepath. I.e.:

path_to_cp = "./central-park-station-data_1869-01-01_2023-09-30.nc"  # this is good; it will work on my computer

NOT the absolute path to where the file lives on your computer:

path_to_cp = "/Users/jane-student/eas4200/central-park-station-data_1869-01-01_2023-09-30.nc"  # this will NOT work on my computer

If you don’t follow this instruction, it will cause your Notebook to not run successfully when I go to run it on my computer.

3.4.3. Use this exact name for the netCDF file: central-park-station-data_1869-01-01_2023-09-30.nc#

Similar but different to the issue immediately above re: relative vs. absolute paths, on my computer the Central Park dataset is saved with the filename printed just above. That means the file must have that name on your computer too; if your notebook refers to a file with a different name, my computer won’t find it and so your notebook will crash.

3.4.4. Your Notebook must run successfully start-to-finish on my laptop#

I will run every person’s notebook on my own computer as part of the grading. If that is unsuccessful—meaning that when I select “Run all cells”, the code execution crashes at any point with an error message—you automatically lose 5% on the assignment.

If I can easily fix the issue, I’ll proceed with that submission. If not, I will email you asking you to re-submit a version that does run. (And each subsequent submission that doesn’t run successfully loses an additional 5%.)

This could be because of the relative/absolute filepath and/or filename issues described immediately above and/or any other bug in your code.

Why? It takes a lot of time to debug someone else’s Notebook that doesn’t work. And meanwhile, it’s very easy for you to follow this instruction (see bold paragraph immediately below). So it’s just not fair to me if I have to spend a lot of time debugging your code.

To prevent this, as a last step before submitting your Notebook, I URGE you to restart your Jupyter Kernel, select “Run all cells”, and make sure that it runs successfully. Then save it one last time and upload it.

3.4.5. A note on ChatGPT etc.#

The first batch of questions on sets and probability are, frankly, ones that are very tempting targets to feed into ChatGPT and related tools. Do so at your own peril! Not because it will necessarily give you the wrong answers (it might or might not), but because then you probably won’t learn nearly as much.

Also, and perhaps more practically relevant to you: the upcoming midterm exam—which will be 100% closed book/notes/computers/etc., paper-and-pencil—will have quite a few questions of a very similar nature to these. So consider these as practice for the exam in that sense, meaning the more you find the discipline to answer them just on your own using the lecture slides and other course materials, the better off you’ll be come test day.

And of course refer to and adhere to the policies in the syllabus regarding these tools.

3.5. Extra credit#

Each extra credit option below earns you up to an extra 5% on this assignment.

3.5.1. Explicitly compare the empirical CDFs to the integrals of your empirical PDFs#

As discussed in class, the cumulative distribution function amounts to the integral of the probability density function. This task asks you to verify that for yourself numerically.

Pick any two of the variables in the Central Park dataset. Take the empirical PDF you generated via the histograms, and numerically integrate it using a method/package/etc. of your choosing. Plot the result against the “true” empirical CDF you generated already. Describe what you find in 1-2 sentences.

3.5.2. Try out “Bayesian blocks” for your PDF bins#

Bayesian blocks is a method for choosing optimal bin sizes for histograms in which the bin sizes do not have to be uniform. (It’s much more involved than that, but that suffices for our purposes).

Return to your empirical PDFs for any two variables in the Central Park dataset. Re-generate their histograms using the bins generated by this Bayesian blocks method. It is implemented in the astropy package; see docs here.

Compare the results to the PDFs you generated before. Describe salient differences and/or similarities.

3.5.3. Create your own empirical CDF function.#

That is, define a function that computes the empirical CDF value at each point in the array it is given, and does so “from scratch,” meaning that it doesn’t just call e.g. scipy.stats.ecdf.

Hint: start by sorting the array. What’s the quantile of the smallest value? Of the 2nd smallest? Of the largest?

Verify that your function gives (nearly) the same results as scipy.stats.ecdf or another available implementation.

3.5.4. Compute some key empirical probabilities on data you’re using for your final project#

Compute and report some key empirical probabilities from the data for your final project. Which probabilities would be key vs. not depends on your particular project. For example, suppose you were using the Central Park dataset to investigate extreme snow events. Then probabilities of snow exceeding various high thresholds would definitely be key, while probabilities of say exceeding various thresholds in maximum daily temperatures in July (when there is never any snow) would not.

This must include at least one of each of the following:

  • unconditional probability of a single event

  • unconditional probability of the union of two or more events

  • unconditional probability of the intersection of two or more events

  • one or more conditional probabilities

This is a double bonus: it gets you extra credit on this assignment and helps you make progress on your final project!

3.5.5. Submitting the extra credit: as the .html output of a separate notebook file#

(Please submit the extra credit this way even if you don’t do the part that involves using your own final project data.)

Each of you is using a different dataset, none of which I have on my local computer where I’ll be running your notebooks. And some of these datasets are quite large, so it’s impractical to have you include your datasets along with the notebook file. But that means that I won’t be able to execute this portion of your code on my laptop.

For that reason, please do the following:

  1. Perform all of your calculations for the extra credit in a separate .ipynb notebook file from the main one that you’ll submit as described above.

  2. Once your Extra Credit notebook is 100% ready, reset and run it start to finish as described above.

  3. At that point, export it to an HTML file using Jupyter’s built-in exporting features.

  4. Upload that HTML file via the same link described above for the main submission.