4. Probability distributions and hypothesis tests homework assignment#

In this assignment, we’ll dig further into \(t\) tests.

Please read through the entire assignment carefully. Note, the instructions for this assignment are deliberately less explicit than previous ones. This is to help you build the critical scientific skills of reading carefully, thinking critically, and making decisions under uncertainty. In other words, it’s on you to: find what’s being asked, make your own plan for how to tackle it, and then execute that plan.

4.1. Compare the \(t\) and normal distributions#

In class, we learned that for hypothesis testing of differences in means, the test statistic is called Student’s \(t\) or just \(t\). The \(t\) distribution is similar but not identical to the standard normal distribution, especially for small sample sizes. Its exact shape depends on the degrees of freedom, denoted \(\nu\). For the difference in two means, this is simply \(\nu=N_1+N_2-2\), where \(N_1\) and \(N_2\) are the sample sizes of the first and second samples, respectively.

Using scipy.stats.t.pdf and scipy.stats.norm.pdf, plot the PDFs of both distributions spanning from -5 to 5. For the \(t\) distribution, overlay 3 different versions on the same axes: \(\nu=1, 10\), and \(100\). Add a legend labeling each of these and the normal.

This demonstrates that, as the sample sizes grow, the \(t\) distribution increasingly is well approximated by the standard normal. (In fact, it holds mathematically that the \(t\) distribution approaches the standard normal exactly as \(\nu\rightarrow\infty\).)

4.2. Find the most helpful online resources on \(t\) tests you can find#

There are tons of great resources that are available online and completely free. These include blog posts, YouTube videos, tutorials in the documentation to statistical software packages, whole semester-length university classes that have been pre-recorded, and more. (The course resources page lists just a few.) An important but overlooked skill (you could call it a meta skill) for scientists and engineers is learning to make use of these resources.

For this assignment, search the web for three different resources. They can be any of the types listed above or even something else. (For example, it could be 2 YouTube videos and 1 blog post; 1 blog post, 1 tutorial, and 1 textbook chapter; etc.) Study these three resources, and then do the following:

  • For each resource: provide the URL to the resource and describe how you found it.

  • If it’s a YouTube video, embed the video in your Jupyter notebook.

  • Write summaries in your own words of what each resource conveys.

  • Describe one or more things that each resource helped you understand.

  • Describe one or more things that each resource did not help you with, either because it was missing or was included but you’re still unclear on it.

  • Rank the three resources from least helpful to most helpful for you, and explain your ranking.

Note

For this particular assignment, don’t include ChatGPT or other related tools. They are great, but they will be the focus of a subsequent assignment.

4.3. Write your own function(s) for computing \(t\) tests#

Write a python function that computes a \(t\) test for the difference in means between two samples. You can import any functions, modules, or packages that you’d like for this purpose other than any that directly compute the \(t\) statistic or corresponding \(p\) value. (For example, you can’t use scipy.stats.ttest_ind, since it directly outputs both \(t\) and \(p\)

Your functions’ call signatures should be as follows:

def t_stat(sample1, sample2):
    """Compute the t statistic for difference in means of the two samples.
    
    Arguments:

    sample1, sample2: xarray.DataArrays, each containing one of the 
                      two samples to compute the t statistic for

    """
    # Insert your code here for computing the t statistic
    return t_stat


def pval_of_ttest(t_stat, deg_free):
    """Compute the p value of of the given t statistic and degrees of freedom.
    
    Arguments:

    t_stat: scalar, the value of the t statistic
    deg_free: scalar, the number of degrees of freedom corresponding to t_stat

    """
    # Insert your code here for computing the p value
    return p_val

Where t_stat is the value of the \(t\) statistic, and p_val is the corresponding \(p\) value.

Specifically, for each of the following pairs of samples from the Central Park dataset:

  1. Across all years 1869-2022, monthly average precipitation in September vs. in January.

  2. Average annual sum of cooling degree days for the period 1971-2000 vs. 2001-2022.

Use numpy.isclose to explicitly test your values against the results of scipy.stats.ttest_ind.

4.4. Extra credit#

4.4.1. Compute the \(t\) test for two or more salient differences-in-mean for your final project.#

Find at least two pairs of sample means from your datasets whose difference are scientifically interesting for your project. Perform the \(t\) test for each. Report the numerical values of the \(t\) statistic, \(p\) value, degrees of freedom. Describe your interpretation of the \(t\) test from a statistical perspective. Also describe your interpretaiton from a physical/scientific perspective.

4.4.2. Create a dedicated conda environment for this course#

The Resources page explains why it’s a good idea to create separate virtual environments for different projects/classes/ etc. using the Anaconda/conda environment manager. For this extra credit opportunity, create a dedicated environment using Anaconda Navigator or the conda command line tool. The environment must be created specifically for this purpose; you could name it eas42000 or eas-a4200 for example.

To show that you have done this, include the following cell in your notebook.

import os
import sys
import jupyter_core

print(f"Path to the active Python executable: {sys.executable}") 
print(f"Path to the active Jupyter installation: {jupyter_core.__file__}")
print(f"""Currently active conda environment: {os.environ.get("CONDA_DEFAULT_ENV")}""")
Path to the active Python executable: /Users/sah2249/miniconda3/envs/stat-methods-course/bin/python
Path to the active Jupyter installation: /Users/sah2249/miniconda3/envs/stat-methods-course/lib/python3.11/site-packages/jupyter_core/__init__.py
Currently active conda environment: stat-methods-course

Note

When I run your notebook, these outputs will be overwritten with the values on my own computer, the ones printed below the cell just above. But I keep a copy of your original submissions, and from there I can see what it outputted on your machine.)

Note

On subsequent assignments, this will be mandatory, not extra credit. So if you do it now, you get bonus points for something that you’ll end up have to do anyways later. And you can reuse the same environment for subsequent assignments once you’ve made it…it would be overkill to create a new environment for every single assignment.

4.5. How to submit#

The instructions from the last HW all apply for this one, so refer to those (available here under “How to submit”).

In addition, please read the following Warning block carefully:

Warning

Do NOT call pip, conda, or any shell commands in your notebook.

Jupyter notebooks are extremely powerful. With a simple exclamation point (!), you can use a cell block to run not Python code but any shell command. This could include commands that, for example, download something nefarious from the internet, or delete all the files on my computer.

As a less extreme example but one that actually has happened, it is possible to call pip and conda from within your Jupyter notebook. When your notebook then runs on my computer, this could potentially mess up the package installations I have.

For this reason, you MUST NOT include any code that calls these or any other command line tools. And you must not use any of the Python builtin tools that accomplish similar things (such as the subprocess package). If you need to install a package or update an existing one, do so OUTSIDE of your notebook, either from a terminal session or using Anaconda Navigator.

I have safeguards implemented to identify these kinds of things, but still if any of them are included in your notebook I will automatically deduct 20% of the possible score from your grade.