HW06: hypothesis tests#

DEADLINE: Wednesday, November 12th, 2025

Compare the \(t\) and normal distributions#

In class, we learned that for hypothesis testing of differences in means, the test statistic is called Student’s \(t\) or just \(t\). The \(t\) distribution is similar but not identical to the standard normal distribution, especially for small sample sizes. Its exact shape depends on the degrees of freedom, denoted \(\nu\). For the difference in two means, this is simply \(\nu=N_1+N_2-2\), where \(N_1\) and \(N_2\) are the sample sizes of the first and second samples, respectively.

Using scipy.stats.t.pdf and scipy.stats.norm.pdf, plot the PDFs of both distributions spanning from -5 to 5. For the \(t\) distribution, overlay 3 different versions on the same axes: \(\nu=1, 10\), and \(100\). Add a legend labeling each of these and the normal.

This demonstrates that, as the sample sizes grow, the \(t\) distribution increasingly is well approximated by the standard normal. (In fact, it holds mathematically that the \(t\) distribution approaches the standard normal exactly as \(\nu\rightarrow\infty\).)

Find the most helpful online resources on \(t\) tests you can find#

There are tons of great resources that are available online and completely free. These include blog posts, YouTube videos, tutorials in the documentation to statistical software packages, whole semester-length university classes that have been pre-recorded, and more. (The course resources page lists just a few.) An important but overlooked skill (you could call it a meta skill) for scientists and engineers is learning to make use of these resources.

For this assignment, search the web to find the best two different resources you can. They can be any of the types listed above or something else.

  • For each resource: provide the URL to the resource and describe how you found it.

  • If it’s a YouTube video, embed the video in your Jupyter notebook.

  • Write summaries in your own words of what each resource conveys.

  • Describe one or more things that each resource helped you understand.

  • Describe one or more things that each resource did not help you with, either because it was missing or was included but you’re still unclear on it.

  • Rank the two resources: which was more helpful and why?

Note

You can use LLMs to find resources, but don’t list the LLM itself as the resource you choose.

Write your own functions for computing the \(t\) statistic and \(p\) value#

Write a python function that computes a \(t\) test for the difference in means between two samples. You can import any functions, modules, or packages that you’d like for this purpose other than any that directly compute the \(t\) statistic or corresponding \(p\) value. (For example, you can’t use scipy.stats.ttest_ind, since it directly outputs both \(t\) and \(p\)

Your functions’ call signatures should be as follows:

def t_stat(sample1, sample2):
    """Compute the t statistic for difference in means of the two samples.
    
    Arguments:

    sample1, sample2: xarray.DataArrays, each containing one of the 
                      two samples to compute the t statistic for

    """
    # Insert your code here for computing the t statistic
    return t_stat


def pval_of_ttest(t_stat, deg_free):
    """Compute the p value of of the given t statistic and degrees of freedom.
    
    Arguments:

    t_stat: scalar, the value of the t statistic
    deg_free: scalar, the number of degrees of freedom corresponding to t_stat

    """
    # Insert your code here for computing the p value
    return p_val

Where t_stat is the value of the \(t\) statistic, and p_val is the corresponding \(p\) value.

Specifically, for each of the following pairs of samples from the Central Park dataset:

  1. Across all years 1869-2022, monthly average precipitation in September vs. in January.

  2. Average annual sum of cooling degree days for the period 1971-2000 vs. 2001-2022.

Use numpy.isclose to explicitly test your values against the results of scipy.stats.ttest_ind.

Extra credit#

Compute the \(t\) test for two or more salient differences-in-mean for your final project.#

Find at least two pairs of sample means from your datasets whose difference are scientifically interesting for your project. Perform the \(t\) test for each. Report the numerical values of the \(t\) statistic, \(p\) value, degrees of freedom. Describe your interpretation of the \(t\) test from a statistical perspective. Also describe your interpretaiton from a physical/scientific perspective.

Create a dedicated conda environment for this course#

The Resources page explains why it’s a good idea to create separate virtual environments for different projects/classes/ etc. using the Anaconda/conda environment manager. For this extra credit opportunity, create a dedicated environment using Anaconda Navigator or the conda command line tool. The environment must be created specifically for this purpose; you could name it eas42000 or eas-a4200 for example.

To show that you have done this, include the following cell in your notebook.

import os
import sys
import jupyter_core

print(f"Path to the active Python executable: {sys.executable}") 
print(f"Path to the active Jupyter installation: {jupyter_core.__file__}")
print(f"""Currently active conda environment: {os.environ.get("CONDA_DEFAULT_ENV")}""")
Path to the active Python executable: /Users/sah2249/miniconda3/envs/stats-book/bin/python
Path to the active Jupyter installation: /Users/sah2249/miniconda3/envs/stats-book/lib/python3.13/site-packages/jupyter_core/__init__.py
Currently active conda environment: stats-book

How to submit#

Use the Google form here