Writing key quantities from descriptive statistics in Python yourself

1. Writing key quantities from descriptive statistics in Python yourself#

1.1. Introduction#

All of the quantities that we discussed in the lecture on descriptive statistics—mean, variance, skewness, etc.—are used ubiquitously. As such, they have already been written up in Python as functions in myriad different ways. implemented many times in Python. That’s how I was able to call precip_central_park.mean() for example: the variable is precip_central_park is an xarray.DataArray object; and DataArrays include mean as a method.

Those built-in implementations in the Python standard library and in reliable packages we’ll use like numpy, scipy, and xarray are very widely tested for bugs, optimized for performance, and include lots of nice additional features for more complex calculations. So other than in this assignment, they are what you should use when you need to compute these kinds of things.

But they’re also kind of a black box—when you call precip_central_park.mean(), it doesn’t tell you what’s happening “under the hood.” It just spits out the result, whether you actually understand the quantity conceptually or not.

In this assignment, to make sure we each do understand what these different quantities each entail, we’ll write each one ourselves as a Python function. For each one, we’ll test it against the built-in version to make sure that they give the right answers. And along the way we’ll learn more about how Python works.

The goal is to challenge you in two key ways: (1) understanding more deeply the key quantities in descriptive statistics, and (2) understanding more deeply the basics of how Python works.

1.1.1. A toy example: addition#

By way of example, I’ll create a function right now that simply adds two numbers. Of course, there is no need to do this; you can just use Python’s standard addition operator via the plus sign +:

2 + 3

But we could also write our own function that does this:

def add(num1, num2):
    """Add the two numbers."""
    return num1 + num2

This defined (def) a function named add that accepts two arguments (num1 and num2) and spits out (returns) their sum.

Now we can call our function to perform the addition:

add(2, 3)

Finally, even though it’s plainly obvious that our function will, as desired, give you the sum of two numbers, let’s formally show that using Python’s == operator (that’s two equals signs), which returns True if two things are the same or False if they aren’t:

add(2, 3) == (2 + 3)

True

1.2. Your specific task#

These are the six quantities that you will implement in Python as functions:

Measures of central tendency:

mean
median

Measures of dispersion

range
variance
standard deviation
coefficient of variation

For each one, the function call signature (meaning what arguments it accepts) should look like this:

def my_func(arr):
   """replace w/ brief informative description of your function"""
    # Insert line(s) of code that perform the desired calculation
    # on the given 1D array named `arr`.
    return answer  # Here, `answer` should be replaced with whatever variable you define in your function that stores the final result of your function.

So, for example, write another addition function, making it add up all the elements in a 1D array rather than just two numbers. That might look like this:

def add_array(arr):
    """Sum all elements of the array."""
    sum = 0
    for arr_element in arr:  # Loop over every item in the array `arr`
        sum += arr_element  # Add that item to the running sum
    return sum  # Once the loop has run over all elements, give us ("return") the total sum.

1.2.1. What each function must do#

Only use things built in to Python’s standard library, and excluding those in the built-in math package. In other words, there should be no import statements anywhere in your code!
You can use a function you already wrote inside of another! So for example, let’s say you wanted to create a function that computed the square of the sum of all array elements. Given that add_array already exists, we could write this as follows:

def square_sum(arr):
    return add_array(arr) ** 2

1.2.2. Things you do not have to worry about#

You can assume that the function will only be given one-dimensional arrays consisting entirely of decimal numbers (i.e. “floats”). You don’t have to worry about handling, for example, if it was passed an integer instead, or a 2-D array, etc.
You don’t have to worry at all about performance, i.e. how fast the code runs and how much memory it uses. Just write it in the way that is the most intuitive to you.

1.2.3. Things to look out for#

Especially for those newer to Python, a couple potential “gotchas”:

Indentation matters in Python! After the def my_func(arr): line, the lines of code that are part of the function must be indented, meaning those lines start with one or more spaces. The standard is four spaces, and so please use that.
The function must end with a line starting with return, followed by whatever it is you want that function to spit out at the end. So for example, if in my add_array function above, I had forgotten to include the return sum line, the function would compute the sum but then just not do anything with it!

1.2.4. What specifically you will submit#

For each function, you’ll print out the results of that function computed on the following simple toy array:

arr_test = [1.0, 4.0, 10.0, 4.0, 2.]

This is small enough that you can double check all of the quantities yourself against a calculator.

Please submit this assignment as a python script, with a filename matching the pattern eas42000_hw1_lastname-firstname.py, replacing lastname with your lastname and firstname with your first name.

(Even if you normally use Jupyter and do all your scratch work on this assignment in Jupyter, you must write this as a .py script rather than a .ipynb Jupyter notebook file. This is for two reasons: (1) it’s good to occasionally practice dropping down from Jupyter notebooks to just plain-old scripts, and (2) to simplify the workflow of grading on my part.)

The script MUST run successfully from the command line when called from within the directory in which the script itself lives:

python ./eas42000_hw1_lastname-firstname.py

For example, for my toy add_array function above, this is what the file would look like (please note that the Jupyter formatting adds an extra 4 space indentation to all code blocks, which you’d remove in the actual script):

def add_array(arr):
    """Sum all elements of the array."""
    sum = 0
    for arr_element in arr:  # Loop over every item in the array `arr`
        sum += arr_element  # Add that item to the running sum
    return sum  # Once the loop has run over all elements, give us ("return") the total sum.


# You don't have to include this `if` statement, especially if you don't 
# yet know what it's doing.  We'll return to why this is a good practice later on.
# If you do comment it out, don't forget to remove the indentation of the lines
# after, otherwise you'll get an `IndentationError`.
if __name__ == "__main__":
    # Define the toy array we'll test the functions against.
    arr_test = [1.0, 4.0, 10.0, 4.0, 2.]
    # Print out its values to make sure you copy/pasted it correctly from the HW.
    print(f"Toy dataset is: {arr_test}")

    # Start with the first measure to be calculated.
    # In this toy example, it's my array-sum metric.  In yours, it will be the mean.

    # Describe the measure conceptually in words, and report the answer
    # that it should produce when computed on the toy array, which you can 
    # compute by hand or at most using a simple calculator.
    print()  # blank line for easier readability
    print("Function #1 is the sum: add all elements of the array.")
    ans_expected_sum = 21
    print(f"We can add those up without a computer or calculator; the answer is {ans_expected_sum}.")

    # Now call your function on the toy array, and print out the result.
    ans_add_array_on_toy = add_array(arr_test)
    print(f"Calling `add_array` on the toy dataset.  Answer: {ans_add_array_on_toy}")

    # Finally, print out the difference between what you expect the answer to be
    # and what the actual function computes (this should be zero or *very* close to it).
    diff_add_array_expect_actual = ans_expected_sum - ans_add_array_on_toy
    print(f"Difference between expected and actual: {diff_add_array_expect_actual}")
    print()  # A blank line at the end of each function makes the output easier to read.

    # Then repeat the lines of code starting with the "# Describe ..." comment but
    # applied to each of the remaining 4 quantities.

1.2.5. How to submit it#

Use the Google Form at (link that I’ve now taken down because this is now done :)

1.3. Extra credit#

There are a few ways to earn extra credit, each one up to 5% of the total:

1.3.1. Implement the interquartile range#

Implement the interquartile range. Unlike the others, here you are permitted to use something from another package, specifically the percentile function from numpy.

1.3.2. Compute your functions on the Central Park dataset#

Show that your functions also give the right answers for the Central Park daily average temperature dataset. You won’t be able to compute the answers by hand for this, but you can just copy-paste in the values I provide in the corresponding lecture notes / slides.

(Even more bonus points if, rather than copy-pasting them, you explicitly compute these “reference values” yourself using built-in versions of each function in numpy, scipy, xarray, or any other package of your choosing. If you do this, for efficiency’s sake when I go to run these, please place the dataset netCDF file itself in the same directory as the script, and load it in using just its filename rather than a complete path, which will be different on your machine than on mine.)