Final project requirements#
Introduction#
Overview#
In the final project, you will apply some of the statistical methods learned throughout the course to a real-world problem in Earth and Atmospheric Sciences of your choosing.
First, you will select a topic and corresponding dataset and get those approved by the professor.
Then, you will conduct the required data analysis tasks listed further below.
You will be responsible for three things:
A final report, in the form of a self-contained, fully executable Jupyter Notebook.
A lightning (5-minute) presentation to the class
A 10-minute “technical interview” with the professor about the project
Grading#
Consult the syllabus for details.
Topic and dataset#
You are welcome to use the Central Park dataset we’ve been using in class.
You’re also welcome to use something else. It’s totally fine if this overlaps with work for another course or your own research (for me at least; consult your other professors and/or advisors if you’re unsure).
If you have a topic in mind but not a dataset, try explaining your idea to an LLM such as ChatGPT and ask it for recommendations of datasets—and insist that it provides you links and verifies that they are real! If that doesn’t help, email me or come see me right after class to discuss.
Your dataset needs to be complex/big enough for the results to be non-trivial. So, for example, a dataset with only 100 points would not be enough.
But I recommend keeping the total size below ~100 MB so that your laptops/Colab can handle the calculations without them being really slow.
The Central Park weather station dataset is a good example: long timeseries (>56,000 points) of multiple variables (so \(\geq\)500,000 points total), but the file is still only 3.9 MB.
It doesn’t have to be timeseries data, but there does need to be at least two different variables or locations or etc. Otherwise, you won’t be able to analyze relationships (correlations etc.)
Some online data portals you could check out:
NASA: https://data.nasa.gov/
NOAA climate data: https://www.ncdc.noaa.gov/cdo-web/datasets
National Centers for Environmental Information: https://www.ncei.noaa.gov/
CMIP6 (climate models): https://esgf-node.llnl.gov/projects/cmip6/
Final report#
Due date: Friday, December 12th, by 11:59pm ET#
How to submit: the submission link will be posted closer to the submission deadline#
Overall scope#
There are two goals of the final report:
(Most important) demonstrate your mastery of select concepts (listed below) we have learned in class, by applying them to your dataset in a way that is in service of the scientific question you seek to answer.
Use those calculations to actually make some progress toward answering your chosen scientific question.
Note that you will not be directly graded on the scientific outcomes. It could be that your analyses generate statistically unclear signals, even though you did them correctly. That still is progress: it’s often just as important to know what things are not related as what things are related.
Report format: a Jupyter Notebook#
You will write the final report as a Jupyter notebook file, with your text as Markdown cells and your analysis code all included as code cells. See sections further below for style and formatting guidelines.
What you will submit: the notebook itself plus a freshly executed version of your notebook exported to HTML#
The steps for submission are:
Create your report as a Jupyter Notebook.
Once your report is 100% ready, reset your Jupyter kernel and re-run the whole notebook from start to finish. (There is an option under the “Kernel” drop-down that does this for you: “Restart Kernel and run all cells”)
At that point, export it to an HTML file using Jupyter’s built-in exporting features.
Upload both the original notebook file (.ipynb) and the HTML export (.html) following the link provided.
Warning
Points WILL be deducted if the cell blocks in your notebook output do not show that they were executed all in order immediately following a reset of your kernel. In other words, the number to the left of the first code cell MUST be [1]
, the next code block must be [2]
, etc. etc. until the very end of the notebook.
The reason for this is to ensure that the results of your code cells aren’t accidentally being affected by running first some cells earlier in your notebook, then others later, etc., which otherwise happens all the time working with Jupyter notebooks. The only way to be certain this isn’t the case is following step 2 above.
Notebook contents#
Like the homework assignments, the notebook will be a mixture of Markdown cells containing mostly text and code cells that perform your analyses and generate your plots.
Notebook structure: same as a normal written report would be#
Though it is written as an interactive Jupyter Notebook, your final report should be organized and read as if it were a standard scientific report written in say Microsoft Word or LaTeX. This means it should:
be organized into labeled and numbered sections and subsections (Use hash-signs in markdown cells for this: starting a line with
#
makes it a top-level heading;##
2nd level heading,###
3rd level, etc.)be written in complete English sentences, organized into paragraphs and sections.
be free of grammatical, spelling, or other similar errors.
Perhaps a useful way to think about it is: imagine that the code blocks were stripped out, and only their outputs kept. The resulting document should look and read basically like an old-school printed final report would look and read.
Code blocks: excess code, style, commenting, etc.#
Your code should be pruned down to only the lines required to generate the values you report and plots that you generate. All other blocks or lines of code should be removed! Otherwise it makes understanding and ultimately grading your code much more difficult.
Adding explanatory comments or adhering to style conventions is less important. Of course, you are encouraged to include helpful comments as appropriate and to use style-enforcement tools such as black (which you can configure to automatically run your whole notebook via the jupyterlab_code_formatter tool.
Required calculations to incorporate#
The following calculations must be executed, presented, and described:
From Intro and numeracy:
thorough explanation of the dataset: source, time span, location, physical quantities being observed, instruments used to measure them
any other salient metadata: e.g. changes in instrumentation, change in location, calibration issues, spatial coverage
From Descriptive statistics:
mean and median of at least one key quantity
range, IQR, sample variance, sample standard deviation of at least one key quantity
skewness and kurtosis of at least one key quantity
From Data visualization:
at least one each of: histogram, boxplot, scatterplot, and timeseries (unless the data are not defined in time)
From Probability theory:
at least two empirical unconditional probabilities
at least two empirical conditional probabilities
From Probability distributions:
empirical PDFs and CDFs
a fitted parametric distribution
a block maxima/minima analysis and corresponding GEV fit
From Hypothesis testing:
at least one \(t\) test of differences in means between two samples
From Linear regression:
at least two correlation coefficients
at least one variable modeled by linear regression on another variable
From Time series:
discussion of how the decomposition into deterministic and random components of a time series would be applied to at least one key variable, even though you don’t have to actually perform that decomposition
calculation and discussion of the autocorrelation function of at least one key variable
From Spectral analysis:
a periodogram computed, plotted, and discussed for at least one key variable
a running average applied to your periodogram
From neural networks and machine learning
A feedforward neural network with one or more hidden layers trained on your data
A discussion of one other machine learning technique we discussed in class: how you could use that tool. Note that you don’t actually have to implement the tool.
Grading#
(credit: copied nearly verbatim from *Teaching Statistics: A Bag of Tricks by Andrew Gelman and Deborah Nolan)
Rubric#
The table below is a competency matrix for this report. The first column describes each critical task for the assignment, and the 2nd, 3rd, and 4th columns respectively describe what work in that task would constitute Needing Improvement, Basic Competency, and Surpassed Expectations.
Critical task |
Needs Improvement |
Basic |
Surpassed |
---|---|---|---|
Computation. Perform computations necessary for the data analysis. |
Computations contain errors and extraneous code. |
Comptations correct but contain extraneous/unnecessary code. |
Computations correct, clear, and properly labeled. |
Analysis. Choose and carry out analysis appropriate for data and context. |
Choice of analysis is overly simplistic, irrelevant, inappropriate for the data, or missing key component. |
Analysis appropriate, but incomplete and important features and assumptions not made explicit. |
Analysis appropriate, complete, advanced, relevant, and informative. |
Synthesis. Identify key features of the analysis, and interpret results in context. |
Conclusions are missing, incorrect, or not made bade on analysis |
Conclusions reasonable, but partially correct or partially complete. |
Relevant conclusions explicitly connected to analysis and context. |
Visual. Communicate findings graphically clearly, precisely, and concisely. |
Inappropriate choice of plots; poorly labeled plots; plots missing |
Plots convey information corretly but lack context for interpretation |
Plots convey information correctly with adequate and appropriate reference information |
Written. Communicate findings in writing clearly, precisely, and concisely |
Explanation is illogical, incorrect, or incoherent |
Explanation is partially correct but incomplete or unconvincing. |
Explanation is correct, complete, and convincing. |
Assigning points#
Basic competency in all five categories results in 75 points.
Five points are added for each task in the Surpassed category.
Similarly, five points are deducted for each competency in the Needs Improvement category.
As such, the maximum possible score is 100, and the minimum possible score is 50.
In-class presentation#
Presentation instructions will be amended soon: <5 min “lightning” talk, not 12 min “conference” talk
I will make an announcement on Github once these have been corrected; until then just ignore this section really.
Each student will present a “conference-style” oral presentation to the class summarizing their final project. “Conference style” means that it follows the format of a standard oral presentation typical of major conferences in Earth Sciences such as the American Geophysical Union Fall Meeting and the American Meteorological Society Annual Meeting.
Deadlines#
You must submit your final slides before midnight just prior to your presentation.
Specific submission instructions will be posted later.
The professor will download the submitted slides to his computer the morning before class, and everyone will use the same computer to present (rather than each person trying to connect their own computer to the A/V system one after the other).
Format#
Conference style means the following:
Total duration: 10 minutes
Presentation: 8 minutes
Questions from the audience: 2 minutes
(The more standard conference length is 15 minutes, 12 for the talk and then 3 for Q&A, but to get the whole class in we have to do a shorter style. Also, since COVID, more conferences are doing shorter talks for various reasons.)
Logistics#
The presentations will be split across two class days:
Monday, December 4th (6 students)
Wednesday, December 6th (7 students)
You must attend BOTH days to receive full credit, because you will be submitting a written question or comment on every other student’s presentation. These questions will be graded, as described below.
Presentation requirements#
You must include slides. These can be in Powerpoint, Keynote, Google Slides, or anything else. You will submit these and they will be evaluated on their own in addition to your actual delivery of the presentation. Guidelines below offer recommendations, but ultimately there are no hard requirements on the actual content of your slides.
Instructions for how to submit your slides has been posted to the course Blackboard.
Guidelines#
Note
Everything in this section is meant to be helpful, but none of it is strictly required. You can deviate from e.g. the slide template if you feel like your presentation will be better served by a different structure.
Presentation scope#
Whereas the written report for this project is meant to be fairly exhaustive, where you document all the important analyses that you performed, an oral presentation has to be more targeted. Eight minutes will fly by! So ask yourself: if you had to pick just one thing that you want to convey about your project, what would it be? Then build your talk around that.
Narrative#
Human beings are storytellers. We understand and retain things best when they are presented as a coherent narrative, with a beginning, middle, and end. (Actually, we retain them best of all when they are put to music, but I won’t ask you to sing your presentation.)
This approach of creating a narrative can be contrasted with what’s unfortunately more typical in scientific presentations (and teaching writ large): a “data dump” listing one fact after the other without linking things together.
So before you start hacking away at slides, a useful first step is to write down the thesis statement of your presentation: what’s the 1 sentence summary of what you want to convey? From there, you can craft a narrative with an intro that motivates that problem (“beginning”), a main, middle section that actually conveys it (“middle”), and an end that synthesizes the individual things you presented (“end”) and leaves the audience wanting more.
Presentation structure#
Tell the audience what you’re going to say, say it; then tell them what you’ve said.
—Dale Carnegie
A good rule of thumb is 1 minute per slide on average over your whole presentation. So for an 8 minute talk (not 10! the last 2 are for Q&A), you should aim for 8 slides. This suggests the following template:
Title slide: who are you, what’s the overall topic you’ll be presenting
Motivation: what is the big-picture topic you’re addressing, and why is it important and/or interesting? (I.e. why should we care?)
Introducing your project (what, in broad strokes, did you do to address the topic?) and talk outline (“tell the audience what you’re going to say”): \(\leq\)1 sentence summary of each of your \(\leq\)3 main points. (In a longer talk, these would be split into separate slides.)
Main point 1
More on main point 1
Main point 2
More on main point 2 (or maybe a 3rd main point if you have it.)
Recap (“tell them what you’ve said”) and Discussion (Where is this going / could this go from here? What are the implications?) (In a longer talk, these would be split into separate slides.)
Slides#
Some guidelines:
Less is more.
Make each slide do one thing, not multiple things.
Make each slide’s title a complete sentence that summarizes the main point you want the slide to convey.
Some text is helpful, but usually people include too much. Boil it down to the essentials.
Plots: describe in words every single image and table you include. This is for accessibility, but also because almost always audience members can’t tell as fast as you think they can what it is you’re showing.
Make all text, plot labels, plotted symbols, and images big enough that everyone in the room can read them.
Give your slides some breathing room: a slide that’s totally full with text, multiple plots, etc. is overwhelming and results in less information being effectively conveyed than if you had less.
Related, dispense with “slidejunk”: you don’t need slide numbers, logos, the date, etc. on every slide. (Except for a logo of your institution on the title slide, you don’t even need these anywhere!)
If the professor’s own slides for this class fail to meet these recommendations sometimes, well, “Do as I say, not as I do” ;)
Delivering the presentation#
Try not to worry! Public speaking can be intimidating, but especially in this setting everyone, the professor and the other students, are there to support you and learn from you.
Practice the talk at least once ahead of time with a timer. Make sure that you’re within the time limit. Nobody likes a talk that goes way beyond its allotted time; it’s rude to the audience and the other presenters.
Don’t be afraid of silence. For the audience, it’s actually a huge relief when a speaker takes a few seconds between slides or to take a sip of water. It helps the audience take a second to gather their thoughts.
Answering questions#
It can be helpful for everybody, yourself included, to repeat the question back to the person in your own words for two reasons: (1) you make sure everyone in the audience heard it. (2) You make sure that you interpreted the question correctly.
Once you’ve confirmed you understand what they’re asking, take a second (or a few)! There’s no need to answer as soon as the last word is out of their lips.
If you don’t know the answer to a question, that’s OK! Take a few seconds to think hard about it, and then just give it your best shot.
Grading#
Your presentation#
Your presentation will be graded based on the following:
Narrative quality: do you tell a single, coherent “scientific story” with a clear beginning, middle, and end? Or do you try to pack in too many different things?
Science quality: are the arguments, calculations, and plots presented valid? Or do they include errors or other problems?
Slide quality: does each slide convey a message in service of your story? Is there enough text, plots, etc. on each slide to convey that message? Is there too much on each slide for the audience to digest? Are the fonts big enough?
Length: did you complete your slides within the 8 minute time limit? Or did you go over? (To ensure we get through them all, you’ll get 2-minute and 1-minute warnings, and at the 8 minute mark I’ll ask you to wrap up essentially right away, regardless of how far you’ve gotten.)
Answering questions: do you make a good-faith attempt to understand and address each question? Do your answers cohere with what you presented?
Participation#
You will be required to submit in writing one question and one piece of constructive criticism for each classmate’s presentation.
Question: what in the presentation would you like to know more about? Or something you didn’t understand that you’d like clarified?
Constructive criticism: this could be positive—something the presenter did well. Or, if delivered respectfully and fairly, something you’d recommend changing or that didn’t quite land for you.
Afterwards, all students will be provided the ANONYMIZED questions and answers from their classmates.
Instructions for how to submit this feedback has been posted to the course Blackboard.
Overall grade#
Your grade will be based on a rubric nearly identical to that for the final report provided above, but using each of the five categories immediately above under “Your presentation.” The “Answering questions” category will be combined with your feedback to other students into a single category.
Technical interview#
Overview#
You will sit down with me for approximately 10 minutes in my office, with a live running Jupyter or Colab session with your final report, exactly as you submitted it.
I will ask you various questions about your report. Some of these will be conceptual, some more technical.
This will include requests to compute additional things on the fly.
Motivation#
LLMs have gotten very good at analyzing data—at least if you pick the right model and prompt it well. As such, it’s possible to have ChatGPT etc. generate much or perhaps all of this report with little effort or understanding on your part.
As such, this “technical interview” will enable me to evaluate how well you yourself understand the dataset, the analyses you performed, their implications, etc.
As a secondary motivation, technical interviews are commonplace in scientific and computing jobs. (And they usually are more like an hour rather than 10 minutes.) So this will give you some experience with that format; hopefully that will help you prepare for future technical interviews.
Dates and other logistics#
These will all be scheduled during Tuesday December 16th and Wednesday December 17th, in-person in Marshak.
More details#
More details will be posted later.