Learning more about linear regressions using large-language models

5. Learning more about linear regressions using large-language models#

5.1. Preliminaries#

5.1.1. Introduction#

In this assignment, you’ll use ChatGPT and (optionally) a partner from class to teach yourself (and the rest of us!) some key concepts regarding correlations and linear regressions that were not directly covered in class.

Disclaimer: this assignment is an experiment

Obviously this is not your typical problem set. And your professor has never tried anything remotely like this in the past. So we’ll have to see all together how it goes: great, total fail, or anything in between.

What that means in practical terms: this will be graded on a completion basis. As long as you make a good-faith effort on each required component of the assignment, you’ll receive full credit for that component. What constitutes a good-faith effort? I’ll leave that to your judgment.

5.1.2. Objectives#

Learn how to perform hypothesis tests on correlation coefficients and linear regression.
Learn how to assess the “goodness of fit” of linear regression models.
Learn how to effectively use ChatGPT and other AI tools to effectively teach yourself about scientific concepts and technical techniques for how to implement those concepts.
Practice effectively sharing questions and/or findings with your peers, and subsequently providing constructive, professional responses to what your peers share.

5.2. The scientific concepts#

5.2.1. Hypothesis tests for correlation coefficients and linear regression slopes#

We’ve already covered hypothesis tests for differences in mean. And separately we’ve covered correlation coefficients and linear regression. Now, use ChatGPT to learn about hypothesis testing for correlation coefficients and for linear regression slopes.

Potential questions to explore: How are they computed in general? In python? Can you generate synthetic data that helps illustrate the concepts? Or upload the Central Park dataset (or perhaps a subset of it) to GPT or Claude and see what

5.2.2. Model goodness of fit#

A key aspect of linear regression models is quantifying their accuracy; this is known as goodness of fit. Use ChatGPT or another LLM to teach yourself more about this topic. I’d recommend starting with three acronyms: SSE, SSR, and SST. What do they mean and how do they relate?

5.3. What you’ll submit#

5.3.1. By Monday, Nov. 20th by 10pm ET: Jupyter notebook including screenshots of GPT session#

Submit your assignment as a Jupyter notebook file as usual. It should include two sections, one on the hypothesis tests and one on model fit. Each section should be your summary of what you learned, NOT just a transcript of your conversation with Chat. (It’s fine if you include some text taken verbatim from Chat in this particular case. And doubly so for Python code.)

Please also include somewhere one annotated screenshot of your interactions with ChatGPT. The particular prompt(s) and response(s) you pick are up to you; anything you think is noteworthy (e.g. something surprisingly helpful, or something it gets wrong, or if hallucinates, or anything else really).

5.3.2. The next day: post one question or finding and one response to others’ posts on this assignment on Blackboard#

Each of you likely knows already or will learn in this process something that most or all of the rest of us do not, whether about linear regression or about using LLMs. And surely each of you will find at least one thing puzzling or worth more thought as you go about this exercise—one that one or more of your classmates may well be able to help with. So let’s crowdsource these questions, comments, and answers into a shared resource you all can use.

Specifically, by the first deadline listed below, post on the course Blackboard at least one substantive question or interesting finding that emerged during this assignment. Access this via the “Discussions” tab in Blackboard; where’s there’s already a Forum generated for this assignment.

Then, by the end of the following day, post a reply to another student’s post. Your reply could be an answer to the question they pose, a request for clarification, an agreement or disagreement with the argument they make, or really anything else so long as it is civil, professional, and constructive.

5.4. Logistics#

5.4.1. Optional: work with a partner#

I encourage you to pair up with one classmate on this assignment. If you do so, the two of you will submit a single assignment (just make sure both your names are on it), and you will both receive an identical grade. You don’t have to specify in advance if you’re working with a partner or not.

5.4.2. Using ChatGPT (or an alternative tool)#

If you don’t have an account already with Open AI, you’ll need to create one. Just follow their prompts. You can use the free version; this gives you access to ChatGPT 3.5, which is one generation prior to the current one, ChatGPT4. If you already have access to GPT4, then of course use it, but otherwise not to worry. You do not have to purchase anything for this assignment.

You are also welcome to use another available AI tool of your choice. Options include Claude.

5.5. Extra credit#

5.5.1. Go above and beyond#

If you really go the extra mile on this assignment, you can earn up to 5% extra credit. Because it is so open-ended and experimental, it’s hard to say more concretely than that. But broadly I’m looking for here analyses, explanations, derivations, plots, or other material that goes far beyond what the “good-faith” effort that will otherwise get you full credit for this.

5.5.2. Apply these tools to your final project#

Compute one or more correlation coefficients and one or more linear regressions. Perform hypothesis tests for each, and assess the goodness of fit of the linear regression model.