Appendix E — Multiple Linear Regression Project
THIS IS A LIVE DOCUMENT, anything that is changed from its original wording will be highlighted and part that is released will be appended to the bottom of this document.
Part | Release Date | Due Date |
---|---|---|
Part 1 | Monday, April 22 | Wednesday, April 24 |
Part 2 | Wednesday, April 24 | Friday, April 26 |
Part 3 | Monday, April 29 | Wednesday, May 1 |
Part 4 | Wednesday, May 1 | Friday, May 3 |
Part 5 | Monday, May 6 | Wednesday, May 8 |
In this project, you will investigate multiple quantitative variables and construct a LSRL from the multiple variables. You will apply all Chapter 3 and 12 knowledge that you have learned and use Google Sheets to analyze your data by creating scatter plots and other visualizations to check the assumptions of linear regression inference.
E.1 Part 1: Choosing your data
In this project , you will each choose data to analyze from the CORGIS Datasets Project
Look at the descriptions of the dataset and look at what variables you can analyze. Before making a final decision, make sure that your data set has at least 4 quantitative variables that you can work with.
When you are done with that, please write your choice on the attached spreadsheet in Google Classroom. Only 1 person per data set
Visit https://stapplet.com/quant2v.html and explore some relationships between the quantitative variables that you have. You will have to download the data set. Please create a new Google Doc, paste screenshots of the 3 - 4 scatter plots that you create, and write down your thoughts about them.
E.2 Part 2: Regression Lines
Now that you’ve explored multiple relationships between variables, now you can create some LSRLs for your selected variables.
Considering what you found out in Part 1 about some of the relationships, decide whether you want to keep exploring those relationships or not. You want to keep some or look at a new relationship(s).
For this part, you will explore 3 of these relationships and calculate and interpret several things about them.
For each of these 3 relationships, you will use the same stapplet app to
- calculate the LSRL,
- provide a screenshot of what the scatter plot looks like with the LSRL now,
- write the LSRL in context of the problem,
- interpret the intercept \(a\)
- interpret the slope \(b\)
- use the value of \(s\) and \(r^2\) to talk about how well the line predicts the values
- use the residual plot to talk about how well the line fits the data.
You’re expected to write each of these correctly as you have in chapter 3.
Use the template Google Doc already created for you on this part’s Google Classroom Assignment
E.3 Part 3: Starting Multiple Regression
You’ve explored multiple relationships, explored multiple LSRLs for them, and now you’re going to put everything into one equation!
Imagine our original goal: to just take two variables and find out about whether they have a true relationship in the form of \[y= a + bx + \epsilon \].
We’re now going to take this a step further and take multiple variables to see if they have a linear relationship with \(y\): \[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p \]
Where, for the simplicity of notation, we’re now going to write \(a\) as \(\beta_0\), which is our intercept. And we now are able to use any number, \(p\), of explanatory variables (\(x_1\), \(x_2\), \(\cdots\) , \(x_p\)) and each one has their own “slope” (\(\beta_1\), \(\beta_2\), \(\cdots\), \(\beta_p\)).
And we are going to use our same method of Least Squares Regression to estimate each of the to achieve: \[ \hat y = \hat \beta _0 + \hat \beta _1 x_1 + \hat \beta _2 x_2 + \cdots + \hat \beta _p x_p \]
With all of your insights from the previous parts, decide on one response variable. Leave what you think is interesting as explanatory variables. If you’re not sure what explanatory variables might be interesting to keep, continue reading…
Now use this applet https://stapplet.com/multreg.html and calculate a multiple regression line. In this applet, you’re able to enter multiple data for explanatory variables. If you weren’t sure about which explanatory variables to have before, you enter your data first! And you decide not to go through with it, just uncheck the “Included in model” checkbox for that variable.
Answer the following questions using the output for your model:
- What is the equation for multiple regression line?
- Interpret each of the coefficients (\(\hat \beta _0\), \(\hat \beta _1\), \(\cdots\), \(\hat \beta _p\)) that the output has provided to you. In other words, interpret the intercept and each of the slopes.
- Compare your intercepts and coefficients to what you had before in the previous parts. Is there anything surprising?
- Use the values of \(s\) and \(r^2\) to evaluate how well your model predicts your response variable
- What do you notice about your value for \(r^2\) in comparison to what you did in Part 2 when you had a LSRL with the same response variable?
- Use the residual plot to evaluate how well your model fits linearly.
You’re expected to complete these interpretations as clearly and correctly as you have previously. There’s nothing different!!!
Use the template Google Doc already created for you on this part’s Google Classroom Assignment
E.4 Part 4: Filtering and One-hot Encoding
In this part of the project, we will explore how to filter out data in Google Sheets using the Google Visualization API Query Language. We will also explore adding a categorical variable to our multiple regression line, allowing us to finally take in account of non-quantitative data.
E.4.1 Filtering data
First, convert your data file into a Google Spreadsheet (open up your file within Google Chrome, and click on open in Google Sheets). I’ll update this instruction during 4th period when I can see what a chromebook does.
Create a new sheet in the spreadsheet (click on the plus that you see on the bottom toolbar of Google Sheets).
In the new sheet, you will only edit the first cell of the new sheet.
You are going to use the QUERY
function. Here is the documentation that Google has created with multiple examples https://support.google.com/docs/answer/3093343?hl=en
Here’s an example of what I did with the UFO sightings data set from CORGIS. 1. I have only data from the first, second, fourth, and fifth columns (select A, B, D, E
). 2. And I filtered out the data such that I keep only the rows where the state is Alaska (where B='AK'
), the UFO shape is “circle” (where D='circle'
), and the year is between 2007 and 2012 (where I > 2007 and I < 2012
).
I wrote this all as one statement:
=QUERY(ufo_sightings!1:60635,"select A, B, C, D where B='AK' and D='circle' and I > 2007 and I < 2012")
Here’s what the spreadsheet looks like: https://docs.google.com/spreadsheets/d/1KxDIefR9i1YhRER5f6zdrxlrYDij5-w5i6GCx69hEas/edit?usp=sharing
ufo_sightings!1:60635
, and how do you do it?
- Start off your new sheet by editing just the first cell, and type
=QUERY(
- At this point (don’t close the cell, don’t press enter yet), go into the sheet that contains your data and click on the top left cell that contains no data (the cell to the left of column “A” and above row “1”).
- Google Sheets automatically inputs the correct range of the data for you, and you can go back to the new sheet and continue typing.
Your task is to, at the very least, show that you know how to filter out the data using two conditions (there should be an and
). If you obviously had bad data / outliers, I expect to see you use the filtered data in your modeling in Part 5.
Follow the steps above to filter your data and attach the final spreadsheet to your Google Classroom Assignment for this part.
You still have more to do below for this part!!!
E.4.2 One-hot encoding
If you didn’t have any bad data / outliers (if you didn’t have to filter your data), please do this section without the work that you just did to filter the data.
So… how do we use categorical variables in a quantitative model? The idea is that we can add a certain value to the predicted value if we see a certain category for some observation. We are going to transform our data set by one-hot encoding our categorical variables so that we can accomplish this. You are expected to include a categorical variable in your final model for Part 5.
I’ve made it simple for you. Using your data set (if you needed to filter it, it should now be filtered), now make sure to include at least one categorical variable (please choose one that doesn’t have too many categories). Follow the previous section about filtering to help yourself with that again. select
is a powerful tool and you can use it to get rid of everything that you don’t need.
If there are too many categories, feel free to filter out some of the categories to only incldue data that has the categories that you’re interested in.
One-hot encoding is easy! All data scientists or anyone working with data analysis have to do normally is one step. And you’ve also been provided an easy one step.
Visit https://onehot.echa.ng.
- Download your final data from your spreadsheet (go to file > download > Comma Separated Values (.csv))
- “Choose” the file on the website
- Hit upload
- Only checkmark the column names that correspond to your categorical variables
- Hit Encode and download
Once you’ve done this, you’ve successfully one-hot encoded and have a new file for you data set that has transformed your data!
Three things to submit to the Google Classroom assignment:
The Google Spreadsheet that contains the work that you did to filter the data using the
QUERY
functionThe One-Hot encoded data.
The attached Google Doc that has some things for you to answer:
Take a moment to open up your one-hot encoded data file. Explain what one-hot encoding does. Hint: your original column(s) has been split up into multiple columns–compare the original to the new columns. Also, provide an example of the original data and what the one hot encoded values for that original data now looks like and describe what happened with each category.
Open up your one-hot encoded data, and do a multiple regression with your one-hot encoded columns that correspond to just one of your categorical variable. You should use the same response variable as Part 3 and just do the regression with one categorical variable (which you will input as 2+ variables).
- What’s the regression line?
- Interpret the first “slope,” \(\hat \beta _1\). It’s now slightly different from what we have before (it’s no longer, “for every 1…”).
It’ll be pretty obvious if you use AI to answer the question… The question isn’t at all hard to answer.
Let me know if you think there are any bugs on https://onehot.echa.ng!!! ChatGPT basically wrote all the code for me.
E.5 Part 5: The Product
Using the all of your previous insights into the relationships that you saw, filtered data (if neccessary), and one-hot encoding, construct a multiple regression model that predicts 1 response variable using at least 3 quantitative explanatory variables and at least 1 categorical variable that has been one-hot encoded.
Complete the slide template provided on the Google Classroom assignment for this part of the project. The template has you answer these questions about your multiple regression model:
- Describe your dataset.
- What piqued your interest in picking this data?
- What are some interesting characteristics about it?
- What is the equation for your multiple regression model?
- Use your values of \(r^2\) and \(s\) to assess how well your model predicts your response variable.
- Use your residual plot to assess how well this linear model fits the data.
- What’s a useful skill that you might use this for?
- What were some challenges that came up with your model (from previous parts to this final part)?