In this project, students will demonstrate their understanding of linear correlation and regression.
The project will use two datasets – mtcars and
nscc_student_data. Do some exploratory analysis of each
when you load them into your report.
The following chunk of code will load the mtcars dataset
into your environment. Familiarize yourself with the dataset before
proceeding with the next four questions.
# Store mtcars dataset into environment
mtcars <- mtcars
a.) Create a scatterplot of the weight variable (explanatory
variable) and the miles per gallon variable (response variable) in the
mtcars dataset.
# Scatterplot of mpg (response) vs. weight (explanatory)
plot(mtcars$wt, mtcars$mpg, main = "MPG vs Weight", xlab = "weight (1000 lbs)", ylab = "Miles Per Gallon")
b.) Based only on the scatterplot, does there appear to be a linear relationship between a car’s weight and its mpg?
Yes, the scatterplot shows a strong negative linear relationship between a car’s weight and its miles per gallon. As the weight increases, the mpg tends to decrease.
a.) Calculate the linear correlation coefficient between weight and mpg.
# Correlation coefficient
cor(mtcars$wt, mtcars$mpg)
## [1] -0.8676594
b.) Based on that correlation coefficient, describe the linear relationship between the two variables.
The correlation coefficient is approximately -0.8677, which indicates a strong negative linear relationship between weight and mpg, as heavier cars tend to get lower gas mileage.
a.) Create a least-squares regression equation to model the relationship between the weight and mpg of a car. Clearly state the full equation.
# Fit and store a linear model of mpg ~ weight
lm_wt <- lm(mpg ~ wt, data = mtcars)
# View the coefficients
coef(lm_wt)
## (Intercept) wt
## 37.285126 -5.344472
The regression equation is: y = -5.344*x + 37.285
b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.
# Prediction for 2,000 lb car
-5.344*2 + 37.285
## [1] 26.597
c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.
# Prediction for 7,000 lb car
-5.344*7 + 37.285
## [1] -0.123
d.) Do you think both predictions in parts b and c are reliable? Explain why or why not.
The 2,000 lb prediction is reliable because it falls within the range of the dataset, however the 7,000lb prediction is not because it’s far outside the observed range of weights, therefore making it an extrapolation.
What percent of the variability in a car’s mpg is explained by the car’s weight?
# R-squared value from the model summary
summary(lm_wt)$r.squared
## [1] 0.7528328
Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next questions.
# Store NSCC Student dataset into environment
nscc <- read.csv("nscc_student_data.csv")
I’m curious whether a person’s height is better predicted by their shoe length or by their pulse rate.
a.) Create two scatterplots, both with height as the response variable – one with shoe length as the explanatory variable, and one with pulse rate as the explanatory variable.
# Scatterplot: height vs. shoe length
plot(nscc$ShoeLength, nscc$Height, main = "Height vs Shoe Length", xlab = "Shoe Length (inches)", ylab = "Height (inches)")
# Scatterplot: height vs. pulse rate
plot(nscc$PulseRate, nscc$Height, main = "Height vs Pulse Rate", xlab = "Pulse Rate", ylab = "Height (inches)")
b.) Discuss the two scatterplots individually. Based only on the scatterplots, does there appear to be a linear relationship between the variables? Is the relationship weak, moderate, or strong? Based on the scatterplots alone, which explanatory variable appears to be the better predictor of height?
The scatterplot of height vs shoe length shows a moderate positive linear relationship, as the shoe length increases, the height generally increases as well.
The scatterplot of height vs pulse rate shows little to no linear relationship, the points appear to be randomly scattered without a clear trend.
Based on the scatterplots, shoe length appears to be a better predictor of height.
a.) Calculate the correlation coefficients for each pair of variables
from Question 5. Use the argument
use = "pairwise.complete.obs" in your cor()
call to handle missing values.
# Correlation coefficient: height vs. shoe length
cor(nscc$Height, nscc$ShoeLength, use = "pairwise.complete.obs")
## [1] 0.2695881
# Correlation coefficient: height vs. pulse rate
cor(nscc$Height, nscc$PulseRate, use = "pairwise.complete.obs")
## [1] 0.2028639
b.) Strictly based on the correlation coefficients, which explanatory variable is the better predictor of height?
Based on the correlation, shoe length is the better predictor of height because it’s correlation coefficient is stronger than the correlation between height and pulse rate.
a.) Create a linear model for height as the response variable with shoe length as the explanatory variable. State the full regression equation.
# Fit and store a linear model of height ~ shoe length
lm_sl <- lm(Height ~ ShoeLength, data = nscc)
# View the coefficients
coef(lm_sl)
## (Intercept) ShoeLength
## 60.3654950 0.5660485
The regression equation is: y = 0.566*x + 60.3655
b.) Use that model to predict the height of someone with a 10-inch shoe length.
# Prediction for shoe length = 10
0.566*10 + 60.3655
## [1] 66.0255
c.) Do you think that prediction is accurate? Explain why or why not. I think the prediction is accurate because the shoe length of 10 inches falls within the range of observed data, although there is still variability because height isn’t explained by shoe length alone.
a.) You hopefully found that both models in Part 2 were relatively weak. Which pair of variables, based on common sense alone, would you have expected to show a poor or no relationship even before your analysis?
Based on common sense, I would’ve expected the height vs pulse rate to have little to no relationship. A person’s pulse rate is determined by factors such as health, exercise, activity, or stress, not by height.
b.) Perhaps you expected the other pair to show a stronger relationship than it did. Can you offer any reasoning – based on the specific sample of NSCC students – for why that relationship was not stronger?
Although shoe length and height are related, it doesn’t necessarily mean the relationship will be strong because everyone has different body proportions. Also, the sample size is small, including both male and female students with different ages and backgrounds, which leads to an increase in variability.