Purpose

In this project, students will demonstrate their understanding of linear correlation and regression.


Preparation

The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each when you load them into your report.


Part 1 – mtcars Dataset

The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next four questions.

# Store mtcars dataset into environment
mtcars <- mtcars

#Preview dataset
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Question 1 – Scatterplot

a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.

# Scatterplot of mpg (response) vs. weight (explanatory)
plot(mtcars$wt, mtcars$mpg, xlab="Weight (1000 lbs)",ylab= "Miles Per Gallon", main="Car Weight vs. MPG")

b.) Based only on the scatterplot, does there appear to be a linear relationship between a car’s weight and its mpg?

Yes, there appears to be a negative linear relationship between a car’s weight and its mpg.

Question 2 – Correlation Coefficient

a.) Calculate the linear correlation coefficient between weight and mpg.

# Correlation coefficient
cor(mtcars$wt, mtcars$mpg)
## [1] -0.8676594

b.) Based on that correlation coefficient, describe the linear relationship between the two variables.

The correlation is approximately -0.868. This suggests a strong negative linear relationship - cars that weigh more tend to be less fuel efficient.

Question 3 – Regression Model

a.) Create a least-squares regression equation to model the relationship between the weight and mpg of a car. Clearly state the full equation.

# Fit and store a linear model of mpg ~ weight
lmcars <- lm(mpg~wt, data=mtcars)

# View the coefficients
lmcars
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt  
##      37.285       -5.344

Regression equation:

mpg = 37.285 -5.344 * wt

b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.

# Prediction for 2,000 lb car
37.285 + (-5.344) * 2
## [1] 26.597

c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.

# Prediction for 7,000 lb car
37.285 + (-5.344) * 7
## [1] -0.123

d.) Do you think both predictions in parts b and c are reliable? Explain why or why not.

No, the prediction for 7000lbs is way outside the range of the observed data so the prediction would be unreliable. This is considered extrapolation. The prediction for 2000lbs is more reasonable because it’s within the range of the data.

Question 4 – Explained Variation

What percent of the variability in a car’s mpg is explained by the car’s weight?

# R-squared value from the model summary
summary(lmcars)$r.squared
## [1] 0.7528328

R^2 is approximately 0.7528 so about 75.28% of the variability in mpg is explained by the weight of the car.


Part 2 – NSCC Student Dataset

Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next questions.

# Store NSCC Student dataset into environment
nscc <- read.csv("nscc_student_data.csv")

Question 5 – Scatterplots

I’m curious whether a person’s height is better predicted by their shoe length or by their pulse rate.

a.) Create two scatterplots, both with height as the response variable – one with shoe length as the explanatory variable, and one with pulse rate as the explanatory variable.

# Scatterplot: height vs. shoe length
plot(nscc$ShoeLength, nscc$Height, xlab="Shoe Length (in)", ylab="Height (in)", main="NSCC Student Shoe Length vs Height")

# Scatterplot: height vs. pulse rate
plot(nscc$PulseRate, nscc$Height, xlab="Pulse (bpm)", ylab="Height (in)", main="NSCC Student Pulse Rate vs Height")

b.) Discuss the two scatterplots individually. Based only on the scatterplots, does there appear to be a linear relationship between the variables? Is the relationship weak, moderate, or strong? Based on the scatterplots alone, which explanatory variable appears to be the better predictor of height?

There appears to be a moderate positive linear relationship between shoe length and height for NSCC students. There appears to be little to no linear relationship between pulse rate and height for NSCC students. Based on the scatterplots alone, shoe length appears to be a better predictor of height.

Question 6 – Correlation Coefficients

a.) Calculate the correlation coefficients for each pair of variables from Question 5. Use the argument use = "pairwise.complete.obs" in your cor() call to handle missing values.

# Correlation coefficient: height vs. shoe length
cor(nscc$ShoeLength, nscc$Height, use = "pairwise.complete.obs")
## [1] 0.2695881
# Correlation coefficient: height vs. pulse rate
cor(nscc$PulseRate, nscc$Height, use = "pairwise.complete.obs")
## [1] 0.2028639

b.) Strictly based on the correlation coefficients, which explanatory variable is the better predictor of height?

Based on the correlation coefficients alone, shoe length is the better predictor of height.

Question 7 – Regression Equation and Prediction

a.) Create a linear model for height as the response variable with shoe length as the explanatory variable. State the full regression equation.

# Fit and store a linear model of height ~ shoe length
lmNSCC <- lm(Height~ShoeLength, data=nscc)

# View the coefficients
lmNSCC
## 
## Call:
## lm(formula = Height ~ ShoeLength, data = nscc)
## 
## Coefficients:
## (Intercept)   ShoeLength  
##      60.365        0.566

Regression equation:

Height = 60.365 + 0.566 * ShoeLength

b.) Use that model to predict the height of someone with a 10-inch shoe length.

# Prediction for shoe length = 10
60.365 + 0.566 * 10
## [1] 66.025

c.) Do you think that prediction is accurate? Explain why or why not.

Yes, the prediction is inside the range of the observed data, so it leads me to believe it’s accurate.

Question 8 – Reflecting on Poor Models

a.) You hopefully found that both models in Part 2 were relatively weak. Which pair of variables, based on common sense alone, would you have expected to show a poor or no relationship even before your analysis?

I would have expected the PulseRate variable and Height variable to have little to no relationship before the analysis. Based on common sense alone, I would not expect someone’s height to affect their heart rate.

b.) Perhaps you expected the other pair to show a stronger relationship than it did. Can you offer any reasoning – based on the specific sample of NSCC students – for why that relationship was not stronger?

The relationship between shoe length and height may not have been as strong as expected due to the small sample size, natural variation, or outliers in the data.