Project #6 - Linear Correlation and Regression

Purpose

In this project, students will demonstrate their understanding of linear correlation and regression.

Preparation

The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each when you load them into your report.

Part 1 – mtcars Dataset

The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next four questions.

# Store mtcars dataset into environment
mtcars <- mtcars

Question 1 – Scatterplot

a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.

# Scatterplot of mpg (response) vs. weight (explanatory)
plot(mtcars$wt, mtcars$mpg, main = "Car Weight by Miles per Gallon", xlab = "Weight (1000 lbs)", ylab = "MPG (miles/gallon)")

b.) Based only on the scatterplot, does there appear to be a linear relationship between a car’s weight and its mpg?

There is a clear negative linear relationship, although there is some variation and it is not the strongest.

Question 2 – Correlation Coefficient

a.) Calculate the linear correlation coefficient between weight and mpg.

# Correlation coefficient
cor(mtcars$wt, mtcars$mpg)

## [1] -0.8676594

b.) Based on that correlation coefficient, describe the linear relationship between the two variables.

With a correlation coefficient of -0.87, the linear relationship has a negative trend and has a somewhat strong correlation.

Question 3 – Regression Model

a.) Create a least-squares regression equation to model the relationship between the weight and mpg of a car. Clearly state the full equation.

# Fit and store a linear model of mpg ~ weight
lm_mw <- lm(mpg ~ wt, data = mtcars)

# View the coefficients
lm_mw

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt  
##      37.285       -5.344

# Adding line to scatterplot
plot(mtcars$wt, mtcars$mpg, main = "Car Weight by Miles per Gallon", xlab = "Weight (1000 lbs)", ylab = "MPG (miles/gallon)")
abline(a = 37.285, b = -5.344, col = "red")

Regression Equation: \(y = -5.344x + 37.285\)

b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.

# Prediction for 2,000 lb car
-5.344*2 + 37.285

## [1] 26.597

The car has an estimated 26.6 miles per gallon.

c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.

# Prediction for 7,000 lb car
-5.344*7 + 37.285

## [1] -0.123

The car has an estimated -0.12 miles per gallon

d.) Do you think both predictions in parts b and c are reliable? Explain why or why not.

The prediction for part b falls well within the observed data and thus is reliable. The prediction for part c is an extrapolation outside of the data and has a nonsensical outcome, making in unreliable.

Question 4 – Explained Variation

What percent of the variability in a car’s mpg is explained by the car’s weight?

# R-squared value from the model summary
summary(lm_mw)$r.squared

## [1] 0.7528328

About 75.3% of the variability in a car’s mpg is explained by weight.

Part 2 – NSCC Student Dataset

Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next questions.

# Store NSCC Student dataset into environment
nscc <- read.csv("nscc_student_data.csv")

# Removing bad data
nscc[15, 5] <- NA
nscc[21, 6] <- NA

Question 5 – Scatterplots

I’m curious whether a person’s height is better predicted by their shoe length or by their pulse rate.

a.) Create two scatterplots, both with height as the response variable – one with shoe length as the explanatory variable, and one with pulse rate as the explanatory variable.

# Scatterplot: height vs. shoe length
plot(nscc$ShoeLength, nscc$Height, main = "NSCC Student Shoe Length by Height", xlab = "Shoe Length (in.)", ylab = "Height (in.)")

# Scatterplot: height vs. pulse rate
plot(nscc$PulseRate, nscc$Height, main = "NSCC Student Pulse Rate by Height", xlab = "Pulse Rate (BPM)", ylab = "Height (in.)")

b.) Discuss the two scatterplots individually. Based only on the scatterplots, does there appear to be a linear relationship between the variables? Is the relationship weak, moderate, or strong? Based on the scatterplots alone, which explanatory variable appears to be the better predictor of height?

For shoe length by height, there may be a linear relationship but there is also a lot of variability that makes it unclear, so if anything it would be weak. For pulse rate by height there is no evident trend at all. Shoe length seems to be the better predictor of height of the two.

Question 6 – Correlation Coefficients

a.) Calculate the correlation coefficients for each pair of variables from Question 5. Use the argument use = "pairwise.complete.obs" in your cor() call to handle missing values.

# Correlation coefficient: height vs. shoe length
cor(nscc$ShoeLength, nscc$Height, use = "pairwise.complete.obs")

## [1] 0.3816193

# Correlation coefficient: height vs. pulse rate
cor(nscc$PulseRate, nscc$Height, use = "pairwise.complete.obs")

## [1] -0.2065758

b.) Strictly based on the correlation coefficients, which explanatory variable is the better predictor of height?

The correlation coefficients suggest that shoe length is a better predictor with a coefficient of 0.38.

Question 7 – Regression Equation and Prediction

a.) Create a linear model for height as the response variable with shoe length as the explanatory variable. State the full regression equation.

# Fit and store a linear model of height ~ shoe length
lm_hs <- lm(Height ~ ShoeLength, data = nscc)

# View the coefficients
lm_hs

## 
## Call:
## lm(formula = Height ~ ShoeLength, data = nscc)
## 
## Coefficients:
## (Intercept)   ShoeLength  
##      53.768        1.236

# Scatterplot with line
plot(nscc$ShoeLength, nscc$Height, main = "NSCC Student Shoe Length by Height", xlab = "Shoe Length (in.)", ylab = "Height (in.)")
abline(a = 53.768, b = 1.236, col = "red")

Regression Equation: \(y = 1.236x + 53.768\)

b.) Use that model to predict the height of someone with a 10-inch shoe length.

# Prediction for shoe length = 10
1.236*10 + 53.768

## [1] 66.128

c.) Do you think that prediction is accurate? Explain why or why not.

The prediction seems reasonable enough with how the data is spread, but in general I don’t think there is a strong enough trend in the data for any prediction to hold much weight.

Question 8 – Reflecting on Poor Models

a.) You hopefully found that both models in Part 2 were relatively weak. Which pair of variables, based on common sense alone, would you have expected to show a poor or no relationship even before your analysis?

I expected the pulse rate and height variables to have basically no relationship, and I wasn’t surprised by the results.

b.) Perhaps you expected the other pair to show a stronger relationship than it did. Can you offer any reasoning – based on the specific sample of NSCC students – for why that relationship was not stronger?

I didn’t really expect shoe length and height to have a strong relationship, though I also wouldn’t have been surprised if it was more prominent. There could be an issue with the quality of the data that makes it weaker than it might have been. I already removed a couple points of data that were obviously wrong but there could be more questionable values. It’s also a pretty small sample size that might be too small to make a prominent trend.