Project #6 - Linear Correlation and Regression

Purpose

In this project, students will demonstrate their understanding of linear correlation and regression.

Preparation

The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each when you load them into your report.

Part 1 – mtcars Dataset

The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next four questions.

# Store mtcars dataset into environment
mtcars <- mtcars

Question 1 – Scatterplot

a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.

# Scatterplot of mpg (response) vs. weight (explanatory)

plot(mtcars$wt, mtcars$mpg, main = "Automobiles' Miles per Gallon by Weight", xlab = "Weight (1000 lbs)", ylab = "Miles per Gallon")

b.) Based only on the scatterplot, does there appear to be a linear relationship between a car’s weight and its mpg?

There appears to be a strong, negative linear relationship between a car’s weight and its miles per gallon.

Question 2 – Correlation Coefficient

a.) Calculate the linear correlation coefficient between weight and mpg.

# Correlation coefficient

cor(mtcars$wt, mtcars$mpg)

## [1] -0.8676594

b.) Based on that correlation coefficient, describe the linear relationship between the two variables.

The correlation coefficient of approximately -0.868 indicates that there is a strong, negative linear relationship between a car’s weight and its miles per gallon.

Question 3 – Regression Model

a.) Create a least-squares regression equation to model the relationship between the weight and mpg of a car. Clearly state the full equation.

# Fit and store a linear model of mpg ~ weight

lm_mpg <- lm(mpg ~ wt, data = mtcars)

# View the coefficients

lm_mpg

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt  
##      37.285       -5.344

Regression Equation: -5.344*x + 37.285

b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.

# Prediction for 2,000 lb car

-5.344*2 + 37.285

## [1] 26.597

c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.

# Prediction for 7,000 lb car

-5.344*7 + 37.285

## [1] -0.123

d.) Do you think both predictions in parts b and c are reliable? Explain why or why not.

In part b, the regression equation predicts that a 2,000 lb car would have about 26.6 miles per gallon. Therefore, the estimated miles per gallon for a 2,000 lb car seems reliable due to the fact that the value computed is consistent with what is seen on the scatterplot. However, the estimated miles per gallon for a 7,000 lb car is not reliable. According to the regression equation, the estimated miles per gallon would be -0.123, and it is not possible to have negative miles per gallon. Additionally, a 7,000 lb car falls outside the range of the observed data. Therefore, the value is an extrapolation, and the prediction is unreliable.

Question 4 – Explained Variation

What percent of the variability in a car’s mpg is explained by the car’s weight?

# R-squared value from the model summary

summary(lm_mpg)$r.squared

## [1] 0.7528328

About 75.3% of the variability in a car’s mpg is explained by the car’s weight.

Part 2 – NSCC Student Dataset

Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next questions.

# Store NSCC Student dataset into environment
nscc <- read.csv("nscc_student_data.csv")

Question 5 – Scatterplots

I’m curious whether a person’s height is better predicted by their shoe length or by their pulse rate.

a.) Create two scatterplots, both with height as the response variable – one with shoe length as the explanatory variable, and one with pulse rate as the explanatory variable.

# Scatterplot: height vs. shoe length

plot(nscc$ShoeLength, nscc$Height, main = "NSCC Students' Height by Shoe Length", xlab = "Shoe Length (inches)", ylab = "Height (inches)")

# Scatterplot: height vs. pulse rate

plot(nscc$PulseRate, nscc$Height, main = "NSCC Students' Height by Pulse Rate", xlab = "Pulse Rate (bpm)", ylab = "Height (inches)")

b.) Discuss the two scatterplots individually. Based only on the scatterplots, does there appear to be a linear relationship between the variables? Is the relationship weak, moderate, or strong? Based on the scatterplots alone, which explanatory variable appears to be the better predictor of height?

The first scatterplot shows that there is a weak, positive linear relationship between a student’s shoe length and their height.

The second scatterplot shows an extremely weak relationship between a student’s pulse rate and their height. There seems to be no linear relationship between the two variables.

Based on both of these scatterplots, shoe length appears to be the better predictor of height.

Question 6 – Correlation Coefficients

a.) Calculate the correlation coefficients for each pair of variables from Question 5. Use the argument use = "pairwise.complete.obs" in your cor() call to handle missing values.

# Correlation coefficient: height vs. shoe length

cor(nscc$ShoeLength, nscc$Height, use = "pairwise.complete.obs")

## [1] 0.2695881

# Correlation coefficient: height vs. pulse rate

cor(nscc$PulseRate, nscc$Height, use = "pairwise.complete.obs")

## [1] 0.2028639

b.) Strictly based on the correlation coefficients, which explanatory variable is the better predictor of height?

Based on the correlation coefficients of both explanatory variables, shoe length is the better predictor of height. The correlation coefficient of shoe length is approximately 0.2696, which indicates a stronger relationship than pulse rate, which has a correlation coefficient of about 0.2029. Even though the shoe length correlation is stronger, both correlation coefficients indicate a weak positive relationship. Therefore, shoe length may be a better predictor of height between the two explanatory variables, but neither of these options is an ideal predictor of height.

Question 7 – Regression Equation and Prediction

a.) Create a linear model for height as the response variable with shoe length as the explanatory variable. State the full regression equation.

# Fit and store a linear model of height ~ shoe length

lm_shoe <- lm(Height ~ ShoeLength, data = nscc)

# View the coefficients

lm_shoe

## 
## Call:
## lm(formula = Height ~ ShoeLength, data = nscc)
## 
## Coefficients:
## (Intercept)   ShoeLength  
##      60.365        0.566

Regression Equation: 0.566*x + 60.365

b.) Use that model to predict the height of someone with a 10-inch shoe length.

# Prediction for shoe length = 10

0.566*10 + 60.365

## [1] 66.025

c.) Do you think that prediction is accurate? Explain why or why not.

Even though the value itself (about 66.0 inches) is reasonable, the prediction may not be accurate or reliable due to the fact that there is a weak relationship between shoe length and height. Therefore, there may be some errors when trying to predict values using the regression equation.

Question 8 – Reflecting on Poor Models

a.) You hopefully found that both models in Part 2 were relatively weak. Which pair of variables, based on common sense alone, would you have expected to show a poor or no relationship even before your analysis?

Even before my analysis, I would have expected a poor relationship or no relationship between a student’s pulse rate and their height. There is no biological or scientific reason to expect any correlation between a person’s height and pulse rate. Therefore, it is common sense to assume that the two variables would not show any relationship.

b.) Perhaps you expected the other pair to show a stronger relationship than it did. Can you offer any reasoning – based on the specific sample of NSCC students – for why that relationship was not stronger?

I expected shoe length and height to show a stronger positive linear relationship than it did. Based on the specific sample of NSCC students, this relationship may not have been stronger due to the sample size and missing values within the sample. Also, the students within the sample are different genders and ages. Additionally, if the values are self-reported and not measured, there may be some variation between the student responses and actual measurements of their height and/or shoe size. However, realistically speaking, people of different heights can have different shoe sizes.