Project #6 - Linear Correlation and Regression

Purpose

In this project, students will demonstrate their understanding of linear correlation and regression.

Preparation

The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each when you load them into your report.

Part 1 – mtcars Dataset

The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next four questions.

# Store mtcars dataset into environment
mtcars <- mtcars

Question 1 – Scatterplot

a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.

# Scatterplot of mpg (response) vs. weight (explanatory)
plot(mtcars$mpg ~ mtcars$wt, xlab = "Weight (1000lbs)", ylab = "MPG",  main = "MTCARS: Weight vs MPG")

b.) Based only on the scatterplot, does there appear to be a linear relationship between a car’s weight and its mpg?

There does seem to be a strong negative linear relationship between the two variables. The MPG of the car decreases as its weight increases.

Question 2 – Correlation Coefficient

a.) Calculate the linear correlation coefficient between weight and mpg.

# Correlation coefficient
cor(mtcars$mpg, mtcars$wt)

## [1] -0.8676594

b.) Based on that correlation coefficient, describe the linear relationship between the two variables.

The correlation coefficient is -0.87. Its closeness in value to -1 indicates that there is a strong negative linear relationship between the weight and MPG of a car.

Question 3 – Regression Model

a.) Create a least-squares regression equation to model the relationship between the weight and mpg of a car. Clearly state the full equation.

# Fit and store a linear model of mpg ~ weight
lmc <- lm(mpg ~ wt, data = mtcars)

# View the coefficients
lmc

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt  
##      37.285       -5.344

The regression equation would be: MPG = 37.285 - 5.344*Weight

b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.

# Prediction for 2,000 lb car
predict(lmc, newdata = data.frame(wt = 2))

##        1 
## 26.59618

37.285 - 5.344*2

## [1] 26.597

c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.

# Prediction for 7,000 lb car
predict(lmc, newdata = data.frame(wt = 7))

##          1 
## -0.1261748

37.285 - 5.344*7

## [1] -0.123

d.) Do you think both predictions in parts b and c are reliable? Explain why or why not.

The predication from part b is reliable because the closest data point we have to 2,000 lbs is 2,140 and the MPG for that car is 26 MPG.

However, part c is unreliable because it is an instance of extrapolation. The value of weight is not at all within our range, having a negative MPG is not possible.

Question 4 – Explained Variation

What percent of the variability in a car’s mpg is explained by the car’s weight?

# R-squared value from the model summary
summary(lmc)$r.squared

## [1] 0.7528328

The \(R^2\) is approximately 0.753, meaning the weight explains about 75.3% of the variability in the MPG of a car. The remaining 24.7% explains the other factors that influence the MPG variability of a car.

Part 2 – NSCC Student Dataset

Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next questions.

# Store NSCC Student dataset into environment
nscc <- read.csv("nscc_student_data.csv")

Question 5 – Scatterplots

I’m curious whether a person’s height is better predicted by their shoe length or by their pulse rate.

a.) Create two scatterplots, both with height as the response variable – one with shoe length as the explanatory variable, and one with pulse rate as the explanatory variable.

# Scatterplot: height vs. shoe length
plot(nscc$Height ~ nscc$ShoeLength, xlab = "Shoe Length (in)", ylab = "Height (in)", main = "Height vs. Shoe Length")

# Scatterplot: height vs. pulse rate
plot(nscc$Height ~ nscc$PulseRate, xlab = "Pulse Rate (bpm)", ylab = "Height (in)", main = "Height vs. Pulse Rate")

b.) Discuss the two scatter plots individually. Based only on the scatter plots, does there appear to be a linear relationship between the variables? Is the relationship weak, moderate, or strong? Based on the scatter plots alone, which explanatory variable appears to be the better predictor of height?

The first scatter plot has a weak and slightly positive linear relationship. As shoe length increases, the increasing values in height slightly become more clustered. This appears be the better predictor of height between the two.

The second scatter plot appears to have be linear but show no sort of relationship between the two values, the slope is seemingly zero. The heights do not seem to depend on the pulse rates.

Question 6 – Correlation Coefficients

a.) Calculate the correlation coefficients for each pair of variables from Question 5. Use the argument use = "pairwise.complete.obs" in your cor() call to handle missing values.

# Correlation coefficient: height vs. shoe length
cor(nscc$Height, nscc$ShoeLength, use = "pairwise.complete.obs")

## [1] 0.2695881

# Correlation coefficient: height vs. pulse rate
cor(nscc$Height, nscc$PulseRate, use = "pairwise.complete.obs")

## [1] 0.2028639

b.) Strictly based on the correlation coefficients, which explanatory variable is the better predictor of height?

Shoe length is a better predictor of height because its correlation coefficient is greater than that of height and pulse rate. However, both values indicate little to no linear relationship because of their closeness in value to zero.

Question 7 – Regression Equation and Prediction

a.) Create a linear model for height as the response variable with shoe length as the explanatory variable. State the full regression equation.

# Fit and store a linear model of height ~ shoe length
lmn <- lm(Height ~ ShoeLength, data = nscc)

# View the coefficients
lmn

## 
## Call:
## lm(formula = Height ~ ShoeLength, data = nscc)
## 
## Coefficients:
## (Intercept)   ShoeLength  
##      60.365        0.566

The regression equation would be: Height = 60.365 + 0.566*ShoeLength

b.) Use that model to predict the height of someone with a 10-inch shoe length.

# Prediction for shoe length = 10
predict(lmn, newdata = data.frame(ShoeLength = 10))

##        1 
## 66.02598

60.365 + 0.566*10

## [1] 66.025

c.) Do you think that prediction is accurate? Explain why or why not.

The predicted height for someone with the shoe length of 10 was 66 inches tall. We have two data points with the exact shoe length of 10, their heights were 60 and 67 inches. So, the prediction seems to be slightly reliable.

Question 8 – Reflecting on Poor Models

a.) You hopefully found that both models in Part 2 were relatively weak. Which pair of variables, based on common sense alone, would you have expected to show a poor or no relationship even before your analysis?

The one that I definitely expected to show a poor relationship was that between height and pulse rate.

b.) Perhaps you expected the other pair to show a stronger relationship than it did. Can you offer any reasoning – based on the specific sample of NSCC students – for why that relationship was not stronger?

I did expect the relationship between shoe length and height to be stronger because in my experience, typically taller people I’ve known have had longer and bigger feet while shorter people have smaller and shorter feet.

I think the relationship was not stronger due to our data set. This is all self-reported information, it’s not a very large sample, we don’t know if shoe lengths are being reported in mens or womens or UK or US sizes, and there were a few data points of shoe length missing. This sample of students seemed to have generally larger feet, even those who were shorter. I feel that if a larger study was done, the relationship between shoe length and height would be stronger.