Project #7 - Linear Correlation and Regression

Purpose

In this project, students will demonstrate their understanding of linear correlation and regression.

Preparation

The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each of them when you load them into your report.

Part 1 - mtcars dataset

The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next 4 questions related to it.

# Store mtcars dataset into environment
mtcars <- mtcars

# Let's see what is going on with this dataset!
summary(mtcars)

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Question 1 – Create Scatterplots

a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.

plot(mtcars$wt, mtcars$mpg, main = "Weight and MPG of cars", xlab = "Weight", ylab = "Miles Per Gallon")

b.) Only by looking at the scatterplot, does there appear to be a linear relationship between the weight and mpg of a car?
Yes, there definitely seems to be a linear relationship between the weight and miles per gallon of a car. It is not super strong, but it is also not weak.

Question 2 – Correlation Coefficient

a.) Calculate the linear correlation coefficient of the weight and mpg variables.

cor(mtcars$wt, mtcars$mpg)

## [1] -0.8676594

The linear correlation coefficient of these two variables is -0.867.

b.) Based on that correlation coefficient, describe the linear relationship between the two variables.
There is a strong, negative linear relationship between these variables.

Question 3 – Create a Regression Model

a.) Create a regression equation to model the relationship between the weight and mpg of a car.

lm1 <-  lm(formula = mpg ~ wt, data = mtcars)
lm(formula = mpg ~ wt, data = mtcars)

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt  
##      37.285       -5.344

b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.

# Regression Equation
37.285 + -5.344*2

## [1] 26.597

c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.

37.285 + -5.344*7

## [1] -0.123

d.) Do you think the predictions in parts b and c are reliable ones? Explain why or why not.
Yes, because it seems realistic that a car of 2,000 pounds (about an average weight for a small car) would get about 27 miles per gallon because my small car gets about that many miles per gallon. It also seems realistic that a giant 7,000 pound car would not get a lot of miles per gallon at all because it is huge so it will be bad on gasoline.

Question 4 – Explained Variation

What percent of the variation in a car’s mpg is explained by the car’s weight?

summary(lm1)$r.squared

## [1] 0.7528328

Our multiple R-squared is 0.7528328, therefore 75.2% of the variation in the mpg of cars can be explained by the weight of the cars.

Part 2 – NSCC Student dataset

Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next few questions related to it.

# Store nscc_student_data into environment
nscc_student_data <- read.csv("~/Desktop/honorsStats/nscc_student_data.csv")

Question 5 – Create Scatterplots

I’m curious if a person’s height is better predicted by their shoe length or by their pulse rate.

Create two scatterplots, both with height as the response variable. One with shoe length as the explanatory variable and the other with pulse rate as the explanatory variable.

# One plot
plot(nscc_student_data$ShoeLength, nscc_student_data$Height, main = "Shoe Length and Height", xlab = "Shoe Length", ylab = "Height")

# Other plot
plot(nscc_student_data$PulseRate, nscc_student_data$Height, main = "Pulse Rate and Height", xlab = "Pulse Rate", ylab = "Height")

Discuss the two scatterplots individually. Does there appear to be a linear relationship between the variables? Is the relationship weak/moderate/strong? Based on the scatterplots, does either explanatory variable appear to be a better predictor of height? Explain your reasoning. Answers may vary here.
Shoe length and height definitely seem to have some sort of weak, positive correlation. However, one random person has a GIANT foot here while he isn’t even super tall. As for the pulse rates and heights of students, there does not seem to be any sort of correlation. I say this because all of these students are between about 56-76 inches, yet their pulse rates vary a ton. Pulse rates vary from about 56 beats per minute to about 98 beats per minute which is a huge range, all for people that are about the same height.

Question 6 – Calculate correlation coefficients

Calculate the correlation coefficients for each pair of variables in #5. Use the argument use = "pairwise.complete.obs" in your call to the cor() function to deal with the missing values.

cor(nscc_student_data$ShoeLength, nscc_student_data$Height, use = "pairwise.complete.obs")

## [1] 0.2695881

cor(nscc_student_data$PulseRate, nscc_student_data$Height, use = "pairwise.complete.obs")

## [1] 0.2028639

Strictly based on the correlation coefficients, which explanatory variable is the better predictor of height?
Shoe length is a better predictor of height than pulse rate, but don’t get me wrong, neither shoe length nor pulse rate is good predictor of height. Both variables have an extremely weak relation to height.

Question 7 – Creating and using a regression equation

Create a linear model for height as the response variable with shoe length as a predictor variable.

lm2 <-  lm(formula = Height ~ ShoeLength, data = nscc_student_data)

lm(formula = Height ~ ShoeLength, data = nscc_student_data)

## 
## Call:
## lm(formula = Height ~ ShoeLength, data = nscc_student_data)
## 
## Coefficients:
## (Intercept)   ShoeLength  
##      60.365        0.566

Use that model to predict the height of someone who has a 10" shoelength

60.365 + 0.566*10

## [1] 66.025

Do you think that prediction is an accurate one? Explain why or why not.
Yes this seems accurate because the scatterplot I made of height and shoe length shows that someone with a 10 inch shoe length will more-than-likely be at or around 66 inches tall.

Question 8 – Poor Models

a.) You hopefully found that these were both poor models. Which pair of variables would you have expected to have a poor/no relationship before your analysis?
I expected no correlation to exist between pulse rate and height because that is odd.

b.) Perhaps you expected the other pair of variables to have a stronger relationship than it did. Can you come up with any reasoning for why the relationship did not turn out to be very strong?
I expected to see a strong correlation between height and shoe length… Perhaps there wasn’t a strong correlation because college students are (on average, I’m assuming) still growing. Because height and foot sizes increase at different rates while people grow, this is just my guess.