Purpose

In this project, students will demonstrate their understanding of linear correlation and regression.

Preparation

The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each of them when you load them into your report.


Part 1 - mtcars dataset

The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next 4 questions related to it.

# Store mtcars dataset into environment
mtcars <- mtcars

Question 1 – Create Scatterplots

a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.

plot(mtcars$wt,mtcars$mpg,main = "Mileage v. Car Weight", xlab = "weight (1000 lbs)", ylab = "miles per gallon")

b.) Only by looking at the scatterplot, does there appear to be a linear relationship between the weight and mpg of a car?

There appears to be a negative linear relationship between the weight and mpg of cars. To my eye, the three cars with weights above 5000 lbs appear to be outliers that don’t fall on the line predicted by the other points. If I exclude these points, I see a strong negative linear relationship.

Question 2 – Correlation Coefficient

a.) Calculate the linear correlation coefficient of the weight and mpg variables.

# We use the cor() function to find the correlation strength between the two variables.

cor(mtcars$wt, mtcars$mpg)
## [1] -0.8676594
#I wonder how much this changes if I exclude the three outliers.
mtcarsLight <- subset(mtcars,mtcars$wt <= 4.250)
cor(mtcarsLight$wt, mtcarsLight$mpg)
## [1] -0.8791779

b.) Based on that correlation coefficient, describe the linear relationship between the two variables.

The correlation coefficient of -0.8677 indicates a relatively strong negative relationship between the weight and mileage of cars. Removing the three high-weight cars had very little effect.

Question 3 – Create a Regression Model

a.) Create a regression equation to model the relationship between the weight and mpg of a car.

mpgFit <- lm(mtcars$mpg ~ mtcars$wt)
mpgFit$coefficients
## (Intercept)   mtcars$wt 
##   37.285126   -5.344472
#How much does this change if we remove the cars that weigh more than 5,000 lbs?  
mpgFitLight <- lm(mtcarsLight$mpg ~ mtcarsLight$wt)
mpgFitLight$coefficients
##    (Intercept) mtcarsLight$wt 
##      41.392930      -6.821287

The regression equation for the full data set: mpg = 37.29 - 5.34*wt
The regression equation for cars under 5000 lbs: mpg = 41.39 - 6.82*wt

b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.

#Estimate based on all data
round(mpgFit$coefficients[[1]] + mpgFit$coefficients[[2]]*2, 1)
## [1] 26.6
#Estimate based on truncated data set
round(mpgFitLight$coefficients[[1]] + mpgFitLight$coefficients[[2]]*2, 1)
## [1] 27.8

From the best fit to all the data, the gas mileage of a 2000 lb car is predicted to be 26.6 mpg. Based on the smaller data set, the predicted mileage is 27.8 mpg.

c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.

round(mpgFit$coefficients[[1]] + mpgFit$coefficients[[2]]*7, 2)
## [1] -0.13

The best fit line to the full data set predicts a gas mileage of -0.13 mpg for a car of 7000 lbs.

d.) Do you think the predictions in parts b and c are reliable ones? Explain why or why not.
I’ll look at the predictions based on the full data set. The prediction for mpg of a 2,000 lb car of 26.6 mpg falls in the general area of data points, so I feel that this is a valid estimate. The prediction for a 7,000 lb car is -0.13 mpg, which is a terrible estimate – it’s impossible to have a negative mileage value. Extrapolating to weights above the data set is not a good idea.

I want to know how different the estimates are for cars of 5,500 lbs.

#Estimate based on all data
round(mpgFit$coefficients[[1]] + mpgFit$coefficients[[2]]*5.5, 1)
## [1] 7.9
#Estimate based on truncated data set
round(mpgFitLight$coefficients[[1]] + mpgFitLight$coefficients[[2]]*5.5, 1)
## [1] 3.9

Estimate based on full data set: 7.9 mpg.
Estimate based on truncated data set: 3.9 mpg.
The linear fits both underpredict the gas mileage at 5,500 lbs. The data points are all above 10 mpg. It makes sense that the second fit (without data in the 5500 lb range), is a worse predictor at 5500lbs because we excluded data in this region to calculate the best fit line.

Question 4 – Explained Variation

What percent of the variation in a car’s mpg is explained by the car’s weight?

summary(mpgFit)
## 
## Call:
## lm(formula = mtcars$mpg ~ mtcars$wt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## mtcars$wt    -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

The R-squared value is 0.7528, which indicates that 75.28% of the variation in a car’s mileage is explained by the car’s weight.

I suspect that this value increases if we exclude the cars that weigh more than 5000 lbs. Let’s see…

summary(mpgFitLight)
## 
## Call:
## lm(formula = mtcarsLight$mpg ~ mtcarsLight$wt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9694 -2.0409 -0.6723  1.3505  6.0139 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     41.3929     2.1923  18.881  < 2e-16 ***
## mtcarsLight$wt  -6.8213     0.7115  -9.587 3.48e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.733 on 27 degrees of freedom
## Multiple R-squared:  0.773,  Adjusted R-squared:  0.7645 
## F-statistic: 91.92 on 1 and 27 DF,  p-value: 3.479e-10

Now the R-squared value is 0.773. I expected that dropping those three points would make more of a difference in the R-squared value! Not much improvement at all.


Part 2 – NSCC Student dataset

Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next few questions related to it.

# Store nscc_student_data into environment

nscc <- read.csv("../nscc_student_data.csv")

Question 5 – Create Scatterplots

I’m curious if a person’s height is better predicted by their shoe length or by their pulse rate.

  1. Create two scatterplots, both with height as the response variable. One with shoelength as the explanatory variable and the other with pulse rate as the explanatory variable.
plot(nscc$Height ~ nscc$ShoeLength, main= "NSCC Height v. Shoe Length", xlab = "Shoe Length (in.)", ylab = "Height (in.)")

plot(nscc$Height ~ nscc$PulseRate, main= "NSCC Height v. Pulse Rate", xlab = "Pulse Rate (bpm)", ylab = "Height (in.)")

  1. Discuss the two scatterplots individually. Does there appear to be a linear relationship between the variables? Is the relationship weak/moderate/strong? Based on the scatterplots, does either explanatory variable appear to be a better predictor of height? Explain your reasoning. Answers may vary here.

Let’s start with the plot of height v. shoe length. Values for shoe length are clustered between 7" and 13“, except for one outlier at 20”. All values for height fall between 60" and 76“. There appears to be no/weak relationship between the variables. The two largest values for height (75” and 76“) both fall at shoe length = 12”, which is one of the larger values for shoe length. So, perhaps we will find a slightly positive linear realtionship.

Now, let’s turn to the plot of height v. pulse rate. Again, there appears to be little if any relationship between the two variables. Pulse rate varies between 56 bpm and 98 bpm, with the exception of one outlier at 50 bpm. Height varies between 60" and 76“, except for one outlier at 6”. Even when this outlier is removed, my eye doesn’t see a relationship.

Neither explanatory variable appears to predict height. If I have to choose one over the other, I would say there is perhaps more of a relationship with shoe length than with pulse rate. We shall see…

Question 6 – Calculate correlation coefficients

  1. Calculate the correlation coefficients for each pair of variables in #5. Use the argument use = "pairwise.complete.obs" in your call to the cor() function to deal with the missing values.
cor(nscc$ShoeLength,nscc$Height, use = "pairwise.complete.obs")
## [1] 0.2695881
cor(nscc$PulseRate,nscc$Height, use = "pairwise.complete.obs")
## [1] 0.2028639

The correlation coefficient for shoe length and height is 0.27. The correlation coefficient for pulse rate and height is 0.20.

  1. Strictly based on the correlation coefficients, which explanatory variable is the better predictor of height?

Neither explanatory variable predicts height. Correlation coefficients in this range indicate a very weak or no relationship. Technically, the correlation between shoe length and height is better than the correlation between pulse rate and height.

Let’s see what happens if I remove the outliers…

#Create subsets of the data that exclude outliers.  First the shoe length set.
nscc_shoe_cleaned <- subset(nscc, nscc$ShoeLength <= 14)

#Now the pulse rate subset
nscc_pulse_cleaned <- subset(nscc, nscc$PulseRate >= 55)

#Let's calculate new correlation coefficients
cor(nscc_shoe_cleaned$ShoeLength, nscc_shoe_cleaned$Height, use = "pairwise.complete.obs")
## [1] 0.3816193
cor(nscc_pulse_cleaned$PulseRate, nscc_pulse_cleaned$Height, use = "pairwise.complete.obs")
## [1] -0.2065758

Without the outlier, the correlation coefficient between shoe length and height increased from 0.27 to 0.38, but the relationship is still weak. The correlation between pulse rate and height stayed at the same absolute value, but flipped sign. Since the value of r already indicates a weak or no relationship, neglecting the outlier did not improve the correlation between height and pulse rate.

Question 7 – Creating and using a regression equation

  1. Create a linear model for height as the response variable with shoe length as a predictor variable.
lmHSL <- lm(Height ~ ShoeLength, data = nscc)

coefficients(lmHSL)
## (Intercept)  ShoeLength 
##  60.3654950   0.5660485

The best fit line to the data is Height = 60.37 + 0.57*ShoeLength.

  1. Use that model to predict the height of someone who has a 10" shoelength
round(lmHSL$coefficients[[1]] + lmHSL$coefficients[[2]]*10, 2)
## [1] 66.03

The linear least squares fit predicts a height of 66.03" based on a shoe length of 10“.

  1. Do you think that prediction is an accurate one? Explain why or why not.
    While the predicted value of 66" falls within the data points and appears to be a good prediction, I would not use shoe length to predict height because the relationship between the two variables is weak.

Question 8 – Poor Models

a.) You hopefully found that these were both poor models. Which pair of variables would you have expected to have a poor/no relationship before your analysis?

I expected shoe length to be a better predictor of height than pulse rate. In general, we think of sizes to be better correlated – taller people wear larger shoes than shorter people. It seemed like pulse rate would be completely unrelated to height. The correlation coefficient of 0.20 supports this impression.

b.) Perhaps you expected the other pair of variables to have a stronger relationship than it did. Can you come up with any reasoning for why the relationship did not turn out to be very strong?

I did expect shoe length to be a better predictor of height than it turned out to be. Even when I eliminated one outlier, the correlation correlation improved to only 0.38.

I wonder what happens if we separate men and women?

#Create two subsets of data, one for men and one for women, continuing to exclude the outlier at shoe length = 20".

nscc_women <- subset(nscc_shoe_cleaned,nscc_shoe_cleaned$Gender == "Female")

nscc_men <- subset(nscc_shoe_cleaned,nscc_shoe_cleaned$Gender == "Male")

plot(nscc_women$Height ~ nscc_women$ShoeLength, main= "NSCC Women Height v. Shoe Length", xlab = "Shoe Length (in.)", ylab = "Height (in.)")

plot(nscc_men$Height ~ nscc_men$ShoeLength, main= "NSCC Men Height v. Shoe Length", xlab = "Shoe Length (in.)", ylab = "Height (in.)")

The plot for women doesn’t support more of a relationship than before. The plot for men might have a stronger positive relationship. Let’s run the numbers.

#Calculate correlation coefficients.
round(cor(nscc_women$Height,nscc_women$ShoeLength), 2)
## [1] 0.18
round(cor(nscc_men$Height,nscc_men$ShoeLength), 2)
## [1] 0.03

Nope! Separating the data by gender did not result in a better relationship!

Another possible reason for the weak relationship could be the sample sizes. The sample size of each data set (full, women, men) is rather small: 34 observations for both sexes (without the outlier), 23 women, and 11 men. Perhaps with a larger sample of students we would see a relationship between shoe size and height.