Purpose

In this project, students will demonstrate their understanding of linear correlation and regression.


Preparation

The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each when you load them into your report.


Part 1 – mtcars Dataset

The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next four questions.

# Store mtcars dataset into environment
mtcars <- mtcars

# Structure of mtcars dataset
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

The mtcars dataset includes 32 observations of 11 numeric variables, all descriptors of the characteristics of different types of cars.

Question 1 – Scatterplot

a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.

# Scatterplot of mpg (response) vs. weight (explanatory)
plot(mtcars$wt, mtcars$mpg, xlab = "Weight (1000 lbs)", ylab = "MPG (Miles per Gallon)", main = "MPG in Relation to Weight of Cars")

b.) Based only on the scatterplot, does there appear to be a linear relationship between a car’s weight and its mpg?

Yes, there appears to be a linear relationship between the variables

Question 2 – Correlation Coefficient

a.) Calculate the linear correlation coefficient between weight and mpg.

# Correlation coefficient
cor(mtcars$wt, mtcars$mpg)
## [1] -0.8676594

b.) Based on that correlation coefficient, describe the linear relationship between the two variables.

The correlation coefficient is close to -1, indicating a strong negative linear relationship between the two variables

Question 3 – Regression Model

a.) Create a least-squares regression equation to model the relationship between the weight and mpg of a car. Clearly state the full equation.

# Fit and store a linear model of mpg ~ weight
lm1 <- lm(mpg ~ wt, data = mtcars)

# View the coefficients
lm1
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt  
##      37.285       -5.344

Regression Equation: mpg = 37.285 - 5.344 * wt

b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.

# Prediction for 2,000 lb car
37.285 - 5.344 * 2
## [1] 26.597
# Prediction using the predict function
predict(lm1, newdata = data.frame(wt = 2))
##        1 
## 26.59618

c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.

# Prediction for 7,000 lb car
37.285 - 5.344 * 7
## [1] -0.123
# Prediction using the predict function
predict(lm1, newdata = data.frame(wt = 7))
##          1 
## -0.1261748

d.) Do you think both predictions in parts b and c are reliable? Explain why or why not.

The prediction in part b appears reliable because there are observations in the dataset for cars that weigh around 2000 lbs. The prediction in part c, however, is unreliable because of the opposite. There are no observations in the dataset where a car weighs 7000 lbs. We can’t know for sure how many miles per gallon a 7000 lb car would get because we don’t know if the data will continue in the same linear trend.

Question 4 – Explained Variation

What percent of the variability in a car’s mpg is explained by the car’s weight?

# R-squared value from the model summary
summary(lm1)$r.squared
## [1] 0.7528328

75% of the variability in a car’s mpg can be explained by the car’s weight


Part 2 – NSCC Student Dataset

Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next questions.

# Store NSCC Student dataset into environment
nscc <- read.csv("nscc_student_data.csv")

# Structure of NSCC Student dataset
str(nscc)
## 'data.frame':    40 obs. of  15 variables:
##  $ Gender      : chr  "Female" "Female" "Female" "Female" ...
##  $ PulseRate   : int  64 75 74 65 NA 72 72 60 66 60 ...
##  $ CoinFlip1   : int  5 4 6 4 NA 6 6 3 7 6 ...
##  $ CoinFlip2   : int  5 6 1 4 NA 5 6 5 8 5 ...
##  $ Height      : num  62 62 60 62 66 ...
##  $ ShoeLength  : num  11 11 10 10.8 NA ...
##  $ Age         : int  19 21 25 19 26 21 19 24 24 20 ...
##  $ Siblings    : int  4 3 2 1 6 1 2 2 3 1 ...
##  $ RandomNum   : int  797 749 13 613 53 836 423 16 12 543 ...
##  $ HoursWorking: int  35 25 30 18 24 15 20 0 40 30 ...
##  $ Credits     : int  13 12 6 9 15 9 15 15 13 16 ...
##  $ Birthday    : chr  "July 5" "December 27" "January 31" "6-13" ...
##  $ ProfsAge    : int  31 30 29 31 32 32 28 28 31 28 ...
##  $ Coffee      : chr  "No" "Yes" "Yes" "Yes" ...
##  $ VoterReg    : chr  "Yes" "Yes" "No" "Yes" ...

The NSCC student dataset is a random sample of 40 students with a variety of variables, consisting of numeric values, including integers, and character values.

Question 5 – Scatterplots

I’m curious whether a person’s height is better predicted by their shoe length or by their pulse rate.

a.) Create two scatterplots, both with height as the response variable – one with shoe length as the explanatory variable, and one with pulse rate as the explanatory variable.

# Scatterplot: height vs. shoe length
plot(nscc$ShoeLength, nscc$Height, xlab = "Shoe Length (in)", ylab = "Height (in)", main = "Height in Relation to Shoe Length in NSCC Students")

# Scatterplot: height vs. pulse rate
plot(nscc$PulseRate, nscc$Height, xlab = "Pulse Rate (BPM)", ylab = "Height (in)", main = "Height in Relation to Pulse Rate in NSCC Students")

b.) Discuss the two scatterplots individually. Based only on the scatterplots, does there appear to be a linear relationship between the variables? Is the relationship weak, moderate, or strong? Based on the scatterplots alone, which explanatory variable appears to be the better predictor of height?

For height and shoe length, there appears to be some correlation. Both variables are scattered around what is most likely the mean height and mean shoe length of students. For height and pulse rate, it looks like the heights stay around the same while the pulse rate increases. This doesn’t display much of a relationship at all. Shoe length appears to be the better explanatory variable for predicting height.

Question 6 – Correlation Coefficients

a.) Calculate the correlation coefficients for each pair of variables from Question 5. Use the argument use = "pairwise.complete.obs" in your cor() call to handle missing values.

# Correlation coefficient: height vs. shoe length
cor(nscc$ShoeLength, nscc$Height, use = "pairwise.complete.obs")
## [1] 0.2695881
# Correlation coefficient: height vs. pulse rate
cor(nscc$PulseRate, nscc$Height, use = "pairwise.complete.obs")
## [1] 0.2028639

b.) Strictly based on the correlation coefficients, which explanatory variable is the better predictor of height?

Based on the correlation coefficients, the shoe length variable is a better predictor of height, but it is still not great since it is close to 0.

Question 7 – Regression Equation and Prediction

a.) Create a linear model for height as the response variable with shoe length as the explanatory variable. State the full regression equation.

# Fit and store a linear model of height ~ shoe length
lm2 <- lm(Height ~ ShoeLength, data = nscc)

# View the coefficients
lm2
## 
## Call:
## lm(formula = Height ~ ShoeLength, data = nscc)
## 
## Coefficients:
## (Intercept)   ShoeLength  
##      60.365        0.566

Regression Equation: Height = 60.365 + 0.566 * ShoeLength

b.) Use that model to predict the height of someone with a 10-inch shoe length.

# Prediction for shoe length = 10
60.365 + 0.566 * 10
## [1] 66.025
# Prediction for shoe length = 10 using predict function
predict(lm2, newdata = data.frame(ShoeLength = 10))
##        1 
## 66.02598

c.) Do you think that prediction is accurate? Explain why or why not.

In the dataset, the students who had a shoe length of 10 inches had heights of 60 inches and 67 inches. The prediction seems accurate at a glance, maybe a little overestimated. However, a student who had a shoe length of 10.5 had a height of 70 inches, and another student with a shoe length of 10.75 had a height of 62 inches. I think since there is a general mean of heights and shoe lengths among most people, it is possible for the regression equation to predict heights from shoe lengths to an extent. However, according to the correlation coefficient, the variables are not linear, so it is kind of a guessing game for the regression equation.

Question 8 – Reflecting on Poor Models

a.) You hopefully found that both models in Part 2 were relatively weak. Which pair of variables, based on common sense alone, would you have expected to show a poor or no relationship even before your analysis?

I did not think shoe length and pulse rate had anything to do with each other. I thought maybe shoe length and height would have somewhat of a relationship, but not a very strong one.

b.) Perhaps you expected the other pair to show a stronger relationship than it did. Can you offer any reasoning – based on the specific sample of NSCC students – for why that relationship was not stronger?

I think the height and shoe length for men and women varies too much and the study probably should be separated by the male and female variables. Also, shoe sizes are not very accurate when it comes to different manufacturers and brands of shoes. Someone could be a size 8 in one brand, but a size 10 in another.