Project #6 - Linear Correlation and Regression

Purpose

In this project, students will demonstrate their understanding of linear correlation and regression.

Preparation

The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each when you load them into your report.

Part 1 – mtcars Dataset

The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next four questions.

# Store mtcars dataset into environment
mtcars <- mtcars
#display first couple of rows in the datset
head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

#display dataset strcuture
str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Question 1 – Scatterplot

a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.

# Scatterplot of mpg (response) vs. weight (explanatory)
plot(mtcars$wt, mtcars$mpg, 
     xlab="Weight (1000 lbs)",
     ylab="MPG",
     pch=19,
     main="Weight vs MPG", col="BLUE")

b.) Based only on the scatterplot, does there appear to be a linear relationship between a car’s weight and its mpg? Based on the scatter plot, there seems to be a strong, negative linear relationship between the weight of a car and mpg. Wegiht increases correspond with lower mpg.

Question 2 – Correlation Coefficient

a.) Calculate the linear correlation coefficient between weight and mpg.

# Correlation coefficient
cor(mtcars$wt, mtcars$mpg)

## [1] -0.8676594

b.) Based on that correlation coefficient, describe the linear relationship between the two variables. A correlation coefficient of -0.868 indicates a strong, negative linear relationship betweenw weight and mpg

Question 3 – Regression Model

a.) Create a least-squares regression equation to model the relationship between the weight and mpg of a car. Clearly state the full equation.

# Fit and store a linear model of mpg ~ weight
lm_cars <- lm(mpg ~ wt, data = mtcars)

# View the coefficients
coef(lm_cars)

## (Intercept)          wt 
##   37.285126   -5.344472

y=-5.345*x+37.285

b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.

# Prediction for 2,000 lb car
-5.345*2+37.285

## [1] 26.595

c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.

# Prediction for 7,000 lb car
-5.345*7+37.285

## [1] -0.13

d.) Do you think both predictions in parts b and c are reliable? Explain why or why not. While the 2000 lb estimation appears reliable, the 7000lb estimate is not reliable. The 7000 lb value lies outside the percieved limits oif our graph, and is a negative value, which would indicate the car is empty before it starts.

Question 4 – Explained Variation

What percent of the variability in a car’s mpg is explained by the car’s weight?

# R-squared value from the model summary
summary(lm_cars)$r.squared

## [1] 0.7528328

The R-squared value is approximately 0.753, meaning 75.3% of variability in mpg is the result of a cars weight

Part 2 – NSCC Student Dataset

Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next questions.

# Store NSCC Student dataset into environment
nscc <- read.csv("nscc_student_data.csv")
#Display first few rows of data
head(nscc)

##   Gender PulseRate CoinFlip1 CoinFlip2 Height ShoeLength Age Siblings RandomNum
## 1 Female        64         5         5     62      11.00  19        4       797
## 2 Female        75         4         6     62      11.00  21        3       749
## 3 Female        74         6         1     60      10.00  25        2        13
## 4 Female        65         4         4     62      10.75  19        1       613
## 5 Female        NA        NA        NA     66         NA  26        6        53
## 6 Female        72         6         5     67       9.75  21        1       836
##   HoursWorking Credits    Birthday ProfsAge Coffee VoterReg
## 1           35      13      July 5       31     No      Yes
## 2           25      12 December 27       30    Yes      Yes
## 3           30       6  January 31       29    Yes       No
## 4           18       9        6-13       31    Yes      Yes
## 5           24      15       02-15       32     No      Yes
## 6           15       9    april 14       32     No      Yes

#Dsiplay Structure of data
str(nscc)

## 'data.frame':    40 obs. of  15 variables:
##  $ Gender      : chr  "Female" "Female" "Female" "Female" ...
##  $ PulseRate   : int  64 75 74 65 NA 72 72 60 66 60 ...
##  $ CoinFlip1   : int  5 4 6 4 NA 6 6 3 7 6 ...
##  $ CoinFlip2   : int  5 6 1 4 NA 5 6 5 8 5 ...
##  $ Height      : num  62 62 60 62 66 ...
##  $ ShoeLength  : num  11 11 10 10.8 NA ...
##  $ Age         : int  19 21 25 19 26 21 19 24 24 20 ...
##  $ Siblings    : int  4 3 2 1 6 1 2 2 3 1 ...
##  $ RandomNum   : int  797 749 13 613 53 836 423 16 12 543 ...
##  $ HoursWorking: int  35 25 30 18 24 15 20 0 40 30 ...
##  $ Credits     : int  13 12 6 9 15 9 15 15 13 16 ...
##  $ Birthday    : chr  "July 5" "December 27" "January 31" "6-13" ...
##  $ ProfsAge    : int  31 30 29 31 32 32 28 28 31 28 ...
##  $ Coffee      : chr  "No" "Yes" "Yes" "Yes" ...
##  $ VoterReg    : chr  "Yes" "Yes" "No" "Yes" ...

Question 5 – Scatterplots

I’m curious whether a person’s height is better predicted by their shoe length or by their pulse rate.

a.) Create two scatterplots, both with height as the response variable – one with shoe length as the explanatory variable, and one with pulse rate as the explanatory variable.

# Scatterplot: height vs. shoe length
plot(nscc$ShoeLength,
     nscc$Height,
     xlab = "Shoe Length (inches)",
     ylab = "Height (inches)",
     main = "Height vs Shoe Length",
     pch = 19,  col = "darkgreen")

# Scatterplot: height vs. pulse rate
plot(nscc$PulseRate,
     nscc$Height,
     xlab = "Pulse Rate",
     ylab = "Height (inches)",
      main = "Height vs Pulse Rate",
     pch = 19, col = "purple")

b.) Discuss the two scatterplots individually. Based only on the scatterplots, does there appear to be a linear relationship between the variables? Is the relationship weak, moderate, or strong? Based on the scatterplots alone, which explanatory variable appears to be the better predictor of height? The height vs shoe length scatterplot shows a weak positive linear relationship. Taller people appear to have slighly larger shoe sizes, but its not a strong relationship.

The sactterplot of height vs pulse rate shows no clear linear relationship, it does not appear to predict height well. Shoe length looks like the best indicator of height out of the two.

Question 6 – Correlation Coefficients

a.) Calculate the correlation coefficients for each pair of variables from Question 5. Use the argument use = "pairwise.complete.obs" in your cor() call to handle missing values.

# Correlation coefficient: height vs. shoe length
cor(nscc$Height, nscc$ShoeLength, use = "pairwise.complete.obs")

## [1] 0.2695881

# Correlation coefficient: height vs. pulse rate
cor(nscc$Height, nscc$PulseRate, use = "pairwise.complete.obs")

## [1] 0.2028639

b.) Strictly based on the correlation coefficients, which explanatory variable is the better predictor of height? Shoe length is a better indicator of height, since 0.270>0.203

Question 7 – Regression Equation and Prediction

a.) Create a linear model for height as the response variable with shoe length as the explanatory variable. State the full regression equation.

# Fit and store a linear model of height ~ shoe length
lm_nscc <- lm(Height ~ ShoeLength, data = nscc)

# View the coefficients
coef(lm_nscc)

## (Intercept)  ShoeLength 
##  60.3654950   0.5660485

y= 0.566*x+60.365.

b.) Use that model to predict the height of someone with a 10-inch shoe length.

# Prediction for shoe length = 10
predict(lm_nscc,
        newdata = data.frame(ShoeLength = 10))

##        1 
## 66.02598

The predicted height is slightly above 66 inches.

c.) Do you think that prediction is accurate? Explain why or why not. Due to the weak relationship between shoe length and height, the prediction is somewhat accurate, as there are other factors that infuence height.

Question 8 – Reflecting on Poor Models

a.) You hopefully found that both models in Part 2 were relatively weak. Which pair of variables, based on common sense alone, would you have expected to show a poor or no relationship even before your analysis? I expected height and pulse rate to to show little/no relationship since pulse rate is unrelated to how tall a person is. b.) Perhaps you expected the other pair to show a stronger relationship than it did. Can you offer any reasoning – based on the specific sample of NSCC students – for why that relationship was not stronger? I thought shoe length and height would have a stronger relationship. There is a wide vareity of people at NSCC which may weaken the relationship. It is aslo possible for people with similar shoe sizes to vary in height, which also weakens the relationship.