Project #7 - Linear Correlation and Regression

Purpose

In this project, students will demonstrate their understanding of linear correlation and regression.

Preparation

The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each of them when you load them into your report.

Part 1 - mtcars dataset

The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next 4 questions related to it.

# Store mtcars dataset into environment
mtcars <- mtcars

Question 1 – Create Scatterplots

a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.

#Create scatterplot of the weight variable and miles per gallon variable of the dataset mtcars.

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

plot(mtcars$wt, mtcars$mpg, main= "Car Weight and Miles Per Gallon", xlab="Car Weight", ylab ="Miles Per Gallon" )

b.) Only by looking at the scatterplot, does there appear to be a linear relationship between the weight and mpg of a car?

Yes, by looking at the scatterplot it appears that there is a moderate, negative linear relationship between the weight and mpg of a car.

Question 2 – Correlation Coefficient

a.) Calculate the linear correlation coefficient of the weight and mpg variables.

#Correlation coefficient of the weight and mpg variables.
cor(mtcars$wt, mtcars$mpg)

## [1] -0.8676594

b.) Based on that correlation coefficient, describe the linear relationship between the two variables.

The correlation coefficient of -0.8676594 indicates a strong negative linear relationship between the two variables.

Question 3 – Create a Regression Model

a.) Create a regression equation to model the relationship between the weight and mpg of a car.

# Create regression line with lm() function and store into object called lm1
lm1 <- lm(mpg ~ wt, data = mtcars)

lm1

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt  
##      37.285       -5.344

The regression line equation to model the relationship between the weight and mpg of a car is: y = 37.285 + (-5.344*x)

b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.

37.285 + (-5.344*2)

## [1] 26.597

According to the regression equation, a car that weighs 2,000lbs is estimated to have 26.597 or 26.6 mpg.

c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.

37.285 + (-5.344*7)

## [1] -0.123

According to the regression equation, a car that weighs 7,000lbs is estimated to have -0.123 mpg.

d.) Do you think the predictions in parts b and c are reliable ones? Explain why or why not.
I believe the prediction in parts B is reliable, because the number looks realistic. But part C is not reliable, because its impossible to have a negative mileage value. Also, by looking at scatterplot the estimate for a car weight 2000lbs that data has been collected while in part C is more of extrapolation. The data did not include information about a car weight 7,000lbs.

Question 4 – Explained Variation

What percent of the variation in a car’s mpg is explained by the car’s weight?

summary(lm1)

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

The multiple R-squared value is 0.7528, and therefore 75.28% of the variation in a car’s mpg is explained by the car’s weight.

Part 2 – NSCC Student dataset

Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next few questions related to it.

# Store nscc_student_data into environment
nscc_student_data <- read.csv("~/Desktop/stats/nscc_student_data.csv")
View(nscc_student_data)
str(nscc_student_data)

## 'data.frame':    40 obs. of  15 variables:
##  $ Gender      : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 2 2 1 2 ...
##  $ PulseRate   : int  64 75 74 65 NA 72 72 60 66 60 ...
##  $ CoinFlip1   : int  5 4 6 4 NA 6 6 3 7 6 ...
##  $ CoinFlip2   : int  5 6 1 4 NA 5 6 5 8 5 ...
##  $ Height      : num  62 62 60 62 66 ...
##  $ ShoeLength  : num  11 11 10 10.8 NA ...
##  $ Age         : int  19 21 25 19 26 21 19 24 24 20 ...
##  $ Siblings    : int  4 3 2 1 6 1 2 2 3 1 ...
##  $ RandomNum   : int  797 749 13 613 53 836 423 16 12 543 ...
##  $ HoursWorking: int  35 25 30 18 24 15 20 0 40 30 ...
##  $ Credits     : int  13 12 6 9 15 9 15 15 13 16 ...
##  $ Birthday    : Factor w/ 39 levels "03.14.1984","11-Jul",..: 28 22 25 5 11 8 15 13 23 19 ...
##  $ ProfsAge    : int  31 30 29 31 32 32 28 28 31 28 ...
##  $ Coffee      : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 2 2 2 1 ...
##  $ VoterReg    : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 2 2 1 ...

Question 5 – Create Scatterplots

I’m curious if a person’s height is better predicted by their shoe length or by their pulse rate.

Create two scatterplots, both with height as the response variable. One with shoe length as the explanatory variable and the other with pulse rate as the explanatory variable.

#Create a scatterplot of "height" as the responce variable and "shoelength" as the explanatory variable

plot(nscc_student_data$ShoeLength, nscc_student_data$Height, main = "Shoe length and Height", xlab = "Shoe length", ylab = "Height")

#Create a scatterplot of "height" as the responce variable and "Pulse rate" as the explanatory variable

plot(nscc_student_data$PulseRate, nscc_student_data$Height, main = "Pulse Rate and Height", xlab = "Pulse Rate", ylab = "Height")

Discuss the two scatterplots individually. Based only on glancing at the scatter plots, does there appear to be a linear relationship between the variables? If so, is the relationship weak/moderate/strong? Based on the scatterplots, does either explanatory variable appear to be a better predictor of height? Explain your reasoning. Answers may vary here.

By looking at the scatterplot with the variables of Shoe length and height appears to have a weak linear relationship between the variable.

By looking at the scatterpolt with the variables of Pulse Rate and Height appears to have no linear relationship.

Based on the scatterplots, both explanatory variables appear not to be good predictors of the height of the student. However, if I had to choose one of the two variables to predict height, it would be the shoe length variable. Even though the linear relationship is weak, it has a more significant correlation to height than the pulse rate variable.

Question 6 – Calculate correlation coefficients

Calculate the correlation coefficients for each pair of variables in #5. Use the argument use = "pairwise.complete.obs" in your call to the cor() function to deal with the missing values.

# Find the correlation coefficient between Shoe length and Height
cor(nscc_student_data$ShoeLength, nscc_student_data$Height, use = "pairwise.complete.obs")

## [1] 0.2695881

#Find the correlation coefficient between Pulse Rate and Height
cor(nscc_student_data$PulseRate, nscc_student_data$Height, use = "pairwise.complete.obs")

## [1] 0.2028639

The correlation coefficient between Shoe length and Height is 0.27 and the correlation coefficient between Pulse Rate and Height is 0.20.

Strictly based on the correlation coefficients, which explanatory variable is the better predictor of height?
Base on the correlation coefficients, Shoe length explanatory variable is the better predictor of the height than the pulse late explanatory variable.

Question 7 – Creating and using a regression equation

Create a linear model for height as the response variable with shoe length as a predictor variable.

#Create a linear model by using lm function
lm2 <- lm(Height ~ ShoeLength, data =  nscc_student_data)

lm2

## 
## Call:
## lm(formula = Height ~ ShoeLength, data = nscc_student_data)
## 
## Coefficients:
## (Intercept)   ShoeLength  
##      60.365        0.566

The regression equation is: y= 60.365 + (0.566 * x)

Use that model to predict the height of someone who has a 10" shoelength

#Compute the estimate height of someone who has a 10" shoe length.

60.365 + (0.566 * 10)

## [1] 66.025

According to the regression equation that we have created for height as the response variable and shoe length as a predictor variable, someone who has a 10" shoe length would likely be 66.025" or 66" tall.

Do you think that prediction is an accurate one? Explain why or why not.
No, i dont think the prediction is an accurate one. Because the scatterplot with the variables of Shoe length and height appears to have a weak linear relationship between the variable, and the correlation coefficient that was found in question number 6 was estimated to 0.2696. Both of these points imply that shoe length is not a better predictor of height.

Question 8 – Poor Models

a.) You hopefully found that these were both poor models. Which pair of variables, based on common sense, would you have expected to have a poor/no relationship before your analysis?

Before my analysis, I expected pulse rate and height to have a poor relationship/no relationship because pulse rate doesn’t play a role in determining someone’s height even though height may potentially have an impact on someone’s heart rate. There are many other variables to look at such as gender, genes, etc. which may have a stronger relationship to heitght so I did not expect pulse rate to be a strong predictor of height.

b.) Perhaps you expected the other pair of variables to have a stronger relationship than it ended up having. Can you come up with any reasoning based on the specific sample of data for why the relationship did not turn out to be very strong?

I expected the shoe length variable to be better predictor of height than it was because typically the taller someone is, the larger their shoe size is likely to be. Even though the statment does make sense and most of the time it is true, it is not always true. Additionally, there are many other factors that determine someone’s height.