In this project, students will demonstrate their understanding of linear correlation and regression.
The project will use two datasets – mtcars and
nscc_student_data. Do some exploratory analysis of each of
them on your own when you load them into your report.
The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next 4 questions related to it.
# Store mtcars dataset into environment
mtcars <- mtcars
#Look at the structure of the dataset
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
#Generate a summary of the data
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.
#Create a scatterplot
plot(mtcars$wt, mtcars$mpg)
b.) Only by looking at the scatterplot, does there appear to be a
linear relationship between the weight and mpg of a car?
There appears to be a negative linear relationship between the weight
and mpg of a car.
a.) Calculate the linear correlation coefficient of the weight and mpg variables.
#Find the correlation coefficient
cor(mtcars$wt, mtcars$mpg)
## [1] -0.8676594
b.) Based on that correlation coefficient, describe the linear
relationship between the two variables.
Based on the correlation coefficient of -.8677, we can describe the
relationship as a strong negative linear relationship.
a.) Create a regression equation to model the relationship between the weight and mpg of a car.
#Create regression line between wt and mpg
(lmm <- lm(mpg ~ wt, data = mtcars))
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Coefficients:
## (Intercept) wt
## 37.285 -5.344
y = -5.344x + 37.285
b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.
#Estimate with equation
(-5.344*(2)) + 37.285
## [1] 26.597
c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.
(-5.344*(7))+37.285
## [1] -0.123
d.) Do you think the predictions in parts b and c are reliable ones?
Explain why or why not.
Parts b and c predictions are somewhat reliable; I would not venture to
say they are exact or highly accurate. Because the least squares
regression line is merely a line of best fit between an explanatory and
response variable, there will always be unaccounted variance and
outliers
What percent of the variation in a car’s mpg is explained by the car’s weight?
#Take a summary of the regression line to find Multiple R-Squared
summary(lmm)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
75.28% of the variation in a car’s mpg can be explained by the variation in a car’s weight.
Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next few questions related to it.
# Store nscc_student_data into environment
nscc <- read.csv("nscc_student_data.csv")
#Look at the structure of the dataset
str(nscc)
## 'data.frame': 40 obs. of 15 variables:
## $ Gender : chr "Female" "Female" "Female" "Female" ...
## $ PulseRate : int 64 75 74 65 NA 72 72 60 66 60 ...
## $ CoinFlip1 : int 5 4 6 4 NA 6 6 3 7 6 ...
## $ CoinFlip2 : int 5 6 1 4 NA 5 6 5 8 5 ...
## $ Height : num 62 62 60 62 66 ...
## $ ShoeLength : num 11 11 10 10.8 NA ...
## $ Age : int 19 21 25 19 26 21 19 24 24 20 ...
## $ Siblings : int 4 3 2 1 6 1 2 2 3 1 ...
## $ RandomNum : int 797 749 13 613 53 836 423 16 12 543 ...
## $ HoursWorking: int 35 25 30 18 24 15 20 0 40 30 ...
## $ Credits : int 13 12 6 9 15 9 15 15 13 16 ...
## $ Birthday : chr "July 5" "December 27" "January 31" "6-13" ...
## $ ProfsAge : int 31 30 29 31 32 32 28 28 31 28 ...
## $ Coffee : chr "No" "Yes" "Yes" "Yes" ...
## $ VoterReg : chr "Yes" "Yes" "No" "Yes" ...
#Take a summary
summary(nscc)
## Gender PulseRate CoinFlip1 CoinFlip2 Height
## Length:40 Min. :50.00 Min. :2 Min. :1.000 Min. : 6.00
## Class :character 1st Qu.:64.25 1st Qu.:4 1st Qu.:4.000 1st Qu.:62.00
## Mode :character Median :70.50 Median :5 Median :5.000 Median :66.00
## Mean :73.47 Mean :5 Mean :4.897 Mean :64.52
## 3rd Qu.:83.75 3rd Qu.:6 3rd Qu.:6.000 3rd Qu.:68.75
## Max. :98.00 Max. :8 Max. :8.000 Max. :76.00
## NA's :2 NA's :1 NA's :1 NA's :1
## ShoeLength Age Siblings RandomNum HoursWorking
## Min. : 7.00 Min. :18.00 Min. :0.00 Min. : 1.0 Min. : 0.00
## 1st Qu.: 9.03 1st Qu.:19.75 1st Qu.:1.00 1st Qu.: 14.0 1st Qu.:17.25
## Median : 9.89 Median :21.50 Median :2.00 Median :273.0 Median :25.00
## Mean :10.33 Mean :24.70 Mean :2.15 Mean :313.7 Mean :25.65
## 3rd Qu.:11.00 3rd Qu.:28.00 3rd Qu.:2.25 3rd Qu.:531.5 3rd Qu.:32.75
## Max. :20.00 Max. :49.00 Max. :7.00 Max. :999.0 Max. :64.00
## NA's :5 NA's :1
## Credits Birthday ProfsAge Coffee
## Min. : 3.00 Length:40 Min. :26.00 Length:40
## 1st Qu.:10.00 Class :character 1st Qu.:28.00 Class :character
## Median :13.00 Mode :character Median :30.50 Mode :character
## Mean :11.78 Mean :31.10
## 3rd Qu.:15.00 3rd Qu.:32.25
## Max. :16.00 Max. :39.00
##
## VoterReg
## Length:40
## Class :character
## Mode :character
##
##
##
##
I’m curious if a person’s height is better predicted by their shoe length or by their pulse rate.
#First plot, shoe length explanatory
plot(nscc$Height, nscc$ShoeLength)
#Second plot, pulse rate explanatory
plot(nscc$Height, nscc$PulseRate)
use = "pairwise.complete.obs" in your
call to the cor() function to deal with the missing
values.#Shoe length correlation coefficient
cor(nscc$ShoeLength, nscc$Height, use = "pairwise.complete.obs")
## [1] 0.2695881
#Pulse rate correlation coefficient
cor(nscc$PulseRate, nscc$Height, use = "pairwise.complete.obs")
## [1] 0.2028639
#Create the linear model
(lmn <- lm(Height ~ ShoeLength, data = nscc))
##
## Call:
## lm(formula = Height ~ ShoeLength, data = nscc)
##
## Coefficients:
## (Intercept) ShoeLength
## 60.365 0.566
#Calculate using regression equation
.566*(10) + 60.365
## [1] 66.025
a.) You hopefully found that these were both poor models. Which pair
of variables, based on common sense, would you have expected to have a
poor/no relationship before your analysis?
Based on common sense, I would have expected pulse rate and height to
have a poor/no relationship before my analysis.
b.) Perhaps you expected the other pair of variables to have a
stronger relationship than it ended up having. Can you come up with any
reasoning based on the specific sample of data for why the relationship
did not turn out to be very strong?
I did not expect either pair of variables to have a strong
relationship.