In this project, students will demonstrate their understanding of linear correlation and regression.
The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each of them when you load them into your report.
# Store datasets into environment
mtcars <- mtcars
nscc_students <- read.csv("C:/Users/naltidor01/Downloads/nscc_student_data.csv")
a.) How many rows are there and what does each row represent?
32 rows in the mtcars dataset, each one representing the model of the cars.
40 rows in the nscc_student_ data dataset, each one representing one participant.
a.) How many variables are there and what do they represent?
11 variables representing the the cars’ characteristics in mtcars.
15 variables representing the statistics of the sample of nscc students.
The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next 4 questions related to it.
# Store mtcars dataset into environment
mtcars <- mtcars
dim(mtcars)
## [1] 32 11
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.
# Scatterplot of the weight variable and the miles per gallon variable
plot(mtcars$wt, mtcars$mpg, main = "Weight per MPG?", xlab = "Weight", ylab = "MPG")
b.) Only by looking at the scatterplot, does there appear to be a linear relationship between the weight and mpg of a car?
There seems to be a moderate linear relationship between the weight variable and the miles per gallon variable.
a.) Calculate the linear correlation coefficient of the weight and mpg variables.
# Correlation coefficient of the weight and mpg variables
cor(mtcars$wt, mtcars$mpg)
## [1] -0.8676594
b.) Based on that correlation coefficient, describe the linear relationship between the two variables.
There is a strong negative linear relationship between the weight and mpg variables.
a.) Create a regression equation to model the relationship between the weight and mpg of a car.
# Regression equation to model the relationship between the weight and mpg of a car
lm1 <- lm(formula = mpg ~ wt, data = mtcars)
lm(formula = mpg ~ wt, data = mtcars)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Coefficients:
## (Intercept) wt
## 37.285 -5.344
1.) Equation
\(y= 37.285 - 5.344x\)
b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.
# Regression Equation
37.285 + -5.344*2
## [1] 26.597
c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.
# Regression Equation
37.285 + -5.344*7
## [1] -0.123
d.) Do you think the predictions in parts b and c are reliable ones? Explain why or why not.
The predictions in part b and c can be reliable since there is noticable difference in the mpg efficiency of the car based on its weight.
What percent of the variation in a car’s mpg is explained by the car’s weight?
# Percentage of variation
summary(lm1)$r.squared
## [1] 0.7528328
There is 75.28% of the variation in the mpg of cars that is explained by the cars’ weight.
Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next few questions related to it.
# Store nscc_student_data into environment
nscc_students <- read.csv("C:/Users/naltidor01/Downloads/nscc_student_data.csv")
dim(nscc_students)
## [1] 40 15
str(nscc_students)
## 'data.frame': 40 obs. of 15 variables:
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 2 2 1 2 ...
## $ PulseRate : int 64 75 74 65 NA 72 72 60 66 60 ...
## $ CoinFlip1 : int 5 4 6 4 NA 6 6 3 7 6 ...
## $ CoinFlip2 : int 5 6 1 4 NA 5 6 5 8 5 ...
## $ Height : num 62 62 60 62 66 ...
## $ ShoeLength : num 11 11 10 10.8 NA ...
## $ Age : int 19 21 25 19 26 21 19 24 24 20 ...
## $ Siblings : int 4 3 2 1 6 1 2 2 3 1 ...
## $ RandomNum : int 797 749 13 613 53 836 423 16 12 543 ...
## $ HoursWorking: int 35 25 30 18 24 15 20 0 40 30 ...
## $ Credits : int 13 12 6 9 15 9 15 15 13 16 ...
## $ Birthday : Factor w/ 40 levels "02-15","03.14.1984",..: 32 25 30 18 1 21 19 27 35 31 ...
## $ ProfsAge : int 31 30 29 31 32 32 28 28 31 28 ...
## $ Coffee : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 2 2 2 1 ...
## $ VoterReg : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 2 2 1 ...
summary(nscc_students)
## Gender PulseRate CoinFlip1 CoinFlip2 Height
## Female:27 Min. :50.00 Min. :2 Min. :1.000 Min. : 6.00
## Male :13 1st Qu.:64.25 1st Qu.:4 1st Qu.:4.000 1st Qu.:62.00
## Median :70.50 Median :5 Median :5.000 Median :66.00
## Mean :73.47 Mean :5 Mean :4.897 Mean :64.52
## 3rd Qu.:83.75 3rd Qu.:6 3rd Qu.:6.000 3rd Qu.:68.75
## Max. :98.00 Max. :8 Max. :8.000 Max. :76.00
## NA's :2 NA's :1 NA's :1 NA's :1
## ShoeLength Age Siblings RandomNum
## Min. : 7.00 Min. :18.00 Min. :0.00 Min. : 1.0
## 1st Qu.: 9.03 1st Qu.:19.75 1st Qu.:1.00 1st Qu.: 14.0
## Median : 9.89 Median :21.50 Median :2.00 Median :273.0
## Mean :10.33 Mean :24.70 Mean :2.15 Mean :313.7
## 3rd Qu.:11.00 3rd Qu.:28.00 3rd Qu.:2.25 3rd Qu.:531.5
## Max. :20.00 Max. :49.00 Max. :7.00 Max. :999.0
## NA's :5 NA's :1
## HoursWorking Credits Birthday ProfsAge Coffee
## Min. : 0.00 Min. : 3.00 02-15 : 1 Min. :26.00 No :10
## 1st Qu.:17.25 1st Qu.:10.00 03.14.1984: 1 1st Qu.:28.00 Yes:30
## Median :25.00 Median :13.00 03/13 : 1 Median :30.50
## Mean :25.65 Mean :11.78 04/15 : 1 Mean :31.10
## 3rd Qu.:32.75 3rd Qu.:15.00 05/07 : 1 3rd Qu.:32.25
## Max. :64.00 Max. :16.00 05/23 : 1 Max. :39.00
## (Other) :34
## VoterReg
## No : 9
## Yes:31
##
##
##
##
##
I’m curious if a person’s height is better predicted by their shoe length or by their pulse rate.
# Scatterplot with shoe length as the explanatory variable
plot(nscc_students$ShoeLength, nscc_students$Height, main = "Shoe Length per Height", xlab = "Shoe Length", ylab = "Height")
# Scatterplot with pulse rate as the explanatory variable
plot(nscc_students$PulseRate, nscc_students$Height, main = "Pulse Rate per Height", xlab = "Pulse Rate", ylab = "Height")
use = "pairwise.complete.obs" in your call to the cor() function to deal with the missing values.# Correlation coefficients for each pair of variables
cor(nscc_students$ShoeLength, nscc_students$Height, use = "pairwise.complete.obs")
## [1] 0.2695881
cor(nscc_students$PulseRate, nscc_students$Height, use = "pairwise.complete.obs")
## [1] 0.2028639
# Linear model with shoe length as a predictor variable
(lm2 <- lm(formula = Height ~ ShoeLength, data = nscc_students))
##
## Call:
## lm(formula = Height ~ ShoeLength, data = nscc_students)
##
## Coefficients:
## (Intercept) ShoeLength
## 60.365 0.566
# Prediction of height
60.365 + 0.566*10
## [1] 66.025
a.) You hopefully found that these were both poor models. Which pair of variables would you have expected to have a poor/no relationship before your analysis?
The of pulse variable was expected to have a poor or no relationship, since shoe length is a lot more likely to affect or be affected by height.
b.) Perhaps you expected the other pair of variables to have a stronger relationship than it did. Can you come up with any reasoning for why the relationship did not turn out to be very strong?
The shoe length variable was expected to have a sronger relationship than it did, which implement that shoe length is not the determinant factor of height. Other possible explanations of why the correlation coefficient for the shoe height variable does not turn out to be as expected, can be because the survey respondants might not have actually measured their shoe length, so the responses might not have been as accurate as they should be.