In this project, students will demonstrate their understanding of linear correlation and regression.
The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each of them when you load them into your report.
The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next 4 questions related to it.
# Store mtcars dataset into environment
mtcars <- mtcars
# Let's see what is going on with this dataset!
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.
plot(mtcars$wt, mtcars$mpg, main = "Weight and MPG of cars", xlab = "Weight", ylab = "Miles Per Gallon")
b.) Only by looking at the scatterplot, does there appear to be a linear relationship between the weight and mpg of a car?
Yes, there definitely seems to be a linear relationship between the weight and miles per gallon of a car. It is not super strong, but it is also not weak.
a.) Calculate the linear correlation coefficient of the weight and mpg variables.
cor(mtcars$wt, mtcars$mpg)
## [1] -0.8676594
The linear correlation coefficient of these two variables is -0.867.
b.) Based on that correlation coefficient, describe the linear relationship between the two variables.
There is a strong, negative linear relationship between these variables.
a.) Create a regression equation to model the relationship between the weight and mpg of a car.
lm1 <- lm(formula = mpg ~ wt, data = mtcars)
lm(formula = mpg ~ wt, data = mtcars)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Coefficients:
## (Intercept) wt
## 37.285 -5.344
b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.
# Regression Equation
37.285 + -5.344*2
## [1] 26.597
c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.
37.285 + -5.344*7
## [1] -0.123
d.) Do you think the predictions in parts b and c are reliable ones? Explain why or why not.
Yes, because it seems realistic that a car of 2,000 pounds (about an average weight for a small car) would get about 27 miles per gallon because my small car gets about that many miles per gallon. It also seems realistic that a giant 7,000 pound car would not get a lot of miles per gallon at all because it is huge so it will be bad on gasoline.
What percent of the variation in a car’s mpg is explained by the car’s weight?
summary(lm1)$r.squared
## [1] 0.7528328
Our multiple R-squared is 0.7528328, therefore 75.2% of the variation in the mpg of cars can be explained by the weight of the cars.
Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next few questions related to it.
# Store nscc_student_data into environment
nscc_student_data <- read.csv("~/Desktop/honorsStats/nscc_student_data.csv")
I’m curious if a person’s height is better predicted by their shoe length or by their pulse rate.
# One plot
plot(nscc_student_data$ShoeLength, nscc_student_data$Height, main = "Shoe Length and Height", xlab = "Shoe Length", ylab = "Height")
# Other plot
plot(nscc_student_data$PulseRate, nscc_student_data$Height, main = "Pulse Rate and Height", xlab = "Pulse Rate", ylab = "Height")
use = "pairwise.complete.obs" in your call to the cor() function to deal with the missing values.cor(nscc_student_data$ShoeLength, nscc_student_data$Height, use = "pairwise.complete.obs")
## [1] 0.2695881
cor(nscc_student_data$PulseRate, nscc_student_data$Height, use = "pairwise.complete.obs")
## [1] 0.2028639
lm2 <- lm(formula = Height ~ ShoeLength, data = nscc_student_data)
lm(formula = Height ~ ShoeLength, data = nscc_student_data)
##
## Call:
## lm(formula = Height ~ ShoeLength, data = nscc_student_data)
##
## Coefficients:
## (Intercept) ShoeLength
## 60.365 0.566
60.365 + 0.566*10
## [1] 66.025
a.) You hopefully found that these were both poor models. Which pair of variables would you have expected to have a poor/no relationship before your analysis?
I expected no correlation to exist between pulse rate and height because that is odd.
b.) Perhaps you expected the other pair of variables to have a stronger relationship than it did. Can you come up with any reasoning for why the relationship did not turn out to be very strong?
I expected to see a strong correlation between height and shoe length… Perhaps there wasn’t a strong correlation because college students are (on average, I’m assuming) still growing. Because height and foot sizes increase at different rates while people grow, this is just my guess.