In this project, students will demonstrate their understanding of linear correlation and regression.
The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each of them when you load them into your report.
The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next 4 questions related to it.
# Store mtcars dataset into environment
mtcars <- mtcars
# Familliarizing with the data
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.
b.) Only by looking at the scatterplot, does there appear to be a linear relationship between the weight and mpg of a car?
plot(wt ~ mpg, data = mtcars, main = "Weight and Miles Per Gallon of cars", xlab = "Weight", ylab = "Miles Per Gallon")
a.) Calculate the linear correlation coefficient of the weight and mpg variables.
cor(mtcars$wt, mtcars$mpg)
## [1] -0.8676594
The linear correlation coefficient of weight and miles per gallon from the mtcars data set is -0.867
b.) Based on that correlation coefficient, describe the linear relationship between the two variables.
The relationship between the two variables is a strong negative
a.) Create a regression equation to model the relationship between the weight and mpg of a car.
lm1 <- lm(formula = mpg ~ wt, data = mtcars)
lm1
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Coefficients:
## (Intercept) wt
## 37.285 -5.344
b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.
37.285 + -5.344 * 2
## [1] 26.597
c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.
37.285 + -5.344 * 7
## [1] -0.123
d.) Do you think the predictions in parts b and c are reliable ones? Explain why or why not.
The prediction in part b is reliable, but it seems the prediction from part c is not. For part b, the regression equation estimate was 26.597. The estimate for part c was -0.123, which is not a possible MPG for a car.
What percent of the variation in a car’s mpg is explained by the car’s weight?
summary(lm1)$r.squared
## [1] 0.7528328
The variation percentange is 75.2. 75.2% can be explained by the weight of cars.
Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next few questions related to it.
# Store nscc_student_data into environment
nscc_student_data <- read.csv("nscc_student_data.csv")
I’m curious if a person’s height is better predicted by their shoe length or by their pulse rate.
# Scatter Plot for Shoe Length and Height
plot(nscc_student_data$ShoeLength, nscc_student_data$Height, main = "Shoe Length and Height", xlab = "Shoe Length", ylab = "Height")
# Scatter Plot for Pulse Rate and Height
plot(nscc_student_data$PulseRate, nscc_student_data$Height, main = "Pulse Rate and Height", xlab = "Pulse Rate", ylab = "Height")
The Shoe length and Height has a weak positive linear correlation. Pulse rate and Height don’t seem to have any sort of correlation at all. So, based on the data Shoe Length and Height have a stronger correlation.
use = "pairwise.complete.obs" in your call to the cor() function to deal with the missing values.cor(nscc_student_data$ShoeLength, nscc_student_data$Height, use = "pairwise.complete.obs")
## [1] 0.2695881
cor(nscc_student_data$PulseRate, nscc_student_data$Height, use = "pairwise.complete.obs")
## [1] 0.2028639
lm2 <- lm(formula = Height ~ ShoeLength, data = nscc_student_data)
lm2
##
## Call:
## lm(formula = Height ~ ShoeLength, data = nscc_student_data)
##
## Coefficients:
## (Intercept) ShoeLength
## 60.365 0.566
60.365 + 0.566*10
## [1] 66.025
a.) You hopefully found that these were both poor models. Which pair of variables would you have expected to have a poor/no relationship before your analysis?
I didn’t think that pulse rate and height would have any relationship. The tests that were done showed that mny expectations were justified.
b.) Perhaps you expected the other pair of variables to have a stronger relationship than it did. Can you come up with any reasoning for why the relationship did not turn out to be very strong?
I expected Shoe Length and Height to have a strong relationship, or at least stronger than what was found. I don’t think that there was enough data to help show the relationship.