In this project, students will demonstrate their understanding of linear correlation and regression.
The project will use two datasets – mtcars
and nscc_student_data
. Do some exploratory analysis of each of them when you load them into your report.
The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next 4 questions related to it.
# Store mtcars dataset into environment
mtcars <- mtcars
a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.
plot(mpg ~ wt, data=mtcars, main="Miles Per Gallon and Weight of Cars", xlab="Weight of Car (in 1000 lbs)", ylab="Miles Per Gallon of Car")
b.) Only by looking at the scatterplot, does there appear to be a linear relationship between the weight and mpg of a car?
After looking at the scatter plot there appears to be a strong negative linerar correlation.
a.) Calculate the linear correlation coefficient of the weight and mpg variables.
cor(mtcars$wt, mtcars$mpg)
## [1] -0.8676594
b.) Based on that correlation coefficient, describe the linear relationship between the two variables.
Based off the correlation coefficient, the realtionship is strong.
a.) Create a regression equation to model the relationship between the weight and mpg of a car.
lm1 <- lm(mpg ~ wt, data=mtcars)
lm1
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Coefficients:
## (Intercept) wt
## 37.285 -5.344
\(y= 37.285+ (-5.344*x)\)
b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.
37.285+(-5.344*2)
## [1] 26.597
26.597
c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.
37.285+(-5.344*7)
## [1] -0.123
The fuel efficiency based off the equation is -.0123 mpg.
d.) Do you think the predictions in parts b and c are reliable ones? Explain why or why not. The prediction from b is reliable because the car weigth is supported by the model, however the one from part c is not supported by the data set so the prediction is not reliable.
What percent of the variation in a car’s mpg is explained by the car’s weight?
summary(lm1)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
The mpg variations is 75.28% based off a cars weight.
Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next few questions related to it.
# Store nscc_student_data into environment
nscc_student_data <- read.csv("nscc_student_data.csv")
I’m curious if a person’s height is better predicted by their shoe length or by their pulse rate.
plot(nscc_student_data$ShoeLength, nscc_student_data$Height, main="Shoe Length and Height", xlab = "Shoe Length", ylab = "Height")
plot(nscc_student_data$PulseRate, nscc_student_data$Height, main="Pulse Rate and Height", xlab = "Pulse Rate", ylab = "Height")
Shoe length and height appears to have a weak postive liner correlation. Pulse Rate and height appears to have no linear correlation at all. So, based off these observations shoe length and height appears to have a stronger correlation.
use = "pairwise.complete.obs"
in your call to the cor()
function to deal with the missing values.cor(nscc_student_data$ShoeLength, nscc_student_data$Height, use = "pairwise.complete.obs")
## [1] 0.2695881
cor(nscc_student_data$PulseRate, nscc_student_data$Height, use = "pairwise.complete.obs")
## [1] 0.2028639
Shoe length has a coefficent of 0.2695881, and Pulse Rate has a coefficent of 0.2028639. Soley based of these numbers shoe length is a better predictor. This also was the prediction I had just based off the graph.
lm2 <- lm(Height ~ ShoeLength, data=nscc_student_data)
lm2
##
## Call:
## lm(formula = Height ~ ShoeLength, data = nscc_student_data)
##
## Coefficients:
## (Intercept) ShoeLength
## 60.365 0.566
The equation is \(y=60.365+0.566*x\) if you plug 10’’ into that equation you will get a height of 66.025.
No the prediction does not appear to be accurate. The correlation coeffiecnt was low, and the data sample was small. Both of these point towards the model being a poor predictor.
a.) You hopefully found that these were both poor models. Which pair of variables would you have expected to have a poor/no relationship before your analysis?
I expected Pulse Rate and Height to have no correlation before analysis. Taking what I know about the two subjects I did not have high hopes for there being any correlation, and after testing it appears that the hypothesis was right.
b.) Perhaps you expected the other pair of variables to have a stronger relationship than it did. Can you come up with any reasoning for why the relationship did not turn out to be very strong?
I expected the realtionship between Shoe Length and Height to be a strong correlation. However, it does not appear to be. However, this does make sense because the statment is generally true, but not always true. I also imagine that there are many factors playing into height.