In this project, students will demonstrate their understanding of linear correlation and regression.
The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each of them when you load them into your report.
The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next 4 questions related to it.
# Store mtcars dataset into environment
mtcars <- mtcars
a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.
# Creating a scatterplot
plot(mpg ~ wt, data = mtcars, main = "weight and MPG of the cars", xlab = "weight", ylab = "miles per gallon")
b.) Only by looking at the scatterplot, does there appear to be a linear relationship between the weight and mpg of a car?
Yes there is a linear relationship between the weight and the mpg of cars, but it is moderate.
a.) Calculate the linear correlation coefficient of the weight and mpg variables.
# Calculating the linear correlation coefficient
cor(mtcars$wt, mtcars$mpg)
## [1] -0.8676594
The linear correlation coefficient is -0.8676594.
b.) Based on that correlation coefficient, describe the linear relationship between the two variables.
The relationship is strong.
a.) Create a regression equation to model the relationship between the weight and mpg of a car.
# Regression Equation
(lm1 <- lm(mpg ~ wt, data = mtcars))
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Coefficients:
## (Intercept) wt
## 37.285 -5.344
The equation is:
y = 37.285 - 5.344x
b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.
# mpg of car that weigh 2,000 lbs
37.285 - 5.344 * 2
## [1] 26.597
26.597 mpg
c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.
# mpg of car that weigh 7,000 lbs
37.285 - 5.344 * 7
## [1] -0.123
-0.123 mpg
d.) Do you think the predictions in parts b and c are reliable ones? Explain why or why not.
I think for the car that car that weighs 2,000 lbs the prediction is a reliable because it is a small car and I see why approximately 27 mpg would be wnough. On the other hand, I don’t think that the prediction is reliable for the car that weigh 7,000 lbs because typically the bigger the car the more it takes.
What percent of the variation in a car’s mpg is explained by the car’s weight?
# percent of variation
summary(lm1)$r.squared
## [1] 0.7528328
75.3% of car’s MPG can be explained by the weight of the car.
Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next few questions related to it.
# Store nscc_student_data into environment
nscc_student_data <- read.csv("~/Desktop/nscc_student_data.csv")
I’m curious if a person’s height is better predicted by their shoe length or by their pulse rate.
# scatterplot of shoe lenght and height
plot(Height ~ ShoeLength, data = nscc_student_data, main = "Shoe lenght and Height")
# scatterplot of pulse rate and height
plot(Height ~ PulseRate, data = nscc_student_data, main = "Shoe lenght and Pulse rate", xlab = "Pulse Rate")
use = "pairwise.complete.obs" in your call to the cor() function to deal with the missing values.# correlation of shoe lenght and height
cor(nscc_student_data$ShoeLength, nscc_student_data$Height, use = "pairwise.complete.obs")
## [1] 0.2695881
# correlation of Pulse rate and height
cor(nscc_student_data$PulseRate, nscc_student_data$Height, use = "pairwise.complete.obs")
## [1] 0.2028639
# Linear model
(lm2 <- lm(Height ~ ShoeLength, data = nscc_student_data))
##
## Call:
## lm(formula = Height ~ ShoeLength, data = nscc_student_data)
##
## Coefficients:
## (Intercept) ShoeLength
## 60.365 0.566
The equation is:
y = 60.365 + 0.566x
# Height of someone with 10" shoe lenght
60.365 + 0.566 * 10
## [1] 66.025
The height would be 66.025 inches.
a.) You hopefully found that these were both poor models. Which pair of variables would you have expected to have a poor/no relationship before your analysis?
Height and pulse rate.
b.) Perhaps you expected the other pair of variables to have a stronger relationship than it did. Can you come up with any reasoning for why the relationship did not turn out to be very strong?
Usuallly when we make that assumption is based on people around us. I don’t know if the nscc student data was randomly selected but we have a more people at once to analyze it. So there are many possibilities.