The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each of them when you load them into your report.
The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next 4 questions related to it.
# Store mtcars dataset into environment
mtcars <- mtcars
a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.
# Plot mpg by wt
plot(mpg ~ wt, data = mtcars)
b.) Only by looking at the scatterplot, does there appear to be a linear relationship between the weight and mpg of a car? It appears to be a negative linear relationship, although it is not very tight.
a.) Calculate the linear correlation coefficient of the weight and mpg variables.
# calculate correlation coefficient
cor(mtcars$wt, mtcars$mpg)
## [1] -0.8676594
b.) Based on that correlation coefficient, describe the linear relationship between the two variables.
There is a strong negative relationship between the two variables.
a.) Create a regression equation to model the relationship between the weight and mpg of a car.
# Create linear model relating miles per gallon to weight
lm_wt_mpg <- lm(mpg ~ wt, data=mtcars)
b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.
# Extract coefficients from linear model
summary(lm_wt_mpg)$coefficients[,1]
## (Intercept) wt
## 37.285126 -5.344472
# Extrapolate linear model to 2000lbs
37.285 - 5.344472*2000
## [1] -10651.66
c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.
# Extrapolate linear model to 7000 lbs
37.285 - 5.344472*7000
## [1] -37374.02
d.) Do you think the predictions in parts b and c are reliable ones? Explain why or why not.
The predictions in parts b and c can not be reliable, as negative miles per gallon is a meaningless concept.
What percent of the variation in a car’s mpg is explained by the car’s weight?
# Extract explained variation from linear model
summary(lm_wt_mpg)$r.squared
## [1] 0.7528328
75% of the variation in a car’s mpg can be explained by the car’s weight.
Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next few questions related to it.
# Store nscc_student_data into environment
nscc_student_data <- read.csv("nscc_student_data.csv")
I’m curious if a person’s height is better predicted by their shoe length or by their pulse rate.
plot(Height ~ ShoeLength, data=nscc_student_data)
plot(Height ~ PulseRate, data=nscc_student_data)
use = "pairwise.complete.obs" in your call to the cor() function to deal with the missing values.# Calculate correlation coefficient for Pulse rate and shoe length compared to height
cor(nscc_student_data$PulseRate, nscc_student_data$Height, use="pairwise.complete.obs")
## [1] 0.2028639
cor(nscc_student_data$ShoeLength, nscc_student_data$Height, use="pairwise.complete.obs")
## [1] 0.2695881
# Create linear model of shoe length to height
lm_sl_height <- lm(Height ~ ShoeLength, data=nscc_student_data)
# Extract coefficients from linear model
summary(lm_sl_height)$coefficients[,1]
## (Intercept) ShoeLength
## 60.3654950 0.5660485
# Calculate predicted value
60.3655 + 0.5660485*10
## [1] 66.02598
a.) You hopefully found that these were both poor models. Which pair of variables would you have expected to have a poor/no relationship before your analysis?
I expected shoe length to have a poor relationship with height.
b.) Perhaps you expected the other pair of variables to have a stronger relationship than it did. Can you come up with any reasoning for why the relationship did not turn out to be very strong?
I believed the model for weight predicting mpg would be stronger. It would most likely be better predicted by something such as an exponential decay model.