The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each of them when you load them into your report.
The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next 4 questions related to it.
# Store mtcars dataset into environment
mtcars <- mtcars
a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.
# MPG by weight
plot(mpg ~ wt, data = mtcars, main="Miles Per Gallon and Weight of Cars", xlab="Weight of Car (in 1000 lbs)", ylab="Miles Per Gallon of Car")
b.) Only by looking at the scatterplot, does there appear to be a linear relationship between the weight and mpg of a car?
Yes, there definitely seems to be a linear relationship between the weight and miles per gallon of a car but it is a negatively weak relationship.
a.) Calculate the linear correlation coefficient of the weight and mpg variables.
# Correlation coefficient
cor(mtcars$mpg, mtcars$wt)
## [1] -0.8676594
The linear correlation coefficient of these two variables is -0.867.
b.) Based on that correlation coefficient, describe the linear relationship between the two variables.
There is a strong, negative linear relationship between these variables.
a.) Create a regression equation to model the relationship between the weight and mpg of a car.
# Create regression line with lm() function and store into object called lm1
lm1 <- lm(formula = mpg ~ wt, data = mtcars)
lm1
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Coefficients:
## (Intercept) wt
## 37.285 -5.344
b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.
Regression line is \(y = 37.285.2 -5.344x\)
37.285 - 5.344*2000
## [1] -10650.72
c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.
37.285 + -5.344*7000
## [1] -37370.71
d.) Do you think the predictions in parts b and c are reliable ones? Explain why or why not.
Predictions in part B and C aren’t reliable because they are negatives.
What percent of the variation in a car’s mpg is explained by the car’s weight?
summary(lm1)$r.squared
## [1] 0.7528328
75.3% of a car’s MPG can be explained by the car’s weight.
Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next few questions related to it.
# Store nscc_student_data into environment
nscc_student_data <- read.csv("C:/Users/samura641/Desktop/Honor Statistics/nscc_student_data.csv")
I’m curious if a person’s height is better predicted by their shoe length or by their pulse rate.
# Create scatterplots with height as a response and shoelength as explanatory
plot(nscc_student_data$ShoeLength, nscc_student_data$Height, main = "Shoe Length and Height", xlab = "Shoe Length", ylab = "Height")
# Create scatterplots with height as a response and PulseRate as explanatory
plot(nscc_student_data$PulseRate, nscc_student_data$Height, main = "Pulse Rate and Height", xlab = "Pulse Rate", ylab = "Height")
b) Discuss the two scatterplots individually. Does there appear to be a linear relationship between the variables? Is the relationship weak/moderate/strong? Based on the scatterplots, does either explanatory variable appear to be a better predictor of height? Explain your reasoning. Answers may vary here.
Based on scatterplot of the Height and shoelength, it looks like there is a moderate positeve linear between the two(2) variable and relatively weak as correlation 0.2695881 is closer to 0.
Based on scatterplot of the Height and pulseRate, it looks like there is a weak undefind linear.between the two(2) variable and relatively weak as correlation 0.2028639 is closer to 0.
use = "pairwise.complete.obs" in your call to the cor() function to deal with the missing values.# correlation coefficients
cor(nscc_student_data$ShoeLength, nscc_student_data$Height, use = "pairwise.complete.obs")
## [1] 0.2695881
cor(nscc_student_data$PulseRate, nscc_student_data$Height, use = "pairwise.complete.obs")
## [1] 0.2028639
The better predictor of height is the shoelength because it has a larger correlation coefficient.
lm2 <- lm(formula = Height ~ ShoeLength, data = nscc_student_data)
lm(formula = Height ~ ShoeLength, data = nscc_student_data)
##
## Call:
## lm(formula = Height ~ ShoeLength, data = nscc_student_data)
##
## Coefficients:
## (Intercept) ShoeLength
## 60.365 0.566
60.365 + 0.566*10
## [1] 66.025
Yes this seems accurate because the scatterplot I made of height and shoe length shows that someone with a 10 inch shoe length will more-than-likely be at or around 66 inches tall.
a.) You hopefully found that these were both poor models. Which pair of variables would you have expected to have a poor/no relationship before your analysis?
I expected no correlation to exist between pulse rate and height because that is odd.
b.) Perhaps you expected the other pair of variables to have a stronger relationship than it did. Can you come up with any reasoning for why the relationship did not turn out to be very strong?
I expected to see a strong correlation between height and shoe length. Perhaps there wasn’t a strong correlation because college students are (on average, I’m assuming) still growing. Because height and foot sizes increase at different rates while people grow, this is just my guess.