In this project, students will demonstrate their understanding of linear correlation and regression.
The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each of them when you load them into your report.
The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next 4 questions related to it.
# Store mtcars dataset into environment
mtcars <- mtcars
wt<- mtcars$wt
mpg <- mtcars$mpg
ss_plot <- function(x, y, x1, y1, x2, y2, showSquares = FALSE, leastSquares = FALSE){
plot(y~x, asp = 1)# xlab = paste(substitute(x)), ylab = paste(substitute(y)))
if(leastSquares){
m1 <- lm(y~x)
y.hat <- m1$fit
} else{
#cat("Click two points to make a line.")
#pt1 <- locator(1)
points(x1, y1, pch = 21, col='red', bg='red', cex=1.5)
#pt2 <- locator(1)
points(x2, y2, pch = 21, col='red', bg='red', cex=1.5)
pts <- data.frame("x" = c(x1, x2),"y" = c(y1, y2))
m1 <- lm(y ~ x, data = pts)
y.hat <- predict(m1, newdata = data.frame(x))
}
r <- y - y.hat
abline(m1)
oSide <- x - r
LLim <- par()$usr[1]
RLim <- par()$usr[2]
oSide[oSide < LLim | oSide > RLim] <- c(x + r)[oSide < LLim | oSide > RLim] # move boxes to avoid margins
n <- length(y.hat)
for(i in 1:n){
lines(rep(x[i], 2), c(y[i], y.hat[i]), lty = 2, col = "blue")
if(showSquares){
lines(rep(oSide[i], 2), c(y[i], y.hat[i]), lty = 3, col = "orange")
lines(c(oSide[i], x[i]), rep(y.hat[i],2), lty = 3, col = "orange")
lines(c(oSide[i], x[i]), rep(y[i],2), lty = 3, col = "orange")
}
}
SS <- round(sum(r^2), 3)
cat("\r ")
print(m1)
cat("Sum of Squares: ", SS)
}
a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.
# Create a scatter plot of weight and miles per gallon.
plot(mtcars$wt, mtcars$mpg, xlab = "weight", ylab = "miles per gallon")
b.) Only by looking at the scatterplot, does there appear to be a linear relationship between the weight and mpg of a car?
Yes, there is a linear relationship between the weight and gallon per miles of mtcars.
a.) Calculate the linear correlation coefficient of the weight and mpg variables.
# Calculate correlation cofficient
cor(mtcars$wt,mtcars$mpg)
## [1] -0.8676594
b.) Based on that correlation coefficient, describe the linear relationship between the two variables.
I would say there is a strong negative linear relationship between the two variables.
a.) Create a regression equation to model the relationship between the weight and mpg of a car.
# Creat a regression equation
ss_plot(mtcars$wt,mtcars$mpg, leastSquares = TRUE)
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 37.285 -5.344
##
## Sum of Squares: 278.322
b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs
# The regression equation of a car that weights 2,000
37.285 - 5.344*2
## [1] 26.597
c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.
# The regression equation of a car that weights 7,000
((-5.344*7)+ 37.285)
## [1] -0.123
d.) Do you think the predictions in parts b and c are reliable ones? Explain why or why not.
The b’s reliability is 26.597. The part c is not reliable - 0.123, because it is not in the renge of the predictions.
What percent of the variation in a car’s mpg is explained by the car’s weight?
#Create regression line with lm() function and store into object called lm1
lm1 <- lm(mpg~wt, data = mtcars)
# Summary of lm1 model
summary(lm1)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
summary(lm1)$r.squared
## [1] 0.7528328
Our multiple R-squared is 0.7528, therefore the percent of the variation in car’s mpg explained by the car’s weight is 75.3%.
Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next few questions related to it.
# Store nscc_student_data into environment
nscc_student_data <- read.csv("C:/Users/selma/Desktop/Stats/nscc_data.csv")
I’m curious if a person’s height is better predicted by their shoe length or by their pulse rate.
# Create scatterplots with height as a response v. and shoelength as explanatory v.
plot( nscc_student_data$ShoeLength,nscc_student_data$Height, xlab = "ShoeLength", ylab ="Height" )
# Create scatterplots with height as a response v. and PulseRate as explanatory v.
plot( nscc_student_data$PulseRate, nscc_student_data$Height, xlab = "pulseRate" , ylab = "Height" )
Based on scatterplot of the Height and shoelength, it looks like there is a moderate positeve linear. therefore this explanatory variable appears to be a better predictor of height.
Based on scatterplot of the Height and pulseRate, it looks like there is a weak undefind linear.therefore this explanatory variable appears to be a weak predictor of height.
use = "pairwise.complete.obs" in your call to the cor() function to deal with the missing values.# Calculate the correlation coefficients for each pair in # 5
cor(nscc_student_data$Height,nscc_student_data$ShoeLength, use = "pairwise.complete.obs")
## [1] 0.2695881
cor(nscc_student_data$PulseRate, nscc_student_data$Height,use = "pairwise.complete.obs")
## [1] 0.2028639
The better predictor of height is the shoelength because it has a larger correlation coefficient.
# Scatter plot of Height ~ shoelength
plot(nscc_student_data$Height ~ nscc_student_data$ShoeLength)
# Adding an regression line on the plot with abline
lm2 <- lm(Height~ ShoeLength, data = nscc_student_data)
abline(lm2)
summary(lm2)
##
## Call:
## lm(formula = Height ~ ShoeLength, data = nscc_student_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.0260 -4.4458 0.2759 3.2665 8.8419
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.365 3.716 16.246 <2e-16 ***
## ShoeLength 0.566 0.352 1.608 0.117
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.533 on 33 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.07268, Adjusted R-squared: 0.04458
## F-statistic: 2.586 on 1 and 33 DF, p-value: 0.1173
Based on the model,I can predict that the person with 10" shoelength his/ her Height is ~ 5.5 ft.
I think 10" is in the range of the data that we predict, therefore should be more reliable between the two variables.However, the correlation coefficient is 0.2695881 and that is not a strong relationship.
a.) You hopefully found that these were both poor models. Which pair of variables would you have expected to have a poor/no relationship before your analysis?
I think The Height and pulseRate have no relationship.
b.) Perhaps you expected the other pair of variables to have a stronger relationship than it did. Can you come up with any reasoning for why the relationship did not turn out to be very strong?
I think because of the outlier in the model, the prediction will be affected therefore it’s not strong enough.