Project #7 - Linear Correlation and Regression

Purpose

In this project, students will demonstrate their understanding of linear correlation and regression.

Preparation

The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each of them when you load them into your report.

Part 1 - mtcars dataset

The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next 4 questions related to it.

# Store mtcars dataset into environment
mtcars <- mtcars
wt<- mtcars$wt
mpg <- mtcars$mpg

ss_plot <- function(x, y, x1, y1, x2, y2, showSquares = FALSE, leastSquares = FALSE){
  plot(y~x, asp = 1)# xlab = paste(substitute(x)), ylab = paste(substitute(y)))
  if(leastSquares){
    m1 <- lm(y~x)
    y.hat <- m1$fit
  } else{
    #cat("Click two points to make a line.")
    #pt1 <- locator(1)
    points(x1, y1, pch = 21, col='red', bg='red', cex=1.5)
    #pt2 <- locator(1)
    points(x2, y2, pch = 21, col='red', bg='red', cex=1.5)
    pts <- data.frame("x" = c(x1, x2),"y" = c(y1, y2))
    m1 <- lm(y ~ x, data = pts)
    y.hat <- predict(m1, newdata = data.frame(x))
  }
  r <- y - y.hat
  abline(m1)
  
  oSide <- x - r
  LLim <- par()$usr[1]
  RLim <- par()$usr[2]
  oSide[oSide < LLim | oSide > RLim] <- c(x + r)[oSide < LLim | oSide > RLim] # move boxes to avoid margins
  
  n <- length(y.hat)
  for(i in 1:n){
    lines(rep(x[i], 2), c(y[i], y.hat[i]), lty = 2, col = "blue")
    if(showSquares){
      lines(rep(oSide[i], 2), c(y[i], y.hat[i]), lty = 3, col = "orange")
      lines(c(oSide[i], x[i]), rep(y.hat[i],2), lty = 3, col = "orange")
      lines(c(oSide[i], x[i]), rep(y[i],2), lty = 3, col = "orange")
    }
  }
  
  SS <- round(sum(r^2), 3)
  cat("\r                                ")
  print(m1)
  cat("Sum of Squares: ", SS)
}

Question 1 – Create Scatterplots

a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.

# Create a scatter plot of weight and miles per gallon.

plot(mtcars$wt, mtcars$mpg, xlab = "weight", ylab = "miles per gallon")

b.) Only by looking at the scatterplot, does there appear to be a linear relationship between the weight and mpg of a car?

Yes, there is a linear relationship between the weight and gallon per miles of mtcars.

Question 2 – Correlation Coefficient

a.) Calculate the linear correlation coefficient of the weight and mpg variables.

# Calculate correlation cofficient
cor(mtcars$wt,mtcars$mpg)

## [1] -0.8676594

b.) Based on that correlation coefficient, describe the linear relationship between the two variables.

I would say there is a strong negative linear relationship between the two variables.

Question 3 – Create a Regression Model

a.) Create a regression equation to model the relationship between the weight and mpg of a car.

# Creat a regression equation

ss_plot(mtcars$wt,mtcars$mpg, leastSquares = TRUE)

## 
                                
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##      37.285       -5.344  
## 
## Sum of Squares:  278.322

b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs

# The regression equation of a car that weights 2,000
37.285 - 5.344*2

## [1] 26.597

c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.

# The regression equation of a car that weights 7,000
((-5.344*7)+ 37.285)

## [1] -0.123

d.) Do you think the predictions in parts b and c are reliable ones? Explain why or why not.

The b’s reliability is 26.597. The part c is not reliable - 0.123, because it is not in the renge of the predictions.

Question 4 – Explained Variation

What percent of the variation in a car’s mpg is explained by the car’s weight?

#Create regression line with lm() function and store into object called lm1
lm1 <- lm(mpg~wt, data = mtcars)
# Summary of lm1 model
summary(lm1)

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

summary(lm1)$r.squared

## [1] 0.7528328

Our multiple R-squared is 0.7528, therefore the percent of the variation in car’s mpg explained by the car’s weight is 75.3%.

Part 2 – NSCC Student dataset

Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next few questions related to it.

# Store nscc_student_data into environment
nscc_student_data <- read.csv("C:/Users/selma/Desktop/Stats/nscc_data.csv")

Question 5 – Create Scatterplots

I’m curious if a person’s height is better predicted by their shoe length or by their pulse rate.

Create two scatterplots, both with height as the response variable. One with shoelength as the explanatory variable and the other with PulseRate as the explanatory variable.

# Create scatterplots with height as a response v. and shoelength as explanatory v.
plot( nscc_student_data$ShoeLength,nscc_student_data$Height, xlab = "ShoeLength", ylab ="Height" )

# Create scatterplots with height as a response v. and PulseRate as explanatory v.

plot( nscc_student_data$PulseRate, nscc_student_data$Height, xlab = "pulseRate" , ylab = "Height" )

Discuss the two scatterplots individually. Does there appear to be a linear relationship between the variables? Is the relationship weak/moderate/strong? Based on the scatterplots, does either explanatory variable appear to be a better predictor of height? Explain your reasoning. Answers may vary here.

Based on scatterplot of the Height and shoelength, it looks like there is a moderate positeve linear. therefore this explanatory variable appears to be a better predictor of height.

Based on scatterplot of the Height and pulseRate, it looks like there is a weak undefind linear.therefore this explanatory variable appears to be a weak predictor of height.

Question 6 – Calculate correlation coefficients

Calculate the correlation coefficients for each pair of variables in #5. Use the argument use = "pairwise.complete.obs" in your call to the cor() function to deal with the missing values.

# Calculate the correlation coefficients for each pair in # 5 
cor(nscc_student_data$Height,nscc_student_data$ShoeLength, use = "pairwise.complete.obs")

## [1] 0.2695881

cor(nscc_student_data$PulseRate, nscc_student_data$Height,use = "pairwise.complete.obs")

## [1] 0.2028639

Strictly based on the correlation coefficients, which explanatory variable is the better predictor of height?

The better predictor of height is the shoelength because it has a larger correlation coefficient.

Question 7 – Creating and using a regression equation

Create a linear model for height as the response variable with shoe length as a predictor variable.

# Scatter plot of Height ~ shoelength

plot(nscc_student_data$Height ~ nscc_student_data$ShoeLength)

# Adding an regression line on the plot with abline

lm2 <- lm(Height~ ShoeLength, data = nscc_student_data)
abline(lm2)

summary(lm2)

## 
## Call:
## lm(formula = Height ~ ShoeLength, data = nscc_student_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0260 -4.4458  0.2759  3.2665  8.8419 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   60.365      3.716  16.246   <2e-16 ***
## ShoeLength     0.566      0.352   1.608    0.117    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.533 on 33 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.07268,    Adjusted R-squared:  0.04458 
## F-statistic: 2.586 on 1 and 33 DF,  p-value: 0.1173

Use that model to predict the height of someone who has a 10" shoelength

Based on the model,I can predict that the person with 10" shoelength his/ her Height is ~ 5.5 ft.

Do you think that prediction is an accurate one? Explain why or why not.

I think 10" is in the range of the data that we predict, therefore should be more reliable between the two variables.However, the correlation coefficient is 0.2695881 and that is not a strong relationship.

Question 8 – Poor Models

a.) You hopefully found that these were both poor models. Which pair of variables would you have expected to have a poor/no relationship before your analysis?

I think The Height and pulseRate have no relationship.

b.) Perhaps you expected the other pair of variables to have a stronger relationship than it did. Can you come up with any reasoning for why the relationship did not turn out to be very strong?

I think because of the outlier in the model, the prediction will be affected therefore it’s not strong enough.