Project #7 - Linear Correlation and Regression

Purpose

In this project, students will demonstrate their understanding of linear correlation and regression.

Preparation

The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each of them when you load them into your report.

Part 1 - mtcars dataset

The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next 4 questions related to it.

# Store mtcars dataset into environment
mtcars <- mtcars

Question 1 – Create Scatterplots

a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.

plot(mpg ~ wt, data=mtcars, main="Miles Per Gallon and Weight of Cars", xlab="Weight of Car (in 1000 lbs)", ylab="Miles Per Gallon of Car")

b.) Only by looking at the scatterplot, does there appear to be a linear relationship between the weight and mpg of a car?

After looking at the scatter plot there appears to be a strong negative linerar correlation.

Question 2 – Correlation Coefficient

a.) Calculate the linear correlation coefficient of the weight and mpg variables.

cor(mtcars$wt, mtcars$mpg)

## [1] -0.8676594

b.) Based on that correlation coefficient, describe the linear relationship between the two variables.

Based off the correlation coefficient, the realtionship is strong.

Question 3 – Create a Regression Model

a.) Create a regression equation to model the relationship between the weight and mpg of a car.

lm1 <- lm(mpg ~ wt, data=mtcars)
lm1

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt  
##      37.285       -5.344

\(y= 37.285+ (-5.344*x)\)

b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.

37.285+(-5.344*2)

## [1] 26.597

26.597

c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.

37.285+(-5.344*7)

## [1] -0.123

The fuel efficiency based off the equation is -.0123 mpg.

d.) Do you think the predictions in parts b and c are reliable ones? Explain why or why not. The prediction from b is reliable because the car weigth is supported by the model, however the one from part c is not supported by the data set so the prediction is not reliable.

Question 4 – Explained Variation

What percent of the variation in a car’s mpg is explained by the car’s weight?

summary(lm1)

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

The mpg variations is 75.28% based off a cars weight.

Part 2 – NSCC Student dataset

Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next few questions related to it.

# Store nscc_student_data into environment
nscc_student_data <- read.csv("nscc_student_data.csv")

Question 5 – Create Scatterplots

I’m curious if a person’s height is better predicted by their shoe length or by their pulse rate.

Create two scatterplots, both with height as the response variable. One with shoe length as the explanatory variable and the other with pulse rate as the explanatory variable.

plot(nscc_student_data$ShoeLength, nscc_student_data$Height, main="Shoe Length and Height", xlab = "Shoe Length", ylab = "Height")

plot(nscc_student_data$PulseRate, nscc_student_data$Height, main="Pulse Rate and Height", xlab = "Pulse Rate", ylab = "Height")

Discuss the two scatterplots individually. Does there appear to be a linear relationship between the variables? Is the relationship weak/moderate/strong? Based on the scatterplots, does either explanatory variable appear to be a better predictor of height? Explain your reasoning. Answers may vary here.

Shoe length and height appears to have a weak postive liner correlation. Pulse Rate and height appears to have no linear correlation at all. So, based off these observations shoe length and height appears to have a stronger correlation.

Question 6 – Calculate correlation coefficients

Calculate the correlation coefficients for each pair of variables in #5. Use the argument use = "pairwise.complete.obs" in your call to the cor() function to deal with the missing values.

cor(nscc_student_data$ShoeLength, nscc_student_data$Height, use = "pairwise.complete.obs")

## [1] 0.2695881

cor(nscc_student_data$PulseRate, nscc_student_data$Height, use = "pairwise.complete.obs")

## [1] 0.2028639

Strictly based on the correlation coefficients, which explanatory variable is the better predictor of height?

Shoe length has a coefficent of 0.2695881, and Pulse Rate has a coefficent of 0.2028639. Soley based of these numbers shoe length is a better predictor. This also was the prediction I had just based off the graph.

Question 7 – Creating and using a regression equation

Create a linear model for height as the response variable with shoe length as a predictor variable.

lm2 <- lm(Height ~ ShoeLength, data=nscc_student_data)

Use that model to predict the height of someone who has a 10" shoelength

lm2

## 
## Call:
## lm(formula = Height ~ ShoeLength, data = nscc_student_data)
## 
## Coefficients:
## (Intercept)   ShoeLength  
##      60.365        0.566

The equation is \(y=60.365+0.566*x\) if you plug 10’’ into that equation you will get a height of 66.025.

Do you think that prediction is an accurate one? Explain why or why not.

No the prediction does not appear to be accurate. The correlation coeffiecnt was low, and the data sample was small. Both of these point towards the model being a poor predictor.

Question 8 – Poor Models

a.) You hopefully found that these were both poor models. Which pair of variables would you have expected to have a poor/no relationship before your analysis?

I expected Pulse Rate and Height to have no correlation before analysis. Taking what I know about the two subjects I did not have high hopes for there being any correlation, and after testing it appears that the hypothesis was right.

b.) Perhaps you expected the other pair of variables to have a stronger relationship than it did. Can you come up with any reasoning for why the relationship did not turn out to be very strong?

I expected the realtionship between Shoe Length and Height to be a strong correlation. However, it does not appear to be. However, this does make sense because the statment is generally true, but not always true. I also imagine that there are many factors playing into height.