Project #7 - Linear Correlation and Regression

Preparation

The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each of them when you load them into your report.

Part 1 - mtcars dataset

The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next 4 questions related to it.

# Store mtcars dataset into environment
mtcars <- mtcars

Question 1 – Create Scatterplots

a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.

# Plot mpg by wt
plot(mpg ~ wt, data = mtcars)

b.) Only by looking at the scatterplot, does there appear to be a linear relationship between the weight and mpg of a car? It appears to be a negative linear relationship, although it is not very tight.

Question 2 – Correlation Coefficient

a.) Calculate the linear correlation coefficient of the weight and mpg variables.

# calculate correlation coefficient
cor(mtcars$wt, mtcars$mpg)

## [1] -0.8676594

b.) Based on that correlation coefficient, describe the linear relationship between the two variables.
There is a strong negative relationship between the two variables.

Question 3 – Create a Regression Model

a.) Create a regression equation to model the relationship between the weight and mpg of a car.

# Create linear model relating miles per gallon to weight
lm_wt_mpg <- lm(mpg ~ wt, data=mtcars)

b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.

# Extract coefficients from linear model
summary(lm_wt_mpg)$coefficients[,1]

## (Intercept)          wt 
##   37.285126   -5.344472

# Extrapolate linear model to 2000lbs
37.285 - 5.344472*2000

## [1] -10651.66

c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.

# Extrapolate linear model to 7000 lbs
37.285 - 5.344472*7000

## [1] -37374.02

d.) Do you think the predictions in parts b and c are reliable ones? Explain why or why not.
The predictions in parts b and c can not be reliable, as negative miles per gallon is a meaningless concept.

Question 4 – Explained Variation

What percent of the variation in a car’s mpg is explained by the car’s weight?

# Extract explained variation from linear model
summary(lm_wt_mpg)$r.squared

## [1] 0.7528328

75% of the variation in a car’s mpg can be explained by the car’s weight.

Part 2 – NSCC Student dataset

Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next few questions related to it.

# Store nscc_student_data into environment
nscc_student_data <- read.csv("nscc_student_data.csv")

Question 5 – Create Scatterplots

I’m curious if a person’s height is better predicted by their shoe length or by their pulse rate.

Create two scatterplots, both with height as the response variable. One with shoe length as the explanatory variable and the other with pulse rate as the explanatory variable.

plot(Height ~ ShoeLength, data=nscc_student_data)

plot(Height ~ PulseRate, data=nscc_student_data)

Discuss the two scatterplots individually. Does there appear to be a linear relationship between the variables? Is the relationship weak/moderate/strong? Based on the scatterplots, does either explanatory variable appear to be a better predictor of height? Explain your reasoning. Answers may vary here.
Neither Shoe Length nor Pulse Rate appears to have any sort of linear correlation with Height. Neither appears to be better as both are quite varied on small scales with no larger trends.

Question 6 – Calculate correlation coefficients

Calculate the correlation coefficients for each pair of variables in #5. Use the argument use = "pairwise.complete.obs" in your call to the cor() function to deal with the missing values.

# Calculate correlation coefficient for Pulse rate and shoe length compared to height
cor(nscc_student_data$PulseRate, nscc_student_data$Height, use="pairwise.complete.obs")

## [1] 0.2028639

cor(nscc_student_data$ShoeLength, nscc_student_data$Height, use="pairwise.complete.obs")

## [1] 0.2695881

Strictly based on the correlation coefficients, which explanatory variable is the better predictor of height?
According to the correlation coefficients, Shoe Length is a better predictor of height.

Question 7 – Creating and using a regression equation

Create a linear model for height as the response variable with shoe length as a predictor variable.

# Create linear model of shoe length to height
lm_sl_height <- lm(Height ~ ShoeLength, data=nscc_student_data)

Use that model to predict the height of someone who has a 10" shoelength

# Extract coefficients from linear model
summary(lm_sl_height)$coefficients[,1]

## (Intercept)  ShoeLength 
##  60.3654950   0.5660485

# Calculate predicted value
60.3655 + 0.5660485*10

## [1] 66.02598

Do you think that prediction is an accurate one? Explain why or why not.
I do not believe that this prediction is accurate as, at shoe lengths close to 10“, heights range from 60” to 70“.

Question 8 – Poor Models

a.) You hopefully found that these were both poor models. Which pair of variables would you have expected to have a poor/no relationship before your analysis?
I expected shoe length to have a poor relationship with height.

b.) Perhaps you expected the other pair of variables to have a stronger relationship than it did. Can you come up with any reasoning for why the relationship did not turn out to be very strong?
I believed the model for weight predicting mpg would be stronger. It would most likely be better predicted by something such as an exponential decay model.