Project #7 - Linear Correlation and Regression

Preparation

The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each of them when you load them into your report.

Part 1 - mtcars dataset

The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next 4 questions related to it.

# Store mtcars dataset into environment
mtcars <- mtcars

Question 1 – Create Scatterplots

a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.

# MPG by weight
plot(mpg ~ wt, data = mtcars, main="Miles Per Gallon and Weight of Cars", xlab="Weight of Car (in 1000 lbs)", ylab="Miles Per Gallon of Car")

b.) Only by looking at the scatterplot, does there appear to be a linear relationship between the weight and mpg of a car?

Yes, there definitely seems to be a linear relationship between the weight and miles per gallon of a car but it is a negatively weak relationship.

Question 2 – Correlation Coefficient

a.) Calculate the linear correlation coefficient of the weight and mpg variables.

# Correlation coefficient
cor(mtcars$mpg, mtcars$wt)

## [1] -0.8676594

The linear correlation coefficient of these two variables is -0.867.

b.) Based on that correlation coefficient, describe the linear relationship between the two variables.
There is a strong, negative linear relationship between these variables.

Question 3 – Create a Regression Model

a.) Create a regression equation to model the relationship between the weight and mpg of a car.

# Create regression line with lm() function and store into object called lm1
lm1 <-  lm(formula = mpg ~ wt, data = mtcars)
lm1

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt  
##      37.285       -5.344

b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.

Regression line is \(y = 37.285.2 -5.344x\)

37.285 - 5.344*2000

## [1] -10650.72

c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.

37.285 + -5.344*7000

## [1] -37370.71

d.) Do you think the predictions in parts b and c are reliable ones? Explain why or why not.

Predictions in part B and C aren’t reliable because they are negatives.

Question 4 – Explained Variation

What percent of the variation in a car’s mpg is explained by the car’s weight?

summary(lm1)$r.squared

## [1] 0.7528328

75.3% of a car’s MPG can be explained by the car’s weight.

Part 2 – NSCC Student dataset

Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next few questions related to it.

# Store nscc_student_data into environment
nscc_student_data <- read.csv("C:/Users/samura641/Desktop/Honor Statistics/nscc_student_data.csv")

Question 5 – Create Scatterplots

I’m curious if a person’s height is better predicted by their shoe length or by their pulse rate.

Create two scatterplots, both with height as the response variable. One with shoe length as the explanatory variable and the other with pulse rate as the explanatory variable.

# Create scatterplots with height as a response and shoelength as explanatory

plot(nscc_student_data$ShoeLength, nscc_student_data$Height, main = "Shoe Length and Height", xlab = "Shoe Length", ylab = "Height")

# Create scatterplots with height as a response  and PulseRate as explanatory
plot(nscc_student_data$PulseRate, nscc_student_data$Height, main = "Pulse Rate and Height", xlab = "Pulse Rate", ylab = "Height")

b) Discuss the two scatterplots individually. Does there appear to be a linear relationship between the variables? Is the relationship weak/moderate/strong? Based on the scatterplots, does either explanatory variable appear to be a better predictor of height? Explain your reasoning. Answers may vary here.

Based on scatterplot of the Height and shoelength, it looks like there is a moderate positeve linear between the two(2) variable and relatively weak as correlation 0.2695881 is closer to 0.

Based on scatterplot of the Height and pulseRate, it looks like there is a weak undefind linear.between the two(2) variable and relatively weak as correlation 0.2028639 is closer to 0.

Question 6 – Calculate correlation coefficients

Calculate the correlation coefficients for each pair of variables in #5. Use the argument use = "pairwise.complete.obs" in your call to the cor() function to deal with the missing values.

# correlation coefficients 
cor(nscc_student_data$ShoeLength, nscc_student_data$Height, use = "pairwise.complete.obs")

## [1] 0.2695881

cor(nscc_student_data$PulseRate, nscc_student_data$Height, use = "pairwise.complete.obs")

## [1] 0.2028639

Strictly based on the correlation coefficients, which explanatory variable is the better predictor of height?

The better predictor of height is the shoelength because it has a larger correlation coefficient.

Question 7 – Creating and using a regression equation

Create a linear model for height as the response variable with shoe length as a predictor variable.

lm2 <-  lm(formula = Height ~ ShoeLength, data = nscc_student_data)

lm(formula = Height ~ ShoeLength, data = nscc_student_data)

## 
## Call:
## lm(formula = Height ~ ShoeLength, data = nscc_student_data)
## 
## Coefficients:
## (Intercept)   ShoeLength  
##      60.365        0.566

Use that model to predict the height of someone who has a 10" shoelength

60.365 + 0.566*10

## [1] 66.025

Do you think that prediction is an accurate one? Explain why or why not.

Yes this seems accurate because the scatterplot I made of height and shoe length shows that someone with a 10 inch shoe length will more-than-likely be at or around 66 inches tall.

Question 8 – Poor Models

a.) You hopefully found that these were both poor models. Which pair of variables would you have expected to have a poor/no relationship before your analysis?

I expected no correlation to exist between pulse rate and height because that is odd.

b.) Perhaps you expected the other pair of variables to have a stronger relationship than it did. Can you come up with any reasoning for why the relationship did not turn out to be very strong?

I expected to see a strong correlation between height and shoe length. Perhaps there wasn’t a strong correlation because college students are (on average, I’m assuming) still growing. Because height and foot sizes increase at different rates while people grow, this is just my guess.