Project #7 - Linear Correlation and Regression

Purpose

In this project, students will demonstrate their understanding of linear correlation and regression.

Preparation

The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each of them when you load them into your report.

Part 1 - mtcars dataset

The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next 4 questions related to it.

# Store mtcars dataset into environment
mtcars <- mtcars

Question 1 – Create Scatterplots

a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.

# Creating a scatterplot
plot(mpg ~ wt, data = mtcars, main = "weight and MPG of the cars", xlab = "weight", ylab = "miles per gallon")

b.) Only by looking at the scatterplot, does there appear to be a linear relationship between the weight and mpg of a car?
Yes there is a linear relationship between the weight and the mpg of cars, but it is moderate.

Question 2 – Correlation Coefficient

a.) Calculate the linear correlation coefficient of the weight and mpg variables.

# Calculating the linear correlation coefficient
cor(mtcars$wt, mtcars$mpg)

## [1] -0.8676594

The linear correlation coefficient is -0.8676594.

b.) Based on that correlation coefficient, describe the linear relationship between the two variables.
The relationship is strong.

Question 3 – Create a Regression Model

a.) Create a regression equation to model the relationship between the weight and mpg of a car.

# Regression Equation
(lm1 <- lm(mpg ~ wt, data = mtcars))

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt  
##      37.285       -5.344

The equation is:
y = 37.285 - 5.344x

b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.

# mpg of car that weigh 2,000 lbs
37.285 - 5.344 * 2

## [1] 26.597

26.597 mpg

c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.

# mpg of car that weigh 7,000 lbs
37.285 - 5.344 * 7

## [1] -0.123

-0.123 mpg

d.) Do you think the predictions in parts b and c are reliable ones? Explain why or why not.
I think for the car that car that weighs 2,000 lbs the prediction is a reliable because it is a small car and I see why approximately 27 mpg would be wnough. On the other hand, I don’t think that the prediction is reliable for the car that weigh 7,000 lbs because typically the bigger the car the more it takes.

Question 4 – Explained Variation

What percent of the variation in a car’s mpg is explained by the car’s weight?

# percent of variation
summary(lm1)$r.squared

## [1] 0.7528328

75.3% of car’s MPG can be explained by the weight of the car.

Part 2 – NSCC Student dataset

Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next few questions related to it.

# Store nscc_student_data into environment
nscc_student_data <- read.csv("~/Desktop/nscc_student_data.csv")

Question 5 – Create Scatterplots

I’m curious if a person’s height is better predicted by their shoe length or by their pulse rate.

Create two scatterplots, both with height as the response variable. One with shoe length as the explanatory variable and the other with pulse rate as the explanatory variable.

# scatterplot of shoe lenght and height
plot(Height ~ ShoeLength, data = nscc_student_data, main = "Shoe lenght and Height")

# scatterplot of pulse rate and height
plot(Height ~ PulseRate, data = nscc_student_data, main = "Shoe lenght and Pulse rate", xlab = "Pulse Rate")

Discuss the two scatterplots individually. Does there appear to be a linear relationship between the variables? Is the relationship weak/moderate/strong? Based on the scatterplots, does either explanatory variable appear to be a better predictor of height? Explain your reasoning. Answers may vary here.
Shoe lenght and Height have a weak linear relationship.
It lookks like like there is no relationship at all for Pulse rate and Height.
If we really had to choose between the two Shoe lenght would be a better predictor of height, but If we were using other variables I don’t any of the two would be a good predictor because the relationships are not strong enough.

Question 6 – Calculate correlation coefficients

Calculate the correlation coefficients for each pair of variables in #5. Use the argument use = "pairwise.complete.obs" in your call to the cor() function to deal with the missing values.

# correlation of shoe lenght and height
cor(nscc_student_data$ShoeLength, nscc_student_data$Height, use = "pairwise.complete.obs")

## [1] 0.2695881

# correlation of Pulse rate and height
cor(nscc_student_data$PulseRate, nscc_student_data$Height, use = "pairwise.complete.obs")

## [1] 0.2028639

Strictly based on the correlation coefficients, which explanatory variable is the better predictor of height?
Shoe lenght would be a better predictor of height.

Question 7 – Creating and using a regression equation

Create a linear model for height as the response variable with shoe length as a predictor variable.

# Linear model
(lm2 <- lm(Height ~ ShoeLength, data = nscc_student_data))

## 
## Call:
## lm(formula = Height ~ ShoeLength, data = nscc_student_data)
## 
## Coefficients:
## (Intercept)   ShoeLength  
##      60.365        0.566

The equation is:
y = 60.365 + 0.566x

Use that model to predict the height of someone who has a 10" shoelength

# Height of someone with 10" shoe lenght
60.365 + 0.566 * 10

## [1] 66.025

The height would be 66.025 inches.

Do you think that prediction is an accurate one? Explain why or why not.
Yes it is accurate. The answer seems pretty close to the points in the scatterplot.

Question 8 – Poor Models

a.) You hopefully found that these were both poor models. Which pair of variables would you have expected to have a poor/no relationship before your analysis?
Height and pulse rate.

b.) Perhaps you expected the other pair of variables to have a stronger relationship than it did. Can you come up with any reasoning for why the relationship did not turn out to be very strong?
Usuallly when we make that assumption is based on people around us. I don’t know if the nscc student data was randomly selected but we have a more people at once to analyze it. So there are many possibilities.