Purpose

In this project, students will demonstrate their understanding of linear correlation and regression.


Preparation

The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each of them when you load them into your report.

Part 1 - mtcars dataset

The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next 4 questions related to it.

# Store mtcars dataset into environment
mtcars <- mtcars

Question 1 – Create Scatterplots

a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.

# Create scatterplot
plot(mtcars$wt, mtcars$mpg, main = "Miles Per Gallon and Weight of Cars", xlab = "Weight in Thousands", ylab = "Miles Per Gallon")

b.) Only by looking at the scatterplot, does there appear to be a linear relationship between the weight and mpg of a car?
Yes, there is a linear relationship between the weight of a car and the mpg of a car.

Question 2 – Correlation Coefficient

a.) Calculate the linear correlation coefficient of the weight and mpg variables.

# Calculate correlation coefficient
cor(mtcars$wt, mtcars$mpg)
## [1] -0.8676594

b.) Based on that correlation coefficient, describe the linear relationship between the two variables.

The correlation coefficient is -0.87, which is close to -1, showing a strong negative relationship between the two variables.

Question 3 – Create a Regression Model

a.) Create a regression equation to model the relationship between the weight and mpg of a car.

# Find the regression equation
(lm1<-lm(mpg ~ wt, data = mtcars))
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt  
##      37.285       -5.344

y = 37.285 + -5.344x

b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.

# Estimate the mpg of 2,000 lb cars
-5.344*2 + 37.285
## [1] 26.597

A 2,000 lb car can get an estimated 26.6 mpg.

c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.

# Estimate the mpg of 7,000 lb cars
-5.344*7 + 37.285
## [1] -0.123

A 7,000 lb car has a deficient mpg at -0.12 mpg.

d.) Do you think the predictions in parts b and c are reliable ones? Explain why or why not.

Yes, 26.6 mpg is a reasonable expectation for a car weighing 2,000 lbs. Whereas, it is not reasonable to suggest that a 7,000 lb car gets less than 1 mile to the gallon. Therefore, a regression equation is not the most reliable when continuing to determine the relationship between a car’s weight and its mpg.

Question 4 – Explained Variation

What percent of the variation in a car’s mpg is explained by the car’s weight?

# Find the summary of the lm model
summary(lm1)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

The Multiple R-squared factor from the summary is 0.7528, so approximately 75.3% of the car’s mpgs can be determined by its weight.

Part 2 – NSCC Student dataset

Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next few questions related to it.

# Store nscc_student_data into environment
nscc_student_data <- read.csv("E:/nscc_data.csv")

Question 5 – Create Scatterplots

I’m curious if a person’s height is better predicted by their shoe length or by their pulse rate.

  1. Create two scatterplots, both with height as the response variable. One with shoe length as the explanatory variable and the other with pulse rate as the explanatory variable.
# Create a scatterplot of height and shoe length
plot(nscc_student_data$ShoeLength, nscc_student_data$Height, main = "NSCC Students' Shoe Lengths and Heights", xlab = "Shoe Length in Inches", ylab = "Height in Inches")

# Create a scatterplot of height and pulse rate
plot(nscc_student_data$PulseRate, nscc_student_data$Height, main = "NSCC Students' Pulse Rates and Heights", xlab = "Pulse Rate in BPM", ylab = "Height in Inches")

  1. Discuss the two scatterplots individually. Based only on glancing at the scatter plots, does there appear to be a linear relationship between the variables? If so, is the relationship weak/moderate/strong? Based on the scatterplots, does either explanatory variable appear to be a better predictor of height? Explain your reasoning. Answers may vary here.

Based on the scatterplot for Shoe Length and Height, there is a linear relationship. However, it considers points which may be inaccurate (A 6’2" male with a 7" shoe and a 5’7" female with a 20" shoe), so this causes a weak relationship between height and shoe length. If we remove those points, it seems to be a moderately strong and slightly positive correlation.

Based on the scatterplot for Pulse Rate and Height, there is no linear relationship. Pulse rate does not seem dependent on height. Again, there is one inaccuracy that may be a typo (A 6 INCH male is said to have a pulse rate of 50, it looks like a zero was alleviated); however even if we remove that outlier, the overall relationship between height and pulse rate is no correlation and a weak relationship.

Question 6 – Calculate correlation coefficients

  1. Calculate the correlation coefficients for each pair of variables in #5. Use the argument use = "pairwise.complete.obs" in your call to the cor() function to deal with the missing values.

  2. Strictly based on the correlation coefficients, which explanatory variable is the better predictor of height?

# Find the correlation coefficients for shoe length and pulse rate in comparison to height
cor(nscc_student_data$ShoeLength, nscc_student_data$Height, "pairwise.complete.obs")
## [1] 0.2695881
cor(nscc_student_data$PulseRate, nscc_student_data$Height, "pairwise.complete.obs")
## [1] 0.2028639

Based on the correlation coefficient, a student’s shoe length is a more reliable variable (0.27) for determining a student’s height. However, they are both extremely weak determinates due to the fact they are both under 0.5, showing a weak correlation in relationship to 1.

Question 7 – Creating and using a regression equation

  1. Create a linear model for height as the response variable with shoe length as a predictor variable.
# Create a linear model for height and shoe length
(lm3<-lm(Height ~ ShoeLength, data = nscc_student_data))
## 
## Call:
## lm(formula = Height ~ ShoeLength, data = nscc_student_data)
## 
## Coefficients:
## (Intercept)   ShoeLength  
##      60.365        0.566

y = 60.365 + 0.566x

  1. Use that model to predict the height of someone who has a 10" shoelength
# Estimate the height of student with a 10 inch shoe length
0.566*10 + 60.365
## [1] 66.025

A student with a 10" shoe length can expect to be approximately 66" tall.

  1. Do you think that prediction is an accurate one? Explain why or why not.

It is reasonable for a student who is 5’6" tall to have a 10" shoe length, so this is a fairly accurate prediction.

Question 8 – Poor Models

a.) You hopefully found that these were both poor models. Which pair of variables, based on common sense, would you have expected to have a poor/no relationship before your analysis?

I expected Pulse Rate to have a poor relationship with someone’s height. Heart rate is more determined by a person’s cardiovascular activity/strength, diet, cardiothoracic size, and genetics.

b.) Perhaps you expected the other pair of variables to have a stronger relationship than it ended up having. Can you come up with any reasoning based on the specific sample of data for why the relationship did not turn out to be very strong?

I expected the relationship of Shoe Length and Height to have more of a relationship than it did. Naturally, I think of taller people having (and needing) bigger feet or shoe lengths. However, foot size is more of a genetic factor than a height variable. It is realistic for a shorter person to have a longer foot/wear a bigger shoe than someone taller than them.