Purpose

In this project, students will demonstrate their understanding of linear correlation and regression.


Preparation

The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each of them when you load them into your report.

Part 1 - mtcars dataset

The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next 4 questions related to it.

# Store mtcars dataset into environment
mtcars <- mtcars
# Familliarizing with the data
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Question 1 – Create Scatterplots

a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.

b.) Only by looking at the scatterplot, does there appear to be a linear relationship between the weight and mpg of a car?

plot(wt ~ mpg, data = mtcars, main = "Weight and Miles Per Gallon of cars", xlab = "Weight", ylab = "Miles Per Gallon")

Question 2 – Correlation Coefficient

a.) Calculate the linear correlation coefficient of the weight and mpg variables.

cor(mtcars$wt, mtcars$mpg)
## [1] -0.8676594

The linear correlation coefficient of weight and miles per gallon from the mtcars data set is -0.867

b.) Based on that correlation coefficient, describe the linear relationship between the two variables.

The relationship between the two variables is a strong negative

Question 3 – Create a Regression Model

a.) Create a regression equation to model the relationship between the weight and mpg of a car.

lm1 <-  lm(formula = mpg ~ wt, data = mtcars)
lm1
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt  
##      37.285       -5.344

b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.

37.285 + -5.344 * 2
## [1] 26.597

c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.

37.285 + -5.344 * 7
## [1] -0.123

d.) Do you think the predictions in parts b and c are reliable ones? Explain why or why not.

The prediction in part b is reliable, but it seems the prediction from part c is not. For part b, the regression equation estimate was 26.597. The estimate for part c was -0.123, which is not a possible MPG for a car.

Question 4 – Explained Variation

What percent of the variation in a car’s mpg is explained by the car’s weight?

summary(lm1)$r.squared
## [1] 0.7528328

The variation percentange is 75.2. 75.2% can be explained by the weight of cars.

Part 2 – NSCC Student dataset

Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next few questions related to it.

# Store nscc_student_data into environment
nscc_student_data <- read.csv("nscc_student_data.csv")

Question 5 – Create Scatterplots

I’m curious if a person’s height is better predicted by their shoe length or by their pulse rate.

  1. Create two scatterplots, both with height as the response variable. One with shoe length as the explanatory variable and the other with pulse rate as the explanatory variable.
# Scatter Plot for Shoe Length and Height 
plot(nscc_student_data$ShoeLength, nscc_student_data$Height, main = "Shoe Length and Height", xlab = "Shoe Length", ylab = "Height")

# Scatter Plot for Pulse Rate and Height
plot(nscc_student_data$PulseRate, nscc_student_data$Height, main = "Pulse Rate and Height", xlab = "Pulse Rate", ylab = "Height")

  1. Discuss the two scatterplots individually. Does there appear to be a linear relationship between the variables? Is the relationship weak/moderate/strong? Based on the scatterplots, does either explanatory variable appear to be a better predictor of height? Explain your reasoning. Answers may vary here.

The Shoe length and Height has a weak positive linear correlation. Pulse rate and Height don’t seem to have any sort of correlation at all. So, based on the data Shoe Length and Height have a stronger correlation.

Question 6 – Calculate correlation coefficients

  1. Calculate the correlation coefficients for each pair of variables in #5. Use the argument use = "pairwise.complete.obs" in your call to the cor() function to deal with the missing values.
cor(nscc_student_data$ShoeLength, nscc_student_data$Height, use = "pairwise.complete.obs")
## [1] 0.2695881
cor(nscc_student_data$PulseRate, nscc_student_data$Height, use = "pairwise.complete.obs")
## [1] 0.2028639
  1. Strictly based on the correlation coefficients, which explanatory variable is the better predictor of height?
    Based on the correlation coefficients Shoe Length is still a better predictor of Height.

Question 7 – Creating and using a regression equation

  1. Create a linear model for height as the response variable with shoe length as a predictor variable.
lm2 <-  lm(formula = Height ~ ShoeLength, data = nscc_student_data)
lm2
## 
## Call:
## lm(formula = Height ~ ShoeLength, data = nscc_student_data)
## 
## Coefficients:
## (Intercept)   ShoeLength  
##      60.365        0.566
  1. Use that model to predict the height of someone who has a 10" shoelength
60.365 + 0.566*10
## [1] 66.025
  1. Do you think that prediction is an accurate one? Explain why or why not.
    I believe this is accurate, but not very because after taking a second look at the scatter plot there is a loose grouping in between 60 and 70 inches for someone with a size 10 shoe. I would say its closer to not accurate than it is to accurate. I don’t think there was enough data to help support this.

Question 8 – Poor Models

a.) You hopefully found that these were both poor models. Which pair of variables would you have expected to have a poor/no relationship before your analysis?

I didn’t think that pulse rate and height would have any relationship. The tests that were done showed that mny expectations were justified.

b.) Perhaps you expected the other pair of variables to have a stronger relationship than it did. Can you come up with any reasoning for why the relationship did not turn out to be very strong?

I expected Shoe Length and Height to have a strong relationship, or at least stronger than what was found. I don’t think that there was enough data to help show the relationship.