Purpose

In this project, students will demonstrate their understanding of linear correlation and regression.


Preparation

The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each of them on your own when you load them into your report.

Part 1 - mtcars dataset

The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next 4 questions related to it.

# Store mtcars dataset into environment
mtcars <- mtcars

#Look at the structure of the dataset
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
#Generate a summary of the data
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Question 1 – Create Scatterplots

a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.

#Create a scatterplot
plot(mtcars$wt, mtcars$mpg)

b.) Only by looking at the scatterplot, does there appear to be a linear relationship between the weight and mpg of a car?
There appears to be a negative linear relationship between the weight and mpg of a car.

Question 2 – Correlation Coefficient

a.) Calculate the linear correlation coefficient of the weight and mpg variables.

#Find the correlation coefficient 
cor(mtcars$wt, mtcars$mpg)
## [1] -0.8676594

b.) Based on that correlation coefficient, describe the linear relationship between the two variables.
Based on the correlation coefficient of -.8677, we can describe the relationship as a strong negative linear relationship.

Question 3 – Create a Regression Model

a.) Create a regression equation to model the relationship between the weight and mpg of a car.

#Create regression line between wt and mpg
(lmm <- lm(mpg ~ wt, data = mtcars))
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt  
##      37.285       -5.344

y = -5.344x + 37.285

b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.

#Estimate with equation
(-5.344*(2)) + 37.285
## [1] 26.597

c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.

(-5.344*(7))+37.285
## [1] -0.123

d.) Do you think the predictions in parts b and c are reliable ones? Explain why or why not.
Parts b and c predictions are somewhat reliable; I would not venture to say they are exact or highly accurate. Because the least squares regression line is merely a line of best fit between an explanatory and response variable, there will always be unaccounted variance and outliers

Question 4 – Explained Variation

What percent of the variation in a car’s mpg is explained by the car’s weight?

#Take a summary of the regression line to find Multiple R-Squared
summary(lmm)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

75.28% of the variation in a car’s mpg can be explained by the variation in a car’s weight.

Part 2 – NSCC Student dataset

Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next few questions related to it.

# Store nscc_student_data into environment
nscc <- read.csv("nscc_student_data.csv")

#Look at the structure of the dataset
str(nscc)
## 'data.frame':    40 obs. of  15 variables:
##  $ Gender      : chr  "Female" "Female" "Female" "Female" ...
##  $ PulseRate   : int  64 75 74 65 NA 72 72 60 66 60 ...
##  $ CoinFlip1   : int  5 4 6 4 NA 6 6 3 7 6 ...
##  $ CoinFlip2   : int  5 6 1 4 NA 5 6 5 8 5 ...
##  $ Height      : num  62 62 60 62 66 ...
##  $ ShoeLength  : num  11 11 10 10.8 NA ...
##  $ Age         : int  19 21 25 19 26 21 19 24 24 20 ...
##  $ Siblings    : int  4 3 2 1 6 1 2 2 3 1 ...
##  $ RandomNum   : int  797 749 13 613 53 836 423 16 12 543 ...
##  $ HoursWorking: int  35 25 30 18 24 15 20 0 40 30 ...
##  $ Credits     : int  13 12 6 9 15 9 15 15 13 16 ...
##  $ Birthday    : chr  "July 5" "December 27" "January 31" "6-13" ...
##  $ ProfsAge    : int  31 30 29 31 32 32 28 28 31 28 ...
##  $ Coffee      : chr  "No" "Yes" "Yes" "Yes" ...
##  $ VoterReg    : chr  "Yes" "Yes" "No" "Yes" ...
#Take a summary
summary(nscc)
##     Gender            PulseRate       CoinFlip1   CoinFlip2         Height     
##  Length:40          Min.   :50.00   Min.   :2   Min.   :1.000   Min.   : 6.00  
##  Class :character   1st Qu.:64.25   1st Qu.:4   1st Qu.:4.000   1st Qu.:62.00  
##  Mode  :character   Median :70.50   Median :5   Median :5.000   Median :66.00  
##                     Mean   :73.47   Mean   :5   Mean   :4.897   Mean   :64.52  
##                     3rd Qu.:83.75   3rd Qu.:6   3rd Qu.:6.000   3rd Qu.:68.75  
##                     Max.   :98.00   Max.   :8   Max.   :8.000   Max.   :76.00  
##                     NA's   :2       NA's   :1   NA's   :1       NA's   :1      
##    ShoeLength         Age           Siblings      RandomNum      HoursWorking  
##  Min.   : 7.00   Min.   :18.00   Min.   :0.00   Min.   :  1.0   Min.   : 0.00  
##  1st Qu.: 9.03   1st Qu.:19.75   1st Qu.:1.00   1st Qu.: 14.0   1st Qu.:17.25  
##  Median : 9.89   Median :21.50   Median :2.00   Median :273.0   Median :25.00  
##  Mean   :10.33   Mean   :24.70   Mean   :2.15   Mean   :313.7   Mean   :25.65  
##  3rd Qu.:11.00   3rd Qu.:28.00   3rd Qu.:2.25   3rd Qu.:531.5   3rd Qu.:32.75  
##  Max.   :20.00   Max.   :49.00   Max.   :7.00   Max.   :999.0   Max.   :64.00  
##  NA's   :5                                      NA's   :1                      
##     Credits        Birthday            ProfsAge        Coffee         
##  Min.   : 3.00   Length:40          Min.   :26.00   Length:40         
##  1st Qu.:10.00   Class :character   1st Qu.:28.00   Class :character  
##  Median :13.00   Mode  :character   Median :30.50   Mode  :character  
##  Mean   :11.78                      Mean   :31.10                     
##  3rd Qu.:15.00                      3rd Qu.:32.25                     
##  Max.   :16.00                      Max.   :39.00                     
##                                                                       
##    VoterReg        
##  Length:40         
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

Question 5 – Create Scatterplots

I’m curious if a person’s height is better predicted by their shoe length or by their pulse rate.

  1. Create two scatterplots, both with height as the response variable. One with shoe length as the explanatory variable and the other with pulse rate as the explanatory variable.
#First plot, shoe length explanatory
plot(nscc$Height, nscc$ShoeLength)

#Second plot, pulse rate explanatory
plot(nscc$Height, nscc$PulseRate)

  1. Discuss the two scatterplots individually. Based only on glancing at the scatter plots, does there appear to be a linear relationship between the variables? If so, is the relationship weak/moderate/strong? Based on the scatterplots, does either explanatory variable appear to be a better predictor of height? Explain your reasoning. Answers may vary here.
    There appears to be no linear relationship between the variables, irregardless of whether the pulse rate or the shoe length acts as the explanatory variable. Based on the scatterplots, no one variable appears to be a better predictor of height.

Question 6 – Calculate correlation coefficients

  1. Calculate the correlation coefficients for each pair of variables in #5. Use the argument use = "pairwise.complete.obs" in your call to the cor() function to deal with the missing values.
#Shoe length correlation coefficient 
cor(nscc$ShoeLength, nscc$Height, use = "pairwise.complete.obs")
## [1] 0.2695881
#Pulse rate correlation coefficient 
cor(nscc$PulseRate, nscc$Height, use = "pairwise.complete.obs")
## [1] 0.2028639
  1. Strictly based on the correlation coefficients, which explanatory variable is the better predictor of height?
    Strictly based on the correlation coefficients, the shoe length explanatory variable predicts height better.

Question 7 – Creating and using a regression equation

  1. Create a linear model for height as the response variable with shoe length as a predictor variable.
#Create the linear model
(lmn <- lm(Height ~ ShoeLength, data = nscc))
## 
## Call:
## lm(formula = Height ~ ShoeLength, data = nscc)
## 
## Coefficients:
## (Intercept)   ShoeLength  
##      60.365        0.566
  1. Use that model to predict the height of someone who has a 10” shoelength
#Calculate using regression equation
.566*(10) + 60.365
## [1] 66.025
  1. Do you think that prediction is an accurate one? Explain why or why not.
    I think the prediction is somewhat accurate. However, I am not confident in saying there is complete accuracy in the finding because the regression equation is not definite, and thus, the outcome is going to be just that, a prediction.

Question 8 – Poor Models

a.) You hopefully found that these were both poor models. Which pair of variables, based on common sense, would you have expected to have a poor/no relationship before your analysis?
Based on common sense, I would have expected pulse rate and height to have a poor/no relationship before my analysis.

b.) Perhaps you expected the other pair of variables to have a stronger relationship than it ended up having. Can you come up with any reasoning based on the specific sample of data for why the relationship did not turn out to be very strong?
I did not expect either pair of variables to have a strong relationship.