Project #7 - Linear Correlation and Regression

Purpose

In this project, students will demonstrate their understanding of linear correlation and regression.

Preparation

The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each of them when you load them into your report.

# Store datasets into environment
mtcars <- mtcars
nscc_students <- read.csv("C:/Users/naltidor01/Downloads/nscc_student_data.csv")

a.) How many rows are there and what does each row represent?
32 rows in the mtcars dataset, each one representing the model of the cars.
40 rows in the nscc_student_ data dataset, each one representing one participant.

a.) How many variables are there and what do they represent?
11 variables representing the the cars’ characteristics in mtcars.
15 variables representing the statistics of the sample of nscc students.

Part 1 - mtcars dataset

The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next 4 questions related to it.

# Store mtcars dataset into environment
mtcars <- mtcars
dim(mtcars)

## [1] 32 11

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

summary(mtcars)

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Question 1 – Create Scatterplots

a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.

#  Scatterplot of the weight variable and the miles per gallon variable
plot(mtcars$wt, mtcars$mpg, main = "Weight per MPG?", xlab = "Weight", ylab = "MPG")

b.) Only by looking at the scatterplot, does there appear to be a linear relationship between the weight and mpg of a car?
There seems to be a moderate linear relationship between the weight variable and the miles per gallon variable.

Question 2 – Correlation Coefficient

a.) Calculate the linear correlation coefficient of the weight and mpg variables.

# Correlation coefficient of the weight and mpg variables
cor(mtcars$wt, mtcars$mpg)

## [1] -0.8676594

b.) Based on that correlation coefficient, describe the linear relationship between the two variables.
There is a strong negative linear relationship between the weight and mpg variables.

Question 3 – Create a Regression Model

a.) Create a regression equation to model the relationship between the weight and mpg of a car.

# Regression equation to model the relationship between the weight and mpg of a car
lm1 <-  lm(formula = mpg ~ wt, data = mtcars)
lm(formula = mpg ~ wt, data = mtcars)

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt  
##      37.285       -5.344

1.) Equation
\(y= 37.285 - 5.344x\)

b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.

# Regression Equation
37.285 + -5.344*2

## [1] 26.597

c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.

# Regression Equation
37.285 + -5.344*7

## [1] -0.123

d.) Do you think the predictions in parts b and c are reliable ones? Explain why or why not.
The predictions in part b and c can be reliable since there is noticable difference in the mpg efficiency of the car based on its weight.

Question 4 – Explained Variation

What percent of the variation in a car’s mpg is explained by the car’s weight?

# Percentage of variation 
summary(lm1)$r.squared

## [1] 0.7528328

There is 75.28% of the variation in the mpg of cars that is explained by the cars’ weight.

Part 2 – NSCC Student dataset

Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next few questions related to it.

# Store nscc_student_data into environment
nscc_students <- read.csv("C:/Users/naltidor01/Downloads/nscc_student_data.csv")
dim(nscc_students)

## [1] 40 15

str(nscc_students)

## 'data.frame':    40 obs. of  15 variables:
##  $ Gender      : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 2 2 1 2 ...
##  $ PulseRate   : int  64 75 74 65 NA 72 72 60 66 60 ...
##  $ CoinFlip1   : int  5 4 6 4 NA 6 6 3 7 6 ...
##  $ CoinFlip2   : int  5 6 1 4 NA 5 6 5 8 5 ...
##  $ Height      : num  62 62 60 62 66 ...
##  $ ShoeLength  : num  11 11 10 10.8 NA ...
##  $ Age         : int  19 21 25 19 26 21 19 24 24 20 ...
##  $ Siblings    : int  4 3 2 1 6 1 2 2 3 1 ...
##  $ RandomNum   : int  797 749 13 613 53 836 423 16 12 543 ...
##  $ HoursWorking: int  35 25 30 18 24 15 20 0 40 30 ...
##  $ Credits     : int  13 12 6 9 15 9 15 15 13 16 ...
##  $ Birthday    : Factor w/ 40 levels "02-15","03.14.1984",..: 32 25 30 18 1 21 19 27 35 31 ...
##  $ ProfsAge    : int  31 30 29 31 32 32 28 28 31 28 ...
##  $ Coffee      : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 2 2 2 1 ...
##  $ VoterReg    : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 2 2 1 ...

summary(nscc_students)

##     Gender     PulseRate       CoinFlip1   CoinFlip2         Height     
##  Female:27   Min.   :50.00   Min.   :2   Min.   :1.000   Min.   : 6.00  
##  Male  :13   1st Qu.:64.25   1st Qu.:4   1st Qu.:4.000   1st Qu.:62.00  
##              Median :70.50   Median :5   Median :5.000   Median :66.00  
##              Mean   :73.47   Mean   :5   Mean   :4.897   Mean   :64.52  
##              3rd Qu.:83.75   3rd Qu.:6   3rd Qu.:6.000   3rd Qu.:68.75  
##              Max.   :98.00   Max.   :8   Max.   :8.000   Max.   :76.00  
##              NA's   :2       NA's   :1   NA's   :1       NA's   :1      
##    ShoeLength         Age           Siblings      RandomNum    
##  Min.   : 7.00   Min.   :18.00   Min.   :0.00   Min.   :  1.0  
##  1st Qu.: 9.03   1st Qu.:19.75   1st Qu.:1.00   1st Qu.: 14.0  
##  Median : 9.89   Median :21.50   Median :2.00   Median :273.0  
##  Mean   :10.33   Mean   :24.70   Mean   :2.15   Mean   :313.7  
##  3rd Qu.:11.00   3rd Qu.:28.00   3rd Qu.:2.25   3rd Qu.:531.5  
##  Max.   :20.00   Max.   :49.00   Max.   :7.00   Max.   :999.0  
##  NA's   :5                                      NA's   :1      
##   HoursWorking      Credits            Birthday     ProfsAge     Coffee  
##  Min.   : 0.00   Min.   : 3.00   02-15     : 1   Min.   :26.00   No :10  
##  1st Qu.:17.25   1st Qu.:10.00   03.14.1984: 1   1st Qu.:28.00   Yes:30  
##  Median :25.00   Median :13.00   03/13     : 1   Median :30.50           
##  Mean   :25.65   Mean   :11.78   04/15     : 1   Mean   :31.10           
##  3rd Qu.:32.75   3rd Qu.:15.00   05/07     : 1   3rd Qu.:32.25           
##  Max.   :64.00   Max.   :16.00   05/23     : 1   Max.   :39.00           
##                                  (Other)   :34                           
##  VoterReg
##  No : 9  
##  Yes:31  
##          
##          
##          
##          
##

Question 5 – Create Scatterplots

I’m curious if a person’s height is better predicted by their shoe length or by their pulse rate.

Create two scatterplots, both with height as the response variable. One with shoe length as the explanatory variable and the other with pulse rate as the explanatory variable.

# Scatterplot with shoe length as the explanatory variable
plot(nscc_students$ShoeLength, nscc_students$Height, main = "Shoe Length per Height", xlab = "Shoe Length", ylab = "Height")

# Scatterplot with pulse rate as the explanatory variable
plot(nscc_students$PulseRate, nscc_students$Height, main = "Pulse Rate per Height", xlab = "Pulse Rate", ylab = "Height")

Discuss the two scatterplots individually. Does there appear to be a linear relationship between the variables? Is the relationship weak/moderate/strong? Based on the scatterplots, does either explanatory variable appear to be a better predictor of height? Explain your reasoning. Answers may vary here.
The scatterplot with shoe length as the explanatory variable does not not show any linear pattern, making it a weaker predictor of height. However the scatterplot with pulse rate as the explanatory variable shows a strong linear relationship, and therefore would be a better predictor of height due to its consistency.

Question 6 – Calculate correlation coefficients

Calculate the correlation coefficients for each pair of variables in #5. Use the argument use = "pairwise.complete.obs" in your call to the cor() function to deal with the missing values.

# Correlation coefficients for each pair of variables
cor(nscc_students$ShoeLength, nscc_students$Height, use = "pairwise.complete.obs")

## [1] 0.2695881

cor(nscc_students$PulseRate, nscc_students$Height, use = "pairwise.complete.obs")

## [1] 0.2028639

Strictly based on the correlation coefficients, which explanatory variable is the better predictor of height?
Neither one is strong enough to use as a predictor of height, but the shoe length variable, since it has a higher correlation coefficient value would be best to use than the pulse rate variable.

Question 7 – Creating and using a regression equation

Create a linear model for height as the response variable with shoe length as a predictor variable.

# Linear model with shoe length as a predictor variable
(lm2 <- lm(formula = Height ~ ShoeLength, data = nscc_students))

## 
## Call:
## lm(formula = Height ~ ShoeLength, data = nscc_students)
## 
## Coefficients:
## (Intercept)   ShoeLength  
##      60.365        0.566

Use that model to predict the height of someone who has a “10” shoelength.

# Prediction of height
60.365 + 0.566*10

## [1] 66.025

Do you think that prediction is an accurate one? Explain why or why not.
According to the scotterplot of the relationships betwen the height and shoe length variables, this prediction can be accurate.

Question 8 – Poor Models

a.) You hopefully found that these were both poor models. Which pair of variables would you have expected to have a poor/no relationship before your analysis?
The of pulse variable was expected to have a poor or no relationship, since shoe length is a lot more likely to affect or be affected by height.

b.) Perhaps you expected the other pair of variables to have a stronger relationship than it did. Can you come up with any reasoning for why the relationship did not turn out to be very strong?
The shoe length variable was expected to have a sronger relationship than it did, which implement that shoe length is not the determinant factor of height. Other possible explanations of why the correlation coefficient for the shoe height variable does not turn out to be as expected, can be because the survey respondants might not have actually measured their shoe length, so the responses might not have been as accurate as they should be.