Project #7 - Linear Correlation and Regression

Purpose

In this project, students will demonstrate their understanding of linear correlation and regression.

Preparation

The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each of them when you load them into your report.

Part 1 - mtcars dataset

The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next 4 questions related to it.

# Store mtcars dataset into environment
mtcars <- mtcars

#Review dataset mtcars.
str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Question 1 – Create Scatterplots

a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.

#Create scatterplot of the "wt" and "mpg" variables of the dataset mtcars, where "mpg" is the response variable, and "wt" is the explanatory variable.
plot(mtcars$wt, mtcars$mpg)

b.) Only by looking at the scatterplot, does there appear to be a linear relationship between the weight and mpg of a car?

Only by looking at the scatterplot, I can tell that the relationship between the response variable “mpg” and the explanatory variable “wt” looks linear and moderate to strong.

Question 2 – Correlation Coefficient

a.) Calculate the linear correlation coefficient of the weight and mpg variables.

#Calculate the linear correlation coefficient of the "wt" and "mpg" variables. 
cor1 <- cor(mtcars$wt, mtcars$mpg)

The correlation coefficient of the “wt” and “mpg” variables is -0.87.

b.) Based on that correlation coefficient, describe the linear relationship between the two variables.

The correlation coefficient is negative, therefore, the relationship is negative, and it is also strong.

Question 3 – Create a Regression Model

a.) Create a regression equation to model the relationship between the weight and mpg of a car.

#To create a regression equation to model the relationship between the weight and mpg variables, we have to find b0 and b1 first.
b1 <- cor1*sd(mtcars$mpg)/sd(mtcars$wt)
b0 <- mean(mtcars$mpg) - b1*mean(mtcars$wt)

Now we can create a regression equation to model the relationship between the “wt” and the “mpg”" variables: \(mpg=37.285+(-5.344*wt)\).

b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.

#Estimate the mpg of a car that weighs 2000lbs using the regression equation.
b0+b1*2

## [1] 26.59618

According to the regression equation, the estimated mpg of a car that weighs 2000lbs is 26.6.

c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.

#Estimate the mpg of a car that weighs 7000lbs using the regression equation.
b0+(b1*7)

## [1] -0.1261748

According to the regression equation, the estimated mpg of a car that weighs 7000lbs is -0.126.

d.) Do you think the predictions in parts b and c are reliable ones? Explain why or why not.

The prediction in part b may be reliable, because the number looks realistic. The prediction in part c is definitely not reliable, because the mpg cannot take a negative value.

Question 4 – Explained Variation

What percent of the variation in a car’s mpg is explained by the car’s weight?

#Find the perecentage of the variation in a car's mpg that is explained by the car's weight. 
summary(lm(wt~mpg, mtcars))$r.squared

## [1] 0.7528328

75.28% of the variation in mpg can be explained by the linear relationship between the car’s weight and car’s mpg.

Part 2 – NSCC Student dataset

Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next few questions related to it.

# Store nscc_student_data into environment
nscc_student_data <- read.csv("nscc_student_data.csv")

#Review dataset nscc_student_data.
str(nscc_student_data)

## 'data.frame':    40 obs. of  15 variables:
##  $ Gender      : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 2 2 1 2 ...
##  $ PulseRate   : int  64 75 74 65 NA 72 72 60 66 60 ...
##  $ CoinFlip1   : int  5 4 6 4 NA 6 6 3 7 6 ...
##  $ CoinFlip2   : int  5 6 1 4 NA 5 6 5 8 5 ...
##  $ Height      : num  62 62 60 62 66 ...
##  $ ShoeLength  : num  11 11 10 10.8 NA ...
##  $ Age         : int  19 21 25 19 26 21 19 24 24 20 ...
##  $ Siblings    : int  4 3 2 1 6 1 2 2 3 1 ...
##  $ RandomNum   : int  797 749 13 613 53 836 423 16 12 543 ...
##  $ HoursWorking: int  35 25 30 18 24 15 20 0 40 30 ...
##  $ Credits     : int  13 12 6 9 15 9 15 15 13 16 ...
##  $ Birthday    : Factor w/ 40 levels "02-15","03.14.1984",..: 32 25 30 18 1 21 19 27 35 31 ...
##  $ ProfsAge    : int  31 30 29 31 32 32 28 28 31 28 ...
##  $ Coffee      : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 2 2 2 1 ...
##  $ VoterReg    : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 2 2 1 ...

Question 5 – Create Scatterplots

I’m curious if a person’s height is better predicted by their shoe length or by their pulse rate.

Create two scatterplots, both with height as the response variable. One with shoe length as the explanatory variable and the other with pulse rate as the explanatory variable.

#Create a scatterplot with "Height" as the response variable and "ShoeLength" as the explanatory variable.
plot(nscc_student_data$ShoeLength, nscc_student_data$Height)

#Create a scatterplot with "Height" as the response variable and "PulseRate" as the explanatory variable.
plot(nscc_student_data$PulseRate, nscc_student_data$Height)

Discuss the two scatterplots individually. Does there appear to be a linear relationship between the variables? Is the relationship weak/moderate/strong? Based on the scatterplots, does either explanatory variable appear to be a better predictor of height? Explain your reasoning. Answers may vary here.

The scatterplot with the variables “Height” and “ShoeLength” appears to have a linear relationship between variables; the relatioship looks insignificant.

The scatterplot with the variables “Height” and “PulseRate” appears to have a linear relationship between variables; the relatioship looks also insignificant.

I think, both explanatory variables are not very good predictors of the height of an NSCC student. Since the relationships between both explanatory variables and the response variable “Height” are insignificant, these relationships cannot be good predictors of the response variable “Height”.

Question 6 – Calculate correlation coefficients

Calculate the correlation coefficients for each pair of variables in #5. Use the argument use = "pairwise.complete.obs" in your call to the cor() function to deal with the missing values.

#Calcualte correlation coefficient between variables "Height" and "ShoeLength".
cor(nscc_student_data$ShoeLength, nscc_student_data$Height, use = "pairwise.complete.obs")

## [1] 0.2695881

#Calcualte correlation coefficient between variables "Height" and "PulseRate".
cor(nscc_student_data$PulseRate, nscc_student_data$Height, use = "pairwise.complete.obs")

## [1] 0.2028639

The correlation coefficient between explanatory variable “ShoeLength” and response variable “Height” is 0.27. The correlation coefficient between explanatory variable “PulseRate” and response variable “Height” is 0.20

Strictly based on the correlation coefficients, which explanatory variable is the better predictor of height?
Based on the correlation coefficients, explanatory variable “ShoeLength” is the better predictor of the height of an NSCC student than the explanatory variable “PulseRate”.

Question 7 – Creating and using a regression equation

Create a linear model for height as the response variable with shoe length as a predictor variable.

#To create a linear model for "Height" as the response variable and "ShoeLength" as a predictor variable, we have to find b0 and b1 first.
lm(Height ~ ShoeLength, nscc_student_data)

## 
## Call:
## lm(formula = Height ~ ShoeLength, data = nscc_student_data)
## 
## Coefficients:
## (Intercept)   ShoeLength  
##      60.365        0.566

The linear model for “Height”" as the response variable with “ShoeLength” as a predictor variable is \(Height=60.365+0.566*ShoeLength\).

Use that model to predict the height of someone who has a 10" shoe length.

#Calculate the estimated height of a student at NSCC who has a 10" shoe length.
60.365+0.566*10

## [1] 66.025

According to the regression equation that we have created for the response varibale “Height” and explanatory varibale “ShoeLength”, an NSCC student, whose shoe length is 10“, is 66” tall.

Do you think that prediction is an accurate one? Explain why or why not.

The prediction most likely is not accurate, because the relationship between variables “Height” and “ShoeLength” is insignificant.

Question 8 – Poor Models

a.) You hopefully found that these were both poor models. Which pair of variables would you have expected to have a poor/no relationship before your analysis?

Both variables were expected to be poor explanatory variables, because pulse rate and shoe length often don’t predict the height of a person, because there are many other variables that need to be accounted for when trying to predict somebody’s height. In other words, we cannot assume person’s height based on his/her pulse rate or shoe length alone. I expected “PulseRate” variable to be the worst predictor of a person’s height.

b.) Perhaps you expected the other pair of variables to have a stronger relationship than it did. Can you come up with any reasoning for why the relationship did not turn out to be very strong?

I expected both explanatory variables to have some week relationship or no relationship with the response variable “Height”, but not strong ones. Why? Because the taller, the bigger a person is, the harder a heart has to work to pump the blood through the body, therefore pulse rate may be higher in a taller person. Same with the shoe length: in general, the taller the person, the more it is expected that that person has a bigger shoe size, but, again, this is not always the case, and we have to look at other variables, such as gender, for example.

According to the “Gender and Height in Relation to Blood Pressure and Heart Rate of Medical Students of University of Abuja” study, “the blood pressure and heart rate increased with increasing height in males but both reduced with increasing height in females”, therefore, gender is also something that the height prediction needs to be based on, when using “PulseRate” variable as an explanatory variable when trying to predict height of a person.

The pulse rate also depends on health and physical conditions, which could also affect the normal resting heart rate.