In this project, students will demonstrate their understanding of linear correlation and regression.
The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each of them when you load them into your report.
The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next 4 questions related to it.
# Store mtcars dataset into environment
mtcars <- mtcars
#Review dataset mtcars.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.
#Create scatterplot of the "wt" and "mpg" variables of the dataset mtcars, where "mpg" is the response variable, and "wt" is the explanatory variable.
plot(mtcars$wt, mtcars$mpg)
b.) Only by looking at the scatterplot, does there appear to be a linear relationship between the weight and mpg of a car?
Only by looking at the scatterplot, I can tell that the relationship between the response variable “mpg” and the explanatory variable “wt” looks linear and moderate to strong.
a.) Calculate the linear correlation coefficient of the weight and mpg variables.
#Calculate the linear correlation coefficient of the "wt" and "mpg" variables.
cor1 <- cor(mtcars$wt, mtcars$mpg)
The correlation coefficient of the “wt” and “mpg” variables is -0.87.
b.) Based on that correlation coefficient, describe the linear relationship between the two variables.
The correlation coefficient is negative, therefore, the relationship is negative, and it is also strong.
a.) Create a regression equation to model the relationship between the weight and mpg of a car.
#To create a regression equation to model the relationship between the weight and mpg variables, we have to find b0 and b1 first.
b1 <- cor1*sd(mtcars$mpg)/sd(mtcars$wt)
b0 <- mean(mtcars$mpg) - b1*mean(mtcars$wt)
Now we can create a regression equation to model the relationship between the “wt” and the “mpg”" variables: \(mpg=37.285+(-5.344*wt)\).
b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.
#Estimate the mpg of a car that weighs 2000lbs using the regression equation.
b0+b1*2
## [1] 26.59618
According to the regression equation, the estimated mpg of a car that weighs 2000lbs is 26.6.
c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.
#Estimate the mpg of a car that weighs 7000lbs using the regression equation.
b0+(b1*7)
## [1] -0.1261748
According to the regression equation, the estimated mpg of a car that weighs 7000lbs is -0.126.
d.) Do you think the predictions in parts b and c are reliable ones? Explain why or why not.
The prediction in part b may be reliable, because the number looks realistic. The prediction in part c is definitely not reliable, because the mpg cannot take a negative value.
What percent of the variation in a car’s mpg is explained by the car’s weight?
#Find the perecentage of the variation in a car's mpg that is explained by the car's weight.
summary(lm(wt~mpg, mtcars))$r.squared
## [1] 0.7528328
75.28% of the variation in mpg can be explained by the linear relationship between the car’s weight and car’s mpg.
Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next few questions related to it.
# Store nscc_student_data into environment
nscc_student_data <- read.csv("nscc_student_data.csv")
#Review dataset nscc_student_data.
str(nscc_student_data)
## 'data.frame': 40 obs. of 15 variables:
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 2 2 1 2 ...
## $ PulseRate : int 64 75 74 65 NA 72 72 60 66 60 ...
## $ CoinFlip1 : int 5 4 6 4 NA 6 6 3 7 6 ...
## $ CoinFlip2 : int 5 6 1 4 NA 5 6 5 8 5 ...
## $ Height : num 62 62 60 62 66 ...
## $ ShoeLength : num 11 11 10 10.8 NA ...
## $ Age : int 19 21 25 19 26 21 19 24 24 20 ...
## $ Siblings : int 4 3 2 1 6 1 2 2 3 1 ...
## $ RandomNum : int 797 749 13 613 53 836 423 16 12 543 ...
## $ HoursWorking: int 35 25 30 18 24 15 20 0 40 30 ...
## $ Credits : int 13 12 6 9 15 9 15 15 13 16 ...
## $ Birthday : Factor w/ 40 levels "02-15","03.14.1984",..: 32 25 30 18 1 21 19 27 35 31 ...
## $ ProfsAge : int 31 30 29 31 32 32 28 28 31 28 ...
## $ Coffee : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 2 2 2 1 ...
## $ VoterReg : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 2 2 1 ...
I’m curious if a person’s height is better predicted by their shoe length or by their pulse rate.
#Create a scatterplot with "Height" as the response variable and "ShoeLength" as the explanatory variable.
plot(nscc_student_data$ShoeLength, nscc_student_data$Height)
#Create a scatterplot with "Height" as the response variable and "PulseRate" as the explanatory variable.
plot(nscc_student_data$PulseRate, nscc_student_data$Height)
The scatterplot with the variables “Height” and “ShoeLength” appears to have a linear relationship between variables; the relatioship looks insignificant.
The scatterplot with the variables “Height” and “PulseRate” appears to have a linear relationship between variables; the relatioship looks also insignificant.
I think, both explanatory variables are not very good predictors of the height of an NSCC student. Since the relationships between both explanatory variables and the response variable “Height” are insignificant, these relationships cannot be good predictors of the response variable “Height”.
use = "pairwise.complete.obs" in your call to the cor() function to deal with the missing values.#Calcualte correlation coefficient between variables "Height" and "ShoeLength".
cor(nscc_student_data$ShoeLength, nscc_student_data$Height, use = "pairwise.complete.obs")
## [1] 0.2695881
#Calcualte correlation coefficient between variables "Height" and "PulseRate".
cor(nscc_student_data$PulseRate, nscc_student_data$Height, use = "pairwise.complete.obs")
## [1] 0.2028639
The correlation coefficient between explanatory variable “ShoeLength” and response variable “Height” is 0.27. The correlation coefficient between explanatory variable “PulseRate” and response variable “Height” is 0.20
#To create a linear model for "Height" as the response variable and "ShoeLength" as a predictor variable, we have to find b0 and b1 first.
lm(Height ~ ShoeLength, nscc_student_data)
##
## Call:
## lm(formula = Height ~ ShoeLength, data = nscc_student_data)
##
## Coefficients:
## (Intercept) ShoeLength
## 60.365 0.566
The linear model for “Height”" as the response variable with “ShoeLength” as a predictor variable is \(Height=60.365+0.566*ShoeLength\).
#Calculate the estimated height of a student at NSCC who has a 10" shoe length.
60.365+0.566*10
## [1] 66.025
According to the regression equation that we have created for the response varibale “Height” and explanatory varibale “ShoeLength”, an NSCC student, whose shoe length is 10“, is 66” tall.
The prediction most likely is not accurate, because the relationship between variables “Height” and “ShoeLength” is insignificant.
a.) You hopefully found that these were both poor models. Which pair of variables would you have expected to have a poor/no relationship before your analysis?
Both variables were expected to be poor explanatory variables, because pulse rate and shoe length often don’t predict the height of a person, because there are many other variables that need to be accounted for when trying to predict somebody’s height. In other words, we cannot assume person’s height based on his/her pulse rate or shoe length alone. I expected “PulseRate” variable to be the worst predictor of a person’s height.
b.) Perhaps you expected the other pair of variables to have a stronger relationship than it did. Can you come up with any reasoning for why the relationship did not turn out to be very strong?
I expected both explanatory variables to have some week relationship or no relationship with the response variable “Height”, but not strong ones. Why? Because the taller, the bigger a person is, the harder a heart has to work to pump the blood through the body, therefore pulse rate may be higher in a taller person. Same with the shoe length: in general, the taller the person, the more it is expected that that person has a bigger shoe size, but, again, this is not always the case, and we have to look at other variables, such as gender, for example.
According to the “Gender and Height in Relation to Blood Pressure and Heart Rate of Medical Students of University of Abuja” study, “the blood pressure and heart rate increased with increasing height in males but both reduced with increasing height in females”, therefore, gender is also something that the height prediction needs to be based on, when using “PulseRate” variable as an explanatory variable when trying to predict height of a person.
The pulse rate also depends on health and physical conditions, which could also affect the normal resting heart rate.