In this project, students will demonstrate their understanding of linear correlation and regression.
The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each of them when you load them into your report.
The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next 4 questions related to it.
# Store mtcars dataset into environment
mtcars <- mtcars
a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.
#Create scatterplot of the weight variable and miles per gallon variable of the dataset mtcars.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
plot(mtcars$wt, mtcars$mpg, main= "Car Weight and Miles Per Gallon", xlab="Car Weight", ylab ="Miles Per Gallon" )
b.) Only by looking at the scatterplot, does there appear to be a linear relationship between the weight and mpg of a car?
Yes, by looking at the scatterplot it appears that there is a moderate, negative linear relationship between the weight and mpg of a car.
a.) Calculate the linear correlation coefficient of the weight and mpg variables.
#Correlation coefficient of the weight and mpg variables.
cor(mtcars$wt, mtcars$mpg)
## [1] -0.8676594
b.) Based on that correlation coefficient, describe the linear relationship between the two variables.
The correlation coefficient of -0.8676594 indicates a strong negative linear relationship between the two variables.
a.) Create a regression equation to model the relationship between the weight and mpg of a car.
# Create regression line with lm() function and store into object called lm1
lm1 <- lm(mpg ~ wt, data = mtcars)
lm1
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Coefficients:
## (Intercept) wt
## 37.285 -5.344
The regression line equation to model the relationship between the weight and mpg of a car is: y = 37.285 + (-5.344*x)
b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.
37.285 + (-5.344*2)
## [1] 26.597
According to the regression equation, a car that weighs 2,000lbs is estimated to have 26.597 or 26.6 mpg.
c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.
37.285 + (-5.344*7)
## [1] -0.123
According to the regression equation, a car that weighs 7,000lbs is estimated to have -0.123 mpg.
d.) Do you think the predictions in parts b and c are reliable ones? Explain why or why not.
I believe the prediction in parts B is reliable, because the number looks realistic. But part C is not reliable, because its impossible to have a negative mileage value. Also, by looking at scatterplot the estimate for a car weight 2000lbs that data has been collected while in part C is more of extrapolation. The data did not include information about a car weight 7,000lbs.
What percent of the variation in a car’s mpg is explained by the car’s weight?
summary(lm1)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
The multiple R-squared value is 0.7528, and therefore 75.28% of the variation in a car’s mpg is explained by the car’s weight.
Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next few questions related to it.
# Store nscc_student_data into environment
nscc_student_data <- read.csv("~/Desktop/stats/nscc_student_data.csv")
View(nscc_student_data)
str(nscc_student_data)
## 'data.frame': 40 obs. of 15 variables:
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 2 2 1 2 ...
## $ PulseRate : int 64 75 74 65 NA 72 72 60 66 60 ...
## $ CoinFlip1 : int 5 4 6 4 NA 6 6 3 7 6 ...
## $ CoinFlip2 : int 5 6 1 4 NA 5 6 5 8 5 ...
## $ Height : num 62 62 60 62 66 ...
## $ ShoeLength : num 11 11 10 10.8 NA ...
## $ Age : int 19 21 25 19 26 21 19 24 24 20 ...
## $ Siblings : int 4 3 2 1 6 1 2 2 3 1 ...
## $ RandomNum : int 797 749 13 613 53 836 423 16 12 543 ...
## $ HoursWorking: int 35 25 30 18 24 15 20 0 40 30 ...
## $ Credits : int 13 12 6 9 15 9 15 15 13 16 ...
## $ Birthday : Factor w/ 39 levels "03.14.1984","11-Jul",..: 28 22 25 5 11 8 15 13 23 19 ...
## $ ProfsAge : int 31 30 29 31 32 32 28 28 31 28 ...
## $ Coffee : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 2 2 2 1 ...
## $ VoterReg : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 2 2 1 ...
I’m curious if a person’s height is better predicted by their shoe length or by their pulse rate.
#Create a scatterplot of "height" as the responce variable and "shoelength" as the explanatory variable
plot(nscc_student_data$ShoeLength, nscc_student_data$Height, main = "Shoe length and Height", xlab = "Shoe length", ylab = "Height")
#Create a scatterplot of "height" as the responce variable and "Pulse rate" as the explanatory variable
plot(nscc_student_data$PulseRate, nscc_student_data$Height, main = "Pulse Rate and Height", xlab = "Pulse Rate", ylab = "Height")
By looking at the scatterplot with the variables of Shoe length and height appears to have a weak linear relationship between the variable.
By looking at the scatterpolt with the variables of Pulse Rate and Height appears to have no linear relationship.
Based on the scatterplots, both explanatory variables appear not to be good predictors of the height of the student. However, if I had to choose one of the two variables to predict height, it would be the shoe length variable. Even though the linear relationship is weak, it has a more significant correlation to height than the pulse rate variable.
use = "pairwise.complete.obs" in your call to the cor() function to deal with the missing values.# Find the correlation coefficient between Shoe length and Height
cor(nscc_student_data$ShoeLength, nscc_student_data$Height, use = "pairwise.complete.obs")
## [1] 0.2695881
#Find the correlation coefficient between Pulse Rate and Height
cor(nscc_student_data$PulseRate, nscc_student_data$Height, use = "pairwise.complete.obs")
## [1] 0.2028639
The correlation coefficient between Shoe length and Height is 0.27 and the correlation coefficient between Pulse Rate and Height is 0.20.
#Create a linear model by using lm function
lm2 <- lm(Height ~ ShoeLength, data = nscc_student_data)
lm2
##
## Call:
## lm(formula = Height ~ ShoeLength, data = nscc_student_data)
##
## Coefficients:
## (Intercept) ShoeLength
## 60.365 0.566
The regression equation is: y= 60.365 + (0.566 * x)
#Compute the estimate height of someone who has a 10" shoe length.
60.365 + (0.566 * 10)
## [1] 66.025
According to the regression equation that we have created for height as the response variable and shoe length as a predictor variable, someone who has a 10" shoe length would likely be 66.025" or 66" tall.
a.) You hopefully found that these were both poor models. Which pair of variables, based on common sense, would you have expected to have a poor/no relationship before your analysis?
Before my analysis, I expected pulse rate and height to have a poor relationship/no relationship because pulse rate doesn’t play a role in determining someone’s height even though height may potentially have an impact on someone’s heart rate. There are many other variables to look at such as gender, genes, etc. which may have a stronger relationship to heitght so I did not expect pulse rate to be a strong predictor of height.
b.) Perhaps you expected the other pair of variables to have a stronger relationship than it ended up having. Can you come up with any reasoning based on the specific sample of data for why the relationship did not turn out to be very strong?
I expected the shoe length variable to be better predictor of height than it was because typically the taller someone is, the larger their shoe size is likely to be. Even though the statment does make sense and most of the time it is true, it is not always true. Additionally, there are many other factors that determine someone’s height.