In this project, students will demonstrate their understanding of linear correlation and regression.
The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each of them when you load them into your report.
The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next 4 questions related to it.
# Store mtcars dataset into environment
mtcars <- mtcars
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.
# Creating scatterplot
plot(mtcars$wt, mtcars$mpg, main="Car Weight and Fuel Consumption", xlab="Weight in Tons", ylab="Miles per Gallon")
b.) Only by looking at the scatterplot, does there appear to be a linear relationship between the weight and mpg of a car?
Yes, there appears to be a moderately strong, negative linear relationship between the two variables.
a.) Calculate the linear correlation coefficient of the weight and mpg variables.
#Calculating the correlation coefficient
cor(mtcars$wt, mtcars$mpg)
## [1] -0.8676594
b.) Based on that correlation coefficient, describe the linear relationship between the two variables.
The correlation coefficient is -0.868, which is relatively close to -1. Because of this, we can say that the linear relationship between car weight and miles per gallon is negative ands strong.
a.) Create a regression equation to model the relationship between the weight and mpg of a car.
(lm1 <- lm(mpg ~ wt, data=mtcars))
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Coefficients:
## (Intercept) wt
## 37.285 -5.344
The regression equation for car weight and miles per gallon is \(y = 37.285-5.344x\)
b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.
#Plugging in 2,000 lbs for the x value. We use 2 because the variable is per 1,000 lbs.
37.285 - (5.344*2)
## [1] 26.597
Using the regression equation, we can estimate that a car weighing 2,000 lbs. will have a fuel efficiency of 26.6 miles per gallon (mpg).
c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.
#Plugging in 7,000 lbs for the x value. We use 7 because the variable is per 1,000 lbs.
37.285 - (5.344*7)
## [1] -0.123
A car weighing 7,000 lbs. will have an approximate fuel efficiency of -0.12 mpg.
d.) Do you think the predictions in parts b and c are reliable ones? Explain why or why not.
The prediction in part b seems reliable. If we look at the scatterplot, the predicted value of a 2,000 lb. car having fuel efficiency of 26.6 mpg falls relatively close to the collected data.
The prediction in part c, however, is unreliable. The mtcars dataset does not contain any observations of cars above 5,000 lbs., which makes estimates for cars weighing more than that less reliable. In the case of a 7,000 lb. car, the regression equation predicts a negative mpg which is, in reality, impossible.
What percent of the variation in a car’s mpg is explained by the car’s weight?
#Using the summary function to call on r-squared, which describes the strength of fit
summary(lm1)$r.squared
## [1] 0.7528328
Our multiple R-squared is 0.7528. Therefore, 75.28% of the variation in MPG in cars is explained by the variation in the cars’ weight.
Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next few questions related to it.
#Store nscc_student_data into environment
nscc_student_data <- read.csv("C:/Users/jessi/Music/Statistics/nscc_student_data.csv")
str(nscc_student_data)
## 'data.frame': 40 obs. of 15 variables:
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 2 2 1 2 ...
## $ PulseRate : int 64 75 74 65 NA 72 72 60 66 60 ...
## $ CoinFlip1 : int 5 4 6 4 NA 6 6 3 7 6 ...
## $ CoinFlip2 : int 5 6 1 4 NA 5 6 5 8 5 ...
## $ Height : num 62 62 60 62 66 ...
## $ ShoeLength : num 11 11 10 10.8 NA ...
## $ Age : int 19 21 25 19 26 21 19 24 24 20 ...
## $ Siblings : int 4 3 2 1 6 1 2 2 3 1 ...
## $ RandomNum : int 797 749 13 613 53 836 423 16 12 543 ...
## $ HoursWorking: int 35 25 30 18 24 15 20 0 40 30 ...
## $ Credits : int 13 12 6 9 15 9 15 15 13 16 ...
## $ Birthday : Factor w/ 40 levels "02-15","03.14.1984",..: 32 25 30 18 1 21 19 27 35 31 ...
## $ ProfsAge : int 31 30 29 31 32 32 28 28 31 28 ...
## $ Coffee : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 2 2 2 1 ...
## $ VoterReg : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 2 2 1 ...
I’m curious if a person’s height is better predicted by their shoe length or by their pulse rate.
#Creating shoe length and height scatterplot
plot(nscc_student_data$ShoeLength, nscc_student_data$Height, main="Shoe Length and Height of NSCC Students", xlab = "Shoe Length in Inches", ylab = "Height in Inches", ylim=c(55, 80))
#Creating pulse rate and height scatterplot
plot(nscc_student_data$PulseRate, nscc_student_data$Height, main="Pulse Rate and Height of NSCC Students", xlab = "Pulse Rate in Beats per Minute", ylab = "Height in Inches", ylim=c(55, 80))
In scatterplot A (Shoe Length and Height), we can see that the data loosely clusters around an upward-sloping line. Therefore, there is a moderately weak, positive linear relationship between the two variables.
In scatterplot B (Pulse Rate and Height), the data is scattered all over the plot, thus indicating that there is no linear relationship between the two variables.
In comparing the two scatterplots, I would say that shoe length is a better predictor of height, since scatterplot A shows evidence of some linear relationship between the variables, while scatterplot B does not show any relationship at all.
use = "pairwise.complete.obs" in your call to the cor() function to deal with the missing values.#Calculating the correlation coefficient for Shoe Length and Height variables
cor(nscc_student_data$ShoeLength, nscc_student_data$Height, use="pairwise.complete.obs")
## [1] 0.2695881
#Calculating the correlation coefficient for the Pulse Rate and Height variables
cor(nscc_student_data$PulseRate, nscc_student_data$Height, use="pairwise.complete.obs")
## [1] 0.2028639
The correlation coefficient (CC) of shoe length & height is 0.27, while the CC of pulse rate & height is 0.20. Strictly base on these numbers, which are both relatively close to 0, there is either a very weak or non-existent relationship between both pairs of variables.
Because the shoe length & height CC is further from 0 than pulse rate & height, we can say that shoe length is the slightly better predictor of height.
#Creating and storing linear model
(lm2 <- lm(Height ~ ShoeLength, data=nscc_student_data))
##
## Call:
## lm(formula = Height ~ ShoeLength, data = nscc_student_data)
##
## Coefficients:
## (Intercept) ShoeLength
## 60.365 0.566
The regression equation for shoe length and height is \(y = 60.4+0.566x\). By plugging in 10 for the x value, we can predict that person’s height.
#Computing equation
60.4 + (0.566*10)
## [1] 66.06
According to our regression equation, a person with a 10" shoe is about 66" tall.
I do not think this is an accurate prediction. Because our CC indicates a relatively weak relationship between the variables, we cannot solely (pun intended!) rely on shoe length as a good predictor of height. Thw weak relationship between the two variables is also evident on the scatterplot: if we look at observations of shoe length around 10“, we can see that there are heights ranging from around 60” to upwards of 70" in height–too much variation to indicate a strong relationship.
a.) You hopefully found that these were both poor models. Which pair of variables, based on common sense, would you have expected to have a poor/no relationship before your analysis?
I would have expected pulse rate & height to have absolutely no relationship. A person’s resting heart rate is based on how well their heart functions, as well as things like cardiovascular fitness. In first analyzing this data, I assumed that heart health has nothing to do with height, which is largely determined by genetics.
b.) Perhaps you expected the other pair of variables to have a stronger relationship than it ended up having. Can you come up with any reasoning based on the specific sample of data for why the relationship did not turn out to be very strong?
I expected shoe length & height to have a much stronger relationship than what the data showed. Using common sense, I assume that the taller a person is, the larger their body in general, including the length of their feet. In general, we tend to see smaller people with smaller foot sizes, and vice versa. In this data set, I think it’s very important to note that the observations measure shoe length and not foot length. There are many different kinds of shoe shapes, etc. that, when simply measured in inches, are not a great indicator of the length of the wearer’s actual foot. If these observations instead included bare foot length in inches, we theoretically could have drastically different results.