Purpose

In this project, students will demonstrate their understanding of linear correlation and regression.


Preparation

The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each of them when you load them into your report.

Part 1 - mtcars dataset

The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next 4 questions related to it.

# Store mtcars dataset into environment
mtcars <- mtcars

#Looking at the strucutre of the mtcars datset
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Question 1 – Create Scatterplots

a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.

#Creating a scatterplot with weight as an explanatory variable and miles per gallon as a response variable
plot(mpg ~ wt, data=mtcars, main="Miles Per Gallon and Weight of Cars", xlab="Weight of Car (in 1000 lbs)", ylab="Miles Per Gallon of Car")

b.) Only by looking at the scatterplot, does there appear to be a linear relationship between the weight and mpg of a car?

By looking at the scatterplot, it appears that there is a negative linear relationship between the weight and mpg of a car. The relationship appears to be moderately strong.

Question 2 – Correlation Coefficient

a.) Calculate the linear correlation coefficient of the weight and mpg variables.

#Finding the correlation coefficeint of the weight and mpg variables
cor(mtcars$wt, mtcars$mpg)
## [1] -0.8676594

The correlation coefficient of the weight and mpg variables is -0.8676594.

b.) Based on that correlation coefficient, describe the linear relationship between the two variables.

Based on the correlation coefficient, the relationship between the variables can be said to be strong.

Question 3 – Create a Regression Model

a.) Create a regression equation to model the relationship between the weight and mpg of a car.

#Using the lm function and storing the linear model into the environment
lm1 <- lm(mpg ~ wt, data=mtcars)
#Observing the data on the relationship of the variables
lm1
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt  
##      37.285       -5.344

The regression equation that models the relationship between the weight and mpg of a car is:

\(y=37.285+(-5.344*x)\).

b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.

#Plugging the x value of 2,000 into the regression equation. 2 is used because the scatterplot is in thousands.
37.285+(-5.344*2)
## [1] 26.597

Using the regression equation, a car that weighs 2,000 lbs is estimated to have a fuel efficiency of approximately 26.6 miles per gallon.

c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.

#Plugging the x value of 7,000 into the regression equation. 7 is used because the scatterplot is in thousands.
37.285+(-5.344*7)
## [1] -0.123

Using the regression equation, a car that weighs 7,000 lbs is estimated to have a fuel efficiency of -.0123 miles per gallon.

d.) Do you think the predictions in parts b and c are reliable ones? Explain why or why not.

I believe that the estimate for part b is reliable, whereas the estimate for part c is not. In part b, the estimate is for a car weight that data has been collected on (2,000 lbs). By looking at the scatterplot one can see that the predicted value of approximately 26.6 miles per gallon for a 2,000 pound car falls relatively close to the actual data point collected on a car of approximately this weight. While the estimate in part b seems reliable, part c is a case of extrapolation. The data does not include information about 7,000 pound cars, making any estimates unreliable, and in reality it is not possible for a car to have a negative value for its mileage per gallon.

Question 4 – Explained Variation

What percent of the variation in a car’s mpg is explained by the car’s weight?

#Using the summary function to look at the linear model of the data
summary(lm1)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

The multiple r-squared value is 0.7528, and therefore 75.28% of the variation in mileage per gallon of a car can be explained by the car’s weight.

Part 2 – NSCC Student dataset

Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next few questions related to it.

# Store nscc_student_data into environment
nscc_student_data <- read.csv("nscc_student_data-2.csv")

#Getting a better sense of the NSCC Student Dataset
str(nscc_student_data)
## 'data.frame':    40 obs. of  15 variables:
##  $ Gender      : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 2 2 1 2 ...
##  $ PulseRate   : int  64 75 74 65 NA 72 72 60 66 60 ...
##  $ CoinFlip1   : int  5 4 6 4 NA 6 6 3 7 6 ...
##  $ CoinFlip2   : int  5 6 1 4 NA 5 6 5 8 5 ...
##  $ Height      : num  62 62 60 62 66 ...
##  $ ShoeLength  : num  11 11 10 10.8 NA ...
##  $ Age         : int  19 21 25 19 26 21 19 24 24 20 ...
##  $ Siblings    : int  4 3 2 1 6 1 2 2 3 1 ...
##  $ RandomNum   : int  797 749 13 613 53 836 423 16 12 543 ...
##  $ HoursWorking: int  35 25 30 18 24 15 20 0 40 30 ...
##  $ Credits     : int  13 12 6 9 15 9 15 15 13 16 ...
##  $ Birthday    : Factor w/ 40 levels "02-15","03.14.1984",..: 32 25 30 18 1 21 19 27 35 31 ...
##  $ ProfsAge    : int  31 30 29 31 32 32 28 28 31 28 ...
##  $ Coffee      : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 2 2 2 1 ...
##  $ VoterReg    : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 2 2 1 ...

Question 5 – Create Scatterplots

I’m curious if a person’s height is better predicted by their shoe length or by their pulse rate.

  1. Create two scatterplots, both with height as the response variable. One with shoelength as the explanatory variable and the other with pulse rate as the explanatory variable.
#Creating a scatterplot of shoelength and height
plot(nscc_student_data$ShoeLength, nscc_student_data$Height, main="Shoe Length and Height of Students", xlab = "Shoe Length", ylab = "Height")

#Creating a scatterplot of pulse rate and height
plot(nscc_student_data$PulseRate, nscc_student_data$Height, main="Pulse Rate and Height of Students", xlab = "Pulse Rate", ylab = "Height")

  1. Discuss the two scatterplots individually. Does there appear to be a linear relationship between the variables? Is the relationship weak/moderate/strong? Based on the scatterplots, does either explanatory variable appear to be a better predictor of height? Explain your reasoning. Answers may vary here.

Looking at the scatterplot of shoe length and height, there is no immediately apparent linear relationship between the two variables. However, upon closer observation, there is a slight positive correlation that can be seen by the clusters of data such as those around the points (10,65), (11,70), and (12,75). Because of the vast array of scattered points, the linear relationship would be categorized as very weak. Therefore, shoe length may provide insight to predict a student’s height, but if so, only a very low percentage of the variation in height is dependent on shoe length.

The scatterplot of pulse rate and height shows no apparent linear correlation. As x increases, y tends to stay within the same range of values (approximately 59-77). Any linear relationship is nearly non-existent. This implies that none or nearly none of the variance in height is dependent upon pulse rate.

Overall, if choosing which variable was a better predicter of height, shoe length would have to be chosen. Even though its linear relationship is weak, it has a more significant correlation to the height variable than the pulse rate variable does.

Question 6 – Calculate correlation coefficients

  1. Calculate the correlation coefficients for each pair of variables in #5. Use the argument use = "pairwise.complete.obs" in your call to the cor() function to deal with the missing values.
#Calculating the correlation coefficient for the shoe length and height variables
cor(nscc_student_data$ShoeLength, nscc_student_data$Height, use = "pairwise.complete.obs")
## [1] 0.2695881
#Calculating the correlation coefficient for the pulse rate and height variables
cor(nscc_student_data$PulseRate, nscc_student_data$Height, use = "pairwise.complete.obs")
## [1] 0.2028639

The correlation coefficient for shoe length and height is approximately .2696, while the correlation coefficient for pulse rate and height is approximately 0.2029.

  1. Strictly based on the correlation coefficients, which explanatory variable is the better predictor of height?

Based strictly on the correlation coefficient, the shoe length variable is a better predictor of height when choosing between shoe length and pulse rate.

Question 7 – Creating and using a regression equation

  1. Create a linear model for height as the response variable with shoe length as a predictor variable.
#Using the lm function to create a linear model
lm2 <- lm(Height ~ ShoeLength, data=nscc_student_data)
  1. Use that model to predict the height of someone who has a 10" shoelength
#Observing the linear model to find the intercept and coefficient for the regression equation
lm2
## 
## Call:
## lm(formula = Height ~ ShoeLength, data = nscc_student_data)
## 
## Coefficients:
## (Intercept)   ShoeLength  
##      60.365        0.566

The regression equation is:

\(y=60.365 + 0.566*x\).

#Plugging the value of 10" into the equation
60.365 + 0.566*(10)
## [1] 66.025

Based on the shoe length variable, someone wih a shoe length of 10" would likely have a height of 66.025.

  1. Do you think that prediction is an accurate one? Explain why or why not.

No, I do not think that this prediction is an accurate one. This is because the scatterplot of the shoe length and height variables appeared to have a very weak correlation, and the calculated correlation coefficient found in Question 6 was only approximately 0.2696. Both of these imply that shoe length is not a great predictor of height. Even though the scatterplot in Question 5 shows a cluster of individuals with shoe lengths of 10 and heights of approximately 66, there is also an individual with a shoe length of 10 and height of about 60 which shows room for variation. Due to the relatively low sample size, the weak linear correlation, and the low correlation coefficient, I would not consider this prediction an accurate one.

Question 8 – Poor Models

a.) You hopefully found that these were both poor models. Which pair of variables would you have expected to have a poor/no relationship before your analysis?

Before my analysis, I expected that the pulse rate and height variables would have a poor/no relationship. At first I thought they may be related becasue of the increased work a heart would have to do to pump blood through the veins of a taller body, but then remembered that pulse can vary greatly and individuals can have high or low resting rates no matter their height. After the analysis, it makes sense that one’s height is not influenced by their heart rate.

b.) Perhaps you expected the other pair of variables to have a stronger relationship than it did. Can you come up with any reasoning for why the relationship did not turn out to be very strong?

Concerning the shoe length variable, I did initially expect it to have a stronger correlation to the height variable than it did. This is because typically the taller someone is, the larger their shoe size is likely going to be. One possible reason for the weak relationship may be that other variables have a much stronger correlation to height than shoe length does, and therefore other variables account for a higher percent of variation in height than shoe length. The two may have a correlation, but it is a weak one, and the length of someone’s shoe is not effectively able to predict their height.