Project #7 - Linear Correlation and Regression

Purpose

In this project, students will demonstrate their understanding of linear correlation and regression.

Preparation

The project will use two datasets – mtcars and nscc_student_data. Do some exploratory analysis of each of them when you load them into your report.

Part 1 - mtcars dataset

The following chunk of code will load the mtcars dataset into your environment. Familiarize yourself with the dataset before proceeding with the next 4 questions related to it.

# Store mtcars dataset into environment
mtcars <- mtcars

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Question 1 – Create Scatterplots

a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.

# Creating scatterplot

plot(mtcars$wt, mtcars$mpg, main="Car Weight and Fuel Consumption", xlab="Weight in Tons", ylab="Miles per Gallon")

b.) Only by looking at the scatterplot, does there appear to be a linear relationship between the weight and mpg of a car?

Yes, there appears to be a moderately strong, negative linear relationship between the two variables.

Question 2 – Correlation Coefficient

a.) Calculate the linear correlation coefficient of the weight and mpg variables.

#Calculating the correlation coefficient 

cor(mtcars$wt, mtcars$mpg)

## [1] -0.8676594

b.) Based on that correlation coefficient, describe the linear relationship between the two variables.

The correlation coefficient is -0.868, which is relatively close to -1. Because of this, we can say that the linear relationship between car weight and miles per gallon is negative ands strong.

Question 3 – Create a Regression Model

a.) Create a regression equation to model the relationship between the weight and mpg of a car.

(lm1 <- lm(mpg ~ wt, data=mtcars))

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt  
##      37.285       -5.344

The regression equation for car weight and miles per gallon is \(y = 37.285-5.344x\)

b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.

#Plugging in 2,000 lbs for the x value. We use 2 because the variable is per 1,000 lbs.

37.285 - (5.344*2)

## [1] 26.597

Using the regression equation, we can estimate that a car weighing 2,000 lbs. will have a fuel efficiency of 26.6 miles per gallon (mpg).

c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.

#Plugging in 7,000 lbs for the x value. We use 7 because the variable is per 1,000 lbs.

37.285 - (5.344*7)

## [1] -0.123

A car weighing 7,000 lbs. will have an approximate fuel efficiency of -0.12 mpg.

d.) Do you think the predictions in parts b and c are reliable ones? Explain why or why not.

The prediction in part b seems reliable. If we look at the scatterplot, the predicted value of a 2,000 lb. car having fuel efficiency of 26.6 mpg falls relatively close to the collected data.

The prediction in part c, however, is unreliable. The mtcars dataset does not contain any observations of cars above 5,000 lbs., which makes estimates for cars weighing more than that less reliable. In the case of a 7,000 lb. car, the regression equation predicts a negative mpg which is, in reality, impossible.

Question 4 – Explained Variation

What percent of the variation in a car’s mpg is explained by the car’s weight?

#Using the summary function to call on r-squared, which describes the strength of fit
summary(lm1)$r.squared

## [1] 0.7528328

Our multiple R-squared is 0.7528. Therefore, 75.28% of the variation in MPG in cars is explained by the variation in the cars’ weight.

Part 2 – NSCC Student dataset

Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next few questions related to it.

#Store nscc_student_data into environment
nscc_student_data <- read.csv("C:/Users/jessi/Music/Statistics/nscc_student_data.csv")

str(nscc_student_data)

## 'data.frame':    40 obs. of  15 variables:
##  $ Gender      : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 2 2 1 2 ...
##  $ PulseRate   : int  64 75 74 65 NA 72 72 60 66 60 ...
##  $ CoinFlip1   : int  5 4 6 4 NA 6 6 3 7 6 ...
##  $ CoinFlip2   : int  5 6 1 4 NA 5 6 5 8 5 ...
##  $ Height      : num  62 62 60 62 66 ...
##  $ ShoeLength  : num  11 11 10 10.8 NA ...
##  $ Age         : int  19 21 25 19 26 21 19 24 24 20 ...
##  $ Siblings    : int  4 3 2 1 6 1 2 2 3 1 ...
##  $ RandomNum   : int  797 749 13 613 53 836 423 16 12 543 ...
##  $ HoursWorking: int  35 25 30 18 24 15 20 0 40 30 ...
##  $ Credits     : int  13 12 6 9 15 9 15 15 13 16 ...
##  $ Birthday    : Factor w/ 40 levels "02-15","03.14.1984",..: 32 25 30 18 1 21 19 27 35 31 ...
##  $ ProfsAge    : int  31 30 29 31 32 32 28 28 31 28 ...
##  $ Coffee      : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 2 2 2 1 ...
##  $ VoterReg    : Factor w/ 2 levels "No","Yes": 2 2 1 2 2 2 2 2 2 1 ...

Question 5 – Create Scatterplots

I’m curious if a person’s height is better predicted by their shoe length or by their pulse rate.

Create two scatterplots, both with height as the response variable. One with shoe length as the explanatory variable and the other with pulse rate as the explanatory variable.

#Creating shoe length and height scatterplot
plot(nscc_student_data$ShoeLength, nscc_student_data$Height, main="Shoe Length and Height of NSCC Students", xlab = "Shoe Length in Inches", ylab = "Height in Inches", ylim=c(55, 80))

#Creating pulse rate and height scatterplot
plot(nscc_student_data$PulseRate, nscc_student_data$Height, main="Pulse Rate and Height of NSCC Students", xlab = "Pulse Rate in Beats per Minute", ylab = "Height in Inches", ylim=c(55, 80))

Discuss the two scatterplots individually. Based only on glancing at the scatter plots, does there appear to be a linear relationship between the variables? If so, is the relationship weak/moderate/strong? Based on the scatterplots, does either explanatory variable appear to be a better predictor of height? Explain your reasoning. Answers may vary here.

In scatterplot A (Shoe Length and Height), we can see that the data loosely clusters around an upward-sloping line. Therefore, there is a moderately weak, positive linear relationship between the two variables.

In scatterplot B (Pulse Rate and Height), the data is scattered all over the plot, thus indicating that there is no linear relationship between the two variables.

In comparing the two scatterplots, I would say that shoe length is a better predictor of height, since scatterplot A shows evidence of some linear relationship between the variables, while scatterplot B does not show any relationship at all.

Question 6 – Calculate correlation coefficients

Calculate the correlation coefficients for each pair of variables in #5. Use the argument use = "pairwise.complete.obs" in your call to the cor() function to deal with the missing values.

#Calculating the correlation coefficient for Shoe Length and Height variables

cor(nscc_student_data$ShoeLength, nscc_student_data$Height, use="pairwise.complete.obs")

## [1] 0.2695881

#Calculating the correlation coefficient for the Pulse Rate and Height variables

cor(nscc_student_data$PulseRate, nscc_student_data$Height, use="pairwise.complete.obs")

## [1] 0.2028639

Strictly based on the correlation coefficients, which explanatory variable is the better predictor of height?

The correlation coefficient (CC) of shoe length & height is 0.27, while the CC of pulse rate & height is 0.20. Strictly base on these numbers, which are both relatively close to 0, there is either a very weak or non-existent relationship between both pairs of variables.

Because the shoe length & height CC is further from 0 than pulse rate & height, we can say that shoe length is the slightly better predictor of height.

Question 7 – Creating and using a regression equation

Create a linear model for height as the response variable with shoe length as a predictor variable.

#Creating and storing linear model
(lm2 <- lm(Height ~ ShoeLength, data=nscc_student_data))

## 
## Call:
## lm(formula = Height ~ ShoeLength, data = nscc_student_data)
## 
## Coefficients:
## (Intercept)   ShoeLength  
##      60.365        0.566

Use that model to predict the height of someone who has a 10" shoelength

The regression equation for shoe length and height is \(y = 60.4+0.566x\). By plugging in 10 for the x value, we can predict that person’s height.

#Computing equation

60.4 + (0.566*10)

## [1] 66.06

According to our regression equation, a person with a 10" shoe is about 66" tall.

Do you think that prediction is an accurate one? Explain why or why not.

I do not think this is an accurate prediction. Because our CC indicates a relatively weak relationship between the variables, we cannot solely (pun intended!) rely on shoe length as a good predictor of height. Thw weak relationship between the two variables is also evident on the scatterplot: if we look at observations of shoe length around 10“, we can see that there are heights ranging from around 60” to upwards of 70" in height–too much variation to indicate a strong relationship.

Question 8 – Poor Models

a.) You hopefully found that these were both poor models. Which pair of variables, based on common sense, would you have expected to have a poor/no relationship before your analysis?

I would have expected pulse rate & height to have absolutely no relationship. A person’s resting heart rate is based on how well their heart functions, as well as things like cardiovascular fitness. In first analyzing this data, I assumed that heart health has nothing to do with height, which is largely determined by genetics.

b.) Perhaps you expected the other pair of variables to have a stronger relationship than it ended up having. Can you come up with any reasoning based on the specific sample of data for why the relationship did not turn out to be very strong?

I expected shoe length & height to have a much stronger relationship than what the data showed. Using common sense, I assume that the taller a person is, the larger their body in general, including the length of their feet. In general, we tend to see smaller people with smaller foot sizes, and vice versa. In this data set, I think it’s very important to note that the observations measure shoe length and not foot length. There are many different kinds of shoe shapes, etc. that, when simply measured in inches, are not a great indicator of the length of the wearer’s actual foot. If these observations instead included bare foot length in inches, we theoretically could have drastically different results.