Project #7 - Linear Correlation and Regression

Purpose

In this project, Colin demonstrates his understanding of linear correlation and regression.

Preparation

The project uses two datasets – mtcars and nscc_student_data. Some exploratory analysis of each dataset has been done after loading each dataset into this report.

Part 1 - mtcars dataset

# Store mtcars dataset into environment
mtcars <- mtcars

Code repository for Part 1 - mtcars dataset

x <- mtcars$wt * 1000 # Store the weight as x, multiplied by 1000 to show the weight in Imperial Pound units.

y <- mtcars$mpg # Store the fuel efficiency as y, displayed in miles per gallon.

plot(x, y, main="Figure 1: Vehicle weights versus miles per gallon", xlab="Vehicle Weight (lb)", ylab="Fuel Efficiency (mpg)") # create the scatterplot to show the variables in two dimensional space, with some customization to ensure readability.

model <- lm(y ~ x) # create a linear regression model for the two variables

abline(model, col="#66AA55") # a trendline for visibility to show the linear relationship, using the model

(correlationcoefficient <- cor(x,y)) # calculate the correlation coefficient between the two variables, paranthetical sandwich to print the result

## [1] -0.8676594

coefficients <- coef(model) # store the model as 'coefficients'

intercept <- coefficients[1] # return the first indexed part of the vector, which is the y-intercept (value of y when x is 0)

slope <- coefficients[2] # return the second indexed part of the vector, which is the slope (m) in the conventional y=mx+b slope equation

cat("The equation of this trendline is:  Y =", slope,"* X +", intercept) # concatenate the quotes strings with the variables above to display a user friendly sentence including an equation describing the vector we have calculated

## The equation of this trendline is:  Y = -0.005344472 * X + 37.28513

# calculate the mpg (y variable) from a given wt (x variable).  The following are performed manually by writing the equation in simple terms:
((slope * 2000) + intercept)

##        x 
## 26.59618

((slope * 7000) + intercept)

##          x 
## -0.1261748

# This y_mpgwt_x function will calculate any input to yield the output.

y_mpgwt_x <- function(xinput){

 x_mpgwt_y <- ((slope * xinput) + intercept)
  cat("The fuel economy of a vehicle weighing ", xinput, " pounds is predicted to be ", x_mpgwt_y, "miles per gallon.")
  
}

y_mpgwt_x(2000)

## The fuel economy of a vehicle weighing  2000  pounds is predicted to be  26.59618 miles per gallon.

y_mpgwt_x(7000)

## The fuel economy of a vehicle weighing  7000  pounds is predicted to be  -0.1261748 miles per gallon.

# Summarizing the model and then calculating the R-squared value from it
model_summary <- summary(model)
r_squared <- model_summary$r.squared
(r_squared)

## [1] 0.7528328

Question 1 – Create Scatterplots

a.) Create a scatterplot of the weight variable (explanatory variable) and the miles per gallon variable (response variable) in the mtcars dataset.

^A.See Figure 1.

b.) Only by looking at the scatterplot, does there appear to be a linear relationship between the weight and mpg of a car?

^B.There does appear to be a linear relationship between the weight and miles per gallon of a car according to these data. Linear regression was used to superimpose a trendline to illustrate the negative correlation between these variables.

Question 2 – Correlation Coefficient

a.) Calculate the linear correlation coefficient of the weight and mpg variables.

b.) Based on that correlation coefficient, describe the linear relationship between the two variables.

^A.The linear correlation coefficient for these data is -0.8676594. ^B.They exhibit a strong negative linear relationship, the implication being that as vehicles increase in curb weight their energy efficiency decreases.

Question 3 – Create a Regression Model

a.) Create a regression equation to model the relationship between the weight and mpg of a car.

^A.The regression equation to model the relationship between the weight and fuel efficiency of a car is:

\(\color{#66AA55}{\textit{Y} = -0.005344472 \times \textit{X} + 37.28513}\)

b.) Use the regression equation to estimate the mpg of a car that weighs 2,000 lbs.

^B.The fuel economy of a car that weighs 2,000 lbs will be roughly 26.59618 mpg using this model.

c.) Use the regression equation to estimate the mpg of a car that weighs 7,000 lbs.

^C.The fuel economy of a car that weighs 7,000 lbs will be roughly -0.1261748 mpg using this model.

d.) Do you think the predictions in parts b and c are reliable ones? Explain why or why not.

^D.Given that the second result for the regression analysis yielded a nonsense result, a negative mileage per gallon, it is reasonable to state that this model is not reliable to predict the fuel economy of vehicles with curb weights outside the range of the given data set. It is not extrapolative, it is interpolative.

Question 4 – Explained Variation

What percent of the variation in a car’s mpg is explained by the car’s weight?

The r² value of the given dataset is 0.7528328. This can be interpreted to mean that this model explains roughly 75.3% of the variation in the fuel efficiency for these cars.

Part 2 – NSCC Student dataset

Use the following chunk of code to load the NSCC student dataset into your environment. Familiarize yourself with the dataset before proceeding with the next few questions related to it.

Code repository for Part 2 - NSCC Student dataset

# Store nscc_student_data into environment, store several data into more managable forms
nscc_student_data <- read.csv('nscc_student_data.csv')
shoe <- nscc_student_data$ShoeLength
pulse <- nscc_student_data$PulseRate
height <- nscc_student_data$Height

plot(shoe,height, main="Figure 2: NSCC Student Height as predicted by Shoe Length", xlab="Shoe Length (in)", ylab="Student Height (in)", ylim=c(60,76)) # Plotting the data as described, note that one data point for a male student with a height reported as "6.00" with a shoe length given as "NA" is not plotted due to the missing shoe length, but would be off the scale of the y-axis due to the ylim() argument.

model <- lm(height ~ shoe) # Question 7, create a linear model for height as response with shoe length as the predictor
# Using the predict() method of evaluating for response variables that differs from my answer for the first problem set above.  I kind of prefer this one, it doesn't require me to manipulate the algebra independently, probably better for large data workflows.
# abline(model, col="#66AA55") # create a standard trendline on the plot() to show in green the trend including obvious outliers mentioned in the answer to Qu5AB.
plot(pulse,height, main="Figure 3: NSCC Student Height as predicted by Pulse Rate", xlab="Pulse Rate (bpm)", ylab="Student Height (in)", ylim=c(60,76))

correlationcoefficient <- cor(shoe,height, use="pairwise.complete.obs")
(correlationcoefficient) # The correlation coefficient for Figure 2

## [1] 0.2695881

correlationcoefficient <- cor(pulse,height, use="pairwise.complete.obs")
(correlationcoefficient) # The correlation coefficient for Figure 3

## [1] 0.2028639

# The following code was written in an attempt to find a way to exclude the obvious outliers without attempting to edit the dataset or provide justification to manually remove them.  This was unsuccessful but I kept it in here to show my work.
# 
# model2 <- lm(height ~ shoe)
# residuals <- resid(model2)
# IQR <- IQR(residuals)
# lower_bound <- quantile(residuals, 0.25) - 1.5 * IQR
# upper_bound <- quantile(residuals, 0.75) + 1.5 * IQR
# non_outliers <- residuals > lower_bound & residuals < upper_bound
# outlier_threshold <- 1.1 * sd(residuals)
# non_outliers <- abs(residuals) < outlier_threshold
# stronger_model <- lm(height ~ shoe, data = nscc_student_data, subset = non_outliers)
# plot(shoe, height, main="Test", ylim=c(60,76))
# abline(stronger_model, col="#AA6655")
# 
# End of the tangential code. 

predict_height <- data.frame(shoe = 10) 
predicted_height <- predict(model, newdata = predict_height)
print(predicted_height) # predicted height for a shoe length of 10 inches, in inches.

##        1 
## 66.02598

Question 5 – Create Scatterplots

I’m curious if a person’s height is better predicted by their shoe length or by their pulse rate.

Create two scatterplots, both with height as the response variable. One with shoe length as the explanatory variable and the other with pulse rate as the explanatory variable.

^AFigure 2 shows a scatter plot with the explanatory variable as shoe length in inches (in). Figure 3 shows a scatter plot with the explanatory variable as pulse rate in beats per minute (bpm). Both Figures 2 and 3 show the response variable as student height, in inches (in). The ylim() characteristic for both figures has been set to equal the range of the lowest and highest plotted heights.

Discuss the two scatterplots individually. Based only on glancing at the scatter plots, does there appear to be a linear relationship between the variables? If so, is the relationship weak/moderate/strong? Based on the scatterplots, does either explanatory variable appear to be a better predictor of height? Explain your reasoning. Answers may vary here.

^BGlancing at the plots and determining relationships.

Figure 2: There appears to be a central positive correlation between shoe length and student height, and an overall trend, but several outliers. Without performing a correlation coefficient calculation, I would evaluate this as moderate or weak.

Figure 3: There does not appear to be any correlation between pulse rate and student height. Without performing a correlation coefficient calculation, I would evaluate this as weak or no relationship.

Question 6 – Calculate correlation coefficients

Calculate the correlation coefficients for each pair of variables in #5. Use the argument use = "pairwise.complete.obs" in your call to the cor() function to deal with the missing values.

The Figure 2, shoe length, correlation coefficient is: 0.2695881

The Figure 3, pulse rate, correlation coefficient is: 0.2028639

Strictly based on the correlation coefficients, which explanatory variable is the better predictor of height?

The correlation coefficients are unequal, and the shoe length value is greater than the pulse rate value. Therefore, shoe length is a better predictor of height according to these data. In the grand scheme of things, a score between 0.20 and 0.30 for correlation coefficient is not an indicator of an excellent predictor.

Question 7 – Creating and using a regression equation

Create a linear model for height as the response variable with shoe length as a predictor variable.

This can be found above in the code section.

Use that model to predict the height of someone who has a 10” shoelength

A person who has a 10 inch shoelength is predicted to have a height approximately 66.03 inches.

Do you think that prediction is an accurate one? Explain why or why not.

That is not very accurate, but somewhat useful to predict height. This prediction would not capture some of the high variation we have seen, but it would capture some of the students, those who land within the IQR. There is a central positive trend of increasing shoe length and increasing height, with many outliers, but I was unable to remove them by selecting out values outside of the central range. Student 21 is apparently either wearing clown shoes or diving flippers, I imagine. Students 10 and 38 also appear to have some odd measurements, but those are not as obvious as the outlier for Student 21’s shoe length, or Student 15’s height.

Question 8 – Poor Models

a.) You hopefully found that these were both poor models. Which pair of variables, based on common sense, would you have expected to have a poor/no relationship before your analysis?

Both should have some meager correlation to the height of the student, but the pulse rate should have had no bearing on the height of the individual. Pulse rate is related closer to overall health and cardiovascular fitness, as well as age, which are both mostly independent of height.

b.) Perhaps you expected the other pair of variables to have a stronger relationship than it ended up having. Can you come up with any reasoning based on the specific sample of data for why the relationship did not turn out to be very strong?

I expected shoe size and height to have a stronger predictive connection. When viewing the scatter plot of the data, I can see a very apparent trend in the center of the cluster of points that shows a defined positive relationship between most students’ shoe lengths and their heights, but when applying more rigorous statistical analyses I found that I could not create a justifiable trendline to describe that relationship without manually editing out datapoints. Using common sense, it seems likely that students select shoes that are only very loosely correlated to their anatomical foot length. Women’s current popular styles, for example, are often open-toed shoes which are intended to be equal to the length from heel to toe, whereas current men’s more common styles, sneakers and crocs, often cover the entire foot with some extra room in the front. Some students may have been wearing alternate footwear, such as medical braces, scuba gear, or even long pointy witch shoes the likes of which you haven’t seen since it was October in Salem. It is possible that many factors impacted the accuracy of these measurements, as the data appears to be self-reported or incomplete. If data were collected at different times of year, then open-toed women’s shoes and closed-toed men’s boots would introduce very low and very high figures.