Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

# Height and weight data
heightWeight <- read.csv("https://gist.githubusercontent.com/nstokoe/7d4717e96c21b8ad04ec91f361b000cb/raw/bf95a2e30fceb9f2ae990eac8379fc7d844a0196/weight-height.csv")

1. Visualize the data

This plot shows the relationship between height and weight. The variable height was measured in inches. The variable weight was measured in pounds. The relationship appears to be very strong in a positive direction.

plot(heightWeight[,'Height'], heightWeight[,'Weight'], main = "Height vs. Weight", xlab="Height (inches)", ylab="Weight (pounds)")

2. Linear Model Function

This sections shows the linear model, the output of the model, and a plot of the model on a graph of the data. The linear regression line appears to follow the upward sloping trend of the data points well. I distinguished the data points by gender to show differences.

library(ggplot2)

heightWeightLM <- lm(Weight ~ Height, data = heightWeight)

heightWeightLM
## 
## Call:
## lm(formula = Weight ~ Height, data = heightWeight)
## 
## Coefficients:
## (Intercept)       Height  
##    -350.737        7.717
ggplot(heightWeight, mapping = aes(x = Height, y = Weight)) +
  geom_point(alpha = 0.5, aes(colour = factor(Gender))) +
  geom_abline(slope = heightWeightLM$coefficients[2], intercept = heightWeightLM$coefficients[1]) +
  ggtitle("Height vs. Weight")

3. Evaluating the Quality of the Linear Model

In this summary, you can see residuals, coefficients, and other statistics. The sections showing residuals shows the differences between the actual measured values and the values on the regression line. Each residual is the distance between the regression line and the actual data point for one specific value. The coefficient section shows the values for the linear regression equation. Each standard error in the Std. Error column is the standard error for the coefficient. A standard error five to ten times smaller than the coefficient is best. In this case, the standard error of speed is within this range. The ratio of the estimate to the standard error is the test statistic and is noted as the t value in the summary table. The last column shows the probability of the test statistic as extreme or more extreme than than the one observed. The last several lines provide statistical information about the model. The residual standard error measures the total variation in residual values. The multiple R-squared value measures how well the model describes the measured data. Values closer to one show a better-fit model. The adjusted R-squared value is the same except it takes into account the number of predictors used in the model. THe F-statistic compares the current model to one with only the intercept. It contains more information for models with multiple predictors.

In this model for the data about heights and weights,’ we see that the standard error for Height was in a good range at about 1/243 times the size of the coefficient. The p-values for Height and the intercept were also significant. The R-squared values appeared to be strong. They were relatively close to 1. Since this model only contains one predictor, the F-statistic is not particularly useful, but since the p-value is small, we can assume that the model is a good fit with Height as an independent variable.

summary(heightWeightLM)
## 
## Call:
## lm(formula = Weight ~ Height, data = heightWeight)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.934  -8.236  -0.119   8.260  46.844 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -350.73719    2.11149  -166.1   <2e-16 ***
## Height         7.71729    0.03176   243.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.22 on 9998 degrees of freedom
## Multiple R-squared:  0.8552, Adjusted R-squared:  0.8552 
## F-statistic: 5.904e+04 on 1 and 9998 DF,  p-value: < 2.2e-16

4. Residual Analysis

This analysis looks at the residuals to help determine the quality of the model. A model that is well-fit will have residuals around 0. A better model will also have residuals that are nearly uniform across all of the data, so the residuals should not have a trend. When the residuals show a well-fit model, then the model sufficiently explains the data. Along with a residual plot, a Q-Q plot shows if the residuals are normally distributed. When the residuals are distributed normally, then the points on the graph will fall on a straight line.

In the graphs below, the residuals are not uniform across the data, but the Q-Q plot follow a straight line extremely well.

plot(fitted(heightWeightLM),resid(heightWeightLM))

qqnorm(resid(heightWeightLM))
qqline(resid(heightWeightLM))

par(mfrow=c(2,2))
plot(heightWeightLM)

Conclusion:

This linear model was a very good fit for the data. The p-values were significant, the standard errors were extremely low, and the data was normally distributed. This was expected based on the typical relationship between height and weight, but I needed to verify with statistical analysis.

References:

Lilja, David J; Linse, Greta M. (2022). Linear Regression Using R: An Introduction to Data Modeling, 2nd Edition. University of Minnesota Libraries Publishing. Retrieved from the University of Minnesota Digital Conservancy, https://hdl.handle.net/11299/189222.

Stokoe, Nathaniel. (2019). Weight-Height.Csv. GitHub Gist, https://gist.github.com/nstokoe/7d4717e96c21b8ad04ec91f361b000cb.