Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

I chose the “Height of Fall and Injury Severity in Singapore 1991-1992” data set from the University of Florida stat site: http://www.stat.ufl.edu/~winner/data/singapore_falls.dat. The data dictionary/descriptions can be found at this link: http://www.stat.ufl.edu/~winner/data/singapore_falls.txt.

I took the data set and put it into a .csv on my desktop called “sg_falls.csv”. The data set contains a frequency column which needs to be exploded out (if frequency is 9, the first two columns should be turned into 9 rows) to use it for regression, which I did in R. “df_sg_falls”" is the final data set.

# bring in raw data
df <- read.csv("C:/Users/Lelan/Desktop/SG_falls.csv", col.names = c("severity", "height", "count"))
# explode rows so that there is one row per event instead of
# a single row and a frequency for the data points on tnat row
df_sg_falls <- data.frame(severity=as.integer(integer()), height=as.integer(integer()))
for(i in 1:nrow(df)) {
  row <- df[i,1:2]
  n.times = df[i,3]
  df_temp <- row[rep(1, n.times),]
  df_sg_falls <- rbind(df_sg_falls, df_temp)
  }

I’m modeling the data using height as the independent variable and severity of injury as the dependent variable.

# simple linear reqression model
severity <- df_sg_falls[,1]
height <- df_sg_falls[,2]
severity.lm <- lm(severity ~ height, data=df_sg_falls)
# scatterplot
install.packages("hexbin", repos='https://mirrors.nics.utk.edu/cran/')
## Installing package into 'C:/Users/Lelan/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)
## package 'hexbin' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Lelan\AppData\Local\Temp\RtmpmS0Pba\downloaded_packages
library(hexbin)
## Warning: package 'hexbin' was built under R version 3.4.2
height_category <- height
severity_category <- severity
bin <- hexbin(height_category, severity_category, xbins=50) 
plot(bin, main="Singapore Falls 1991-2")

# display summary
summary(severity.lm) 
## 
## Call:
## lm(formula = severity ~ height, data = df_sg_falls)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3485 -0.9498  0.2529  1.0502  1.8474 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.75399    0.14996  25.033   <2e-16 ***
## height       0.39862    0.04299   9.271   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.158 on 414 degrees of freedom
## Multiple R-squared:  0.1719, Adjusted R-squared:  0.1699 
## F-statistic: 85.96 on 1 and 414 DF,  p-value: < 2.2e-16

The p-values suggest that there is a significant relationship between the height and severity in the linear regression model.

Next, I plotted the residuals,

# get the residuals
severity.res = resid(severity.lm)
# plot the residuals
plot(df_sg_falls$height, severity.res, ylab="Residuals", xlab="Height", main="Residual Plot") 
abline(0, 0) 

The residual plot seems fairly random. It isn’t a U or an inverted U or any other non-random shape.

Lastly, I plotted standardized residuals (Q-Q plot).

# get standardized residuals
severity.stdres = rstandard(severity.lm)
qqnorm(severity.stdres, ylab="Standardized Residuals", xlab="Normal Scores") 
qqline(severity.stdres)

The plot suggests normality with perhaps some left skew.