Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
I chose the “Height of Fall and Injury Severity in Singapore 1991-1992” data set from the University of Florida stat site: http://www.stat.ufl.edu/~winner/data/singapore_falls.dat. The data dictionary/descriptions can be found at this link: http://www.stat.ufl.edu/~winner/data/singapore_falls.txt.
I took the data set and put it into a .csv on my desktop called “sg_falls.csv”. The data set contains a frequency column which needs to be exploded out (if frequency is 9, the first two columns should be turned into 9 rows) to use it for regression, which I did in R. “df_sg_falls”" is the final data set.
# bring in raw data
df <- read.csv("C:/Users/Lelan/Desktop/SG_falls.csv", col.names = c("severity", "height", "count"))
# explode rows so that there is one row per event instead of
# a single row and a frequency for the data points on tnat row
df_sg_falls <- data.frame(severity=as.integer(integer()), height=as.integer(integer()))
for(i in 1:nrow(df)) {
row <- df[i,1:2]
n.times = df[i,3]
df_temp <- row[rep(1, n.times),]
df_sg_falls <- rbind(df_sg_falls, df_temp)
}
I’m modeling the data using height as the independent variable and severity of injury as the dependent variable.
# simple linear reqression model
severity <- df_sg_falls[,1]
height <- df_sg_falls[,2]
severity.lm <- lm(severity ~ height, data=df_sg_falls)
# scatterplot
install.packages("hexbin", repos='https://mirrors.nics.utk.edu/cran/')
## Installing package into 'C:/Users/Lelan/Documents/R/win-library/3.4'
## (as 'lib' is unspecified)
## package 'hexbin' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Lelan\AppData\Local\Temp\RtmpmS0Pba\downloaded_packages
library(hexbin)
## Warning: package 'hexbin' was built under R version 3.4.2
height_category <- height
severity_category <- severity
bin <- hexbin(height_category, severity_category, xbins=50)
plot(bin, main="Singapore Falls 1991-2")
# display summary
summary(severity.lm)
##
## Call:
## lm(formula = severity ~ height, data = df_sg_falls)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3485 -0.9498 0.2529 1.0502 1.8474
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.75399 0.14996 25.033 <2e-16 ***
## height 0.39862 0.04299 9.271 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.158 on 414 degrees of freedom
## Multiple R-squared: 0.1719, Adjusted R-squared: 0.1699
## F-statistic: 85.96 on 1 and 414 DF, p-value: < 2.2e-16
The p-values suggest that there is a significant relationship between the height and severity in the linear regression model.
Next, I plotted the residuals,
# get the residuals
severity.res = resid(severity.lm)
# plot the residuals
plot(df_sg_falls$height, severity.res, ylab="Residuals", xlab="Height", main="Residual Plot")
abline(0, 0)
The residual plot seems fairly random. It isn’t a U or an inverted U or any other non-random shape.
Lastly, I plotted standardized residuals (Q-Q plot).
# get standardized residuals
severity.stdres = rstandard(severity.lm)
qqnorm(severity.stdres, ylab="Standardized Residuals", xlab="Normal Scores")
qqline(severity.stdres)
The plot suggests normality with perhaps some left skew.