The data used for this investigation comes from UCLA’s Institute for Digital Research and Education. It was collected by statisticians Alan Agresti and Barbara Finlay in 1997. To access this data, we can use the read.dta function from the foreign package.
crimeData<-read.dta("https://stats.idre.ucla.edu/stat/data/crime.dta")
head(crimeData)
## sid state crime murder pctmetro pctwhite pcths poverty single
## 1 1 ak 761 9.0 41.8 75.2 86.6 9.1 14.3
## 2 2 al 780 11.6 67.4 73.5 66.9 17.4 11.5
## 3 3 ar 593 10.2 44.7 82.9 66.3 20.0 10.7
## 4 4 az 715 8.6 84.7 88.6 78.7 15.4 12.1
## 5 5 ca 1078 13.1 96.7 79.3 76.2 18.2 12.5
## 6 6 co 567 5.8 81.8 92.5 84.4 9.9 12.1
This data shows the rate of crimes for all states as well as for Washington DC. It also shows characteristics of the states such as murder rates, percent of people with high school education, percent of people below the poverty line, etc. For the investigation, we will use crime as a response variable. The crime variable measures how many violent crimes occur per 100,000 people in the given area. If we look at the first observational unit in our table, Alaska, we see that it has a crime value of 761 which means 761 violent crimes were committed per 100,000 people in Alaska. For the explanatory variable, we will be looking at the pctmetro variable which shows the percent of people that live in a metropolitan area.
Research Question: How does the percent of the population living in metropolitan areas impact violent crime rates and how can we create a model for the data without excluding any observational units?
While the first part of the question could be answered in a variety of different ways such as simple linear regression, we can address the second part of the question as well as the first part by using robust regression, a statistical modeling technique that is less impacted by outliers.
When to use Robust Regression: Robust regression analysis (specifically iteratively reweighted least squares robust regression) will make the most sense if we address when to use it before we address how to use it. Robust regression is useful when we are working with data for which the equal variance condition is not satisfied or there are outliers that impact the data. When we think about how models are made with linear regression, we fit the line based on the sum of least squares of the residuals. To minimize this sum, an outlier can pull the line of best fit towards it and thus the outlier has a lot more impact than any other point. Robust regression is used to lessen the weight of outliers in order to fit a better line while still including all data points.
Let’s now take a look at a plot of our data that shows the violent crime rates based on the percent of people living in a metropolitan area.
Looking at this data, it becomes clear that our data has an obvious outlier. Washington DC has a violent crime rate of almost 3000 compared to the next highest crime rate of 1206 per 100,000 people. Thus, it would be appropriate to use robust regression to model violent crime rates.
How is Robust Regression different? When using linear regression, we calculate a line of best fit by minimizing the sum of squared residuals. For robust regression, we use a very similar method but the squared residuals are multiplied by a weighted term that gives larger residuals less value.
For linear regression, the equation to minimize for the line of best fit is \[\sum^{n}_{i=1}(y_i-\hat{y}_i)^2\] where \(y_i\) is the observed value for some \(x=i\) and \(\hat{y}\) is the predicted value.
For iteratively reweighted least squares (IRLS) robust regression, the equation we are minimizing is \[\sum^{n}_{i=1}w_i(y_i-\hat{y}_i)^2.\] Notice that we have basically the same equation with the weighted term \(w_i\) added. The different forms of robust regression will have different ways of producing this weighted \(w_i\) term, but the point is that this term changes based on the size of the residual. When the residual is large, the weighted term will be smaller and give it less value in the sum, so the outliers with large residuals have less impact on the line we fit for our model. So overall, it is simplest to think of this technique as similar to linear regression with a method of fitting a line that is less swayed by outliers or heterogeneity.
Forms of IRLS robust regression: Two common forms of IRLS robust regression are Huber’s weight function and bisquare weight function.
knitr::include_graphics("RR.png")
The two graphs above show how the weighted terms are calculated using both Huber’s Weight Function as well as Bisquare Weight Function. These graphs show the size of a residual on the x-axis and then what the weighted \(w\) term would be on the y-axis. Both of these plots are centered at 0 which means that a point with a residual of 0 will be weighted the most, and the more extreme a residual gets the less weight it will have when fitting the model. Since outliers have large residuals, they would appear far from the center on these graphs. Thus we can see how they would have a smaller weighted term and not impact our robust regression model as much as they would be with linear regression.
So now that we have two methods for using IRLS robust regression, Huber’s weight function and Bisquare weight function, we can apply robust regression to our crime data.
Theoretical Model: \(Crime=\beta_0+\beta_1pctMetro+\epsilon\) where \(\epsilon \sim NI(0,\sigma_i^2)\). This model is nearly identical to one we would use for linear regression. Both modeling techniques involve conditions that require the residuals to follow a normal distribution and be independent. However, our model conditions now have the term \(\sigma_i\) rather than just \(\sigma\) which means that variance can differ between residuals. For linear regression, this would violate the homogeneity condition but we do not have to worry about equal variance when using robust regression.
Fitting models with both Huber’s method and Bisquare method as well as a simple linear regression model yields the following:
summary(rr.huber <- rlm(crime~pctmetro,data=crimeData))
##
## Call: rlm(formula = crime ~ pctmetro, data = crimeData)
## Residuals:
## Min 1Q Median 3Q Max
## -401.40 -183.27 -11.03 182.58 2062.96
##
## Coefficients:
## Value Std. Error t value
## (Intercept) -10.2964 128.0393 -0.0804
## pctmetro 8.6934 1.8082 4.8078
##
## Residual standard error: 278.3 on 49 degrees of freedom
rr.bisquare <- rlm(crime~pctmetro,data=crimeData, psi = psi.bisquare)
summary(rr.bisquare)
##
## Call: rlm(formula = crime ~ pctmetro, data = crimeData, psi = psi.bisquare)
## Residuals:
## Min 1Q Median 3Q Max
## -381.799 -180.491 4.117 180.937 2085.767
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 16.9520 119.1010 0.1423
## pctmetro 8.1928 1.6820 4.8710
##
## Residual standard error: 269.8 on 49 degrees of freedom
linearCrimeModel<-lm(crime~pctmetro,data=crimeData)
summary(linearCrimeModel)
##
## Call:
## lm(formula = crime ~ pctmetro, data = crimeData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -497.30 -226.15 -24.61 133.02 1952.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -123.683 170.511 -0.725 0.472
## pctmetro 10.929 2.408 4.539 3.69e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 373.9 on 49 degrees of freedom
## Multiple R-squared: 0.296, Adjusted R-squared: 0.2816
## F-statistic: 20.6 on 1 and 49 DF, p-value: 3.685e-05
Fitted Models:
IRLS Robust Regression with Huber’s Weight Function: \(\widehat{Crime}=-10.30+8.69pctMetro\)
IRLS Robust Regression with Bisquare Weight Function: \(\widehat{Crime}=16.95+8.19pctMetro\)
Simple Linear Regression: \(\widehat{Crime}=-123.68+10.929pctMetro\)
Based on our fitted model equations, we can see that the simple linear regression that uses sum of least squares to fit a line produces a higher slope of 10.9, clearly adjusting for the outlier. Both methods of IRLS robust regression produce a slope of about 8 which better fits the majority of data, rather than fitting a line that gets pulled up to adjust for the Washington DC outlier. The robust regression methods have slightly different intercepts but still produce very similar lines to each other, especially considering how large the scale is on the y-axis. So now that we have our fitted models, we can look at a plot of these lines.
ggplot(data = crimeData, aes(x = pctmetro, y = crime))+
geom_point() +
geom_smooth(method = "lm", se = FALSE, color='#56B4E9') +
geom_smooth(method = "rlm", se = FALSE, color='#D55E00')
So this plot shows our line of best fit found through simple linear regression compared to our line found from robust regression using the bisquare weight function. We can see that one of the lines is steeper than the other and this would be for our simple linear regression model. The line for robust regression has a less steep slope and therefore does a better job at fitting the data on the bottom half of the plot which is the majority of data. Thus, our robust regression model would be better for predicting crime based on the percent of people that live in a metropolitan area compared to the simple linear regression model.
One way that we could check the effectiveness of our robust regression models would be to remove the outlier from the data set and use simple linear regression. Then we could compare this fitted model to our robust regression models to see if our robust regression was useful in creating an accurate model without exluding observational units.
noWashDCcrimeData<-crimeData[-c(51),]
noDClinearModel<-lm(crime~pctmetro,data=noWashDCcrimeData)
summary(noDClinearModel)
##
## Call:
## lm(formula = crime ~ pctmetro, data = noWashDCcrimeData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -382.45 -182.81 1.09 163.98 450.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.558 111.117 0.230 0.819
## pctmetro 8.108 1.585 5.115 5.44e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 240.5 on 48 degrees of freedom
## Multiple R-squared: 0.3528, Adjusted R-squared: 0.3393
## F-statistic: 26.16 on 1 and 48 DF, p-value: 5.44e-06
Fitting a simple linear regression model without considering the Washington DC data point, we get the fitted model: \(\widehat{crime}=25.56+8.11pctMetro\). This slope of 8.11 is very similar to the slopes obtained from using robust regression, 8.19 and 8.69. Thus, it becomes clear that our fitted models using robust regression were just as effective as removing the outlying data point and using linear regression.
Conclusion: The way we evaluate our model and the predictors is exactly the same as we would do with simple linear regression. We are testing the null hypothesis that \(beta_1\) is equal to 0 but we do encounter the problem that R does not give us a p-value for the robust regression methods. However, it does give us a t-value of 4.81 with 49 degrees of freedom so we can use a p-value calculator to find that the coefficient for our percent metro variable yields a p-value of .000015 or approximately 0. This is the case for using Huber’s weight function as well as bisquare weight function so our conclusion really applies to both. Since the p-value is approximately 0, this of course means that the probability of observing a t-value as extreme as 4.81 or more or as extreme as -4.81 or less is approximately 0. Thus we have sufficient evidence to reject the null hypothesis and support the alternative hypothesis that \(beta_1\) is not equal to 0 and therefore the percent of people living in a metropolitan area is a significant predictor of the violent crime rates of a state. While this is the same conclusion as we would’ve gotten with simple linear regression and is not surprising when looking at the plot, there is a difference between the models when it comes to predicting values. Because the simple linear regression model has a slope of about 10 while robust regression yields a slope of about 8, the models will predict different values for different levels of percent of people living in a metropolitan area. Because the robust method was not as impacted by the Washington DC observational unit as the simple linear regression model was, the IRLS robust regression model would be more useful in predicting the violent crime rates. We know this because of how the simple linear regression model that excluded Washington DC from the data matched the robust regression models.
Discussion and Critique: So with this project, I was looking at the relationship between percent of the population in metropolitan areas with regards to violent crime rates and although I found that there was a significant relationship between these two variables, the bigger takeaway from this study is how can we fit the best model without just removing outliers. We are able to determine if an observation is an outlier through studentized and standardized residuals and then we could remove that point so that it didn’t impact our line of best fit. With robust regression, a clear advantage is that we don’t have to worry about these outliers because their impact on the line of best fit is weighted appropriately based on the size of the residual. Considering that robust regression is not commonly taught, it’s hard to find good sources about robust regression, and there aren’t many resources on R to calculate robust regression, there clearly must be some disadvantages of this modeling technique. I’m not surprised that before the statistical software we have now this method would not have been taught because the calculations for adjusting the line of best fit reiteratively after changing the weight of each residual is extremely complex . I guess another possibility for why robust regression isn’t seen as frequently is because of how simple it is to remove outliers from datasets and then evaluate the data. Looking back at the data I used, if someone was trying to predict crime rates, there’s no reason to not remove Washington DC and then use the sum of squares and simple linear regression techniques to get a model. Overall, robust regression seems like an efficient and simple way to look at data with outliers or unequal variance of errors without having to manipulate the data.