Data selection

  1. Data description The data I selected was compiled from the 2010 census. It details the access to nationwide access to supermarkets at the level of a total of 72,864 census tracts. I’ll be focusing only on tracts denoted as “urbanized areas”" (containing >50,000 people). Within this subpopulation, the variables I’m including in the model are as follows:
  1. Null and alternate hypotheses
  1. Explanatory vs. forcasting model? The model is explanatory, rather than for forcasting. We are trying to determine whether either proximity (or lack thereof) to a supermarket or percentage of the population living in group housing has any effect on a census tract’s status as “low-income”.

  2. Data cleanup and filtering Because I’m only interested in few of the columns from this many-columned table, I’m just selecting those. I’m also filtering out any census tracts that are not categorized as “urbanized areas”. After I remove any rows for which any of the columns contain NA values, I multiply the percentages by 100 to get more intuitive values to work with. After all this is done, it leaves us with a total of 49,552 rows.

data = read.delim("~/Documents/MSSM/Applied_Regression_Analysis/Project4/food_desert_census.csv", sep=",")
# Keep only the columns of interest
data = data[,c("CensusTract", "County", "State", "Urban", "UATYP10", "LowIncomeTracts", "PCTGQTRS", "lapophalfshare")]
# Restrict to only the urbanized areas
data = data[data$Urban==1,]
data = data[data$UATYP10=="U",]
data = data[complete.cases(data),]
data$PCTGQTRS = data$PCTGQTRS*100
data$lapophalfshare = data$lapophalfshare*100
data = data[which(data$PCTGQTRS < 50),]

Creating the model

  1. Describe how the model is built.
sample_size=311
vector=nrow(data)
set.seed(100)
idx=sample(vector, sample_size, replace=FALSE)
data=data[idx,]
mylogit <- glm(LowIncomeTracts ~ lapophalfshare + PCTGQTRS, data=data, family="binomial")

In order to assess the relevance of each independent variable, we’ll run three chi-squared tests: 1. Using only the first independent variable (lapophalfshare) only to predict LowIncomeTract status:

library(aod)
wald.test(b = coef(mylogit), Sigma = vcov(mylogit), Terms = 2)
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 16.2, df = 1, P(> X2) = 5.8e-05
  1. Next, we’ll use only the second independent variable (PCTGQTRS) to predict LowIncomeTract status:
wald.test(b = coef(mylogit), Sigma = vcov(mylogit), Terms = 3)
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 9.0, df = 1, P(> X2) = 0.0026
  1. Finally, we’ll use both independent variables together:
wald.test(b = coef(mylogit), Sigma = vcov(mylogit), Terms = 2:3)
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 24.9, df = 2, P(> X2) = 3.9e-06

Residual plots

mylogit.res = resid(mylogit)
mylogit.stdres = rstandard(mylogit)
par(mfrow=c(3,2))
plot(fitted(mylogit), mylogit.stdres, xlab="Fitted values", ylab="Standardized residual", main="(a) Std. residuals vs. fitted values")
abline(0,0,col='red')
hist(mylogit.stdres,breaks=80, xlab="Standardized residual", main="(b) Std. residuals distribution")
boxplot(mylogit.stdres, xlab="LowIncomeStatus ~ lapophalfshare + PCTGQTRS", ylab="Standardized residual", main="(c) Std. residuals boxplot")
qqnorm(mylogit.stdres, xlab="Normal quantiles", ylab="Standardized residual quantiles", main="(d) QQ plot: std. residuals vs. normal")
plot(mylogit.stdres, ylab="Standard Residuals")
par(mfrow=c(1,1))

par(mfrow=c(1,2))
plot(data$lapophalfshare, mylogit.stdres, xlab="Index", ylab="Standardized residual", main="StdResid vs. lapophalfshare", pch=19,cex=0.5)
abline(0,0,col='red')
plot(data$PCTGQTRS, mylogit.stdres, xlab="Index", ylab="Standardized residual", main="StdResid vs. PCTGQTRS", pch=19,cex=0.5)
abline(0,0,col='red')

par(mfrow=c(1,1))

Interpretting the model

Here’s the model summary:

summary(mylogit)
## 
## Call:
## glm(formula = LowIncomeTracts ~ lapophalfshare + PCTGQTRS, family = "binomial", 
##     data = data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.002  -1.011  -0.854   1.130   1.540  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     0.564363   0.245053   2.303  0.02128 *  
## lapophalfshare -0.013853   0.003446  -4.021 5.81e-05 ***
## PCTGQTRS        0.108907   0.036225   3.006  0.00264 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 430.59  on 310  degrees of freedom
## Residual deviance: 397.99  on 308  degrees of freedom
## AIC: 403.99
## 
## Number of Fisher Scoring iterations: 4

The coeffecient for each independent variable in the model summary represents the change in \(ln(OR)\) of the low-income tract status associated with a percentage point increase in that independent variable.

Using these coeffecients, we can compute the confidence intervals, as well as the odds ratios associated with each independent variable:

confint(mylogit)
## Waiting for profiling to be done...
##                      2.5 %       97.5 %
## (Intercept)     0.08970080  1.053382630
## lapophalfshare -0.02072471 -0.007189041
## PCTGQTRS        0.04594645  0.189265975
exp(coef(mylogit))
##    (Intercept) lapophalfshare       PCTGQTRS 
##      1.7583275      0.9862424      1.1150590
  1. According to our model parameters:
  1. Using the LINE analysis, which assumptions are satisfied and which are not?
  1. We’ll use the Breusch-Pagan test of heteroscedasticity.
library(lmtest)
bptest(mylogit)
## 
##  studentized Breusch-Pagan test
## 
## data:  mylogit
## BP = 20.064, df = 2, p-value = 4.397e-05
  1. Are we satisfying all eight assumptions?

  2. Which of the four issues, if any, present problems for your analysis?
  1. Are there interaction effects (i.e. moderators) between the independent variables?
mylogit_int <- glm(LowIncomeTracts ~ lapophalfshare * PCTGQTRS, data=data, family="binomial")
summary(mylogit_int)
## 
## Call:
## glm(formula = LowIncomeTracts ~ lapophalfshare * PCTGQTRS, family = "binomial", 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9930  -1.0114  -0.8562   1.1317   1.5371  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              0.5548043  0.2598408   2.135 0.032747 *  
## lapophalfshare          -0.0136962  0.0037250  -3.677 0.000236 ***
## PCTGQTRS                 0.1165525  0.0791370   1.473 0.140807    
## lapophalfshare:PCTGQTRS -0.0001228  0.0011177  -0.110 0.912485    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 430.59  on 310  degrees of freedom
## Residual deviance: 397.98  on 307  degrees of freedom
## AIC: 405.98
## 
## Number of Fisher Scoring iterations: 5
  1. What is your interpretation of this explanatory model?
library(ROCR)
predicted = predict(mylogit)
prob <- prediction(predicted, data$LowIncomeTracts)
tprfpr <- performance(prob, "tpr", "fpr")
tpr <- unlist(slot(tprfpr, "y.values"))
fpr <- unlist(slot(tprfpr, "x.values"))
roc <- data.frame(tpr, fpr)
library(ggplot2)
ggplot(roc) + geom_line(aes(x = fpr, y = tpr)) + geom_abline(intercept = 0, slope = 1, colour = "gray") + ylab("Sensitivity") + xlab("1 - Specificity")

slot(performance(prob,"auc"),"y.values")[[1]]
## [1] 0.6630002