Data selection
- Data description The data I selected was compiled from the 2010 census. It details the access to nationwide access to supermarkets at the level of a total of 72,864 census tracts. I’ll be focusing only on tracts denoted as “urbanized areas”" (containing >50,000 people). Within this subpopulation, the variables I’m including in the model are as follows:
- lapophalfshare: Percentage of the tract population that lives within 1/2 mile of a supermarket (continuous)
- PCTGQTRS: Percentage of the tract population that lives in a group housing structure (continuous)
- LowIncomeTracts Whether the tract is designated as “low-income” (categorical; 0=no, 1=yes)
- Null and alternate hypotheses
- \(H_0\) = The low-income classification of a census tract is unrelated to either its constituents’ proximity to a supermarket or how many of them reside in group housing complexes.
- \(H_1\) = The low-income classification is related to BOTH the supermarket proximity and the percentage of people living in group housing complexes.
Explanatory vs. forcasting model? The model is explanatory, rather than for forcasting. We are trying to determine whether either proximity (or lack thereof) to a supermarket or percentage of the population living in group housing has any effect on a census tract’s status as “low-income”.
Data cleanup and filtering Because I’m only interested in few of the columns from this many-columned table, I’m just selecting those. I’m also filtering out any census tracts that are not categorized as “urbanized areas”. After I remove any rows for which any of the columns contain NA values, I multiply the percentages by 100 to get more intuitive values to work with. After all this is done, it leaves us with a total of 49,552 rows.
data = read.delim("~/Documents/MSSM/Applied_Regression_Analysis/Project4/food_desert_census.csv", sep=",")
# Keep only the columns of interest
data = data[,c("CensusTract", "County", "State", "Urban", "UATYP10", "LowIncomeTracts", "PCTGQTRS", "lapophalfshare")]
# Restrict to only the urbanized areas
data = data[data$Urban==1,]
data = data[data$UATYP10=="U",]
data = data[complete.cases(data),]
data$PCTGQTRS = data$PCTGQTRS*100
data$lapophalfshare = data$lapophalfshare*100
- When I look at the PCTGQTRS values, there are a few outliers where the percentage of people living in group housing is well over 70%. However, the vast majority of data points have this number at <50%. So I’m going to remove these outliers prior to proceeding with the analysis.
data = data[which(data$PCTGQTRS < 50),]
Creating the model
- Describe how the model is built.
- First, I suspect that I have way too many data points (~50,000). So I’m going to randomly downsample to 311 data points (see the power analysis below) and proceed with the analysis.
sample_size=311
vector=nrow(data)
set.seed(100)
idx=sample(vector, sample_size, replace=FALSE)
data=data[idx,]
- Next, I’ll use the GLM function to create the generalized linear model using the two continuous independent variables (lapophalfshare & PCTGQTRS) and the single categorical response variable (LowIncomeTracts):
mylogit <- glm(LowIncomeTracts ~ lapophalfshare + PCTGQTRS, data=data, family="binomial")
In order to assess the relevance of each independent variable, we’ll run three chi-squared tests: 1. Using only the first independent variable (lapophalfshare) only to predict LowIncomeTract status:
library(aod)
wald.test(b = coef(mylogit), Sigma = vcov(mylogit), Terms = 2)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 16.2, df = 1, P(> X2) = 5.8e-05
- This gives us a \(X^2\) of 16.2 with one degree of freedom, corresponding to a P-value=5.8e-05, meaning that the contribution of lapophalfshare is indeed significant to the quality of the model fit.
- Next, we’ll use only the second independent variable (PCTGQTRS) to predict LowIncomeTract status:
wald.test(b = coef(mylogit), Sigma = vcov(mylogit), Terms = 3)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 9.0, df = 1, P(> X2) = 0.0026
- This gives us a \(X^2\) of 9.0 with one degree of freedom, corresponding to a P-value=0.0026, so the contribution of PCTGQTRS is less significant to the model fit than lapophalfshare.
- Finally, we’ll use both independent variables together:
wald.test(b = coef(mylogit), Sigma = vcov(mylogit), Terms = 2:3)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 24.9, df = 2, P(> X2) = 3.9e-06
- This gives us a \(X^2\) of 24.9 with two degrees of freedom, corresponding to a P-value=3.9e-06, meaning that the combination of both independent variables is better than either one individually.
Residual plots
mylogit.res = resid(mylogit)
mylogit.stdres = rstandard(mylogit)
par(mfrow=c(3,2))
plot(fitted(mylogit), mylogit.stdres, xlab="Fitted values", ylab="Standardized residual", main="(a) Std. residuals vs. fitted values")
abline(0,0,col='red')
hist(mylogit.stdres,breaks=80, xlab="Standardized residual", main="(b) Std. residuals distribution")
boxplot(mylogit.stdres, xlab="LowIncomeStatus ~ lapophalfshare + PCTGQTRS", ylab="Standardized residual", main="(c) Std. residuals boxplot")
qqnorm(mylogit.stdres, xlab="Normal quantiles", ylab="Standardized residual quantiles", main="(d) QQ plot: std. residuals vs. normal")
plot(mylogit.stdres, ylab="Standard Residuals")
par(mfrow=c(1,1))

par(mfrow=c(1,2))
plot(data$lapophalfshare, mylogit.stdres, xlab="Index", ylab="Standardized residual", main="StdResid vs. lapophalfshare", pch=19,cex=0.5)
abline(0,0,col='red')
plot(data$PCTGQTRS, mylogit.stdres, xlab="Index", ylab="Standardized residual", main="StdResid vs. PCTGQTRS", pch=19,cex=0.5)
abline(0,0,col='red')

par(mfrow=c(1,1))
Interpretting the model
Here’s the model summary:
summary(mylogit)
##
## Call:
## glm(formula = LowIncomeTracts ~ lapophalfshare + PCTGQTRS, family = "binomial",
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.002 -1.011 -0.854 1.130 1.540
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.564363 0.245053 2.303 0.02128 *
## lapophalfshare -0.013853 0.003446 -4.021 5.81e-05 ***
## PCTGQTRS 0.108907 0.036225 3.006 0.00264 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 430.59 on 310 degrees of freedom
## Residual deviance: 397.99 on 308 degrees of freedom
## AIC: 403.99
##
## Number of Fisher Scoring iterations: 4
The coeffecient for each independent variable in the model summary represents the change in \(ln(OR)\) of the low-income tract status associated with a percentage point increase in that independent variable.
Using these coeffecients, we can compute the confidence intervals, as well as the odds ratios associated with each independent variable:
confint(mylogit)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.08970080 1.053382630
## lapophalfshare -0.02072471 -0.007189041
## PCTGQTRS 0.04594645 0.189265975
exp(coef(mylogit))
## (Intercept) lapophalfshare PCTGQTRS
## 1.7583275 0.9862424 1.1150590
- According to our model parameters:
- \(b_0\): The intercept value is 0.564, which is the \(ln(OR)\) likelihood that a census tract will be classified as low-income if zero of its residents lived within 1/2 mile of a supermarket AND zero of its residents lived in group housing. This translates to an odds ratio of 1.76.
- \(b_1\): The value associated with the variable lapophalfshare is -0.014, which is the \(ln(OR)\) increase in likelihood of having low-income status with a single percentage point increase in the amount of the tract population living within 1/2 mile of a supermarket. This translates to an odds ratio of 0.986.
- \(b_2\): The value associated with the variable PCTGQTRS is 0.11, which is the \(ln(OR)\) increase in likelihood of having low-income status with a single percentage point increase in amount of the tract population living in group housing. This translates to an odds ratio of 1.115.
- Using the LINE analysis, which assumptions are satisfied and which are not?
- Linearity
- Based on the plot of the standardized residuals vs. fitted values (above), we don’t seem to have linearity of the residuals. If it were linear, we would expect a uniform distribution centered around zero. Instead, what we see are these two separate groups of residuals (corresponding to our two categorical outcomes). Fortunately, logistic linear regression does NOT require a linear relationship between the independent and dependent variables.
- Independence of errors
- The scatter plot of the residuals (above) does not reveal any dependence of the error on the index, mostly following a uniform distribution. So the errors seem independent.
- Normality of error
- We do not have a normal distribution of the residuals in this case. We actually have what looks like two distinct distributions, both of which skew heavily to the right. So this assumption is certainly not satisfied (but that’s OK for logistic regression).
- Equal variance
- There isn’t really visible heteroscedasticity, but I’ll check using the Breush-Pagan test next.
- We’ll use the Breusch-Pagan test of heteroscedasticity.
library(lmtest)
bptest(mylogit)
##
## studentized Breusch-Pagan test
##
## data: mylogit
## BP = 20.064, df = 2, p-value = 4.397e-05
- The Breusch-Pagan test gives a p-value = 4.397e-05, meaning that there is significant heteroscedasticity. Fortunately, this is acceptable for a logistic regression model.
Are we satisfying all eight assumptions?
- Which of the four issues, if any, present problems for your analysis?
- Causal relationship? Neither the proximity of residents to a supermarket or the percentage of residents that live in a group domicile is likely to be causal for low-income status of a census tract. What is more likely is that the low-income status of the census tract contributes a probabalistic causality to the grocery store proximity and group domicile share. This is not deterministic causality because some low-income tracts will contain many grocery stores and it is entirely possible to have a low-income population that lives in single family homes.
- Sample size too large/small? My original sample size was too large (49,552 rows), which led to extremely high values of \(X^2\) in the Chi-squared test of the coefficient contributions. So I used G*power to determine the required sample size for observing an effect size of 0.02 at an \(alpha\)=0.05, \(power\)=0.8, using two predictors. The statistical test I’m using is “Linear multiple regression: Fixed model, single regression coefficient”. Using these parameters, the required sample size is 311. So I’ve downsampled to 311 entries for the analysis.
- Colinearity among independent variables? The correlation value of my two independent variables is only -0.088, so I don’t think colinearity is an issue.
- Measurement error? These data are based on 2010 census responses, so there is likely a bit of measurement error involved. The most likely variable to contain error is the percentage of residents that live within 1/2 mile of a supermarket. It’s not clear how they define a supermarket and it’s also not clear whether the data is based on self-reported proximity (very susceptible to error) or integration with separate GIS information (less prone to error).
- Are there interaction effects (i.e. moderators) between the independent variables?
mylogit_int <- glm(LowIncomeTracts ~ lapophalfshare * PCTGQTRS, data=data, family="binomial")
summary(mylogit_int)
##
## Call:
## glm(formula = LowIncomeTracts ~ lapophalfshare * PCTGQTRS, family = "binomial",
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9930 -1.0114 -0.8562 1.1317 1.5371
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.5548043 0.2598408 2.135 0.032747 *
## lapophalfshare -0.0136962 0.0037250 -3.677 0.000236 ***
## PCTGQTRS 0.1165525 0.0791370 1.473 0.140807
## lapophalfshare:PCTGQTRS -0.0001228 0.0011177 -0.110 0.912485
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 430.59 on 310 degrees of freedom
## Residual deviance: 397.98 on 307 degrees of freedom
## AIC: 405.98
##
## Number of Fisher Scoring iterations: 5
- When I add an interaction between the two independent variables to the model, the summary above shows that (1) the interaction is not significant (p-value = 0.912) and (2) the PCTGQTRS contribution to the model is less significant than in the model without the interaction. So the model seems to fit better without the interaction term, meaning that I likely don’t have any moderators.
- What is your interpretation of this explanatory model?
- Given the odds ratios generated for each independent variable, we can draw the following conclusions from this explanatory model:
- Both the 1/2 mile proximity to supermarkets and percentage of residents in group domiciles are significant predictors of a census tract’s low-income designation status.
- According to the model, if 40% of the tract residents live in group housing and only 10% live within 1/2 mile of a supermarket, the odds ratio that the tract will be classified as low-income is 119.4. But if only 20% live in group housing and 50% live within 1/2 mile of a supermarket, the odds ratio that the tract is low-income plummets to 7.77.
- Finally, if we wanted to use this model to try to predict low-income status based on these two independent variables, the following ROC curve suggests that it would not do terribly well (but better than random! AUC = 0.663):
library(ROCR)
predicted = predict(mylogit)
prob <- prediction(predicted, data$LowIncomeTracts)
tprfpr <- performance(prob, "tpr", "fpr")
tpr <- unlist(slot(tprfpr, "y.values"))
fpr <- unlist(slot(tprfpr, "x.values"))
roc <- data.frame(tpr, fpr)
library(ggplot2)
ggplot(roc) + geom_line(aes(x = fpr, y = tpr)) + geom_abline(intercept = 0, slope = 1, colour = "gray") + ylab("Sensitivity") + xlab("1 - Specificity")

slot(performance(prob,"auc"),"y.values")[[1]]
## [1] 0.6630002