Project 4: Logistic Regression


Food Environment Atlas

Jane Braun

RPI

May 7th - Initial Draft


1. Select

US Government’s Open Data

The dataset I chose is called the Food Environment Atlas. http://catalog.data.gov/dataset/food-environment-atlas-f4a22

Reading in dataset

x <- read.csv("~/Spring2015/Applied Regression/project4.csv")
colnames(x) <- c("County","Persistent_Poverty","Grocery","Supercenter","PreschoolObesity","AdultObesity","65PlusPercent","18MinusPercent","ReducedLunch","NoCarAccess","LowIncomeAccess")
x$Persistent_Poverty<- as.numeric(x$Persistent_Poverty)
x <- x[complete.cases(x),]

This dataset is quite large. Therefore, it is possible that the large amount of data is making the results of the analysis significant. For this reason, the data was subsetted down to a sample size calculated later on in the power analysis section.

N <- 485
M <- nrow(x)
set.seed(99)
indicies <- sample(M,N, replace = FALSE)
x <- x[indicies,]

Variables of Interest

Because the initial dataset was extremely large (over 212 variables), it was subsetted down prior to being imported into R Studio. Therefore, there are 11 variables of interest in this dataset.

The goal of this project is to complete a logistic regression. It will include two continuous independent variables, along with a single categorical dependent variable.

str(x)
## 'data.frame':    485 obs. of  11 variables:
##  $ County            : Factor w/ 1941 levels "Abbeville","Acadia",..: 1717 1089 227 1386 1176 1176 1814 1316 951 1754 ...
##  $ Persistent_Poverty: num  0 0 0 0 0 1 0 0 1 1 ...
##  $ Grocery           : num  0.775 0.197 0.134 0.272 0.229 ...
##  $ Supercenter       : num  0 0.0203 0 0.0227 0 ...
##  $ PreschoolObesity  : num  18 17.3 8.8 10.5 15.4 12.8 16.2 22.1 16.7 13.4 ...
##  $ AdultObesity      : num  27.8 20.1 36.4 32 32.1 31.1 24.8 32.2 36.6 30.9 ...
##  $ 65PlusPercent     : num  25.8 27.3 14.4 16 18.6 ...
##  $ 18MinusPercent    : num  21.4 17.6 24.6 23.8 23.3 ...
##  $ ReducedLunch      : num  8.31 4.33 8.59 9.82 10.59 ...
##  $ NoCarAccess       : num  1.91 1.53 4.03 1.93 2.34 ...
##  $ LowIncomeAccess   : num  8.814 7.373 4.274 0.729 1.328 ...

As seen above, the subsetted data set includes 485 observations (2,700 originally), with 11 variables. There is one factor variable, the US county. There is one binary variable, whether or not the county is a persistent poverty county. The rest of the variables are numeric/continuous variables.

Independent Variables

We will be looking at all of the numeric variables available in the dataset for the independent variables. For this dataset, these include: grocery stores per 1000 people, supercenters per 1000, the preschool obesity rate, the adult obesity rate, percentage of population >65, percentage of population < 18, the percentage of children in reduced price lunch programs, percentage of households without access to cars, and the percentage with low access to stores.

For this analysis, the independent variables of interest will be the adult obesity rate (in percent) for each county, and percent of households that have access to a car (also in percent).

Dependent Variable

For this analysis, the dependent variable needed to be categorial. In this case, the dependent variable selected was whether or not the county is described as a persistent poverty county.

2. Describe

Selected Dataset

Located at data.gov, the US government has a multitude of free, large datasets under different categories. The dataset selected from this source is called the Food Environment Atlas. This dataset focuses on different elements of the food environment in various United States counties. It includes factors such as how close food stores and restaurants are, socioeconomic conditions of the counties, and poverty assistance programs available.

The dataset was very large, and therefore was subsetted down to only include some of the variables.

Independent and Dependent Variables

The independent variables are adult obesity rate (in percent) for each county, and percent of households that have access to a car (also in percent). Adult obesity rate is fairly self-explanatory - it is the percentage of the adult population that can be considered to be obese or very overweight. The “NoCarAccess” independent variable is the percent of households within the county which have no access to a car.

The dependent variable in the dataset did not need to be created. It was a binary variable that is part of the original dataset. The variable determines whether or not the given county is described as a “persistent poverty county”. Because this was binary in the original dataset, no manipulation or segmentation was necessary.

Hypotheses

The null hypothesis states that the variation in the dependent variable, persistent poverty county, cannot be explained by anything other than randomization.

The alternative hypothesis is that something other than randomization can help to explain the variation in the dependent variable - in this case, those are the adult obesity rate and the percentage with acces to a car.

Model

The model for this analysis will be an explanatory multiple logistic regression model. This model is explanatory, rather than predictive, because we are not trying to use certain factors in order to predict whether or not a county will be labeled as “persistent poverty” in the future. Instead, we are trying to demonstrate if certain factors explain why a particular county may have persistent poverty compared to others.

3. Model

Model Creation

For this model, I used a hierarchical method to build the model. This is because for a previous class, I had to complete a lot of reading on impoverished areas of the US. Therefore, I believe that I know enough theoretically in order to provide some reasoning behind when to add various factors.

I chose the first independent variable - no car access - because automobiles can have a significant impact on whether or not an individual can maintain employment. If they do not have access to a car, it becomes more difficult for them to commute. Even if they are qualified for a certain job, they will not be able to work there unless they can get to the place of work.

The second independent variable that I was interested in was adult obesity. There have been fairly significant ties to obesity and poverty, so I was interested if this would be demonstrated in the analysis.

Logistic regression is a form of a general linear model. Therefore, the function “glm” was used instead of “lm” for this analysis. Also in the function for general linear models, we must call a specific family - in this case “binomial” - in order to represent the logit method we are using.

Linear Model with 1 Independent Variable

model1 <- glm(Persistent_Poverty ~ NoCarAccess, data=x, family = "binomial")

Linear Model with 2 Independent Variables

model2 <- glm(Persistent_Poverty ~ NoCarAccess + AdultObesity, data=x, family = binomial(logit))

Model Summary

Model 1

summary(model1)
## 
## Call:
## glm(formula = Persistent_Poverty ~ NoCarAccess, family = "binomial", 
##     data = x)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0585  -0.6076  -0.4466  -0.2848   2.5970  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.4074     0.2878 -11.838   <2e-16 ***
## NoCarAccess   0.6587     0.0751   8.772   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 528.95  on 484  degrees of freedom
## Residual deviance: 399.71  on 483  degrees of freedom
## AIC: 403.71
## 
## Number of Fisher Scoring iterations: 5

For this primary model, the estimate of y-intercept is -3.41, and the estimate of the NoCarAccess slope parameter is 0.659. Both of these estimates have p-values of less than 2 x 10^-16.

Model 2

summary(model2)
## 
## Call:
## glm(formula = Persistent_Poverty ~ NoCarAccess + AdultObesity, 
##     family = binomial(logit), data = x)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1030  -0.6067  -0.4085  -0.1393   2.6011  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -8.31236    1.24845  -6.658 2.77e-11 ***
## NoCarAccess   0.56089    0.07727   7.259 3.91e-13 ***
## AdultObesity  0.16695    0.03966   4.209 2.56e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 528.95  on 484  degrees of freedom
## Residual deviance: 379.64  on 482  degrees of freedom
## AIC: 385.64
## 
## Number of Fisher Scoring iterations: 5

For this second model, the estimate of y-intercept is -8.31, the estimate of the NoCarAccess slope parameter is 0.56, and the estimate of the AdultObesity slope parameter is 0.167. All three of these estimate have p-values are very small: 2.77 x 10-11, 3.91 x 10-13, and 2.56 x 10-5 respectively.

4. Plots

Standardized Residuals versus Fitted

par(mfrow=c(1,1))
model.st.res <- rstandard(model2)
plot(fitted(model2), model.st.res, pch=21, cex=1, bg='ivory4',main="Standardized Plot of Fitted Values vs. Residuals", xlab = "Fitted Values of Model", ylab = "Standardized Residuals")
abline(0,0, lwd=2.5, col='dodgerblue4')

Diagnostic Plots

attach(x)
hist(Persistent_Poverty)

boxplot(Persistent_Poverty)

plot(x[c(-1)])

NEED STANDARDIZED RESIDUALS

par(mfrow = c(1,1))
qqnorm(rstandard(model2))
qqline(rstandard(model2))   

hist(rstandard(model2),ylab = "Standardized Residuals", main = "Histogram of Standardized Residuals")

boxplot(rstandard(model2),ylab = "Standardized Residuals", main = "Boxplot of Standardized Residuals")

5. Interpretation

Statistical Analysis

As seen in the summary of the second model, the estimate of the y-intercept is -8.31, the slope estimate for NoCarAccess is 0.561, and the slope estimate for AdultObesity is 0.167. The intercept indicates the value of the log odds with a values of 0 for No Car Access percent and Adult Obesity rate This means that for every unit increase (one %) of household without access to cars, the log odds of a county being a persistent poverty county increases by 0.561. Also, for every unit increase (one %) of adult obesity rate, the log odds of being a persistent poverty county increases by 0.167.

Another indicator of the strength of the model are the p-values for the estimate, which in this case are all very small ( 2.77 x 10-11, 3.91 x 10-13, and 2.56 x 10-5 respectively). These are very small values, and they indicate the probability that the variation in the dependent variable is due to randomization alone - showing that this event is very unlikely. If we analyze using an alpha of 0.05, the p-values are less than 0.05. Therefore, we may reject our null hypothesis, which states that the variation in persistent poverty counties cannot be explained by anything other than randomization. Instead, we suggest an alternative - that variation in no car access and adult obesity rate may help to explain the variation in persistent poverty counties.

LINE Analysis of Assumptions

  1. Linear
  2. Independent
  3. Normally Distributed
  4. Equal Variances

1. Linear

According to this assumption, E(Yi) at each set of values Xji is a linear function of the predictors Xji. In other words, the expected value of the errors is 0.

par(mfrow=c(1,1))
model.st.res <- rstandard(model2)
plot(fitted(model2), model.st.res, pch=21, cex=1, bg='ivory4',main="Standardized Plot of Fitted Values vs. Residuals", xlab = "Fitted Values of Model", ylab = "Standardized Residuals")
abline(0,0, lwd=2.5, col='dodgerblue4')

The plot of the fitted vs. residuals definitely does not appear to be linear. The residuals have a clear downward trend that does not look scattered or random across the dynamic range. Also, there are two distinct curves in the graph. This indicates that there is some sort of nonlinear effect.

2. Independent

The errors are independent.

This assumption suggests that there is a lack of autocorrelation in the data. In order to analyze this, we can look at the plot of the residuals in order that they were observed:

plot(model2$residuals, type = "p", col = "black", main = "Residuals plot", ylim=c(-10,10))

There does not appear to be any pattern (i.e. groups of data, seasonality, etc.). For this reason, I believe that we can assume independence in the data.

3. Normally Distributed

The errors at each Xji are normally distributed.

Looking at the histogram of the residuals, I would not classify them as a normal distribution. The data appear to be skew right, with a tail going off to the right side of the plot. The kurtosis of the histogram does not seem very significant. It appears there is a relatively normal level of kurtosis.

hist(rstandard(model2),ylab = "Standardized Residuals", main = "Histogram of Standardized Residuals")

The boxplot below confirms what was seen in the histogram. There is a positive skew to the plot with trails upwards into the positive range.

boxplot(rstandard(model2),ylab = "Standardized Residuals", main = "Boxplot of Standardized Residuals")

4. Equal Variances

The errors at each Xji have equal variances.

Looking below at plot of the residuals it seems like there is some heteroskedasticity. The variance of the residuals seems greater initially in the lower part of the range, but the variance decreases in the greater part of the range.

plot(model2$residuals, type = "p", col = "black", main = "Residuals plot")

The Breusch-Pagan test can also be used in order to test the heteroskedasticity of the data. the null hypothesis of the Breusch-Pagan states that the there is homoskedasticity in the data. The alternative, however, states the opposite - that there is heteroskedasticity in the data.

Breusch-Pagan Test
#install.packages("lmtest")
library(lmtest)
## Warning: package 'lmtest' was built under R version 3.1.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.1.2
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
attach(x)
## The following objects are masked from x (pos = 5):
## 
##     18MinusPercent, 65PlusPercent, AdultObesity, County, Grocery,
##     LowIncomeAccess, NoCarAccess, Persistent_Poverty,
##     PreschoolObesity, ReducedLunch, Supercenter
bptest(Persistent_Poverty ~ NoCarAccess)
## 
##  studentized Breusch-Pagan test
## 
## data:  Persistent_Poverty ~ NoCarAccess
## BP = 10.3197, df = 1, p-value = 0.001316
bptest(Persistent_Poverty ~ AdultObesity)
## 
##  studentized Breusch-Pagan test
## 
## data:  Persistent_Poverty ~ AdultObesity
## BP = 43.8031, df = 1, p-value = 3.631e-11

The first run of the Breusch-Pagan test looks at Persistent Poverty explained by No Car Access, while the second run looks at Persistent Poverty explained by Adult Obesity.The null hypothesis states that the data in homoskedastic, while the alternative states that the data is heteroskedastic. In both runs, the p-values were less than an alpha significance level of 0.05. Therefore, we must reject the null hypothesis. This suggests that we should support the alternative in both cases - that the data presents heteroskedasticity.

The Four Issues

  1. Causality
  2. Sample Sizes
  3. Collinearity
  4. Measurement Error
1. Causality

Ultimate versus proximal cause

Both of the independent variables - no car access and adult obesity would be considered proximal causes. They are not ultimate causes because they are not the initial point of causation of being an impoverished county - there are many factors which would affect this.

Determinate versus probabilistic Cause

Both of these independent variables are also probabilistic causes. It is possible that there will be exceptions for these independent variables causing poverty in a county. For example - many residents of New York city do not own cars and they don’t have access to cars. this does not necessarily mean that that county will be impoverished. Instead, some areas of NYC are quite wealthy.

2. Sample Sizes

The dataset that was used is extremely large. Because of this, it can be unclear whether the independent variables are significant as a result of their actual effect, or if it is because there are so many data points.

In order to solve this problem, we must complete a power analysis using G* Power to find the sample size needed for certain levels of beta, alpha, etc.

We will use a small effect size 0.02, an alpha level of 0.05, and a power level of 0.8, with 2 predictors, we can then calculate the sample size using the G*Power Program.

Using G*Power, a sample size of 485 was calculated. This sample size was used to subset the data earlier in this analysis.

3. Collinearity

Collinearity deals with the issue that the two independent variables may be correlated. If the R^2 value of a regression using one IV to predict the other, then there would likely be collinearity.

collinearity <- lm(AdultObesity ~ NoCarAccess, data = x)
summary(collinearity)
## 
## Call:
## lm(formula = AdultObesity ~ NoCarAccess, data = x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.3916  -2.1282   0.4417   2.6239  11.4333 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.83236    0.30111   92.43   <2e-16 ***
## NoCarAccess  0.80896    0.08039   10.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.867 on 483 degrees of freedom
## Multiple R-squared:  0.1733, Adjusted R-squared:  0.1716 
## F-statistic: 101.3 on 1 and 483 DF,  p-value: < 2.2e-16

The R^2 value looking at Adult Obesity predicted by No Car Access is 0.1716, meaning that 17.16% of variation in Adult Obesity rates can be explained by No Car Access percents. This is a relatively low R^2 value. Therefore, I tend to think that there is not a significant amount of multicollinearity in the data.

4. Measurement Error

There is definitely potential for measurement error in this data. The data was compiled by the Economic Research Service within the U.S. Department of Agriculture, and was collected from multiple sources, including census data and surveys. The independent variables, adult obesity rate and no car access percent were both likely the result of surveys. Surveys can have the problem of self-reporting, resulting in inaccurate results. The persistent poverty index is likely fairly accurate because it is the result of a government designation.

Interaction Effects

model_interaction <- glm(Persistent_Poverty ~ NoCarAccess * AdultObesity, data=x, family = binomial(logit))
summary(model_interaction)
## 
## Call:
## glm(formula = Persistent_Poverty ~ NoCarAccess * AdultObesity, 
##     family = binomial(logit), data = x)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3179  -0.5996  -0.4161  -0.1982   2.5133  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)  
## (Intercept)              -5.82927    2.63388  -2.213   0.0269 *
## NoCarAccess              -0.18227    0.72816  -0.250   0.8023  
## AdultObesity              0.08772    0.08463   1.037   0.2999  
## NoCarAccess:AdultObesity  0.02347    0.02301   1.020   0.3077  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 528.95  on 484  degrees of freedom
## Residual deviance: 378.58  on 481  degrees of freedom
## AIC: 386.58
## 
## Number of Fisher Scoring iterations: 5

Conclusion

Given all of these analyses, what is your interpretation of this (predictive or explanatory) model?

6. References