Introduction

On this project, we will be investigating, detecting and predicting an interesting urban phenomenon. Over the past 50 years or so, some of the nation’s biggest metropolitans have experienced a unique change resulted from a set of urban redevelopment and rehabilitation process known as Gentrification.

Gentrification is defined as ‘Middle class settlement in renovated or redeveloped properties on older, inner-city districts formerly occupied by lower-income population’ (Greogry et al, 2009).Although there are many alternative definitions, most researchers agree that gentrification takes place when socially and economically affluent new settlers move in to rehabilitated and redeveloped low-income neighborhoods(Hammel & Wyly, 1996; Freeman, 2005). The influx of high-income earning residents over time puts pressure on housing prices and living expenses, which leads to poor people to migrate out of their neighborhood.

The Nation’s Capital, Washington D.C. is among the most well known examples of Gentrification. Over the years, the District has been going through unprecedented urban transformation as young, educated and well paid people continue to settle in (Arévalo, et al, 2012). According to a report by the Census Bureau, since the year 2000, DC received at least 100,000 new settlers. The increase, specially, in the population group that is well-off than the existing residents, skyrocketed housing prices. As the result, new settlers began to move into relatively affordable neighborhoods, causing low-income residents to migrate out (Guerrieri etal, 2012).

In addition to several case studies, empirical analysis by Hammel & Wyly (1996, 2001) and other researchers indicate that statistical methods can also be used to study gentrification. On this project, therefore, we made an attempt to empirically detect gentrification in the context of the change in household median income that is associated with the influx of high-income earning residents.

In order to do so, we built two different kinds of regression models, examined outputs and assessed the relationship between the dependent variable income and explanatory variables. The project is organized in to three parts. In the first, we introduce the data, lay out the null hypothesis, run explanatory data analysis and conduct feature selection. The second part is model building. We started out with Ordinary Least Square regression model and then move to Logistic regression. The third part is for result and summary where we also outlined limitations and potential solutions for future similar projects.

The data

The data for this projected is collected from the Census Bureau. To collect the variables we need, in addition to assessing the features of gentrification we discussed above, we referred to similar works conducted by other researchers (Heidkamp & Lucas, 2013; Arévalo, et al, 2012; Hammel & Wyly, 2001). Although gentrification is taking places in many cities around the place, in our data collection, we take into account the changes in the context of Washington DC.

The variables are:

Key Description
GEO.id Geographic (unique) id
p_chg_wh Percentage change in white population
p_chg_bk Percentage chagne in black population
p_chg_edc Percentage change in the 25 years old and above with a Bachelors or Higher academic degree (s)
change_incm Change in median household income($) (USD, adjusted to infilation)
change_hhsval Change in median house value ($) (USD, adjusted to infilation)
change_rent Change in gross median rent ($) (USD, adjusted to infiliation)
p_chg_ownd Percentage change in the number of households owned
p_chg_rented Percentage change in the number of households rented
p_cg_pvt Percentage change in the number of people below poverty line
change_vcri Change in the number of Violent Crimes
p_chg_chfm Percentage change in the number of families with children
chg_md_age Change in median age

Now, let us load the data in R and explore!

Description

The total number of observations in our dataset is 179, each representing the exact same number of census tracts in DC. There are also a total of 13 columns. Let’s take a look at the first 10 observations and the over all structure of the data.

'data.frame':   179 obs. of  13 variables:
 $ GEO.id       : num  1.1e+10 1.1e+10 1.1e+10 1.1e+10 1.1e+10 ...
 $ p_chg_wh     : num  1.9 -14.4 -9.14 -1.96 -10.71 ...
 $ p_chg_bk     : num  -3.53 2.22 6.03 3.19 0.84 -0.66 -0.11 -3.28 1.94 -3.5 ...
 $ p_chg_edc    : num  -4.51 -1.48 -6.94 1.86 -2.6 ...
 $ change_incm  : num  44567 14714 25884 21361 -3328 ...
 $ change_hhsval: int  402630 309790 287069 300978 473559 352800 196087 139104 -203462 -164277 ...
 $ change_rent  : num  598 476 359 428 484 ...
 $ p_chg_ownd   : num  -3.77 1.72 3.78 0.43 -5.13 2.05 -3.86 7.01 -3.25 2.69 ...
 $ p_chg_rentd  : num  -6.98 -1.33 -9.71 -0.83 1.96 ...
 $ p_chg_pvt    : num  -1.59 -1.52 -6.21 -0.18 6.89 0.69 -2.22 4.33 4.12 -6.5 ...
 $ change_vcri  : int  -132 1 -54 -28 0 -53 -21 -60 -5 -9 ...
 $ p_chg_chfm   : num  10.21 0 10.33 19.04 -3.11 ...
 $ chg_md_age   : num  -3.9 -0.2 -8 2 0.7 1.8 -1.1 -0.4 3.6 -2 ...

Except for the first column, which is a unique geo-id, the dataset contains values representing the social, economic and demographic factors associated with gentrification. As we shall see later, although all of them are not equally relevant to build our models, our initial data collection included as many predicators as literatures covered.

Exploratory Data Analysis

Next, will run basic descriptive statistics, visualize relationship and detect some patterns in the dataset.

    p_chg_wh          p_chg_bk         p_chg_edc       change_incm      
 Min.   :-14.400   Min.   :-74.570   Min.   :-15.38   Min.   :-37259.1  
 1st Qu.: -0.155   1st Qu.:-21.255   1st Qu.:  3.40   1st Qu.:  -823.7  
 Median :  3.870   Median : -8.800   Median :  8.00   Median : 10834.1  
 Mean   :  9.720   Mean   :-13.530   Mean   : 12.51   Mean   : 15897.3  
 3rd Qu.: 18.385   3rd Qu.: -1.275   3rd Qu.: 20.80   3rd Qu.: 31858.0  
 Max.   : 66.850   Max.   :  8.030   Max.   : 56.07   Max.   : 95491.5  
 change_hhsval      change_rent        p_chg_ownd      p_chg_rentd     
 Min.   :-373601   Min.   : -33.39   Min.   :-22.70   Min.   :-67.820  
 1st Qu.:  86860   1st Qu.: 245.85   1st Qu.: -3.22   1st Qu.: -5.725  
 Median : 192824   Median : 421.40   Median :  0.21   Median : -1.270  
 Mean   : 191139   Mean   : 480.83   Mean   :  1.72   Mean   : -1.328  
 3rd Qu.: 319432   3rd Qu.: 637.17   3rd Qu.:  5.39   3rd Qu.:  3.755  
 Max.   : 508399   Max.   :1951.96   Max.   : 68.39   Max.   : 33.050  
   p_chg_pvt        change_vcri        p_chg_chfm       chg_md_age      
 Min.   :-50.460   Min.   :-391.00   Min.   :-45.83   Min.   :-22.2000  
 1st Qu.: -7.830   1st Qu.:-141.00   1st Qu.:  8.03   1st Qu.: -2.5500  
 Median : -1.520   Median : -86.00   Median : 17.06   Median : -0.4000  
 Mean   : -2.424   Mean   : -97.79   Mean   : 18.57   Mean   : -0.3955  
 3rd Qu.:  3.680   3rd Qu.: -43.00   3rd Qu.: 28.23   3rd Qu.:  2.0500  
 Max.   : 17.140   Max.   :  30.00   Max.   : 65.97   Max.   : 12.5000  

In the past 16 years, Washington DC has seen some interesting changes. At a census tract level, the percentage share of white population increased by an average 10% while the share of black population declined by 13.5%. Similarly, the percentage share of educated people went up by, on average, 12.5 % while the share of population below poverty line declined by 2.4 %. The median house value and gross rent amount also increased significantly. Interestingly, the median household income also goes up by, on average, $15,897 U.S. dollars.

Let’s visualize some the associations in the variables!

The change in the percentage share of white and black population exhibits a negative association while the change in income and share of educated people tend to positively associated. Most census tracts saw a decline in the number of violent crimes although the district still have high crime rate due theft related incidents. Most places also saw an increase in rent ranging from few hundred dollars to the upper a couple of thousands. The share of low income people also decline as the rent increases.

Dependent Variable

Gentrification often takes many forms, depending the location of urban transformation and can be seen from many angles. Most literatures, however, agree that gentrified neighborhoods distinctly can be identified as settlement of high income earning new residents in neighborhoods that were once considered poor and deteriorating (Ellen & Ding, 2016; Hammel & Wyly, 1996; Smith, 1982). Hence, sharp increase in household income in low-income neighborhoods can be explained by factors that can demystify the change.

The null hypothesis is that there is no statistical significance between income and social, economic and demographic changes of a neighborhood. In line with this hypothesis, we would also like to assess the potential use of statistical methods to study gentrification in dynamic cities such as the District.

Checking Distribution

Before moving on, we would like to check the normality of the distribution of our dependent variable (change in income). In order to do so, we’ll examine its distribution, first using a set of four graphical outputs and then by running, Kurtosis, Skewness and Shapiro Wilk Tests.

Accordingly, the change in income is slightly right-tailed. In the scatter plot, some extreme values are also visible. The QQ-Plot also indicates a distribution that deviates from normal.


    Shapiro-Wilk normality test

data:  Income_change
W = 0.93917, p-value = 6.973e-07
   Skewness Kurtosis
1 0.9339091 0.831103

Additional diagnosis using Shapiro-Wilk test indicates the null hypothesis is rejected, indicating that the variable is not normally distributed. The Skewness and Kurtosis also shows that the distribution of our dependent variable is skewed and tailed to the right. We shall also later see how the residuals from the OLS model behave and decide conclusively whether the distribution is truly skewed or not.

Independent Variables

Now that we have examined the distribution of dependent variable, let us, first, see the relationship within the dependent variables and, then, select the best predicators of income among them.

Multicollinearity

One of the underlying assumptions of an Ordinary Least Square regression is that there is no multicollinearity among the predicator variables. In other words, we don’t want our independent variables to have high to moderate correlation. If correlations exist, we shall use the Variance Inflation Factor (VIF) test and identify the multicollinear variables we may need to exclude from our model.

It looks like that there are some highly correlated variables. In running VIF test, we can detect by how much variance of the coefficient generated from the model is increased because of collinearity. Just to give you a head up, the VIF test in R is found in the car library.

              vif.value.
p_chg_wh       11.638064
p_chg_bk        7.301797
p_chg_edc       6.649509
change_hhsval   1.253591
change_rent     2.013559
p_chg_ownd      3.040424
p_chg_rentd     2.882633
p_chg_pvt       1.982138
change_vcri     1.230919
p_chg_chfm      1.242959
chg_md_age      1.149341

In this case, at least three of our variables showing a VIF value greater than 4 are either moderately or highly multicollinear (i.e., the change in the percentage share of white population, black population, educated people) Before deciding to hastily exclude any variables from our analysis, let us see which subset of might actually be a good predicator.

Feature Selection

The multicollinearity we detect could be due to the presence of redundant or irrelevant variables and this can be detected through Feature (Variable) selection. By selecting a subset of predicators, we can eliminate redundant variables that might other wise cause over fitting and biased estimate.

Based on the Bayesian Information Criterion (BIC),a subsets of our dataset has a much higher value to our model than others. This subset include changes in the values of white population, rent, house value, educated, and low-income people. We also can look into the R-square values and gather the same information.For our purpose, we’ll regress change in rent (change_rent), poverty (p_chg_pvt) and education (p_chg_pvt) on change in median household income (change_incm).

OLS Regression Model

We will begin with the Oridianary Least Square (OLS) regression model and to see if the estimates hold true to the assumptions of a linear regression model.


Call:
lm(formula = change_incm ~ change_rent + p_chg_edc + p_chg_pvt, 
    data = sub_data)

Residuals:
   Min     1Q Median     3Q    Max 
-32848  -8234  -1134   7007  36625 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -6129.280   1814.832  -3.377 0.000902 ***
change_rent    28.238      3.624   7.792 5.61e-13 ***
p_chg_edc     535.072     96.393   5.551 1.04e-07 ***
p_chg_pvt    -724.475    136.765  -5.297 3.50e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12470 on 175 degrees of freedom
Multiple R-squared:  0.7061,    Adjusted R-squared:  0.7011 
F-statistic: 140.2 on 3 and 175 DF,  p-value: < 2.2e-16

As you can see from the p value of each the variables we chose, changes assocated with rent, education, poverty and are statistically signficant to the change in income. The model also has relatively high R-square indicating that these variables can potentially explain the variation income. Before moving into accepting the validity of this model, let’s examine the residuals.

Residuals

W’ll first look into their over all pattern and check for Heteroscedastacity using the Breusch–Pagan test.

Some of the observations in our data seem to skew the over all distribution of the residuals. Interestingly, the Cook’s Distance shows the U-Street Corridor (64) and the Navy-Yard (91) neighborhoods have higher deviation from the rest. Since these two neighbrhoods went through rapid urban transformation in the last 20 years or so, their deviation explain a high level of gentrification.

Let’s now test for Heteroscedastacity and assess the statistical signficance of the distribution of the residuals. In order to do so, we will use the Breusch–Pagan test


    Breusch-Pagan test

data:  model1
BP = 10.401, df = 3, p-value = 0.01545

The result suggests that we should reject the null hypothesis of of homoscedastacity. Since this will invalidate our model, few potentials solutions might help us correct the issue better further.

  1. Remove or impute the those highly deviating observations. This solution, however, might lead us to misleading estimates because those neighbrhoods are theoretically signficant to our understanding of gentrification in Washington DC.

  2. Fit the data using a different modeling technqiues. We chose to go this route to see if other modeling technique, more specifically the logistic regression could help us detect the change in income better.In Regional and econometric studies, an alternative technique known as the General IV/GMM model is also used. Since this modeling technique is beyond the scope of this course, we’ll test and see if ‘Logistic Regression’ will do the majic.

K-Nearest Neighbor

Our second approach is to use K-Nearest Neighbor algorithm and attempt to detect gentrification. First, will see the chance of correctly orrectly classifying the neighborhoods based on income.

   95491.51 
0.005586592 

Accordingly, we have 5.5 % chance of classifying correctly. Its exteremely low but would like to see if KNN will do a better job. First, we have to split our data in to training and testing groups. W’ll use the 80% of our dataset for training and 20% of it for testing the KNN algorithms.

set.seed(1)
data_train_rows = sample(1:nrow(knn_data),
                              round(0.8 * nrow(knn_data), 0),  
                              replace = FALSE)       

data_train = gd_v2[data_train_rows, ]
data_test = gd_v2[-data_train_rows, ]  

Training_data<-nrow(data_train)
Testing_data<-nrow(data_test)

data.frame(Training_data, Testing_data)
  Training_data Testing_data
1           143           36

Accordingly, 143 out of the 179 observations will be used to train the data while the remaining 36 observation will be used for testing.

Classifying using the data

We shall now use the algorithm to classify the neighbrhoods. Later on, we will compare the calssification result to the true class using cofusion matrix.

set.seed(1)
bank_3NN = knn(train = data_train[, c("p_chg_edc", "p_chg_pvt", "change_rent")],  
               test = data_test[, c("p_chg_edc", "p_chg_pvt", "change_rent")],    
               cl = data_train[, "change_incm"],                       
               k = 5,                                                       
               use.all = TRUE)                                

kNN_res = table(bank_3NN,
                data_test$`change_incm`)
kNN_acc = sum(kNN_res[row(kNN_res) == col(kNN_res)]) / sum(kNN_res)
kNN_acc
[1] 0

As it turn out the data we fitted into algorithm is not likely to be classified using KNN. Our next decision is to test and fit the data using Logistic regression modelling technique.

Logistic Regression Model

In simple term, Logistic regression model can be understood as an estimate of the probability of an outcome. Now that we know that some neighbrhoods exhibit a sharp difference from the other, in the Logistic Regression, we want to identify which neighbrhoods have seen a sharp increase compared to the others.

Modifying dataset

The first task is to create a factor variable that categorizes income into different groups. We’ll begin by look at the quantile distribution of income.

       0%       25%       50%       75%      100% 
-37259.10   -823.70  10834.06  31857.96  95491.51 

75% of the census tracts saw a decline or slight change compared to the other 25 % of the observations showing sharp increase. Let’s divide this variable between those showing a decline or slight to moderate increase (low) and those showing sharp increase (high). We will use the 75% as a break point.

  p_chg_wh p_chg_bk p_chg_edc change_hhsval change_rent p_chg_ownd
1     1.90    -3.53     -4.51        402630      597.79      -3.77
2   -14.40     2.22     -1.48        309790      475.50       1.72
3    -9.14     6.03     -6.94        287069      358.64       3.78
4    -1.96     3.19      1.86        300978      427.51       0.43
5   -10.71     0.84     -2.60        473559      483.63      -5.13
6    -1.35    -0.66      4.89        352800      443.84       2.05
  p_chg_rentd p_chg_pvt change_vcri p_chg_chfm ch_md_age    y
1       -6.98     -1.59        -132      10.21      -3.9 high
2       -1.33     -1.52           1       0.00      -0.2  low
3       -9.71     -6.21         -54      10.33      -8.0  low
4       -0.83     -0.18         -28      19.04       2.0  low
5        1.96      6.89           0      -3.11       0.7  low
6       -4.87      0.69         -53       1.72       1.8  low

Accordingly 90 census tracts show sharp increase. Before moving on, we need to make sure that this new categorical variable is recongized as a factor variable.

'data.frame':   179 obs. of  12 variables:
 $ p_chg_wh     : num  1.9 -14.4 -9.14 -1.96 -10.71 ...
 $ p_chg_bk     : num  -3.53 2.22 6.03 3.19 0.84 -0.66 -0.11 -3.28 1.94 -3.5 ...
 $ p_chg_edc    : num  -4.51 -1.48 -6.94 1.86 -2.6 ...
 $ change_hhsval: int  402630 309790 287069 300978 473559 352800 196087 139104 -203462 -164277 ...
 $ change_rent  : num  598 476 359 428 484 ...
 $ p_chg_ownd   : num  -3.77 1.72 3.78 0.43 -5.13 2.05 -3.86 7.01 -3.25 2.69 ...
 $ p_chg_rentd  : num  -6.98 -1.33 -9.71 -0.83 1.96 ...
 $ p_chg_pvt    : num  -1.59 -1.52 -6.21 -0.18 6.89 0.69 -2.22 4.33 4.12 -6.5 ...
 $ change_vcri  : int  -132 1 -54 -28 0 -53 -21 -60 -5 -9 ...
 $ p_chg_chfm   : num  10.21 0 10.33 19.04 -3.11 ...
 $ ch_md_age    : num  -3.9 -0.2 -8 2 0.7 1.8 -1.1 -0.4 3.6 -2 ...
 $ y            : Factor w/ 2 levels "low","high": 2 1 1 1 1 1 1 2 1 1 ...

Best GLM technique

We will first start with a logistic regression approach that selects and identifies a subset of inputs showing the smallest deviance. The following result shows us the difference in the different AIC levels of all the different model possibilities.

Morgan-Tatar search since family is non-gaussian.

Call:
glm(formula = y ~ ., family = family, data = Xi, weights = weights)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.77621  -0.36874  -0.17568  -0.02026   2.50161  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -5.837e+00  9.525e-01  -6.128  8.9e-10 ***
p_chg_edc      8.518e-02  2.618e-02   3.254  0.00114 ** 
change_hhsval  4.037e-06  1.805e-06   2.236  0.02533 *  
change_rent    3.429e-03  1.107e-03   3.096  0.00196 ** 
p_chg_rentd   -3.553e-02  2.471e-02  -1.438  0.15040    
p_chg_pvt     -1.186e-01  4.095e-02  -2.896  0.00377 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 201.863  on 178  degrees of freedom
Residual deviance:  96.622  on 173  degrees of freedom
AIC: 108.62

Number of Fisher Scoring iterations: 6
  (Intercept)     p_chg_edc change_hhsval   change_rent   p_chg_rentd 
  0.002917577   1.088912955   1.000004037   1.003434847   0.965094244 
    p_chg_pvt 
  0.888148009 

The next test we would like to do is the Hosmer Lemeshow goodness of fit test. This evaluation instrument in line with others is helpful to examine whether or not the observed event rates match expected event rates in subgroups of the model population.


    Hosmer and Lemeshow goodness of fit (GOF) test

data:  gentrification$y, fitted(income.bglm$BestModel)
X-squared = 179, df = 8, p-value < 2.2e-16

Our next task will focus on finding the probabilities for each response, calculate teh hit rate and plot the ROC curve.

income.prob.final<-predict(income.bglm$BestModel, type = c("response"))
View(income.prob.final)
income.hit.final <- roc(y~income.prob.final, data=gentrification)
income.hit.final
plot(income.hit.final)

With the information about the best glm, we’re going to try if we can partition the data successfully and test our model. This time we will also only the variables with known to have signficance to the estimates.

train.income.final <- gentrification[1:125,]
test.income.final <- gentrification[126:179,]

income.model.final<-glm(y~p_chg_wh+change_rent+p_chg_rentd+p_chg_pvt+p_chg_chfm, family = binomial(link="logit"), train.income.final)
summary(income.model.final)

Call:
glm(formula = y ~ p_chg_wh + change_rent + p_chg_rentd + p_chg_pvt + 
    p_chg_chfm, family = binomial(link = "logit"), data = train.income.final)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5508  -0.4553  -0.2019   0.1867   2.4108  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.9948051  0.9616268  -4.154 3.26e-05 ***
p_chg_wh     0.0978332  0.0274229   3.568  0.00036 ***
change_rent  0.0032999  0.0012803   2.577  0.00995 ** 
p_chg_rentd  0.0008607  0.0326292   0.026  0.97896    
p_chg_pvt   -0.1261501  0.0541597  -2.329  0.01985 *  
p_chg_chfm  -0.0303358  0.0235003  -1.291  0.19675    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 153.554  on 124  degrees of freedom
Residual deviance:  75.989  on 119  degrees of freedom
AIC: 87.989

Number of Fisher Scoring iterations: 6

To generate the coffients we’ll run the following code as well.

income.output.final <- exp(coef(income.model.final))
income.output.final
(Intercept)    p_chg_wh change_rent p_chg_rentd   p_chg_pvt  p_chg_chfm 
 0.01841103  1.10277887  1.00330535  1.00086107  0.88148251  0.97011967 

Prediction

Now we have a sense of our model, we’ll use the regression model and predict the probability values. These values are the chances of outcome for each of the census tracts to either be of higher or lower income change.

     high
low     0
high    1
126 127 128 129 130 131 
  1   1   1   0   0   0 
# A tibble: 2 x 2
       y no_rows
  <fctr>   <int>
1    low      47
2   high       7
[1] high high low  low  low  low 
Levels: low high
[1] 0.9444444

An object of class "performance"
Slot "x.name":
[1] "None"

Slot "y.name":
[1] "Area under the ROC curve"

Slot "alpha.name":
[1] "none"

Slot "x.values":
list()

Slot "y.values":
[[1]]
[1] 0.05471125


Slot "alpha.values":
list()

The AUC performance value indicates that the number of observations we have for our data is making it difficult for us to cross validate our model. However, the model over all has done workable progress and further researches can improve up on these findings and develop a model that performs well.

Summary and Conclusion

On this project, we attempted to detect gentrification in Washington DC by using emprical analysis. There are mulitple evidences of Gentrification and our goal was to leverage statistical tools and assess its over all pattern. Some of our findings can be summerized as:

  1. Over all, our analysis indicates that regression modeling techniques are indeed useful to study the gentrification. This a very important finding in that our attempt addresses the criticism against studies on gentrification for lacking macro-scale emperical analysis. The modeling techniques we used clearly show their relevance in any future research in this topic.

  2. The OLS regression model didn’t gave us reliable estimates and because of the difference in the scale of changes we saw in Washington DC, the residuals failed to show normal distribution. However, by introducing other modeling techniques such as the Two Stage Least Square (2SLS), the General Moment of Methods (GMM), instrumental variables, this linear regression technique can be improved to yield better result.

  3. We also detected statistically signficant relationship between the change in household income and explantory variables that include changes in share of people with higher education, the share of people living under poverty line and the change in gross amount of median rent.

  4. We didn’t manage to cross-validate the outputs of the logistic regression model because the number of observation’s we use is relatively small. As a result, we recommend that future emperical studies on small cities such as the District should use values at a census blocks rather than census tracts.

In summary, our attempt to study and detect gentrification in Washington DC through the various attributes and features associated with the change has proven that further research in this area will help us understand the pattern better.


References

Arévalo, J. C, Pető, B., Suaya, A. & Mann, L. M. (2012). Demographic changes and gentrification in Washington DC between 2000 and 2010. Papers of the Applied Geography Conferences.

Ellen, I. & Ding, L. (2016). Advancing our understanding of Gentrification. Cityscape: A Journal of Policy Development and Research 18:3, 1-8.

Freeman, L. (2005). Displacement or Succession? Residential Mobility in Gentrifying Neighborhoods. Urban Affairs Review, 40:4, 463-491

Guerrieri, V., Hartley, D, & Hurst, E.(2010). Endogenous Gentrification and housing price dynamics. National Bureau of Economic Research. Working Paper 16237. Retrived from: http://www.nber.org/papers/w16237, on Sun, 19 Nov 2017 18:59:21 UTC

Gregory, D. Johnston, R. Pratt, G. Watts, M. Whatmore, S. (2009). The Dictionary of Human Geography. Hoboken, NJ: Wiley-Blackwell.

Hammel, D. J., and Wyly. K. E.(1996). A model for identifying gentrified areas with census data. Urban Geography 17 (3): 248-68.

Hammel, D. J., and Wyly. K. E.(1996). Modeling in the context and contengency of Gentrification. Urban Affairs 20(3): 303-326.

Heidkamp, C. P. & Lucas, S. (2006) Finding the Gentrification Frontier Using Census Data: The Case of Portland, Maine. Urban Geography, 27:2, 101-125, DOI: 10.2747/0272-3638.27.2.101

Smith, N. (1982). Gentrification and Uneven Development. Economic Geography 58(2) 139-155.


@Tidy Inisghts
Introduction to Data Science
The George Washington University
Dec-2017