## CLEAR ENVIRONMENT ##
# Clear the workspace
rm(list = ls()) # Clear environment
gc() # Clear unused memory
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 530530 28.4 1181100 63.1 NA 669400 35.8
## Vcells 983996 7.6 8388608 64.0 16384 1851696 14.2
cat("\f") # Clear the console
## IMPORT DATA ##
library(readxl)
# Read excel file and store into df
df_panel <- read_excel("Downloads/Take-home assignment #3-4/Panel data - slum population.xlsx")
# View(df_panel)
# 360 observations of 22 variables
The following report will focus on the causal effect of urban growth on the prevalence of slums over time. Both OLS and fixed effects models will be used to explore this relationship, controlling for country demographics, and individual and time fixed effects.
The data used in the analysis comprises of a country-level panel dataset that captures the proportion of the urban population that lives in slums for several countries over several years. The data spans across sixty developing countries over six years, for a total of 360 observations of 22 variables.
After processing the initial data set using R, approximately 29% of values are missing within the variables.
Additionally, I created a new variable called country_id, which sorts the countries in the country variable and assigns a numeric ID to each unique country. This will allow country names to be linked to their corresponding numeric IDs for further analysis and modeling when running the fixed effects models.
## CREATE COUNTRY UNIQUE IDENTIFIER ##
# Load required package
require(dplyr)
# Sort the data by the 'country' variable (if it's not already sorted)
df_panel <- df_panel %>% arrange(country)
# Get the unique countries
unique_countries <- unique(df_panel$country)
# Create a numeric ID for each country
country_ids <- seq_along(unique_countries)
# Create the new variable 'country_id'
df_panel <- df_panel %>%
mutate(country_id = match(country, unique_countries))
Overall, the data consisted of 360 observations spanning from 1990 to 2009. The dependent variable, proportion of slums within a country, follows a normal distribution and is widely scattered with the values ranging from 3.3% to 98.9%, and the standard deviation at 23%. Moreover, GDP per capital in USD ranges from $434 and $16,633, with a right skewed distribution meaning that most country's GDP are clustered toward the left. Annual GDP growth has a normal, yet large distribution ranging from -6.9% and 36%, and clustered around 4.4%. The number of refugees ranges from 1 to 3,255,975. Interestingly, the distribution is skewed to the right with most of the data points around the median value of 17,800, so 3,000,000 is a lot of refugees relative to the median. Total population is also right skewed, however, with an extremely large standard deviation meaning that the values are widely scattered. Next, annual urban growth ranges from -0.8% and 18.7%, with most of the data around 3.5%, which tells us that some countries urbanization is happening at a much larger magnitude. Population density ranges from 1.4 to 1,190 people per square km, meaning that there are some very heavily and some lightly populated countries with most around 53.9 people per square km. The GINI index is a measure of inequality, with a score of 0 representing perfect equality and a score of 1 (or 100%) indicating perfect inequality. Among the countries in the dataset, their GINI index ranges from 0.8 to 26.7, with most countries having an index around 5. The urban poverty rate in these countries varies from 1% to 61.5%, with the majority of data points clustered around 28%. This suggests that there are countries with significantly high urban poverty rates and others with very low rates, possibly due to differences in their urban population sizes. Both the Government Effectiveness Index and Political Stability Index of the countries are generally on the lower side, with most scores in the negatives. This indicates that many countries face challenges in terms of government effectiveness and political stability. Below is a summary table demonstrating the statistics for the discussed panel data:
##
## Summary Statistics of Panel Data
## ======================================
## Statistic Mean Median St. Dev. Min Max
## ======================================
There are many variables in the data set that are correlated, which could affect the model performance in the regression analysis. The following correlation matrix graph demonstrates the variables in data set that are correlated:
Firstly, the proportion of slums is negatively correlated with GDP
per capita in US dollars. Empirically speaking, this makes sense as
increases in GDP per capita means that that the country and population
are wealthier meaning that the proportion of slums should be smaller.
Interestingly, the proportion of slums is also negatively correlated
with urban population percentage, and it is positively correlated with
urban growth. In rapidly urbanizing regions, the increase of people to
cities can outpace the development of formal housing, leading to an
increase in the proportion of slums. Additionally, the positive
correlation between the proportion of slums and urban growth suggests
that as urban areas expand, so does the proportion of slum
development.
Next, GDP per capita in USD is positively correlated with urban
population percentage and the HDI, meaning that GDP increases are linked
with increases in the urban population and HDI. Oddly, urban population
is negatively correlated with urban growth, and population density. This
negative correlation between urban population and urban growth could
mean that at some point urbanization reaches a stopping point where the
urban growth slows down due to limited resources or a decrease in
migration from rural to urban areas. Finally, political stability and
government effectiveness are positively correlated, signifying that
countries that have strong political stability typically have effective
governments, and vice versa.
In order to demonstrate and measure the causal impact of urban growth
on the proportion of slums, a fixed effects model was used controlling
for country demographics. Moreover, OLS was used to compare the results
obtained from the fixed effects models to observe how bias can be
eliminated and result-changing fixed effects can be.
The equation for the fixed effects model is
\[ \text{Slum Prevalence}_{it} = \beta_0 + \beta_1 \text{Urban Growth}_{it} + \gamma_{it} + \alpha_i + \lambda_t + \epsilon_{it} \]
where the dependent variable, slum prevalence, represents the proportion of slums for the ith country at time t. The independent variable, urban growth, represents the percentage in urban growth for the ith country at time t. 𝛾 controls for country demographics which are correlated with urban growth and affect the prevalence of slums in a country such as GDP per capita in US dollars, the infant mortality rate, HDI, urban population percentage, government effectiveness and population density. α controls for the individual effects, such as the unique identifier variable I created for every country called country_id. In addition, λ controls for time-specific factors, which affect all individuals in the panel equally.
Firstly, simple OLS is used to get a raw estimate of the effect of ever urban growth on the prevalence of slums without controls. As expected, the results differ from the fixed effects model. Based on the OLS results, an increase in urban growth by 1 percentage point, increases the prevalence of slums by 6.8 percentage points, holding all variables constant. This is statistically significant at the 1% level. In this model, however, the residual standard error is relatively high, meaning that the observed data have larger variability around the regression line than expected. Additionally, urban growth does not necessarily mean there will be more slums. Urban growth entails new economic opportunities, infrastructure development, government spending. Therefore, the coefficient for this model is bias due to omitted variable bias.
The second OLS model controls for country demographics which are correlated with urban growth and affect the prevalence of slums in a country such as GDP per capita in US dollars, the infant mortality rate, HDI, urban population percentage, government effectiveness and population density. Based on this model, an increase in urban growth by 1 percentage point, increase the prevalence of slums by 2.1 percentage points, holding all variables constant. This is statistically significant at the 10% level. Here, the coefficient is much smaller; however, the results are still bias as observable differences across countries and time are not accounted for.
The third model uses fixed effects to account for individual (country) fixed effects – country-specific characteristics that remain constant over time. When running this model, the coefficient decreases only by a little since all the countries in the sample are developing countries and, therefore, are relatively similar.
Model 4, which is a fixed effects model which controls for demographics and country/individual fixed effects, is where the coefficient flips from positive to negative, demonstrating how accounting for observable differences across the countries brings us closer to the true relationship.
Lastly, model 5, which includes both individual and time fixed effects and the controls, the coefficient becomes even more negative, indicating that accounting for both time and individual FEs significantly changes the estimated relationship between slum prevalence and urban growth. This suggests that the initial positive relationship observed in the absence of fixed effects was likely driven by unobserved time-specific factors, and once these fixed effects are considered, the true relationship becomes apparent, leading to a substantial reversal in the estimated coefficient. Now, an increase in urban growth by 1 percentage point, decreases the prevalence of slums by -2.9 percentage points, holding all other variables constant. This is significant at the 5% level. Given that the sample includes countries that are followed over time, including time fixed effects is crucial. Notably, significant events that could affect a country's urban growth rate include economic boom or recessions, or government policies that happen at a particular time within a country. By including time fixed effects, the model captures the influence of these specific events and controls for their potential impact on the observed outcomes.
Based on the F-statistic of the final model 5, it is statistically significant at the 1% level, meaning that there is strong evidence to reject the null hypothesis stating that all coefficients are equal, and the model does not explain any significant variation in the data. Empirically speaking, it would make sense that an increase in urban growth decreases the prevalence of slums since urban growth is associated with positive economic and demographic outcomes.
The table below demonstrates the following discussed models:
## Warning in pdata.frame(df_panel, index = c("country_id", "year")): at least one NA in at least one index dimension in resulting pdata.frame
## to find out which, use, e.g., table(index(your_pdataframe), useNA = "ifany")
##
## Summary Statistics Table of OLS and Fixed Effects Models
## =======================================================================================================================================
## Dependent variable:
## --------------------------------------------------------------------------------------------------------------
## proportion_slums
## OLS panel
## linear
## (1) (2) (3) (4) (5)
## ---------------------------------------------------------------------------------------------------------------------------------------
## Urban Growth % 6.798*** 2.056* 0.900* -0.794 -2.903**
## (0.739) (1.077) (0.542) (1.695) (1.235)
## GDP per capita USD -0.002*** -0.002 -0.002
## (0.001) (0.002) (0.001)
## Infant Mortality Rate 0.403*** 0.524* 0.465*
## (0.111) (0.311) (0.232)
## HDI -6.879 48.331 -30.568
## (32.403) (37.237) (29.589)
## Urban Population % 0.328** 0.525 -0.749
## (0.158) (0.799) (0.619)
## Government Effectiveness -10.953*** 0.578 5.846
## (3.286) (5.905) (4.731)
## Population Density 0.021** -0.065 0.013
## (0.008) (0.109) (0.079)
## Constant 29.271*** 9.613
## (2.847) (20.302)
## ---------------------------------------------------------------------------------------------------------------------------------------
## Ind. FE NO NO YES YES YES
## Time FE NO NO NO NO YES
## Observations 310 78 310 78 78
## R2 0.215 0.728 0.011 0.239 0.648
## Adjusted R2 0.213 0.701 -0.223 -0.302 0.355
## Residual Std. Error 20.304 (df = 308) 13.047 (df = 70)
## F Statistic 84.531*** (df = 1; 308) 26.793*** (df = 7; 70) 2.759* (df = 1; 250) 2.018* (df = 7; 45) 7.730*** (df = 10; 42)
## =======================================================================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
This report provides evidence that country's that experience urban growth will likely observe a decrease in the prevalence of slums. These findings could be important in building urban development policies, such as government investment in urban planning to accommodate urban population growth. For instance, investing in sanitation, healthcare, and affordable housing projects could help prevent the proportion of slums.