Internal migration due to economic conditions is a well-known phenomenon. It can be witnessed in countries all across the world. In this analysis, we use the ‘county’ data set (available in the ‘usdata’ package in RStudio), to visualize this phenomenon in context of the United States, and consolidate its validity. We will also purpose, to a certain degree, to explore and identify key factors that encourage such behavior.
The ‘county’ data set contains data for 3142 counties in the United States. Among other variables, it contains populations in the years 2000, 2010, 2017, population change, median household income, unemployment rate, poverty rate, and whether the county contains a Metropolitan or not. For our analysis, we look at data for the states of Alabama, Texas, California, Alaska, New Jersey and Colorado.
We first create histograms for the Median Household Income, partitioned by whether there was a loss or gain in population. From the initial analysis, we can notice a clear difference in the distributions. The median of counties that gained population looks considerably higher compared to the median of counties that lost population. We can also notice that, while both the distributions are right-skewed (implying that a small percentage of the population are in the upper quantile of the income distribution, consistent with existing income distribution theories) the fatter tail for the distribution corresponding to counties with population gains indicates that there is a larger share of the population in those counties that is in the upper quantile of the distribution, relative to counties that witnessed a loss in population.
As evident from the above distributions, we can see that the median household income for counties that saw increases in population is far greater ($52,716.5) than those which saw population loss ($43,712). The standard deviation of the median household income for them (counties that gained population) is also larger compared to counties that lost population, indicating a higher variance in the income distribution.
Box plots are another great way to visualize the heterogeneity in median household incomes between the counties. The box plots below reaffirm our hypothesis that there exists a significant difference in median household incomes between counties that gained and lost population.
Box plot sizes correspond to the IQR; the bigger the IQR the greater the variability in the data. The IQR for the box plot corresponding to counties that gained population is bigger compared to the IQR for counties that lost population, as evident from the sizes of the 2 box plots. This reaffirms the fact that there is a greater variance in the income distribution for counties that gained population, compared to those that lost population, as evident by the size of the box plot.
We now further breakdown the data by the following 4 conditions:
1. The county does not contain a metro, and lost population
2. The county contains a metro, and gained population
3. The county does not contain a metro, and gained population
4. The county contains a metro, and also gained population
Let’s have a look at histograms, to verify consistency with the box plots created above.
The histograms are consistent with our findings from the box plots. In summary we have the following key observations:
1. Counties that did not contain a metro, on an average, had a lower median household income relative to counties that did.
2. Counties that gained population, also on an average, had a higher median household income compared to counties that lost population.
3. Counties with the highest median household income, are the ones that contained both a metro and saw a gain in population.
4. On the contrary, counties with the lowest median income had no metro, and also witnessed a loss in population.
Education adds another dimension to our analysis. We would like to identify if there exists any relationship between an individual’s education level and their probability of migration to another county. We can create Ridge Plots to identify such relationships.
The Ridge Plots above further reaffirm our hypothesis. We can immediately take notice of the fact that counties that saw population loss had no representation of individuals with Bachelors’ degrees in the median household income distribution, while on the flip side, counties that saw population gain had a larger share of the median household income distribution made up of those who had a Bachelors’ degrees.
But do higher levels of education correspond to an increase in median household income?
We can answer this question using a simple Box Plot:
As suspected, there is a strong correlation between education level and median household income. Therefore we can conclude that a higher education level corresponds to a higher median household income. This finding, is again consistent with existing literature in labor economics, which states that higher levels of education are expected to be associated with higher levels of productivity, and consequently, higher income levels (https://www.pc.gov.au/research/supporting/education-health-effects-wages/education-health-effects-wages.pdf).
From the above analysis, we can draw a preliminary conclusion that counties with low median household incomes and no metros, were likely to witness a loss in population. On the other hand, counties with higher median household incomes and with a metro, were likely to witness an increase in population. This fits in well with the economic theory that labor will migrate (flow) from regions of low economic activity to those of higher economic activity.
This simple, yet powerful approach of using visualizations, has allowed us to consolidate our a priori beliefs regarding this phenomenon. Further analysis of the issue, using much more advanced statistical techniques such as Factor Analysis, Logistic Regression, et al, may help us drill down further to help identify the latent variables that motivate such diffusion of labor.
The results of this analysis can be used to further identify the causal behavior of labor flow within a set geographic location, and while other factors such the Gini Coefficient, GDP per Capita, and Ease of Doing Business, may help create more accurate, predictive models for such diffusion, the initial data analysis is to comprehend and consolidate the existence of the issue, rather than its solution. This is much more important when it comes to a country like Bangladesh, where there is a seemingly high inflow of labor from regions with low economic activity (rural), to those with higher activity (metropolitan cities such as Dhaka).
Such diffusion of labor puts enormous pressure on the infrastructure of a city, and understanding its rate would help policy makers, planners, and other key-stakeholders to appropriately allocate resources needed to sustain existing infrastructure.