Description

This “internal migration” is the movement of households from one address to another address within the same town, county, state, or between states without leaving the country. The U.S. Census Bureau estimates that about 14% of the people living in the U.S. move within the U.S. each year (As of Oct 4, 2019).

In this analysis, we use the ‘county’ data set (available in the ‘usdata’ package in RStudio), to visualize this phenomenon in context of the United States, and consolidate its validity. We will also purpose, to a certain degree, to explore and identify key factors that encourage such behavior.

Task-1: Load the relevant packages and create clusters

Load the relevant packages: “usdata”, “openintro”, “tidyverse”. The “county” dataset can be found in both “usdata” and “openintro” packages; load the dataset from either package. Once you have loaded the dataset, create 2 more clusters with each containing 6 States. Call them Cluster_2, Cluster_3. The clusters should be sampled Without Replacement, i.e., each cluster should contain unique States, and no State should be repeated twice across the clusters.

Answer-1:

  • The usdata R package: Demographic data on the United States at the county and state levels spanning multiple years.

  • The openintro R package: for data and custom functions with the OpenIntro resources.

  • The tidyverse “umbrella” package which houses a suite of many different R packages: for data wrangling and data visualization.

  • The ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics.

  • The sf provides a table format for simple features, where feature geometries are carried in a list-column.

  • The ggsn provide north symbols and scale Bars for maps created with ‘ggplot2’ or. ‘ggmap’.

  • The ggpubr package provides some easy-to-use functions for creating and customizing ‘ggplot2’- based publication ready plots

  • The ggridges package provides two main geoms, geom_ridgeline and geom_density_ridges. The former takes height values directly to draw ridgelines, and the latter first estimates data densities and then draws those using ridgelines.

library(usdata)
library(openintro)
library(tidyverse)
library(ggplot2)              
library(sf)
library(ggsn)
library(ggpubr)
library(ggridges)

The following command extends the number of lines of printing your results in the console.

options(scipen = 999) 

US population census data

The “county” data set contains data for 3142 counties in the United States. Among other variables, it contains populations in the years 2000, 2010, and 2017, population change, median household income, unemployment rate, poverty rate, and whether the county contains a Metropolitan or not. For our analysis, we look at data for the states of Alabama, Texas, California, Alaska, New Jersey, and Colorado as a Cluster_1. We will also drow another two additional clusters (Cluster_2 and Cluster_3) with unique units between clusters using Sampling Without Replacement (SWOR). These data were collected from Census Quick Facts.

data("county")

Cluster slection

Step-1: At first, let’s create Cluster_1 where the cluster units has been already provided.

Cluster_1 <-county %>%
  filter(state %in% c("Alabama", "Texas", "California", 
                       "Alaska", "New Jersey", "Colorado"))
unique(Cluster_1$state)

Step-2:

Randomly choose 6 states without replacement for each cluster with unique cluster units

Cluster_2 <-county %>%
  filter(!state %in% c("Alabama", "Texas", "California", 
                       "Alaska", "New Jersey", "Colorado"))%>%
  sample_n(state, size=6, replace=F)

unique(Cluster_2$state)

Step-2:

Randomly choose 6 states without replacement for each cluster with unique cluster units

Cluster_3 <-county %>%
  filter(!state %in% c("Alabama", "Texas", "California", 
             "Alaska", "New Jersey", "Colorado",
            "Wyoming", "New York", "Mississippi",
             "Indiana", "Montana", "Kansas"))%>%
  sample_n(state, size=6, replace=F)

unique(Cluster_3$state)

I have randomly selected cluster units. If I run the code, it will display different cluster units each time. To make it consistent, I have stored the results. The following analysis was conducted by looking at data for the Cluster_2 (Wyoming, New York, Mississippi,Indiana, Montana, and Kansas) and Cluster_3 (West Virginia, New Hampshire,Pennsylvania, Rhode Island, Minnesota)

GIS Mapping for two clusters (Extra work)

To create the two clusters of the map, I downloaded shape files from publicly available data source open data soft website.

rm(list = ls())

mapshape <- st_read("us-state-boundaries.shp")
## Reading layer `us-state-boundaries' from data source 
##   `C:\Nasif\BIGM COURSE\2022\Assignment\Assginment-3\us-state-boundaries.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 56 features and 20 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.44069
## Geodetic CRS:  WGS 84
names(mapshape)[5]<-paste("state")

mapshape$ComCode<-rep(1, 56)

state<-c("Alabama", "Texas", "California", 
          "Alaska", "New Jersey", "Colorado",
          "Wyoming", "New York", "Mississippi",
           "Indiana", "Montana", "Kansas",
         "West Virginia", "New Hampshire", "Rhode Island",
        "Minnesota", "Pennsylvania", "Colorado")

Cluster<-rep(1:3,each=6)

Cluster<-data.frame(state, Cluster)

mapshape<-merge(mapshape, Cluster, by="state", all.x=TRUE)

names(mapshape)[1]<-paste("Study Area")

Cluster1<-mapshape%>%
  filter(`Study Area` %in% c("Alabama", "Texas", "California", 
                            "Alaska", "New Jersey", "Colorado"))

Cluster2<-mapshape%>%
  filter(`Study Area` %in% c("Wyoming", "New York", "Mississippi",
                              "Indiana", "Montana", "Kansas"))

Cluster3<-mapshape%>%
  filter(`Study Area` %in% c("West Virginia", "New Hampshire", "Rhode Island",
                                "Minnesota", "Pennsylvania", "Colorado"))


Cluster1_map<-mapshape%>%
  ggplot()+
  geom_sf(aes(fill="ComCode"), color="black")+
  geom_sf(data=Cluster1, aes(fill = "value"),color="Red")+
  scale_fill_manual(values = c("white", "green"), name= "value")+
  xlim(125, 68)+ylim(22, 50)+
  scalebar( st.size = 3, dist = 500,
            dist_unit = "km",transform = TRUE,location = "bottomright",
            #model = "WGS84", 
            border.size = 0.4,
            #specify based on minimum and maximum coordinates of map
            x.min=70, x.max=120, y.min=23, y.max= 45)+
  labs(title = "Cluster 1" ,x="Longitude", y = "Latitude")+
  theme(plot.title = element_text(hjust = 0.5))+
  theme(legend.position="none")


Cluster2_map<-mapshape%>%
  ggplot()+
  geom_sf(aes(fill="ComCode"), color="black")+
  geom_sf(data=Cluster2, aes(fill = "value"),color="Red")+
  scale_fill_manual(values = c("white", "orange"), name= "value")+
  xlim(125, 68)+ylim(22, 50)+
  scalebar( st.size = 3, dist = 500,
            dist_unit = "km",transform = TRUE,location = "bottomright",
            #model = "WGS84", 
            border.size = 0.4,
            #specify based on minimum and maximum coordinates of map
            x.min=70, x.max=120, y.min=23, y.max= 45)+
  labs(title = "Cluster 2" ,x="Longitude", y = "Latitude")+
  theme(plot.title = element_text(hjust = 0.5))+
  theme(legend.position="none")


Cluster3_map<-mapshape%>%
  ggplot()+
  geom_sf(aes(fill="ComCode"), color="black")+
  geom_sf(data=Cluster3, aes(fill = "value"),color="Red")+
  scale_fill_manual(values = c("white", "blue"), name= "value")+
  xlim(125, 68)+ylim(22, 50)+
  scalebar( st.size = 3, dist = 500,
            dist_unit = "km",transform = TRUE,location = "bottomright",
            #model = "WGS84", 
            border.size = 0.4,
            #specify based on minimum and maximum coordinates of map
            x.min=70, x.max=120, y.min=23, y.max= 45)+
  labs(title = "Cluster 3" ,x="Longitude", y = "Latitude")+
  theme(plot.title = element_text(hjust = 0.5))+
  theme(legend.position="none")

Cluster1_map

Cluster2_map

Cluster3_map

The following analysis will continue only for Cluster _2 and Cluster _3 because a similar analysis has already been done for Cluster_1 in the demo of Assignment 1 . However, we will compare the results from Cluster _2 and Cluster _3 with Cluster_1.

Task 2: Create new variable(s)

Create a new variable called “Popoulation_Change”. This variable should be categorical (factor) in nature, and it should indicate whether the population change in the county is positive (Gain), or negative (No-Gain).

Answer-2:

I have created two new data frames,Cluster1_update, Cluster2_update and Cluster3_update. Inside two update clusters, I have also created a new variable, Population_Change, as a factor variable from pop_change.

Cluster2_update<-county %>%
  filter(state %in% c("Wyoming", "New York", "Mississippi", 
                      "Indiana", "Montana", "Kansas"))%>%
  mutate(Population_Change=factor(if_else(pop_change>=0, "Gain", "No-Gain")))


Cluster3_update<-county %>%
  filter(state %in% c("West Virginia", "New Hampshire", "Pennsylvania", 
                      "Rhode Island", "Minnesota", "Colorado"))%>%
  mutate(Population_Change=factor(if_else(pop_change>=0, "Gain", "No-Gain")))

Task 3: Perform an Initial Data Investigation on both clusters

Use the format in the analysis report provided above, to perform an Initial Data Investigation. Create boxplots, histograms and other appropriate visualizations to identify relationships between interesting variables. Make sure to comment on what the variables might be, and use theory to justify your selection.

Answer-3:

Handling missing observation

Before the initial data analysis, we should check missing observations in the dataset. If there have a missing cell, we need to omit it.

sum(is.na(Cluster2_update))
## [1] 81
sum(is.na(Cluster3_update))
## [1] 95

Both datasets “Cluster2_update” and “Cluster3_update” have missing observations 81 and 95, respectively.

Cluster2_update<-Cluster2_update[complete.cases(Cluster2_update),]
Cluster3_update<-Cluster3_update[complete.cases(Cluster3_update),]

# Check again total missing observation

sum(is.na(Cluster2_update)) # Missing observation is 0
## [1] 0
sum(is.na(Cluster3_update)) # Missing observation is 0
## [1] 0

Now the dataset is full length, we may prceesed for further analysis

Histograms

The median of counties that gained population looks considerably lower compared to the median of counties that lost population for two clusters. We can also notice that, while both the distributions are right-skewed (implying that a small percentage of the population are in the upper quantile of the income distribution, consistent with existing income distribution theories) the fatter tail for the distribution corresponding to counties with population gains indicates that there is a larger share of the population in those counties that is in the upper quantile of the distribution, relative to counties that witnessed a loss in population [Cluster 2]. The income distribution shape of Cluster 3 has shifted to the right relative to Cluster 2, which shows a higher-upper quantile of the income distribution.

C2<-Cluster2_update %>%
  ggplot(aes(x=median_hh_income, fill=Population_Change, color=Population_Change))+
  geom_histogram(position = "identity", alpha=0.5)+
  labs(title = "Cluster 2", x="Household Income $ (Median)", y="Count")+
  theme(plot.title = element_text(hjust = 0.5))+
  theme(legend.position="bottom")

summary(Cluster2_update$Population_Change)
##    Gain No-Gain 
##     108     231
C3<-Cluster3_update %>%
  ggplot(aes(x=median_hh_income, fill=Population_Change, color=Population_Change))+
  geom_histogram(position = "identity", alpha=0.5)+
  labs(title = "Cluster 3", x="Household Income $ (Median)", y="Count")+
  theme(plot.title = element_text(hjust = 0.5))+
  theme(legend.position="bottom")

summary(Cluster3_update$Population_Change)
##    Gain No-Gain 
##      92     102
ggarrange(C2,C3, nrow = 1, ncol = 2)

Cluster 2

By visual inspection of the above distribution of histogram and summary of Population_Change (108 vs 231), it seems to be fewer counties gained their population (opposite Cluster_1).

But from the following summary of median household income, we can see that the median household income for counties that saw population increases is far greater ($51,948) than those which saw population loss ($46666). The median household income standard deviation for them (counties that gained population) is also larger than counties that lost population, indicating a higher variance in the income distribution. (Similar findings as like Cluster_1). Because gained population’s median household income shifted more to the right side than the no-gained population’s (Cluster_2 histogram). Also, the extreme value may affect the median value of median household income.

Cluster2_update %>%
  group_by(Population_Change) %>%
  summarise(median=median(median_hh_income), std_dev=sd(median_hh_income))
## # A tibble: 2 x 3
##   Population_Change median std_dev
##   <fct>              <dbl>   <dbl>
## 1 Gain               51948  11131.
## 2 No-Gain            46666   9460.

Cluster 3

By visual inspection of the above distribution of histogram and summary of Population_Change (92 vs 102), it seems to be fewer counties gained their population (opposite Cluster_1).

But from the following summary of median household income, we can see that the median household income for counties that saw population increases is far greater ($59,530) than those which saw population loss ($46209). The median household income standard deviation for them (counties that gained population) is also larger than counties that lost population, indicating a higher variance in the income distribution. (Similar findings as like Cluster_1). Because gained population’s median household income was right-skewed while no-gained was left-skewed (Cluster_3 histogram). Also, the extreme value may affect the median value of median household income.

Cluster3_update %>%
  group_by(Population_Change) %>%
  summarise(median=median(median_hh_income), std_dev=sd(median_hh_income))
## # A tibble: 2 x 3
##   Population_Change median std_dev
##   <fct>              <dbl>   <dbl>
## 1 Gain               59530  15054.
## 2 No-Gain            46209   8381.

Boxplots for Median Household Distribution

Box plots are another great way to visualize the heterogeneity in median household incomes between the counties. The box plots below reaffirm our hypothesis that there exists a significant difference in median household incomes between counties that gained and lost population.

C2<-Cluster2_update %>%
  ggplot(aes(x=median_hh_income, y= Population_Change,
             color=Population_Change))+
  geom_boxplot()+
  labs(title = "Cluster 2", x="Household Income $ (Median)", y="Change in population")+
  theme(plot.title = element_text(hjust = 0.5))+
  theme(legend.position="right")

C2

C3<-Cluster3_update %>%
  ggplot(aes(x=median_hh_income, y= Population_Change,
             color=Population_Change))+
  geom_boxplot()+
  labs(title = "Cluster 3", x="Household Income $ (Median)", y="Change in population")+
  theme(plot.title = element_text(hjust = 0.5))+
  theme(legend.position="right")
C3

Box plot sizes correspond to the IQR; the bigger the IQR the greater the variability in the data. The IQR for the box plot corresponding to counties that gained population is bigger compared to the IQR for counties that lost population, as evident from the sizes of the 2 box plots from Cluster_2 and Cluster_3. This reaffirms the fact that there is a greater variance in the income distribution for counties that gained population, compared to those that lost population, as evident by the size of the box plot (Similar findings as like Cluster_1). No-Gain for Cluster 2 has outlier for both upper bound and lower bound where as Cluster 1 and Cluster 3 exhibited only upper bound outlier for Gain/No-Gain.

We now further breakdown the data by the following 4 conditions:

1. The county does not contain a metro, and lost population

2. The county contains a metro, and gained population

3. The county does not contain a metro, and gained population

4. The county contains a metro, and also gained population

C2<-Cluster2_update %>%
  ggplot(aes(x=median_hh_income, y= Population_Change,
             color=Population_Change))+
  geom_boxplot()+
  facet_grid(~metro, labeller = label_both)+
  labs(title = "Cluster 2", x="Household Income $ (Median)", y="Change in population")+
  theme(plot.title = element_text(hjust = 0.5))+
  theme(legend.position="bottom")

C2

C3<-Cluster3_update %>%
  ggplot(aes(x=median_hh_income, y= Population_Change,
             color=Population_Change))+
  geom_boxplot()+
  facet_grid(~metro, labeller = label_both)+
  labs(title = "Cluster 3", x="Household Income $ (Median)", y="Change in population")+
  theme(plot.title = element_text(hjust = 0.5))+
  theme(legend.position="bottom")

C3

The Metro area gained more population with three clusters. Cluster 3 has shown comparatively fewer outliers.

Let’s have a look at histograms, to verify consistency with the box plots created above.

Histograms to describe Median Household Income

Cluster2_update %>%
  ggplot(aes(x=median_hh_income,fill= Population_Change,
             color=Population_Change))+
  geom_histogram(position="identity")+
  facet_grid(Population_Change~metro, labeller=labeller(.cols=label_both))+
  labs(title = "Cluster 2", x="Household Income $ (Median)", y="Count")+
  theme(plot.title = element_text(hjust = 0.5))

Cluster3_update %>%
  ggplot(aes(x=median_hh_income,fill= Population_Change,
             color=Population_Change))+
  geom_histogram(position="identity")+
  facet_grid(Population_Change~metro, labeller=labeller(.cols=label_both))+
  labs(title = "Cluster 3", x="Household Income $ (Median)", y="Count")+
  theme(plot.title = element_text(hjust = 0.5))

The both histograms for ‘Cluster 2’ and ‘Cluster 3’ are consistent with our findings from the box plots. In summary we have the following key observations:

1. Counties that did not contain a metro, on an average, had a lower median household income relative to counties that did.

2. Counties that gained population, also on an average, had a higher median household income compared to counties that lost population.

3. Counties with the highest median household income, are the ones that contained both a metro and saw a gain in population.

4. On the contrary, counties with the lowest median income had no metro, and also witnessed a loss in population.

The results are consistent with Cluster_1.

Median Household Income by Education

Education adds another dimension to our analysis. We would like to identify if there exists any relationship between an individual’s education level and their probability of migration to another county. We can create Ridge Plots to identify such relationships..

Cluster2_update %>%
  ggplot(aes(x=median_hh_income, y=median_edu, fill=median_edu))+
  geom_density_ridges()+
  facet_grid(Population_Change~metro, labeller=labeller(.cols=label_both))+
  labs(title = "Cluster 2", x="Household Income $ (Median)", y="Median Education Level")+
  scale_color_viridis_c()+
  theme_bw()+
  theme(plot.title = element_text(hjust = 0.5))

Cluster 2

The Ridge Plots above further reaffirm our hypothesis. We can immediately take notice of the fact that counties that saw bi-modal representation of individuals with Bachelors’ degrees in the median household income distribution only for gained-metro.

Cluster3_update %>%
  ggplot(aes(x=median_hh_income, y=median_edu, fill=median_edu))+
  geom_density_ridges()+
  facet_grid(Population_Change~metro, labeller=labeller(.cols=label_both))+
  labs(title = "Cluster 3", x="Household Income $ (Median)", y="Median Education Level")+
  scale_color_viridis_c()+
  theme_bw()+
  theme(plot.title = element_text(hjust = 0.5))

Cluster 3

The Ridge Plots above further reaffirm our hypothesis. We can immediately take notice of the fact that counties that saw population loss had no representation of individuals with Bachelors’ degrees in the median household income distribution, while on the flip side, counties that saw population gain had a larger share of the median household income distribution made up of those who had a Bachelors’ degrees (The results are consistent with Cluster_1). Gained-metro and Bachelors’ degrees distribution surprisingly shown tri-models pattern.

But do higher levels of education correspond to an increase in median household income?

We can answer this question using a simple Box Plot:

Cluster2_update %>%
  ggplot(aes(x=median_hh_income, y=median_edu, color=median_edu))+
  geom_boxplot()+
  labs(title = "Cluster 2", x="Household Income $ (Median)", y="Median Education Level")+
  theme(plot.title = element_text(hjust = 0.5))

Cluster 2

As suspected, there is a strong correlation between education level and median household income. Therefore we can conclude that a higher education level corresponds to a higher median household income (Similar findings from cluster_1). The median value of Household median income is close to the 1st quartile, while confidence interval is very narrow. There have no representative for below high school level of education.

Cluster3_update %>%
  ggplot(aes(x=median_hh_income, y=median_edu, color=median_edu))+
  geom_boxplot()+
  labs(title = "Cluster 3", x="Household Income $ (Median)", y="Median Education Level")+
  theme(plot.title = element_text(hjust = 0.5))

Cluster 3

As suspected, there is a strong correlation between education level and median household income. Therefore we can conclude that a higher education level corresponds to a higher median household income (Similar findings from cluster_1). The median value of Household median income is close to the 1st quartile, while confidence interval is quite wide. There have no representative for below high school level of education. This finding, is again consistent with existing literature in labor economics, which states that higher levels of education are expected to be associated with higher levels of productivity, and consequently, higher income levels (https://www.pc.gov.au/research/supporting/education-health-effects-wages/education-health-effects-wages.pdf).

Task 4: Write a conclusion

Answer-4:

In this assignment, we provideded a comprehensive comparative assessment of the population change (Positive/Negative rate) to demonstrate the cluster-wise median household income of these data exploration on internal migration in the United States.

Both types of population changes have resulted in a higher variance in income distribution among clusters. Estimates of measures of income inequality and polarization are often required in studies of income distributions. One should account for their sampling variability when income distributions are compared from region to region or through time.

There was a higher and broader range of median household income when metro populations gained. Increases in urbanization may contribute to higher levels of inequality between metro and non-metro counties. Our results are supported by two different visual approaches (Boxplot and histogram) from a three-dimensional point of view. We found that the population increase rate is much higher in the urban area, which is directly related to increasing household income.

Higher levels of education usually translate into better employment opportunities and higher earnings (see ridges plots). Most of the higher education institutes (universities) are located in urban areas. We inspected the relationship between education level and income growth with a combination of metro-population change. We found that income increases with a higher level of education in both urban-rural settings.

Limitation

  1. We did not quantify our findings. Only visual inspection may lead to misleading interpretations. Further analysis of the issue, using much more advanced statistical techniques such as Factor Analysis, Logistic Regression, etc., may help us drill down further to help identify the latent variables that motivate such diffusion of labor.

  2. We did not consider other relevant variables in this scientific report, such as poverty, unemployment rate, and per-capita income. We might be able to gain a better understanding of US internal migration by considering those factors.

Importance of this study (Bangladesh Perspective)

The population of Bangladesh is increasing at an alarming rate. Bangladesh is the 7th largest country in the world in population. The country may have already reduced its population growth, but this reduction is not nearly enough to avoid dire consequences. Currently, the country adds about 3 million to its population every year.

Too much centralization and a growing population are leading to uninhabitable cities. Especially the capital of Bangladesh, Dhaka, is ranked 2nd uninhabitable city. Due to high income-inequality distribution, internal migration may be boosting population density in Dhaka city.

The capital of Bangladesh is seen as a frenzied economic engine, with the city’s skyline thrusting up aggressively and the sprawling markets bustling with activities. However, the rural agriculture sector contributes a lion-share of the GDP.

We see in this assignment that a lower level of education is not represented for household income. But In Bangladesh, low-level education presents a significant portion of household income. For example, most of the Ready-Made Garments (RMG) workers’ education qualifications are below the high school level.

Internal migration is essential to understanding the population dynamics and the multifaceted relationship between population and developing a nation like Bangladesh. Therefore, due to the unavailability of quality data, exploring the actual scenario of domestic migration and income distribution is challenging.