Project 1: Migration Analysis

Author

Rebecca Jipdjio

Published

March 10, 2024

source: facinghistory.org

Introduction

This code explores the relationship between state population and the number of refugees using data sourced from the Department of Homeland Security for fiscal year 2022. The dataset includes variables such as Population and Refugees for each state. We aim to investigate how population size influences the influx of refugees across states. (** = comments)

# load the necessary libraries 
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(RColorBrewer)

* This chunk loads the required libraries such as tidyverse for data manipulation, ggplot2 for creating static plots, plotly for creating interactive plots, and RColorBrewer for generating color palettes.*

# Reading the data
immigration_data <- read_csv("/Users/rebeccambaho/Downloads/State Immigration Data Sheets- 2022/immigration_data .csv")

Rows: 52 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): State, R Rank, Asylees, A Rank
dbl (3): LPR Rank, NI Rank, N Rank
num (5): Population, Lawful Permanent Residents, Nonimmigrants, Naturalizati...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

* This chunk reads the immigration data from a CSV file into the R environment for further analysis.*

Data Cleaning

# Check for missing values
sum(is.na(immigration_data))

[1] 0

* This section checks for missing values in the dataset, which is essential for data quality assurance and preprocessing.*

Color Scale

# Set up color scale
n_colors <- 5
color_palette <- brewer.pal(n_colors, "GnBu")
color_breaks <- pretty(range(immigration_data$Refugees), n = n_colors)
color_labels <- pretty(range(immigration_data$Refugees), n = n_colors)

* This section sets up a color scale for visualizing the number of refugees in the subsequent plot, ensuring effective representation of data.*

Creating the Scatter Plot

scatter_plot <- ggplot(immigration_data, aes(x = Population, y = Refugees, text = paste("State: ", State, "<br>Population: ", Population, "<br>Refugees: ", Refugees))) +
  geom_point(aes(fill = Refugees), color = "black", shape = 21, size = 3) + # Use shape 21 and specify black color for border
  scale_fill_gradientn(name = "Number of Refugees", colors = color_palette, breaks = color_breaks, labels = color_labels) +
  labs(x = "Population", y = "Number of Refugees", title = "Population vs Number of Refugees by State (2022)",
       caption = "Source: Department of Homeland Security") +
  theme_minimal() +
  geom_smooth(aes(group = 1), method = "lm", se = TRUE, color = "darkblue", size = 0.5, fill = "lightblue") +  # Add linear regression line with blue confidence interval
  scale_x_continuous(labels = scales::comma) +
  guides(fill = "legend")  # To get a legend with points

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

scatter_plot

`geom_smooth()` using formula = 'y ~ x'

Warning: The following aesthetics were dropped during statistical transformation: text
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

* This code creates a scatter plot with a regression line using ggplot2 in R. It plots 'Population' against 'Refugees' from the 'immigration_data' dataset. Points are colored based on the number of refugees, with a custom color gradient. Additional information about each point is displayed using the 'text' aesthetic. Axes labels, title, and caption are added, and the plot theme is set to minimal. A linear regression line is included, and x-axis labels are formatted for better readability. Overall, the code aims to visualize the relationship between population and the number of refugees by state.*

# Convert ggplot to Plotly for interactivity
scatter_plot <- ggplotly(scatter_plot, tooltip = "text")

`geom_smooth()` using formula = 'y ~ x'

Warning: The following aesthetics were dropped during statistical transformation: text
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

* This section converts the static ggplot scatter plot into an interactive plot using Plotly, enhancing user engagement and exploration.*

# Show the interactive scatter plot
scatter_plot

* This chunk displays the interactive scatter plot, allowing users to hover over points for detailed information.*

Linear Regression Analysis

# Linear regression analysis
linear_model <- lm(Refugees ~ Population, data = immigration_data)

* This chunk performs linear regression analysis to model the relationship between population size and the number of refugees, providing insights into their association.*

# Summary of the regression model
summary(linear_model)


Call:
lm(formula = Refugees ~ Population, data = immigration_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-708.90 -132.02  -44.66  138.15  929.93 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 9.325e+01  5.168e+01   1.804   0.0772 .  
Population  6.135e-05  5.300e-06  11.574 9.38e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 278.8 on 50 degrees of freedom
Multiple R-squared:  0.7282,    Adjusted R-squared:  0.7228 
F-statistic:   134 on 1 and 50 DF,  p-value: 9.379e-16

* This part of the process examines the dataset for missing values and outliers. Ensuring data quality by identifying and addressing these issues is crucial before proceeding with any further analysis.*

The linear regression model equation is: Refugees = 93.25 + 0.00006135 × Population.

The summary output of the linear regression model indicates several key findings. Firstly, the coefficients section reveals that for every one-person increase in population, there’s an estimated increase of approximately 0.00006135 in the number of refugees, holding all other variables constant. This coefficient is statistically significant (p< 0.001), suggesting a strong positive relationship between population size and the number of refugees. The adjusted R-squared value of 0.7228 indicates that approximately 72.28% of the variance in the number of refugees is explained by the population size. This suggests that population size is a substantial predictor of refugee numbers across states. Additionally, the p-value associated with the F-statistic is highly significant (p< 0.001), indicating that the model as a whole is statistically significant in predicting refugee numbers based on population size. Therefore, the analysis suggests that larger states tend to accommodate a greater number of refugees, possibly due to having more resources available.

Conclusion

Cleaning the dataset involved several steps to ensure its reliability and usability. Initially, I manually reconstructed the data due to errors encountered in previous code runs. This process included removing unnecessary rows from the Excel sheet and retaining only the relevant ones, which were then converted into a CSV file format. Although I initially retained more rows anticipating their use, I found that reducing the dataset helped resolve errors. Additionally, during coding, I implemented a line to check for any remaining missing data, further ensuring the dataset’s accuracy.

The visualization depicts a clear positive correlation between state population size and refugee influx. Notably, larger states such as Texas and California exhibit higher refugee populations, while states like Wyoming, Hawaii, Mississippi, and Puerto Rico show no refugees. This pattern underscores the expected trend wherein states with larger populations tend to accommodate more refugees. However, the absence of refugees in certain states raises questions about their policies or demographic factors influencing refugee settlement.

Despite my efforts, I encountered several challenges while attempting to include additional features in the visualization. I faced errors at various stages, which compelled me to rewrite my code multiple times. For instance, the legend disappeared from the final interactive graph without any apparent reason. Initially, I aimed to include more categories, but coding them proved trickier than anticipated. For example, I intended to visualize state rankings across different metrics, such as the refugee category, but due to technical difficulties, I had to prioritize my primary graph and omit these extra elements.