Russia’s war on Ukraine has resulted in millions of people being displaced from their country. Most of them have entered Poland which is a neighboring country to Ukraine. I wanted to use the data available based on border crossing records in the Polish border to figure out which border crossing had the most number of refugees entering so that international aid agencies can plan resource allocation based on the needs in each area. Furthermore, I wanted to analyze which countries these refugees belonged to so that the governments of these countries can stand up against Russia’s aggression.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(stringr)
library(ggplot2)
library(dplyr)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(usmap)
Read analysis: The data for this analysis was obtained from kaggle.com. The web link is https://www.kaggle.com/datasets/krystianadammolenda/refugees-from-ukraine-poland?resource=download. The data was first pulled into Github and then loaded to the R dataframe.
df<- read.csv("https://raw.githubusercontent.com/deepasharma06/Data-607/main/border_traffic_UA_PL_01_03.csv")
head(df)
## Border.Guard.Post Border.crossing Type.of.border.crossing Border.Guard.Unit
## 1 Dorohusk Dorohusk-Jagodzin road Nadbuzanski
## 2 Dorohusk Dorohusk-Jagodzin road Nadbuzanski
## 3 Dorohusk Dorohusk-Jagodzin road Nadbuzanski
## 4 Dorohusk Dorohusk-Jagodzin road Nadbuzanski
## 5 Dorohusk Dorohusk-Jagodzin road Nadbuzanski
## 6 Dorohusk Dorohusk-Jagodzin road Nadbuzanski
## Date Direction.to...from.Poland Citizenship..code. UE...Schengen
## 1 2022-01-01 arrival in Poland BY 0
## 2 2022-01-01 arrival in Poland DE UE
## 3 2022-01-01 arrival in Poland TR 0
## 4 2022-01-01 arrival in Poland UA 0
## 5 2022-01-01 arrival in Poland LV UE
## 6 2022-01-01 arrival in Poland NL UE
## Number.of.persons..checked.in. Number.of.people..evacuated.
## 1 7 0
## 2 29 0
## 3 2 0
## 4 389 0
## 5 3 0
## 6 3 0
Below is the summary of the dataset.
summary(df)
## Border.Guard.Post Border.crossing Type.of.border.crossing
## Length:54233 Length:54233 Length:54233
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Border.Guard.Unit Date Direction.to...from.Poland
## Length:54233 Length:54233 Length:54233
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Citizenship..code. UE...Schengen Number.of.persons..checked.in.
## Length:54233 Length:54233 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00
## Mode :character Mode :character Median : 1.00
## Mean : 78.06
## 3rd Qu.: 4.00
## Max. :24662.00
## Number.of.people..evacuated.
## Min. : 0.00
## 1st Qu.: 0.00
## Median : 0.00
## Mean : 33.79
## 3rd Qu.: 1.00
## Max. :13641.00
I see in the summary above that the data headers are not stated properly so I ran the below code to make the column headers meaningful.
colnames(df) <- c("Border_Guard_Post", "Border_Crossing","Type_Of_Border_Crossing","Border_Guard_Unit","Date","Direction","Code","EU_Schengen","Number_Of_Persons_Checked_In","Number_Of_Persons_Evacuated")
head(df)
## Border_Guard_Post Border_Crossing Type_Of_Border_Crossing Border_Guard_Unit
## 1 Dorohusk Dorohusk-Jagodzin road Nadbuzanski
## 2 Dorohusk Dorohusk-Jagodzin road Nadbuzanski
## 3 Dorohusk Dorohusk-Jagodzin road Nadbuzanski
## 4 Dorohusk Dorohusk-Jagodzin road Nadbuzanski
## 5 Dorohusk Dorohusk-Jagodzin road Nadbuzanski
## 6 Dorohusk Dorohusk-Jagodzin road Nadbuzanski
## Date Direction Code EU_Schengen Number_Of_Persons_Checked_In
## 1 2022-01-01 arrival in Poland BY 0 7
## 2 2022-01-01 arrival in Poland DE UE 29
## 3 2022-01-01 arrival in Poland TR 0 2
## 4 2022-01-01 arrival in Poland UA 0 389
## 5 2022-01-01 arrival in Poland LV UE 3
## 6 2022-01-01 arrival in Poland NL UE 3
## Number_Of_Persons_Evacuated
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
str(df)
## 'data.frame': 54233 obs. of 10 variables:
## $ Border_Guard_Post : chr "Dorohusk" "Dorohusk" "Dorohusk" "Dorohusk" ...
## $ Border_Crossing : chr "Dorohusk-Jagodzin" "Dorohusk-Jagodzin" "Dorohusk-Jagodzin" "Dorohusk-Jagodzin" ...
## $ Type_Of_Border_Crossing : chr "road" "road" "road" "road" ...
## $ Border_Guard_Unit : chr "Nadbuzanski" "Nadbuzanski" "Nadbuzanski" "Nadbuzanski" ...
## $ Date : chr "2022-01-01" "2022-01-01" "2022-01-01" "2022-01-01" ...
## $ Direction : chr "arrival in Poland" "arrival in Poland" "arrival in Poland" "arrival in Poland" ...
## $ Code : chr "BY" "DE" "TR" "UA" ...
## $ EU_Schengen : chr "0" "UE" "0" "0" ...
## $ Number_Of_Persons_Checked_In: int 7 29 2 389 3 3 4 2 4 2 ...
## $ Number_Of_Persons_Evacuated : int 0 0 0 0 0 0 0 0 0 0 ...
I converted blank fields into NA so I can remove these rows.
df$Border_Crossing[df$Border_Crossing == ""] = NA
cbind(lapply(lapply(df, is.na), sum))
## [,1]
## Border_Guard_Post 0
## Border_Crossing 81
## Type_Of_Border_Crossing 0
## Border_Guard_Unit 0
## Date 0
## Direction 0
## Code 26
## EU_Schengen 0
## Number_Of_Persons_Checked_In 0
## Number_Of_Persons_Evacuated 0
I wanted to keep my data clean by removing rows where a column is null. I see that my initial dataframe has 54233 rows but now with this command it has gone down to 54126 rows.So 107 rows were removed.
df1 <- na.omit(df)
cbind(lapply(lapply(df1, is.na), sum))
## [,1]
## Border_Guard_Post 0
## Border_Crossing 0
## Type_Of_Border_Crossing 0
## Border_Guard_Unit 0
## Date 0
## Direction 0
## Code 0
## EU_Schengen 0
## Number_Of_Persons_Checked_In 0
## Number_Of_Persons_Evacuated 0
The Date field is a character. I want to convert it into date fromat.
class(df1$Date)
## [1] "character"
df1$Date <- as.Date(paste(df1$Date,"-01",sep=""))
class(df1$Date)
## [1] "Date"
In the dataframe, df1, there are two columns that tells me the number of persons who checked in themselves and those who were evacuated. I want to create a new column by adding these two columns which will give me the total arrivals
df1$Total_Arrival=rowSums(cbind(df1$Number_Of_Persons_Checked_In,df1$Number_Of_Persons_Evacuated),na.rm=TRUE,)
str(df1)
## 'data.frame': 54126 obs. of 11 variables:
## $ Border_Guard_Post : chr "Dorohusk" "Dorohusk" "Dorohusk" "Dorohusk" ...
## $ Border_Crossing : chr "Dorohusk-Jagodzin" "Dorohusk-Jagodzin" "Dorohusk-Jagodzin" "Dorohusk-Jagodzin" ...
## $ Type_Of_Border_Crossing : chr "road" "road" "road" "road" ...
## $ Border_Guard_Unit : chr "Nadbuzanski" "Nadbuzanski" "Nadbuzanski" "Nadbuzanski" ...
## $ Date : Date, format: "2022-01-01" "2022-01-01" ...
## $ Direction : chr "arrival in Poland" "arrival in Poland" "arrival in Poland" "arrival in Poland" ...
## $ Code : chr "BY" "DE" "TR" "UA" ...
## $ EU_Schengen : chr "0" "UE" "0" "0" ...
## $ Number_Of_Persons_Checked_In: int 7 29 2 389 3 3 4 2 4 2 ...
## $ Number_Of_Persons_Evacuated : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Total_Arrival : num 7 29 2 389 3 3 4 2 4 2 ...
## - attr(*, "na.action")= 'omit' Named int [1:107] 11720 12731 12900 12944 13610 13747 13875 14417 14652 16450 ...
## ..- attr(*, "names")= chr [1:107] "11720" "12731" "12900" "12944" ...
I separated the date column into Year, Month and Day. Note: This could will not run after the first run as the dataframe is not renamed.
df1 <- separate(df1, "Date", c("Year", "Month", "Day"), sep = "-")
summary(df1)
## Border_Guard_Post Border_Crossing Type_Of_Border_Crossing
## Length:54126 Length:54126 Length:54126
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Border_Guard_Unit Year Month Day
## Length:54126 Length:54126 Length:54126 Length:54126
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Direction Code EU_Schengen
## Length:54126 Length:54126 Length:54126
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Number_Of_Persons_Checked_In Number_Of_Persons_Evacuated Total_Arrival
## Min. : 0.00 Min. : 0.00 Min. : 1.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 1.0
## Median : 1.00 Median : 0.00 Median : 2.0
## Mean : 78.22 Mean : 33.84 Mean : 112.1
## 3rd Qu.: 4.00 3rd Qu.: 1.00 3rd Qu.: 8.0
## Max. :24662.00 Max. :13641.00 Max. :24662.0
Number of people that arrived in Poland and departed from Poland
arrive_in_poland <-arrive_in_poland <- df1 %>%
group_by(Direction) %>%
dplyr::summarise(Total_Arrival = sum(Total_Arrival)) %>%
as.data.frame()
arrive_in_poland
## Direction Total_Arrival
## 1 arrival in Poland 5090593
## 2 departure from Poland 974864
The below code is used to determine the total number of people that arrived in Poland via road and via railways.
datagroup <- df1 %>%
group_by(Type_Of_Border_Crossing ) %>%
dplyr::summarise(Total = sum(Total_Arrival)) %>%
as.data.frame()
datagroup
## Type_Of_Border_Crossing Total
## 1 railway 570047
## 2 road 5495410
Below is a graph to show this data.
cvpalette<- c("#80bfff","#6666ff","#85adad")
datagroup <- df1 %>%
group_by(Type_Of_Border_Crossing ) %>%
dplyr::summarise(Total = sum(Total_Arrival)) %>%
as.data.frame()
datagroup$Total <- as.numeric(datagroup$Total)
ggplot(head(datagroup,4), aes(Type_Of_Border_Crossing , Total, fill = Type_Of_Border_Crossing) ) +
geom_col() +
scale_color_viridis_d()+
scale_fill_manual(values=cvpalette)+
theme(axis.text.x = element_text(angle = -30, vjust = 1, hjust = 0))+
labs(x="Type Of Border Crossing",y= "Total", title = "Border Crossing by Type")
Total arrival in Poland based on country of citizenship
library(dplyr)
Arrival_By_Citizenship <- df1 %>%
group_by(Code) %>%
dplyr::summarise(Total = sum(Total_Arrival)) %>%
arrange(Total, increasing = FALSE)
Arrival_By_Citizenship<-data.frame(Arrival_By_Citizenship)
head(Arrival_By_Citizenship)
## Code Total
## 1 BN 1
## 2 PG 1
## 3 TT 1
## 4 AN 2
## 5 CF 2
## 6 GL 2
## install.packages("writexl")
## library(writexl)
## write_xlsx(Arrival_By_Citizenship,"C:\\Users\\dkbs0\\OneDrive\\Desktop\\file_name.xlsx")
# read in states csv
codeList<-read.csv("https://datahub.io/core/country-list/r/data.csv")
head(codeList)
## Name Code
## 1 Afghanistan AF
## 2 Ã…land Islands AX
## 3 Albania AL
## 4 Algeria DZ
## 5 American Samoa AS
## 6 Andorra AD
str(codeList)
## 'data.frame': 249 obs. of 2 variables:
## $ Name: chr "Afghanistan" "Ã…land Islands" "Albania" "Algeria" ...
## $ Code: chr "AF" "AX" "AL" "DZ" ...
##write_xlsx(codeList,"C:\\Users\\dkbs0\\OneDrive\\Desktop\\codeList.xlsx")
##states<-read.csv(curl("https://raw.githubusercontent.com/brsingh7/DATA607/main/states.csv"))
mapdata<-map_data("world")
head(mapdata)
## long lat group order region subregion
## 1 -69.89912 12.45200 1 1 Aruba <NA>
## 2 -69.89571 12.42300 1 2 Aruba <NA>
## 3 -69.94219 12.43853 1 3 Aruba <NA>
## 4 -70.00415 12.50049 1 4 Aruba <NA>
## 5 -70.06612 12.54697 1 5 Aruba <NA>
## 6 -70.05088 12.59707 1 6 Aruba <NA>
str(mapdata)
## 'data.frame': 99338 obs. of 6 variables:
## $ long : num -69.9 -69.9 -69.9 -70 -70.1 ...
## $ lat : num 12.5 12.4 12.4 12.5 12.5 ...
## $ group : num 1 1 1 1 1 1 1 1 1 1 ...
## $ order : int 1 2 3 4 5 6 7 8 9 10 ...
## $ region : chr "Aruba" "Aruba" "Aruba" "Aruba" ...
## $ subregion: chr NA NA NA NA ...
Since the source data obtained from Kaggle was not in alignment with the map package data, I had to export the data into Excel and then namually work on the data and load it to GitHub to pull back into R.
df3 <- read.csv("https://raw.githubusercontent.com/deepasharma06/Data-607/main/ukraine_refugee.csv")
colnames(df3) <- c("region", "Code","Total")
head(df3)
## region Code Total
## 1 Andorra AD 21
## 2 United Arab Emirates AE 7
## 3 Afghanistan AF 3860
## 4 Antigua and Barbuda AG 5
## 5 Anguilla AI 0
## 6 Albania AL 88
mapdata1 <- read.csv("https://raw.githubusercontent.com/deepasharma06/Data-607/main/mapdata.csv")
head(mapdata1)
## region long lat group order subregion Code Total
## 1 Andorra 1.706055 42.50332 10 939 AD 21
## 2 Andorra 1.678515 42.49668 10 940 AD 21
## 3 Andorra 1.586426 42.45596 10 941 AD 21
## 4 Andorra 1.534082 42.44170 10 942 AD 21
## 5 Andorra 1.486231 42.43447 10 943 AD 21
## 6 Andorra 1.448828 42.43745 10 944 AD 21
The below code is used to create a new dataframe for a plot of the top 10 countries that the Ukranian refugees were citizens of.
df4 <- df3 %>% top_n(10,Total)
df4 <- df4 %>% select(-Code)
df4 = df4[order(-df4$Total), ]
head(df4)
## region Total
## 8 Ukraine 5624176
## 5 Poland 139741
## 2 Germany 23923
## 6 Romania 19290
## 10 Uzbekistan 14841
## 9 United States 14657
ggplot(df4, aes(region, log(Total), fill= region)) +
geom_bar(stat="identity") +
coord_flip() +
labs("region", "Total", subtitle = "Ukraine Refugee by Citizenship")
Analysis of refugee count by border crossing
arrive_in_poland_by_border_crossing <-arrive_in_poland_by_border_crossing <- df1 %>%
group_by(Border_Guard_Post) %>%
dplyr::summarise(Total_Arrival = sum(Total_Arrival)) %>%
as.data.frame()
arrive_in_poland_by_border_crossing
## Border_Guard_Post Total_Arrival
## 1 Dolhobyczow 482305
## 2 Dorohusk 925025
## 3 Horyniec-Zdroj 454
## 4 Hrebenne 804245
## 5 Hrubieszow 497972
## 6 Korczowa 958894
## 7 Kroscienko 254821
## 8 Lubaczow 506880
## 9 Medyka 1634861
head(arrive_in_poland_by_border_crossing)
## Border_Guard_Post Total_Arrival
## 1 Dolhobyczow 482305
## 2 Dorohusk 925025
## 3 Horyniec-Zdroj 454
## 4 Hrebenne 804245
## 5 Hrubieszow 497972
## 6 Korczowa 958894
ggplot(arrive_in_poland_by_border_crossing, aes(Border_Guard_Post, Total_Arrival, fill= Border_Guard_Post)) +
geom_bar(stat="identity") +
coord_flip() +
labs("Border Guard Post", "Total Arrival", subtitle = "Ukraine Refugee by Border Crossing")
Note: Horyniec-Zdroj does not show in the plot because the value is quite low compared to the other border crossings.
Challenges: The below code is to plot the number of refugee leaving Ukraine to Poland in the world map by their citizenship. I found that the data set used for this analysis did not match well with the world map packages available in R. For example, the country code dis not match with an existing country code and there was no way to figure out which country the data set was referring to because the country name was not available. Similarly, the border crossing names were laid out very loosely some of these locations were not available in the map data. So I had to manually update the data and load it into GitHub and then pull into R for the plot. Even in doing so, I did see that there are imperfections in the map and the layout and color code is not as expected.
map1<-ggplot(mapdata1,aes(x=long,y= lat, group = group))+
geom_polygon(aes(fill=Total), color = "black")
map1
Conclusion: Based on the above analysis, I found that most of the refugees were coming into the Medyka border crossing followed by Kroscienko, Dorohusk, Hrebenne, Lubaczow, Hrubieszow and Dolhobyczow. The number of refugee was the least in Horyniec-Zdroj. This means that there is a concentration of refugee arrival in south eastern Poland which makes sense as Ukraine is under attack from the north side and people feeling the conflict are travelling towards Poland from as south of Ukraine as possible.
International aid agencies can use the above analysis to make aid distribution decisions so that there is equity in such distributions based on need.
Another interesting thing that came up in this analysis is that even Russian citizens in Ukraine who have been affected by the war in Ukraine have entered into Poland. One thing that I am interested to understand is whether these Russian citizens will eventually choose to enter Russia or seek refuge in a third country. Also, there is a question about how the Polish people and the international aid agencies are treating Russian refugee. I believe everyone effected by war need to be taken care of irrespective of their national origin. But there was no data to support that analysis at this time.