Data607 Final project

Part 1:
- Introduction/Motivation:
- Libraries loaded
Part 2:
- Data Wrangling:
Part 3:
- Data Analysis:
Part 4:
Part 5:

Part 1:

Introduction/Motivation:

Russia’s war on Ukraine has resulted in millions of people being displaced from their country. Most of them have entered Poland which is a neighboring country to Ukraine. I wanted to use the data available based on border crossing records in the Polish border to figure out which border crossing had the most number of refugees entering so that international aid agencies can plan resource allocation based on the needs in each area. Furthermore, I wanted to analyze which countries these refugees belonged to so that the governments of these countries can stand up against Russia’s aggression.

Libraries loaded

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(magrittr)

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

library(stringr)
library(ggplot2)
library(dplyr)
library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

library(usmap)

Read analysis: The data for this analysis was obtained from kaggle.com. The web link is https://www.kaggle.com/datasets/krystianadammolenda/refugees-from-ukraine-poland?resource=download. The data was first pulled into Github and then loaded to the R dataframe.

df<- read.csv("https://raw.githubusercontent.com/deepasharma06/Data-607/main/border_traffic_UA_PL_01_03.csv")
head(df)

##   Border.Guard.Post   Border.crossing Type.of.border.crossing Border.Guard.Unit
## 1          Dorohusk Dorohusk-Jagodzin                    road       Nadbuzanski
## 2          Dorohusk Dorohusk-Jagodzin                    road       Nadbuzanski
## 3          Dorohusk Dorohusk-Jagodzin                    road       Nadbuzanski
## 4          Dorohusk Dorohusk-Jagodzin                    road       Nadbuzanski
## 5          Dorohusk Dorohusk-Jagodzin                    road       Nadbuzanski
## 6          Dorohusk Dorohusk-Jagodzin                    road       Nadbuzanski
##         Date Direction.to...from.Poland Citizenship..code. UE...Schengen
## 1 2022-01-01          arrival in Poland                 BY             0
## 2 2022-01-01          arrival in Poland                 DE            UE
## 3 2022-01-01          arrival in Poland                 TR             0
## 4 2022-01-01          arrival in Poland                 UA             0
## 5 2022-01-01          arrival in Poland                 LV            UE
## 6 2022-01-01          arrival in Poland                 NL            UE
##   Number.of.persons..checked.in. Number.of.people..evacuated.
## 1                              7                            0
## 2                             29                            0
## 3                              2                            0
## 4                            389                            0
## 5                              3                            0
## 6                              3                            0

Part 2:

Data Wrangling:

Below is the summary of the dataset.

summary(df)

##  Border.Guard.Post  Border.crossing    Type.of.border.crossing
##  Length:54233       Length:54233       Length:54233           
##  Class :character   Class :character   Class :character       
##  Mode  :character   Mode  :character   Mode  :character       
##                                                               
##                                                               
##                                                               
##  Border.Guard.Unit      Date           Direction.to...from.Poland
##  Length:54233       Length:54233       Length:54233              
##  Class :character   Class :character   Class :character          
##  Mode  :character   Mode  :character   Mode  :character          
##                                                                  
##                                                                  
##                                                                  
##  Citizenship..code. UE...Schengen      Number.of.persons..checked.in.
##  Length:54233       Length:54233       Min.   :    0.00              
##  Class :character   Class :character   1st Qu.:    0.00              
##  Mode  :character   Mode  :character   Median :    1.00              
##                                        Mean   :   78.06              
##                                        3rd Qu.:    4.00              
##                                        Max.   :24662.00              
##  Number.of.people..evacuated.
##  Min.   :    0.00            
##  1st Qu.:    0.00            
##  Median :    0.00            
##  Mean   :   33.79            
##  3rd Qu.:    1.00            
##  Max.   :13641.00

I see in the summary above that the data headers are not stated properly so I ran the below code to make the column headers meaningful.

colnames(df) <- c("Border_Guard_Post", "Border_Crossing","Type_Of_Border_Crossing","Border_Guard_Unit","Date","Direction","Code","EU_Schengen","Number_Of_Persons_Checked_In","Number_Of_Persons_Evacuated")
head(df)

##   Border_Guard_Post   Border_Crossing Type_Of_Border_Crossing Border_Guard_Unit
## 1          Dorohusk Dorohusk-Jagodzin                    road       Nadbuzanski
## 2          Dorohusk Dorohusk-Jagodzin                    road       Nadbuzanski
## 3          Dorohusk Dorohusk-Jagodzin                    road       Nadbuzanski
## 4          Dorohusk Dorohusk-Jagodzin                    road       Nadbuzanski
## 5          Dorohusk Dorohusk-Jagodzin                    road       Nadbuzanski
## 6          Dorohusk Dorohusk-Jagodzin                    road       Nadbuzanski
##         Date         Direction Code EU_Schengen Number_Of_Persons_Checked_In
## 1 2022-01-01 arrival in Poland   BY           0                            7
## 2 2022-01-01 arrival in Poland   DE          UE                           29
## 3 2022-01-01 arrival in Poland   TR           0                            2
## 4 2022-01-01 arrival in Poland   UA           0                          389
## 5 2022-01-01 arrival in Poland   LV          UE                            3
## 6 2022-01-01 arrival in Poland   NL          UE                            3
##   Number_Of_Persons_Evacuated
## 1                           0
## 2                           0
## 3                           0
## 4                           0
## 5                           0
## 6                           0

str(df)

## 'data.frame':    54233 obs. of  10 variables:
##  $ Border_Guard_Post           : chr  "Dorohusk" "Dorohusk" "Dorohusk" "Dorohusk" ...
##  $ Border_Crossing             : chr  "Dorohusk-Jagodzin" "Dorohusk-Jagodzin" "Dorohusk-Jagodzin" "Dorohusk-Jagodzin" ...
##  $ Type_Of_Border_Crossing     : chr  "road" "road" "road" "road" ...
##  $ Border_Guard_Unit           : chr  "Nadbuzanski" "Nadbuzanski" "Nadbuzanski" "Nadbuzanski" ...
##  $ Date                        : chr  "2022-01-01" "2022-01-01" "2022-01-01" "2022-01-01" ...
##  $ Direction                   : chr  "arrival in Poland" "arrival in Poland" "arrival in Poland" "arrival in Poland" ...
##  $ Code                        : chr  "BY" "DE" "TR" "UA" ...
##  $ EU_Schengen                 : chr  "0" "UE" "0" "0" ...
##  $ Number_Of_Persons_Checked_In: int  7 29 2 389 3 3 4 2 4 2 ...
##  $ Number_Of_Persons_Evacuated : int  0 0 0 0 0 0 0 0 0 0 ...

I converted blank fields into NA so I can remove these rows.

df$Border_Crossing[df$Border_Crossing == ""] = NA

cbind(lapply(lapply(df, is.na), sum))

##                              [,1]
## Border_Guard_Post            0   
## Border_Crossing              81  
## Type_Of_Border_Crossing      0   
## Border_Guard_Unit            0   
## Date                         0   
## Direction                    0   
## Code                         26  
## EU_Schengen                  0   
## Number_Of_Persons_Checked_In 0   
## Number_Of_Persons_Evacuated  0

I wanted to keep my data clean by removing rows where a column is null. I see that my initial dataframe has 54233 rows but now with this command it has gone down to 54126 rows.So 107 rows were removed.

df1 <- na.omit(df) 

cbind(lapply(lapply(df1, is.na), sum))

##                              [,1]
## Border_Guard_Post            0   
## Border_Crossing              0   
## Type_Of_Border_Crossing      0   
## Border_Guard_Unit            0   
## Date                         0   
## Direction                    0   
## Code                         0   
## EU_Schengen                  0   
## Number_Of_Persons_Checked_In 0   
## Number_Of_Persons_Evacuated  0

The Date field is a character. I want to convert it into date fromat.

class(df1$Date)

## [1] "character"

df1$Date <- as.Date(paste(df1$Date,"-01",sep=""))
class(df1$Date)

## [1] "Date"

In the dataframe, df1, there are two columns that tells me the number of persons who checked in themselves and those who were evacuated. I want to create a new column by adding these two columns which will give me the total arrivals

df1$Total_Arrival=rowSums(cbind(df1$Number_Of_Persons_Checked_In,df1$Number_Of_Persons_Evacuated),na.rm=TRUE,)
str(df1)

## 'data.frame':    54126 obs. of  11 variables:
##  $ Border_Guard_Post           : chr  "Dorohusk" "Dorohusk" "Dorohusk" "Dorohusk" ...
##  $ Border_Crossing             : chr  "Dorohusk-Jagodzin" "Dorohusk-Jagodzin" "Dorohusk-Jagodzin" "Dorohusk-Jagodzin" ...
##  $ Type_Of_Border_Crossing     : chr  "road" "road" "road" "road" ...
##  $ Border_Guard_Unit           : chr  "Nadbuzanski" "Nadbuzanski" "Nadbuzanski" "Nadbuzanski" ...
##  $ Date                        : Date, format: "2022-01-01" "2022-01-01" ...
##  $ Direction                   : chr  "arrival in Poland" "arrival in Poland" "arrival in Poland" "arrival in Poland" ...
##  $ Code                        : chr  "BY" "DE" "TR" "UA" ...
##  $ EU_Schengen                 : chr  "0" "UE" "0" "0" ...
##  $ Number_Of_Persons_Checked_In: int  7 29 2 389 3 3 4 2 4 2 ...
##  $ Number_Of_Persons_Evacuated : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Total_Arrival               : num  7 29 2 389 3 3 4 2 4 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:107] 11720 12731 12900 12944 13610 13747 13875 14417 14652 16450 ...
##   ..- attr(*, "names")= chr [1:107] "11720" "12731" "12900" "12944" ...

I separated the date column into Year, Month and Day. Note: This could will not run after the first run as the dataframe is not renamed.

df1 <- separate(df1, "Date", c("Year", "Month", "Day"), sep = "-")

summary(df1)

##  Border_Guard_Post  Border_Crossing    Type_Of_Border_Crossing
##  Length:54126       Length:54126       Length:54126           
##  Class :character   Class :character   Class :character       
##  Mode  :character   Mode  :character   Mode  :character       
##                                                               
##                                                               
##                                                               
##  Border_Guard_Unit      Year              Month               Day           
##  Length:54126       Length:54126       Length:54126       Length:54126      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   Direction             Code           EU_Schengen       
##  Length:54126       Length:54126       Length:54126      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##  Number_Of_Persons_Checked_In Number_Of_Persons_Evacuated Total_Arrival    
##  Min.   :    0.00             Min.   :    0.00            Min.   :    1.0  
##  1st Qu.:    0.00             1st Qu.:    0.00            1st Qu.:    1.0  
##  Median :    1.00             Median :    0.00            Median :    2.0  
##  Mean   :   78.22             Mean   :   33.84            Mean   :  112.1  
##  3rd Qu.:    4.00             3rd Qu.:    1.00            3rd Qu.:    8.0  
##  Max.   :24662.00             Max.   :13641.00            Max.   :24662.0

Part 3:

Data Analysis:

Number of people that arrived in Poland and departed from Poland

arrive_in_poland <-arrive_in_poland <- df1 %>%
  group_by(Direction) %>%
  dplyr::summarise(Total_Arrival = sum(Total_Arrival)) %>%
                     as.data.frame()

arrive_in_poland

##               Direction Total_Arrival
## 1     arrival in Poland       5090593
## 2 departure from Poland        974864

The below code is used to determine the total number of people that arrived in Poland via road and via railways.

datagroup <- df1 %>%
  group_by(Type_Of_Border_Crossing ) %>%
  dplyr::summarise(Total = sum(Total_Arrival)) %>%
                     as.data.frame()

datagroup

##   Type_Of_Border_Crossing   Total
## 1                 railway  570047
## 2                    road 5495410

Below is a graph to show this data.

cvpalette<- c("#80bfff","#6666ff","#85adad")
datagroup <- df1 %>%
  group_by(Type_Of_Border_Crossing ) %>%
  dplyr::summarise(Total = sum(Total_Arrival)) %>%
                     as.data.frame()

datagroup$Total <- as.numeric(datagroup$Total)
ggplot(head(datagroup,4), aes(Type_Of_Border_Crossing  , Total, fill = Type_Of_Border_Crossing) ) +
  geom_col() +
  
    scale_color_viridis_d()+
   scale_fill_manual(values=cvpalette)+
    theme(axis.text.x = element_text(angle = -30, vjust = 1, hjust = 0))+
   labs(x="Type Of Border Crossing",y= "Total", title = "Border Crossing by Type")

Total arrival in Poland based on country of citizenship

library(dplyr)
Arrival_By_Citizenship <- df1 %>%
  group_by(Code) %>%
  dplyr::summarise(Total = sum(Total_Arrival)) %>%
  arrange(Total, increasing = FALSE)

  
Arrival_By_Citizenship<-data.frame(Arrival_By_Citizenship)
head(Arrival_By_Citizenship)

##   Code Total
## 1   BN     1
## 2   PG     1
## 3   TT     1
## 4   AN     2
## 5   CF     2
## 6   GL     2

## install.packages("writexl")
## library(writexl)
## write_xlsx(Arrival_By_Citizenship,"C:\\Users\\dkbs0\\OneDrive\\Desktop\\file_name.xlsx")

# read in states csv
codeList<-read.csv("https://datahub.io/core/country-list/r/data.csv")
head(codeList)

##             Name Code
## 1    Afghanistan   AF
## 2 Ã…land Islands   AX
## 3        Albania   AL
## 4        Algeria   DZ
## 5 American Samoa   AS
## 6        Andorra   AD

str(codeList)

## 'data.frame':    249 obs. of  2 variables:
##  $ Name: chr  "Afghanistan" "Ã…land Islands" "Albania" "Algeria" ...
##  $ Code: chr  "AF" "AX" "AL" "DZ" ...

##write_xlsx(codeList,"C:\\Users\\dkbs0\\OneDrive\\Desktop\\codeList.xlsx")
##states<-read.csv(curl("https://raw.githubusercontent.com/brsingh7/DATA607/main/states.csv"))

mapdata<-map_data("world")
head(mapdata)

##        long      lat group order region subregion
## 1 -69.89912 12.45200     1     1  Aruba      <NA>
## 2 -69.89571 12.42300     1     2  Aruba      <NA>
## 3 -69.94219 12.43853     1     3  Aruba      <NA>
## 4 -70.00415 12.50049     1     4  Aruba      <NA>
## 5 -70.06612 12.54697     1     5  Aruba      <NA>
## 6 -70.05088 12.59707     1     6  Aruba      <NA>

str(mapdata)

## 'data.frame':    99338 obs. of  6 variables:
##  $ long     : num  -69.9 -69.9 -69.9 -70 -70.1 ...
##  $ lat      : num  12.5 12.4 12.4 12.5 12.5 ...
##  $ group    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ order    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ region   : chr  "Aruba" "Aruba" "Aruba" "Aruba" ...
##  $ subregion: chr  NA NA NA NA ...

Since the source data obtained from Kaggle was not in alignment with the map package data, I had to export the data into Excel and then namually work on the data and load it to GitHub to pull back into R.

df3 <- read.csv("https://raw.githubusercontent.com/deepasharma06/Data-607/main/ukraine_refugee.csv")
colnames(df3) <- c("region", "Code","Total")
head(df3)

##                 region Code Total
## 1              Andorra   AD    21
## 2 United Arab Emirates   AE     7
## 3          Afghanistan   AF  3860
## 4  Antigua and Barbuda   AG     5
## 5             Anguilla   AI     0
## 6              Albania   AL    88

mapdata1 <- read.csv("https://raw.githubusercontent.com/deepasharma06/Data-607/main/mapdata.csv")
head(mapdata1)

##    region     long      lat group order subregion Code Total
## 1 Andorra 1.706055 42.50332    10   939             AD    21
## 2 Andorra 1.678515 42.49668    10   940             AD    21
## 3 Andorra 1.586426 42.45596    10   941             AD    21
## 4 Andorra 1.534082 42.44170    10   942             AD    21
## 5 Andorra 1.486231 42.43447    10   943             AD    21
## 6 Andorra 1.448828 42.43745    10   944             AD    21

The below code is used to create a new dataframe for a plot of the top 10 countries that the Ukranian refugees were citizens of.

df4 <- df3 %>% top_n(10,Total)
df4 <- df4 %>% select(-Code)
df4 = df4[order(-df4$Total), ]
head(df4)

##           region   Total
## 8        Ukraine 5624176
## 5         Poland  139741
## 2        Germany   23923
## 6        Romania   19290
## 10    Uzbekistan   14841
## 9  United States   14657

ggplot(df4, aes(region, log(Total), fill= region)) +
  geom_bar(stat="identity") +
  coord_flip() +
  labs("region", "Total", subtitle = "Ukraine Refugee by Citizenship")

Analysis of refugee count by border crossing

arrive_in_poland_by_border_crossing <-arrive_in_poland_by_border_crossing <- df1 %>%
  group_by(Border_Guard_Post) %>%
  dplyr::summarise(Total_Arrival = sum(Total_Arrival)) %>%
                     as.data.frame()

arrive_in_poland_by_border_crossing

##   Border_Guard_Post Total_Arrival
## 1       Dolhobyczow        482305
## 2          Dorohusk        925025
## 3    Horyniec-Zdroj           454
## 4          Hrebenne        804245
## 5        Hrubieszow        497972
## 6          Korczowa        958894
## 7        Kroscienko        254821
## 8          Lubaczow        506880
## 9            Medyka       1634861

head(arrive_in_poland_by_border_crossing)

##   Border_Guard_Post Total_Arrival
## 1       Dolhobyczow        482305
## 2          Dorohusk        925025
## 3    Horyniec-Zdroj           454
## 4          Hrebenne        804245
## 5        Hrubieszow        497972
## 6          Korczowa        958894

ggplot(arrive_in_poland_by_border_crossing, aes(Border_Guard_Post, Total_Arrival, fill= Border_Guard_Post)) +
  geom_bar(stat="identity") +
  coord_flip() +
  labs("Border Guard Post", "Total Arrival", subtitle = "Ukraine Refugee by Border Crossing")

Note: Horyniec-Zdroj does not show in the plot because the value is quite low compared to the other border crossings.

Part 4:

Challenges: The below code is to plot the number of refugee leaving Ukraine to Poland in the world map by their citizenship. I found that the data set used for this analysis did not match well with the world map packages available in R. For example, the country code dis not match with an existing country code and there was no way to figure out which country the data set was referring to because the country name was not available. Similarly, the border crossing names were laid out very loosely some of these locations were not available in the map data. So I had to manually update the data and load it into GitHub and then pull into R for the plot. Even in doing so, I did see that there are imperfections in the map and the layout and color code is not as expected.

map1<-ggplot(mapdata1,aes(x=long,y= lat, group = group))+
  geom_polygon(aes(fill=Total), color = "black")
  map1

Part 5:

Conclusion: Based on the above analysis, I found that most of the refugees were coming into the Medyka border crossing followed by Kroscienko, Dorohusk, Hrebenne, Lubaczow, Hrubieszow and Dolhobyczow. The number of refugee was the least in Horyniec-Zdroj. This means that there is a concentration of refugee arrival in south eastern Poland which makes sense as Ukraine is under attack from the north side and people feeling the conflict are travelling towards Poland from as south of Ukraine as possible.

International aid agencies can use the above analysis to make aid distribution decisions so that there is equity in such distributions based on need.

Another interesting thing that came up in this analysis is that even Russian citizens in Ukraine who have been affected by the war in Ukraine have entered into Poland. One thing that I am interested to understand is whether these Russian citizens will eventually choose to enter Russia or seek refuge in a third country. Also, there is a question about how the Polish people and the international aid agencies are treating Russian refugee. I believe everyone effected by war need to be taken care of irrespective of their national origin. But there was no data to support that analysis at this time.