The Data

The data is UN estimated refugee stock at mid-year by country for 1990 - 2015 (five year increments). Key observations include:

  • Three in One - The data set is comprised of three distinct, yet related, data sets.
  • Need for Conversion - Will need to convert the numeric data from character back to numeric
  • Selects to get to working sets - Certain data subsets will need to be removed
  • Missing and Incomplete Data - There is a need to deal with incomplete and/or missing data
  • Skip Rows - Willl need to skip rows during data import due to blank space and a fat header.

My Game Plan

My game plan for this data set follows:

  • Use readr to bring in the data (will need to skip numerous rows)
  • Select a subset of data for my wrangling and analysis
  • Clean up the data to include converting from text to numeric, column headers, removal of missing / incomplete data
  • Employ a list column and develop models and/or plots for each region
  • Analyze the models / plots
  • Leverage chapter 20 of R for Data Science (Gapminder analysis)

My Questions

Transform Data

The steps to transform the dataset are set forth below:

  1. Import Data

Utilize readr with skip parameter of 15 to import the data. The import was fairly clean.

## # A tibble: 6 x 32
##      X1 X2    X3       X4 X5    `1990` `1995` `2000` `2005` `2010` `2015`
##   <dbl> <chr> <chr> <dbl> <chr> <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
## 1     1 WORLD <NA>    900 <NA>  18 83~ 17 85~ 15 82~ 13 27~ 15 37~ 19 57~
## 2     2 Deve~ (b)     901 <NA>  2 014~ 3 609~ 2 997~ 2 361~ 2 046~ 1 954~
## 3     3 Deve~ (c)     902 <NA>  16 82~ 14 24~ 12 83~ 10 91~ 13 32~ 17 62~
## 4     4 Leas~ (d)     941 <NA>  5 048~ 5 160~ 3 047~ 2 363~ 1 957~ 3 443~
## 5     5 Less~ <NA>    934 <NA>  11 77~ 9 084~ 9 783~ 8 551~ 11 36~ 14 17~
## 6     6 Sub-~ (e)     947 <NA>  5 516~ 5 747~ 3 421~ 2 555~ 2 215~ 3 638~
## # ... with 21 more variables: `1990_1` <chr>, `1995_1` <chr>,
## #   `2000_1` <chr>, `2005_1` <chr>, `2010_1` <dbl>, `2015_1` <dbl>,
## #   `1990-1995` <chr>, `1995-2000` <chr>, `2000-2005` <chr>,
## #   `2005-2010` <chr>, `2010-2015` <chr>, X23 <lgl>, X24 <lgl>, X25 <lgl>,
## #   X26 <lgl>, X27 <lgl>, X28 <lgl>, X29 <lgl>, X30 <lgl>, X31 <lgl>,
## #   X32 <lgl>
  1. Get To A Good Starting Point

Clean up the column names and select the columns I’d like to work with

id country loc_code 1990 1995 2000 2005 2010 2015
1 WORLD 900 18 836 571 17 853 840 15 827 803 13 276 733 15 370 755 19 577 474
2 Developed regions 901 2 014 564 3 609 670 2 997 256 2 361 229 2 046 917 1 954 224
3 Developing regions 902 16 822 007 14 244 170 12 830 547 10 915 504 13 323 838 17 623 250
4 Least developed countries 941 5 048 391 5 160 131 3 047 488 2 363 782 1 957 884 3 443 582
5 Less developed regions excluding least developed countries 934 11 773 616 9 084 039 9 783 059 8 551 722 11 365 954 14 179 668
6 Sub-Saharan Africa 947 5 516 042 5 747 830 3 421 165 2 555 099 2 215 890 3 638 433
  1. Data Wrangling / Transform

I’m going to use ggplot to take a look at the data to see what it tells me.

There seems to be a lot going on in the plot, some countries have declining refugees, while others are increasing and other its hard to tell because the change is slight. There are so many countries, there’s a bit of a signal to noise issues. To deal with the large number of countries, I will create a nested data data fram and continue my analysis. Using plotly enables me to inspect the plot data easily.

  1. Create A List Column

Use nest() function to create a nested data frame.

## # A tibble: 112 x 3
##    country    region                          data            
##    <chr>      <chr>                           <list>          
##  1 Albania    Europe                          <tibble [6 x 4]>
##  2 Algeria    Africa                          <tibble [6 x 4]>
##  3 Angola     Africa                          <tibble [6 x 4]>
##  4 Argentina  Latin America and the Caribbean <tibble [6 x 4]>
##  5 Australia  Oceania                         <tibble [6 x 4]>
##  6 Austria    Europe                          <tibble [6 x 4]>
##  7 Bahrain    Asia                            <tibble [6 x 4]>
##  8 Bangladesh Asia                            <tibble [6 x 4]>
##  9 Belgium    Europe                          <tibble [6 x 4]>
## 10 Belize     Latin America and the Caribbean <tibble [6 x 4]>
## # ... with 102 more rows
  1. Create a Country Model

####Here I calculate the model using the country_model function and add the model residuals. This will enable me to plot the model residuals

## # A tibble: 112 x 5
##    country                  region data             model   resids         
##    <chr>                    <chr>  <list>           <list>  <list>         
##  1 Algeria                  Africa <tibble [6 x 4]> <S3: l~ <tibble [6 x 5~
##  2 Angola                   Africa <tibble [6 x 4]> <S3: l~ <tibble [6 x 5~
##  3 Benin                    Africa <tibble [6 x 4]> <S3: l~ <tibble [6 x 5~
##  4 Botswana                 Africa <tibble [6 x 4]> <S3: l~ <tibble [6 x 5~
##  5 Burkina Faso             Africa <tibble [6 x 4]> <S3: l~ <tibble [6 x 5~
##  6 Burundi                  Africa <tibble [6 x 4]> <S3: l~ <tibble [6 x 5~
##  7 "C\xf4te d'Ivoire"       Africa <tibble [6 x 4]> <S3: l~ <tibble [6 x 5~
##  8 Cameroon                 Africa <tibble [6 x 4]> <S3: l~ <tibble [6 x 5~
##  9 Central African Republic Africa <tibble [6 x 4]> <S3: l~ <tibble [6 x 5~
## 10 Congo                    Africa <tibble [6 x 4]> <S3: l~ <tibble [6 x 5~
## # ... with 102 more rows
## # A tibble: 672 x 7
##    country region    id loc_code year  refugees     resid
##    <chr>   <chr>  <dbl>    <dbl> <chr>    <dbl>     <dbl>
##  1 Algeria Africa    40       12 1990    169107 -1.16e-10
##  2 Algeria Africa    40       12 1995    192489 -2.91e-11
##  3 Algeria Africa    40       12 2000    167453 -5.82e-11
##  4 Algeria Africa    40       12 2005     94101 -5.82e-11
##  5 Algeria Africa    40       12 2010     94144 -4.37e-11
##  6 Algeria Africa    40       12 2015     94144 -2.91e-11
##  7 Angola  Africa    30       24 1990     12000 -9.09e-12
##  8 Angola  Africa    30       24 1995     11404 -3.64e-12
##  9 Angola  Africa    30       24 2000     12579 -3.64e-12
## 10 Angola  Africa    30       24 2005     13867 -3.64e-12
## # ... with 662 more rows
  1. Review Model Residuals

####Here I calculate the model using the country_model function and add the model residuals. This will enable me to plot the model residuals. I will use plotly so I can work with the plots to identify insights.

The model seems to work fairly well, but there appear to be several countries in each region that the model does not fit well. This indicates the refugee data is not linear for all countries.

Answers