library(dplyr)
library(tidyr)
library(stringr)
library(magrittr)

For project 2, I chose to work with the New York Health Department Restaraunt Data, The Doctor Who Time Travel Data, and the UN Migrant Stock Data.

New York Department of Health and Mental Hygeine Restaurant Data

1.) For the New York Health Department Restaraunt data, I was asked to find the most common violations that shut down restaurants, how likely they are to be reopened if they are shut down and which zip codes have the most violations. I chose to tackle the data by looking at the classes of critical problems (mice, roaches, improper food storage temperature, things that could immediately harm someone’s health) that cause shut downs vs. non-critical problems (food storage surfaces having defects, openings that could allow in vermin, things that are not immediately hazardous, but could be so).

Given the number of restaraunts in New York, I worked with an approximately 6 month window.

The data is already quite tall, so instead of making it narrower, I instead widened it out with tidyr.

(There were some characters that R read as end of file characters that I had to edit out with a text editor, thus the file name.)

NTCdata<-read.csv("DOHMH_New_York_City_Restaurant_Inspection_Resultsfixed.csv")

There are three classes of violations in terms of whether or not they were critical. Those are critical, non-critical and not applicable. I eliminated the not applicable ones, as they were things like customers smoking, that were not actually relevant to food safety.

NTCApplicable<-NTCdata[!str_detect(NTCdata$CRITICAL.FLAG,"Applicable"),]

I then spread out the data by dividing those with the critical flag. This is making it slightly less tidy, but it allows a quick separation of critical and non-critical violations.

NTCwide<-spread(NTCApplicable,CRITICAL.FLAG,VIOLATION.DESCRIPTION)
critical<-NTCwide[!is.na(NTCwide$Critical),]
noncritical<-NTCwide[is.na(NTCwide$Critical),]

Having done this, I can now look at the closures for both critical and non-critical violations.

critclose<-critical[str_detect(critical$ACTION,"closed")|str_detect(critical$ACTION,"Closed"),]
summary(critclose)
##      CAMIS                                          DBA      
##  Min.   :40362098   RESTAURANTE & PANADERIA GUATELINDA:  25  
##  1st Qu.:41580756   BELLA PIZZA                       :  19  
##  Median :50035430   B&B HALALL RESTAURANT             :  17  
##  Mean   :47035182   THE AINSWORTH                     :  16  
##  3rd Qu.:50069068   CROWN FRIED CHICKEN               :  15  
##  Max.   :50083289   KENNEDY FRIED CHICKEN             :  15  
##                     (Other)                           :2946  
##             BORO         BUILDING              STREET        ZIPCODE     
##  BRONX        : 291   170-18 :  25   BROADWAY     :  98   Min.   :10001  
##  BROOKLYN     : 877   1      :  20   2ND AVE      :  63   1st Qu.:10029  
##  MANHATTAN    :1038   200    :  20   3RD AVE      :  57   Median :11104  
##  QUEENS       : 751   12     :  17   AMSTERDAM AVE:  44   Mean   :10744  
##  STATEN ISLAND:  96   234    :  17   8TH AVE      :  39   3rd Qu.:11232  
##                       67A    :  17   UTICA AVE    :  37   Max.   :11693  
##                       (Other):2937   (Other)      :2715   NA's   :31     
##         PHONE     
##  2124706169:  25  
##  3472958422:  17  
##  2127410646:  16  
##  7185283900:  15  
##  2125690555:  12  
##  3476172383:  12  
##  (Other)   :2956  
##                                                        CUISINE.DESCRIPTION
##  Chinese                                                         : 415    
##  American                                                        : 409    
##  Latin (Cuban, Dominican, Puerto Rican, South & Central American): 170    
##  Caribbean                                                       : 159    
##  Spanish                                                         : 157    
##  Pizza                                                           : 140    
##  (Other)                                                         :1603    
##    INSPECTION.DATE
##  09/20/2018:  59  
##  06/18/2018:  58  
##  06/14/2018:  52  
##  06/21/2018:  51  
##  08/16/2018:  50  
##  08/29/2018:  50  
##  (Other)   :2733  
##                                                                                                                                  ACTION    
##  Establishment Closed by DOHMH.  Violations were cited in the following area(s) and those requiring immediate action were addressed.:2767  
##  Establishment re-closed by DOHMH                                                                                                   : 286  
##  Establishment re-opened by DOHMH                                                                                                   :   0  
##  No violations were recorded at the time of this inspection.                                                                        :   0  
##  Violations were cited in the following area(s).                                                                                    :   0  
##                                                                                                                                            
##                                                                                                                                            
##  VIOLATION.CODE     SCORE                   GRADE           GRADE.DATE  
##  04L    : 537   Min.   :  5.00                 :2948             :3049  
##  02G    : 315   1st Qu.: 43.00   A             :   0   05/02/2018:   4  
##  04N    : 293   Median : 54.00   B             :   0   04/03/2018:   0  
##  04M    : 258   Mean   : 57.34   C             :   4   04/04/2018:   0  
##  06C    : 236   3rd Qu.: 70.00   Not Yet Graded: 101   04/05/2018:   0  
##  02B    : 204   Max.   :156.00   P             :   0   04/06/2018:   0  
##  (Other):1210                    Z             :   0   (Other)   :   0  
##      RECORD.DATE                                           INSPECTION.TYPE
##  10/05/2018:3053   Cycle Inspection / Initial Inspection           :1515  
##                    Cycle Inspection / Re-inspection                : 612  
##                    Pre-permit (Operational) / Initial Inspection   : 425  
##                    Cycle Inspection / Reopening Inspection         : 234  
##                    Pre-permit (Operational) / Re-inspection        : 139  
##                    Pre-permit (Operational) / Compliance Inspection:  50  
##                    (Other)                                         :  78  
##                                                                                                                                                                                                                                                                                              Critical   
##  Evidence of mice or live mice present in facility's food and/or non-food areas.                                                                                                                                                                                                                 : 537  
##  Cold food item held above 41º F (smoked fish and reduced oxygen packaged foods above 38 ºF) except during necessary preparation.                                                                                                                                                          : 315  
##  Filth flies or food/refuse/sewage-associated (FRSA) flies present in facilitys food and/or non-food areas. Filth flies include house flies, little house flies, blow flies, bottle flies and flesh flies. Food/refuse/sewage-associated flies include fruit flies, drain flies and Phorid flies.: 293  
##  Live roaches present in facility's food and/or non-food areas.                                                                                                                                                                                                                                  : 258  
##  Food not protected from potential source of contamination during storage, preparation, transportation, display or service.                                                                                                                                                                      : 236  
##  Hot food item not held at or above 140º F.                                                                                                                                                                                                                                                   : 204  
##  (Other)                                                                                                                                                                                                                                                                                         :1210  
##                                                                                                                                                                                         Not Critical 
##                                                                                                                                                                                               :   0  
##  '''Wash hands sign not posted at hand wash facility.                                                                                                                                         :   0  
##  A food containing artificial trans fat, with 0.5 grams or more of trans fat per serving, is being stored, distributed, held for service, used in preparation of a menu item, or served.      :   0  
##  Accurate thermometer not provided in refrigerated or hot holding equipment.                                                                                                                  :   0  
##  Appropriately scaled metal stem-type thermometer or thermocouple not provided or used to evaluate temperatures of potentially hazardous foods during cooking, cooling, reheating and holding.:   0  
##  (Other)                                                                                                                                                                                      :   0  
##  NA's                                                                                                                                                                                         :3053
noncritclose<-noncritical[str_detect(noncritical$ACTION,"closed")|str_detect(noncritical$ACTION,"Closed"),]
summary(noncritclose)
##      CAMIS                                          DBA      
##  Min.   :40362098   RESTAURANTE & PANADERIA GUATELINDA:  10  
##  1st Qu.:41530497   B&B HALALL RESTAURANT             :   9  
##  Median :50017751   VACCAROS BAKERY                   :   9  
##  Mean   :46922419   CROWN FRIED CHICKEN               :   8  
##  3rd Qu.:50064922   KENNEDY FRIED CHICKEN             :   8  
##  Max.   :50083289   MAYA CUISINE                      :   8  
##                     (Other)                           :1509  
##             BORO        BUILDING              STREET        ZIPCODE     
##  BRONX        :146   176    :  11   BROADWAY     :  56   Min.   :10001  
##  BROOKLYN     :452   160    :  10   3RD AVE      :  34   1st Qu.:10027  
##  MANHATTAN    :544   170-18 :  10   2ND AVE      :  32   Median :11103  
##  QUEENS       :373   2805   :   9   AMSTERDAM AVE:  20   Mean   :10732  
##  STATEN ISLAND: 46   67A    :   9   8TH AVE      :  19   3rd Qu.:11226  
##                      1      :   8   STEINWAY ST  :  19   Max.   :11693  
##                      (Other):1504   (Other)      :1381   NA's   :16     
##         PHONE      CUISINE.DESCRIPTION   INSPECTION.DATE
##  2124706169:  10   American :215       09/20/2018:  37  
##  3472958422:   9   Chinese  :210       06/14/2018:  33  
##  3476172383:   9   Spanish  : 81       06/18/2018:  24  
##  7187773131:   8   Caribbean: 80       09/17/2018:  24  
##  212_933_10:   7   Indian   : 74       06/19/2018:  22  
##  2123880885:   7   Mexican  : 69       07/11/2018:  22  
##  (Other)   :1511   (Other)  :832       (Other)   :1399  
##                                                                                                                                  ACTION    
##  Establishment Closed by DOHMH.  Violations were cited in the following area(s) and those requiring immediate action were addressed.:1315  
##  Establishment re-closed by DOHMH                                                                                                   : 246  
##  Establishment re-opened by DOHMH                                                                                                   :   0  
##  No violations were recorded at the time of this inspection.                                                                        :   0  
##  Violations were cited in the following area(s).                                                                                    :   0  
##                                                                                                                                            
##                                                                                                                                            
##  VIOLATION.CODE     SCORE                   GRADE           GRADE.DATE  
##  08A    :767    Min.   :  2.00                 :1525             :1560  
##  10F    :268    1st Qu.: 38.00   A             :   0   05/02/2018:   1  
##  10B    :208    Median : 49.00   B             :   0   04/03/2018:   0  
##  08C    : 72    Mean   : 50.72   C             :   1   04/04/2018:   0  
##  10H    : 50    3rd Qu.: 61.00   Not Yet Graded:  35   04/05/2018:   0  
##  09C    : 49    Max.   :156.00   P             :   0   04/06/2018:   0  
##  (Other):147    NA's   :28       Z             :   0   (Other)   :   0  
##      RECORD.DATE                                          INSPECTION.TYPE
##  10/05/2018:1561   Cycle Inspection / Initial Inspection          :738   
##                    Cycle Inspection / Re-inspection               :305   
##                    Cycle Inspection / Reopening Inspection        :192   
##                    Pre-permit (Operational) / Initial Inspection  :146   
##                    Pre-permit (Operational) / Re-inspection       : 66   
##                    Pre-permit (Operational) / Reopening Inspection: 43   
##                    (Other)                                        : 71   
##                                                                                                                                                                                           Critical   
##                                                                                                                                                                                               :   0  
##  '''Wash hands sign not posted at hand wash facility.                                                                                                                                         :   0  
##  A food containing artificial trans fat, with 0.5 grams or more of trans fat per serving, is being stored, distributed, held for service, used in preparation of a menu item, or served.      :   0  
##  Accurate thermometer not provided in refrigerated or hot holding equipment.                                                                                                                  :   0  
##  Appropriately scaled metal stem-type thermometer or thermocouple not provided or used to evaluate temperatures of potentially hazardous foods during cooking, cooling, reheating and holding.:   0  
##  (Other)                                                                                                                                                                                      :   0  
##  NA's                                                                                                                                                                                         :1561  
##                                                                                                                                                                                                                                                                           Not Critical
##  Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.                                                                                                                                             :767  
##  Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.:268  
##  Plumbing not properly installed or maintained; anti-siphonage or backflow prevention device not provided where required; equipment or floor not properly drained; sewage disposal system in disrepair or not functioning properly.                                             :208  
##  Pesticide use not in accordance with label or applicable laws. Prohibited chemical used/stored. Open bait station used.                                                                                                                                                        : 72  
##  Proper sanitization not provided for utensil ware washing operation.                                                                                                                                                                                                           : 50  
##  Food contact surface not properly maintained.                                                                                                                                                                                                                                  : 49  
##  (Other)                                                                                                                                                                                                                                                                        :147

The top critical closure reasons are mice and temperature control problems. The top non-critical reasons are not being vermin proof, bad food storage, and plumbing problems, including not having drains in the floor.

To look at which restaurants were closed and reopened, I subsetted the critical and non-critical data sets by those that the health department reopened and then did an inner join to see which ones had been closed during this time period. Oddly enough, it seems that there are more results for these than there were initial elements, this indicates that restaurants are being closed and re-opened more than once.

noncritreopen<-noncritical[str_detect(noncritical$ACTION,"re-open"),]
noncritcloseandreopen<-inner_join(noncritclose,noncritreopen,by = "CAMIS")
critreopen<-critical[str_detect(critical$ACTION,"re-open"),]
critclosereopen<-inner_join(critclose,critreopen,by="CAMIS")
nrow(noncritclose)
## [1] 1561
nrow(noncritcloseandreopen)
## [1] 1662
nrow(critclose)
## [1] 3053
nrow(critclosereopen)
## [1] 1070

In order to look only on the restaurant basis, no matter if they had multiple closures, I used the distinct() function.

NCCU<-distinct(noncritclose,CAMIS)
NCCRU<-distinct(noncritcloseandreopen,CAMIS)
nrow(NCCRU)/nrow(NCCU)
## [1] 0.7701318
CCU<-distinct(critclose,CAMIS)
CCRU<-distinct(critclosereopen,CAMIS)
nrow(CCRU)/nrow(CCU)
## [1] 0.2863014

This shows that restaurants that are shut down for critical issues are much less likely to reopen in the six month window than ones that are shut down for non-critical reasons. Apparently, once you have mice or cockroaches, it’s hard to come back from that.

Finally, I checked to see which zip codes had the most health code violations.

zips<-as.factor(NTCdata$ZIPCODE)
summary(zips)
##   10003   10019   10036   10013   10002   10016   11354   10012   10001 
##    2138    1927    1904    1748    1656    1454    1411    1373    1370 
##   10014   10011   10022   11220   10017   11372   10018   11201   11215 
##    1309    1273    1217    1191    1150    1103    1089    1020     974 
##   11368   11101   11211   10009   11209   11355   11385   11226   11237 
##     945     868     836     823     815     806     794     778     773 
##   11217   11103   11373   10025   11214   11238   10029   11432   10010 
##     768     766     722     716     684     661     658     658     654 
##   10028   10458   11222   10024   11377   10038   11375   10314   10467 
##     631     630     627     624     612     611     583     579     565 
##   11204   10023   10128   11203   10065   11223   10021   11206   10451 
##     564     555     555     536     531     523     517     507     506 
##   11249   10032   11235   10027   11216   11106   11207   10461   11218 
##     505     503     487     483     479     460     459     452     450 
##   11234   10031   11232   10301   11212   11225   11231   10033   10463 
##     438     434     431     421     417     412     412     406     404 
##   11205   11229   11213   11358   10452   11230   11419   11435   10004 
##     401     393     392     391     390     390     390     385     377 
##   11208   10306   10034   11236   11105   11219   11221   10453   11418 
##     376     372     368     364     363     361     355     354     343 
##   10040   10468   10035   10457   10472   11104   10007   10462 (Other) 
##     342     336     335     334     329     314     309     303   12838 
##    NA's 
##    1239

This is not surprising, as the top two zip codes have some of the highest number of restaraunts in New York City, even if I am slightly disturbed, given that I used to work in 10019.

If instead we want to look at which zip codes get proportionally the most health code violations, we can use

numviol<-count(NTCdata,ZIPCODE)
RestUnique<-distinct(NTCdata,CAMIS,.keep_all = TRUE)
numrest<-count(RestUnique,ZIPCODE)
violratio<-cbind(numviol[,1],numviol[,2]%/%numrest[,2])
head(violratio[order(-violratio$n),],8)
##     ZIPCODE n
## 31    10032 6
## 39    10040 6
## 49    10111 6
## 68    10271 6
## 108   10474 6
## 168   11363 6
## 3     10002 5
## 4     10003 5

The zip codes 10032, 10040 and 10111 have the highest ratio of violations to restaurants.

Docto Who Data

2.) I was asked to see whether the old series or the new one had more time traveling and which Doctor traveled the most.

More of the work with this data set is in cleaning rather than tidying. There is also a typo present that I did not bother to correct, both because I wanted to work with the initial data set and second because it did not alter the resutls. One of the Doctors is recorded as going back to -400000000000, which is before the beginning of the universe. It was meant to be -4000000000, the beginning of life on earth, but the person writing this up slipped when entering 0s.
The initial data cleaning:

TheDoctor<-read.csv("DrWhoTimeTravel.csv",stringsAsFactors = FALSE)
TheDoctor<-select(TheDoctor,Doctor.Who.season:location)
TheDoctor<-TheDoctor[,c(1:5,7,6,8:11)]
TheDoctor[,1]<-as.numeric(TheDoctor[,1])
## Warning: NAs introduced by coercion
TheDoctor<-TheDoctor[!is.na(TheDoctor[,1]),]
TheDoctor[,6]<-as.numeric(gsub(",","",TheDoctor[,6]))
head(TheDoctor)
##   Doctor.Who.season doctor.actor    ep..no           episode.title  from
## 1                10      Tennant xmas 2006       the Runaway Bride  2007
## 2                 1     Hartnell         1      An Unearthly Child  1963
## 3                 1     Hartnell        21 The Daleks' Master Plan -2500
## 4                 3      Pertwee        64        The Time Monster  1972
## 5                 1     Hartnell        20         The Myth Makers -1200
## 6                 1     Hartnell        12              The Romans    64
##         to estimated.from estimated.          planet sub.location
## 1 -5.0e+09             NA                      Earth      England
## 2 -1.0e+05             NA                                     n/a
## 3 -2.5e+03           4000                      Earth             
## 4 -2.0e+03           2900            Earth/ Atlantis             
## 5 -1.2e+03           3999          y           Earth             
## 6  6.4e+01           2493                      Earth        Italy
##          location
## 1          London
## 2             n/a
## 3           Egypt
## 4                
## 5 Asia Minor/Troy
## 6            Rome

Now, I can gather up the starts and ends for the various Doctors to see when they started and ended. I first did this with the non-estimated data.

NDoctor<-gather(TheDoctor,Extremes,Years,5:6)
head(NDoctor)
##   Doctor.Who.season doctor.actor    ep..no           episode.title
## 1                10      Tennant xmas 2006       the Runaway Bride
## 2                 1     Hartnell         1      An Unearthly Child
## 3                 1     Hartnell        21 The Daleks' Master Plan
## 4                 3      Pertwee        64        The Time Monster
## 5                 1     Hartnell        20         The Myth Makers
## 6                 1     Hartnell        12              The Romans
##   estimated.from estimated.          planet sub.location        location
## 1             NA                      Earth      England          London
## 2             NA                                     n/a             n/a
## 3           4000                      Earth                        Egypt
## 4           2900            Earth/ Atlantis                             
## 5           3999          y           Earth              Asia Minor/Troy
## 6           2493                      Earth        Italy            Rome
##   Extremes Years
## 1     from  2007
## 2     from  1963
## 3     from -2500
## 4     from  1972
## 5     from -1200
## 6     from    64
NDoctor$Doctor.Who.season<-as.numeric(NDoctor$Doctor.Who.season)
OldWho<-NDoctor[NDoctor$Doctor.Who.season<9,]
NuWho<-NDoctor[NDoctor$Doctor.Who.season>8,]
GroupNuWho<-group_by(NuWho,Doctor.Who.season) #need to know about this in terms of both doctor and in terms of old vs nu
GroupOldWho<-group_by(OldWho,Doctor.Who.season)

summarise(GroupNuWho,total.years=abs(min(Years))+abs(max(Years)))
## # A tibble: 3 x 2
##   Doctor.Who.season total.years
##               <dbl>       <dbl>
## 1                 9     5.00e 9
## 2                10     1.00e14
## 3                11     5.45e 3
summarise(GroupOldWho,total.years=abs(min(Years))+abs(max(Years)))
## # A tibble: 8 x 2
##   Doctor.Who.season   total.years
##               <dbl>         <dbl>
## 1                 1   13000004000
## 2                 2          3936
## 3                 3          4540
## 4                 4 4000000001980
## 5                 5   13700002683
## 6                 6          4352
## 7                 7          5480
## 8                 8          3998

To include the estimations, I simply broadened the gather to include column 7. With the estimations, we have

NEDoctor<-gather(TheDoctor,Extremes,Years,5:7)
head(NEDoctor)
##   Doctor.Who.season doctor.actor    ep..no           episode.title
## 1                10      Tennant xmas 2006       the Runaway Bride
## 2                 1     Hartnell         1      An Unearthly Child
## 3                 1     Hartnell        21 The Daleks' Master Plan
## 4                 3      Pertwee        64        The Time Monster
## 5                 1     Hartnell        20         The Myth Makers
## 6                 1     Hartnell        12              The Romans
##   estimated.          planet sub.location        location Extremes Years
## 1                      Earth      England          London     from  2007
## 2                                     n/a             n/a     from  1963
## 3                      Earth                        Egypt     from -2500
## 4            Earth/ Atlantis                                  from  1972
## 5          y           Earth              Asia Minor/Troy     from -1200
## 6                      Earth        Italy            Rome     from    64
NEDoctor<-NEDoctor[!is.na(NEDoctor$Years),]
OldEWho<-NEDoctor[NEDoctor$Doctor.Who.season<9,]
NuEWho<-NEDoctor[NEDoctor$Doctor.Who.season>8,]
GroupNuEWho<-group_by(NuEWho,Doctor.Who.season) #need to know about this in terms of both doctor and in terms of old vs nu
GroupOldEWho<-group_by(OldEWho,Doctor.Who.season)
summarise(GroupNuEWho,total.years=abs(max(Years))-(min(Years)))
## # A tibble: 3 x 2
##   Doctor.Who.season total.years
##               <dbl>       <dbl>
## 1                 9     5.00e 9
## 2                10     1.00e14
## 3                11     5.24e 3
summarise(GroupOldEWho,total.years=abs(max(Years))-(min(Years)))
## # A tibble: 8 x 2
##   Doctor.Who.season   total.years
##               <dbl>         <dbl>
## 1                 1   13010000700
## 2                 2           444
## 3                 3          4900
## 4                 4 4000000037166
## 5                 5   13710000000
## 6                 6           708
## 7                 7          3894
## 8                 8            12
conciseNuEWho<-summarise(GroupNuEWho,total.years=abs(min(Years))+abs(max(Years)))
conciseOldEWho<-summarise(GroupOldEWho,total.years=abs(min(Years))+abs(max(Years)))
mean(conciseNuEWho$total.years)
## [1] 3.333667e+13
mean(conciseOldEWho$total.years)
## [1] 5.0334e+11

With both the estimations and without them, the 10th Doctor traveled the most and the 8th traveled the least. This is not surprising, as the 10th Doctor has an episode where he goes 100 trillion years into the future and the 8th Doctor was only in a TV movie that involved not much time travel.

Similarly, we can see that, mostly because of the orders of magnitude more travels of the 10th Doctor, the new series has more years covered than the old.

UN Migrant Stock Data

3.)For this data set, I was asked to examine trends in migrant movement over the last 25 years. I interpretted this as seeing which countries had the most migrants, which had the greatest change in migrants and which had the greates percentage change in migrants.

I start by loading in the xlsx turned csv file from row 16 down, as the fancy formatting of the Excel document takes 15 lines. This does have the downside of missing out on the sex differentiation labels, which I address with some string substitution and regex work. Also, as noted in the comments, the labels for the first 5 columns were on row 15, so I relabled the rows at this time.

Migrants<-read.csv("UN_MigrantStockTotal_2015 - Table 1.csv",skip=16,stringsAsFactors = FALSE)#skipping to the row with the year identifiers, .1 means male, .2 means female, will need to melt/gather for this
names(Migrants)[1:5]<-c("sort.order","major.area,region,country.or.area.of.destination","notes","country.code","type.of.data.(a)") #these names appear in the line above the years, thus renaming the columns
head(Migrants)
##   sort.order           major.area,region,country.or.area.of.destination
## 1          1                                                      WORLD
## 2          2                                          Developed regions
## 3          3                                         Developing regions
## 4          4                                  Least developed countries
## 5          5 Less developed regions excluding least developed countries
## 6          6                                         Sub-Saharan Africa
##   notes country.code type.of.data.(a)        X1990        X1995
## 1                900                   152 563 212  160 801 752
## 2   (b)          901                    82 378 628   92 306 854
## 3   (c)          902                    70 184 584   68 494 898
## 4   (d)          941                    11 075 966   11 711 703
## 5                934                    59 105 261   56 778 501
## 6   (e)          947                    14 690 319   15 324 570
##          X2000        X2005        X2010        X2015     X1990.1
## 1  172 703 309  191 269 100  221 714 243  243 700 236  77 747 510
## 2  103 375 363  117 181 109  132 560 325  140 481 955  40 263 397
## 3   69 327 946   74 087 991   89 153 918  103 218 281  37 484 113
## 4   10 077 824    9 809 634   10 018 128   11 951 316   5 843 107
## 5   59 244 124   64 272 611   79 130 668   91 262 036  31 641 006
## 6   13 716 539   13 951 086   15 496 764   18 993 986   7 745 306
##       X1995.1     X2000.1     X2005.1      X2010.1      X2015.1
## 1  81 737 477  87 884 839  97 866 674  114 613 714  126 115 435
## 2  45 092 799  50 536 796  57 217 777   64 081 077   67 618 619
## 3  36 644 678  37 348 043  40 648 897   50 532 637   58 496 816
## 4   6 142 712   5 361 902   5 383 009    5 462 714    6 463 217
## 5  30 501 966  31 986 141  35 265 888   45 069 923   52 033 599
## 6   8 036 824   7 210 452   7 444 048    8 188 581   10 099 486
##       X1990.2     X1995.2     X2000.2     X2005.2      X2010.2
## 1  74 815 702  79 064 275  84 818 470  93 402 426  107 100 529
## 2  42 115 231  47 214 055  52 838 567  59 963 332   68 479 248
## 3  32 700 471  31 850 220  31 979 903  33 439 094   38 621 281
## 4   5 236 216   5 573 685   4 721 920   4 432 371    4 560 536
## 5  27 464 255  26 276 535  27 257 983  29 006 723   34 060 745
## 6   6 945 013   7 287 746   6 506 087   6 507 038    7 308 183
##        X2015.2
## 1  117 584 801
## 2   72 863 336
## 3   44 721 465
## 4    5 493 028
## 5   39 228 437
## 6    8 894 500

We can see that the numbers are written as groups of three digits with spaces separating them. While this looks good in Excel and is removed when working there, a shift through Google docs to get a tab of the Excel file renders them into the .csv file. We have to deal with them if we want any useable data.

GMig<-apply(Migrants[,6:23],MARGIN = c(1,2),FUN = function(x) gsub("\\s","",x)) #each 3 number block has a space in front of it, need to get rid of those
Migrants[,6:23]<-GMig #that works, but it only gives me columns 6:23.  Need to put this all together
NMig<-apply(Migrants[,6:23],MARGIN = c(1,2),FUN = function(x) as.numeric(x)) #what GMig produced were all character strings.  They really are numbers, fixing that.  I'm going to have to do this again after gathering, so this may be extra.  I wish I could just do Migrants[,6:23], but R gets really annoyed at passing a vector into the as.numeric function
Migrants[,6:23]<-NMig

I then made the data set tall by moving years into a single columns.

MigrantsSkinny<-gather(Migrants,Year.and.Sex,number.of.migrants,6:23,na.rm = TRUE) #the NAs come from countries like South Sudan or East Timor that didn't exist in 1990 or 1995
MigrantsSkinnySpecific<-MigrantsSkinny

As I said above, I needed to now differentiate between the men, women and combined values. Given that the combined values were simply in the format of X, I had to use their length to do this for the combined ones. I found it best to write a simple function to use in the lapply command.

addcombine<- function(x){
  y<-ifelse(regexpr("$",x)[1]==6,paste(x,".combined",sep=""),x)
  return(y)
}

MigrantsSkinnySpecific$Year.and.Sex<-lapply(MigrantsSkinnySpecific$Year.and.Sex,FUN = function(x) addcombine(x)) 
MigrantsSkinnySpecific$Year.and.Sex<-str_replace(MigrantsSkinnySpecific$Year.and.Sex,"\\.1",".male")
MigrantsSkinnySpecific$Year.and.Sex<-str_replace(MigrantsSkinnySpecific$Year.and.Sex,"\\.2",".female")

We can now finally separate the year from the sexes of the migrants. As noted, the parenthese in the column name throw things off. Having unsuitable characters in column names is an issue I deal with repeatedly after this.

MigrantsSpecific<-separate(MigrantsSkinnySpecific,Year.and.Sex,into = c("Year","Sex"),sep="\\.")
MigrantsTidy<-MigrantsSpecific[,c(2,5:8)]
names(MigrantsTidy)[2]<-"type.of.data" #the (a) throws everything off

Also, the data is both on the country and region scale. To judge things by countries, I removed the regions.

MigrantsCountriesOnly<-subset(MigrantsTidy,grepl("[A-Z]+",type.of.data))
names(MigrantsCountriesOnly)[1]<-"country"

To see which country overall took in the most migrants, be they temporary or permanent, we can use the summarize command. This will double the actual number, but because we are looking for the largest number, this will not change the result.

MigrantsCountriesOnly<-group_by(MigrantsCountriesOnly,country)
totalmigrants<-summarise(MigrantsCountriesOnly,sum(number.of.migrants))
names(totalmigrants)[2]<-"total.number.of.migrants"
totalmigrants[which.max(totalmigrants$total.number.of.migrants),]
## # A tibble: 1 x 2
##   country                  total.number.of.migrants
##   <chr>                                       <dbl>
## 1 United States of America                433170340

Similarly, I check to see which country had the largest absolute change in number of migrants over the last 25 years.

biggestchange<-summarise(MigrantsCountriesOnly,max(number.of.migrants)-min(number.of.migrants))
names(biggestchange)[2]<-"difference"
biggestchange[biggestchange$difference==max(biggestchange$difference),] #which.max was not happy, thus the kludge
## # A tibble: 1 x 2
##   country                  difference
##   <chr>                         <dbl>
## 1 United States of America   35254117

In both cases it is the US. Given that the US maintains its population growth with immigrants, combined with workers coming over here to send money back home, students going to college/graduate school, a small number of refugees, and other immigrants, this is unsurprising.

I also wanted to see what the case was percentage wise. I started with the combined sexes numbers and looked first at total migrants vs. 1990 levels and then 2015 migrants vs. 1990 numbers.

X1990<-MigrantsCountriesOnly[MigrantsCountriesOnly$Year=="X1990"&MigrantsCountriesOnly$Sex=="combined",]
bigper<-inner_join(biggestchange,X1990,by="country")
bigper%<>%mutate(difference/number.of.migrants)
names(bigper)[7]<-"percentage"
bigper[bigper$percentage==max(bigper$percentage),]
## # A tibble: 1 x 7
##   country  difference type.of.data Year  Sex   number.of.migra~ percentage
##   <chr>         <dbl> <chr>        <chr> <chr>            <dbl>      <dbl>
## 1 Republi~    1308518 C            X1990 comb~            43017       30.4
bigper[bigper$percentage==max(bigper$percentage),1]
## # A tibble: 1 x 1
##   country          
##   <chr>            
## 1 Republic of Korea
#the Republic of Korea has seen the most imigrants in total vs. 1990
X2015<-MigrantsCountriesOnly[MigrantsCountriesOnly$Year=="X2015"&MigrantsCountriesOnly$Sex=="combined",]
miniper<-inner_join(X2015,X1990,by="country")
miniper%<>%mutate(number.of.migrants.x/number.of.migrants.y)
names(miniper)[10]<-"percentage"
miniper[miniper$percentage==max(miniper$percentage),]
## # A tibble: 1 x 10
## # Groups:   country [1]
##   country    type.of.data.x Year.x Sex.x  number.of.migran~ type.of.data.y
##   <chr>      <chr>          <chr>  <chr>              <dbl> <chr>         
## 1 Republic ~ C              X2015  combi~           1327324 C             
## # ... with 4 more variables: Year.y <chr>, Sex.y <chr>,
## #   number.of.migrants.y <dbl>, percentage <dbl>
miniper[miniper$percentage==max(miniper$percentage),1]
## # A tibble: 1 x 1
## # Groups:   country [1]
##   country          
##   <chr>            
## 1 Republic of Korea

In both cases, it was the Republic of Korea. This makes sense, as South Korea became a democracy only in 1987 and really began to take off economically after that.

I finally wanted to know if the change in migrant women was also largest in the Republic of Korea, or if it was in another country.

X1990F<-MigrantsCountriesOnly[MigrantsCountriesOnly$Year=="X1990"&MigrantsCountriesOnly$Sex=="female",]
X2015F<-MigrantsCountriesOnly[MigrantsCountriesOnly$Year=="X2015"&MigrantsCountriesOnly$Sex=="female",]
miniperF<-inner_join(X2015F,X1990F,by="country")
miniperF%<>%mutate(number.of.migrants.x/number.of.migrants.y)
names(miniperF)[10]<-"percentage"
miniperF[miniperF$percentage==max(miniperF$percentage),]
## # A tibble: 1 x 10
## # Groups:   country [1]
##   country     type.of.data.x Year.x Sex.x number.of.migran~ type.of.data.y
##   <chr>       <chr>          <chr>  <chr>             <dbl> <chr>         
## 1 Republic o~ C              X2015  fema~            582547 C             
## # ... with 4 more variables: Year.y <chr>, Sex.y <chr>,
## #   number.of.migrants.y <dbl>, percentage <dbl>

I had thought that it was mostly men who were moving and visiting long term to South Korea. Women did so in just about the same percentages.

I also checked to see which country has simply been receiving the same number of migrants over the last 25 years.

miniper%<>%mutate(abs(1-percentage))
names(miniper)[11]<-"close.to.constant"
head(miniper)
## # A tibble: 6 x 11
## # Groups:   country [6]
##   country  type.of.data.x Year.x Sex.x   number.of.migrant~ type.of.data.y
##   <chr>    <chr>          <chr>  <chr>                <dbl> <chr>         
## 1 Burundi  B R            X2015  combin~             286810 B R           
## 2 Comoros  B              X2015  combin~              12555 B             
## 3 Djibouti B R            X2015  combin~             112351 B R           
## 4 Eritrea  I              X2015  combin~              15941 I             
## 5 Ethiopia B R            X2015  combin~            1072949 B R           
## 6 Kenya    B R            X2015  combin~            1084357 B R           
## # ... with 5 more variables: Year.y <chr>, Sex.y <chr>,
## #   number.of.migrants.y <dbl>, percentage <dbl>, close.to.constant <dbl>
min(miniper$close.to.constant)
## [1] 0.009743121
miniper[miniper$close.to.constant==min(miniper$close.to.constant),]
## # A tibble: 1 x 11
## # Groups:   country [1]
##   country   type.of.data.x Year.x Sex.x  number.of.migrant~ type.of.data.y
##   <chr>     <chr>          <chr>  <chr>               <dbl> <chr>         
## 1 Costa Ri~ B R            X2015  combi~             421697 B R           
## # ... with 5 more variables: Year.y <chr>, Sex.y <chr>,
## #   number.of.migrants.y <dbl>, percentage <dbl>, close.to.constant <dbl>
#miniper[miniper$]

Apparently Costa Rica gets about 400,000 migrants annually, no matter what.

Finally, I wanted to see which country had the greatest percentage drop in migrants. Where are people not going anymore?

miniper[miniper$percentage==min(miniper$percentage),]
## # A tibble: 1 x 11
## # Groups:   country [1]
##   country type.of.data.x Year.x Sex.x    number.of.migrant~ type.of.data.y
##   <chr>   <chr>          <chr>  <chr>                 <dbl> <chr>         
## 1 Somalia I R            X2015  combined              25291 I R           
## # ... with 5 more variables: Year.y <chr>, Sex.y <chr>,
## #   number.of.migrants.y <dbl>, percentage <dbl>, close.to.constant <dbl>

Given the civil war in Somalia started in 1991 and has not entirely ceased, this is unsurprising.