library(dplyr)
library(tidyr)
library(stringr)
library(magrittr)
For project 2, I chose to work with the New York Health Department Restaraunt Data, The Doctor Who Time Travel Data, and the UN Migrant Stock Data.
1.) For the New York Health Department Restaraunt data, I was asked to find the most common violations that shut down restaurants, how likely they are to be reopened if they are shut down and which zip codes have the most violations. I chose to tackle the data by looking at the classes of critical problems (mice, roaches, improper food storage temperature, things that could immediately harm someone’s health) that cause shut downs vs. non-critical problems (food storage surfaces having defects, openings that could allow in vermin, things that are not immediately hazardous, but could be so).
Given the number of restaraunts in New York, I worked with an approximately 6 month window.
The data is already quite tall, so instead of making it narrower, I instead widened it out with tidyr.
(There were some characters that R read as end of file characters that I had to edit out with a text editor, thus the file name.)
NTCdata<-read.csv("DOHMH_New_York_City_Restaurant_Inspection_Resultsfixed.csv")
There are three classes of violations in terms of whether or not they were critical. Those are critical, non-critical and not applicable. I eliminated the not applicable ones, as they were things like customers smoking, that were not actually relevant to food safety.
NTCApplicable<-NTCdata[!str_detect(NTCdata$CRITICAL.FLAG,"Applicable"),]
I then spread out the data by dividing those with the critical flag. This is making it slightly less tidy, but it allows a quick separation of critical and non-critical violations.
NTCwide<-spread(NTCApplicable,CRITICAL.FLAG,VIOLATION.DESCRIPTION)
critical<-NTCwide[!is.na(NTCwide$Critical),]
noncritical<-NTCwide[is.na(NTCwide$Critical),]
Having done this, I can now look at the closures for both critical and non-critical violations.
critclose<-critical[str_detect(critical$ACTION,"closed")|str_detect(critical$ACTION,"Closed"),]
summary(critclose)
## CAMIS DBA
## Min. :40362098 RESTAURANTE & PANADERIA GUATELINDA: 25
## 1st Qu.:41580756 BELLA PIZZA : 19
## Median :50035430 B&B HALALL RESTAURANT : 17
## Mean :47035182 THE AINSWORTH : 16
## 3rd Qu.:50069068 CROWN FRIED CHICKEN : 15
## Max. :50083289 KENNEDY FRIED CHICKEN : 15
## (Other) :2946
## BORO BUILDING STREET ZIPCODE
## BRONX : 291 170-18 : 25 BROADWAY : 98 Min. :10001
## BROOKLYN : 877 1 : 20 2ND AVE : 63 1st Qu.:10029
## MANHATTAN :1038 200 : 20 3RD AVE : 57 Median :11104
## QUEENS : 751 12 : 17 AMSTERDAM AVE: 44 Mean :10744
## STATEN ISLAND: 96 234 : 17 8TH AVE : 39 3rd Qu.:11232
## 67A : 17 UTICA AVE : 37 Max. :11693
## (Other):2937 (Other) :2715 NA's :31
## PHONE
## 2124706169: 25
## 3472958422: 17
## 2127410646: 16
## 7185283900: 15
## 2125690555: 12
## 3476172383: 12
## (Other) :2956
## CUISINE.DESCRIPTION
## Chinese : 415
## American : 409
## Latin (Cuban, Dominican, Puerto Rican, South & Central American): 170
## Caribbean : 159
## Spanish : 157
## Pizza : 140
## (Other) :1603
## INSPECTION.DATE
## 09/20/2018: 59
## 06/18/2018: 58
## 06/14/2018: 52
## 06/21/2018: 51
## 08/16/2018: 50
## 08/29/2018: 50
## (Other) :2733
## ACTION
## Establishment Closed by DOHMH. Violations were cited in the following area(s) and those requiring immediate action were addressed.:2767
## Establishment re-closed by DOHMH : 286
## Establishment re-opened by DOHMH : 0
## No violations were recorded at the time of this inspection. : 0
## Violations were cited in the following area(s). : 0
##
##
## VIOLATION.CODE SCORE GRADE GRADE.DATE
## 04L : 537 Min. : 5.00 :2948 :3049
## 02G : 315 1st Qu.: 43.00 A : 0 05/02/2018: 4
## 04N : 293 Median : 54.00 B : 0 04/03/2018: 0
## 04M : 258 Mean : 57.34 C : 4 04/04/2018: 0
## 06C : 236 3rd Qu.: 70.00 Not Yet Graded: 101 04/05/2018: 0
## 02B : 204 Max. :156.00 P : 0 04/06/2018: 0
## (Other):1210 Z : 0 (Other) : 0
## RECORD.DATE INSPECTION.TYPE
## 10/05/2018:3053 Cycle Inspection / Initial Inspection :1515
## Cycle Inspection / Re-inspection : 612
## Pre-permit (Operational) / Initial Inspection : 425
## Cycle Inspection / Reopening Inspection : 234
## Pre-permit (Operational) / Re-inspection : 139
## Pre-permit (Operational) / Compliance Inspection: 50
## (Other) : 78
## Critical
## Evidence of mice or live mice present in facility's food and/or non-food areas. : 537
## Cold food item held above 41º F (smoked fish and reduced oxygen packaged foods above 38 ºF) except during necessary preparation. : 315
## Filth flies or food/refuse/sewage-associated (FRSA) flies present in facilitys food and/or non-food areas. Filth flies include house flies, little house flies, blow flies, bottle flies and flesh flies. Food/refuse/sewage-associated flies include fruit flies, drain flies and Phorid flies.: 293
## Live roaches present in facility's food and/or non-food areas. : 258
## Food not protected from potential source of contamination during storage, preparation, transportation, display or service. : 236
## Hot food item not held at or above 140º F. : 204
## (Other) :1210
## Not Critical
## : 0
## '''Wash hands sign not posted at hand wash facility. : 0
## A food containing artificial trans fat, with 0.5 grams or more of trans fat per serving, is being stored, distributed, held for service, used in preparation of a menu item, or served. : 0
## Accurate thermometer not provided in refrigerated or hot holding equipment. : 0
## Appropriately scaled metal stem-type thermometer or thermocouple not provided or used to evaluate temperatures of potentially hazardous foods during cooking, cooling, reheating and holding.: 0
## (Other) : 0
## NA's :3053
noncritclose<-noncritical[str_detect(noncritical$ACTION,"closed")|str_detect(noncritical$ACTION,"Closed"),]
summary(noncritclose)
## CAMIS DBA
## Min. :40362098 RESTAURANTE & PANADERIA GUATELINDA: 10
## 1st Qu.:41530497 B&B HALALL RESTAURANT : 9
## Median :50017751 VACCAROS BAKERY : 9
## Mean :46922419 CROWN FRIED CHICKEN : 8
## 3rd Qu.:50064922 KENNEDY FRIED CHICKEN : 8
## Max. :50083289 MAYA CUISINE : 8
## (Other) :1509
## BORO BUILDING STREET ZIPCODE
## BRONX :146 176 : 11 BROADWAY : 56 Min. :10001
## BROOKLYN :452 160 : 10 3RD AVE : 34 1st Qu.:10027
## MANHATTAN :544 170-18 : 10 2ND AVE : 32 Median :11103
## QUEENS :373 2805 : 9 AMSTERDAM AVE: 20 Mean :10732
## STATEN ISLAND: 46 67A : 9 8TH AVE : 19 3rd Qu.:11226
## 1 : 8 STEINWAY ST : 19 Max. :11693
## (Other):1504 (Other) :1381 NA's :16
## PHONE CUISINE.DESCRIPTION INSPECTION.DATE
## 2124706169: 10 American :215 09/20/2018: 37
## 3472958422: 9 Chinese :210 06/14/2018: 33
## 3476172383: 9 Spanish : 81 06/18/2018: 24
## 7187773131: 8 Caribbean: 80 09/17/2018: 24
## 212_933_10: 7 Indian : 74 06/19/2018: 22
## 2123880885: 7 Mexican : 69 07/11/2018: 22
## (Other) :1511 (Other) :832 (Other) :1399
## ACTION
## Establishment Closed by DOHMH. Violations were cited in the following area(s) and those requiring immediate action were addressed.:1315
## Establishment re-closed by DOHMH : 246
## Establishment re-opened by DOHMH : 0
## No violations were recorded at the time of this inspection. : 0
## Violations were cited in the following area(s). : 0
##
##
## VIOLATION.CODE SCORE GRADE GRADE.DATE
## 08A :767 Min. : 2.00 :1525 :1560
## 10F :268 1st Qu.: 38.00 A : 0 05/02/2018: 1
## 10B :208 Median : 49.00 B : 0 04/03/2018: 0
## 08C : 72 Mean : 50.72 C : 1 04/04/2018: 0
## 10H : 50 3rd Qu.: 61.00 Not Yet Graded: 35 04/05/2018: 0
## 09C : 49 Max. :156.00 P : 0 04/06/2018: 0
## (Other):147 NA's :28 Z : 0 (Other) : 0
## RECORD.DATE INSPECTION.TYPE
## 10/05/2018:1561 Cycle Inspection / Initial Inspection :738
## Cycle Inspection / Re-inspection :305
## Cycle Inspection / Reopening Inspection :192
## Pre-permit (Operational) / Initial Inspection :146
## Pre-permit (Operational) / Re-inspection : 66
## Pre-permit (Operational) / Reopening Inspection: 43
## (Other) : 71
## Critical
## : 0
## '''Wash hands sign not posted at hand wash facility. : 0
## A food containing artificial trans fat, with 0.5 grams or more of trans fat per serving, is being stored, distributed, held for service, used in preparation of a menu item, or served. : 0
## Accurate thermometer not provided in refrigerated or hot holding equipment. : 0
## Appropriately scaled metal stem-type thermometer or thermocouple not provided or used to evaluate temperatures of potentially hazardous foods during cooking, cooling, reheating and holding.: 0
## (Other) : 0
## NA's :1561
## Not Critical
## Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist. :767
## Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.:268
## Plumbing not properly installed or maintained; anti-siphonage or backflow prevention device not provided where required; equipment or floor not properly drained; sewage disposal system in disrepair or not functioning properly. :208
## Pesticide use not in accordance with label or applicable laws. Prohibited chemical used/stored. Open bait station used. : 72
## Proper sanitization not provided for utensil ware washing operation. : 50
## Food contact surface not properly maintained. : 49
## (Other) :147
The top critical closure reasons are mice and temperature control problems. The top non-critical reasons are not being vermin proof, bad food storage, and plumbing problems, including not having drains in the floor.
To look at which restaurants were closed and reopened, I subsetted the critical and non-critical data sets by those that the health department reopened and then did an inner join to see which ones had been closed during this time period. Oddly enough, it seems that there are more results for these than there were initial elements, this indicates that restaurants are being closed and re-opened more than once.
noncritreopen<-noncritical[str_detect(noncritical$ACTION,"re-open"),]
noncritcloseandreopen<-inner_join(noncritclose,noncritreopen,by = "CAMIS")
critreopen<-critical[str_detect(critical$ACTION,"re-open"),]
critclosereopen<-inner_join(critclose,critreopen,by="CAMIS")
nrow(noncritclose)
## [1] 1561
nrow(noncritcloseandreopen)
## [1] 1662
nrow(critclose)
## [1] 3053
nrow(critclosereopen)
## [1] 1070
In order to look only on the restaurant basis, no matter if they had multiple closures, I used the distinct() function.
NCCU<-distinct(noncritclose,CAMIS)
NCCRU<-distinct(noncritcloseandreopen,CAMIS)
nrow(NCCRU)/nrow(NCCU)
## [1] 0.7701318
CCU<-distinct(critclose,CAMIS)
CCRU<-distinct(critclosereopen,CAMIS)
nrow(CCRU)/nrow(CCU)
## [1] 0.2863014
This shows that restaurants that are shut down for critical issues are much less likely to reopen in the six month window than ones that are shut down for non-critical reasons. Apparently, once you have mice or cockroaches, it’s hard to come back from that.
Finally, I checked to see which zip codes had the most health code violations.
zips<-as.factor(NTCdata$ZIPCODE)
summary(zips)
## 10003 10019 10036 10013 10002 10016 11354 10012 10001
## 2138 1927 1904 1748 1656 1454 1411 1373 1370
## 10014 10011 10022 11220 10017 11372 10018 11201 11215
## 1309 1273 1217 1191 1150 1103 1089 1020 974
## 11368 11101 11211 10009 11209 11355 11385 11226 11237
## 945 868 836 823 815 806 794 778 773
## 11217 11103 11373 10025 11214 11238 10029 11432 10010
## 768 766 722 716 684 661 658 658 654
## 10028 10458 11222 10024 11377 10038 11375 10314 10467
## 631 630 627 624 612 611 583 579 565
## 11204 10023 10128 11203 10065 11223 10021 11206 10451
## 564 555 555 536 531 523 517 507 506
## 11249 10032 11235 10027 11216 11106 11207 10461 11218
## 505 503 487 483 479 460 459 452 450
## 11234 10031 11232 10301 11212 11225 11231 10033 10463
## 438 434 431 421 417 412 412 406 404
## 11205 11229 11213 11358 10452 11230 11419 11435 10004
## 401 393 392 391 390 390 390 385 377
## 11208 10306 10034 11236 11105 11219 11221 10453 11418
## 376 372 368 364 363 361 355 354 343
## 10040 10468 10035 10457 10472 11104 10007 10462 (Other)
## 342 336 335 334 329 314 309 303 12838
## NA's
## 1239
This is not surprising, as the top two zip codes have some of the highest number of restaraunts in New York City, even if I am slightly disturbed, given that I used to work in 10019.
If instead we want to look at which zip codes get proportionally the most health code violations, we can use
numviol<-count(NTCdata,ZIPCODE)
RestUnique<-distinct(NTCdata,CAMIS,.keep_all = TRUE)
numrest<-count(RestUnique,ZIPCODE)
violratio<-cbind(numviol[,1],numviol[,2]%/%numrest[,2])
head(violratio[order(-violratio$n),],8)
## ZIPCODE n
## 31 10032 6
## 39 10040 6
## 49 10111 6
## 68 10271 6
## 108 10474 6
## 168 11363 6
## 3 10002 5
## 4 10003 5
The zip codes 10032, 10040 and 10111 have the highest ratio of violations to restaurants.
2.) I was asked to see whether the old series or the new one had more time traveling and which Doctor traveled the most.
More of the work with this data set is in cleaning rather than tidying. There is also a typo present that I did not bother to correct, both because I wanted to work with the initial data set and second because it did not alter the resutls. One of the Doctors is recorded as going back to -400000000000, which is before the beginning of the universe. It was meant to be -4000000000, the beginning of life on earth, but the person writing this up slipped when entering 0s.
The initial data cleaning:
TheDoctor<-read.csv("DrWhoTimeTravel.csv",stringsAsFactors = FALSE)
TheDoctor<-select(TheDoctor,Doctor.Who.season:location)
TheDoctor<-TheDoctor[,c(1:5,7,6,8:11)]
TheDoctor[,1]<-as.numeric(TheDoctor[,1])
## Warning: NAs introduced by coercion
TheDoctor<-TheDoctor[!is.na(TheDoctor[,1]),]
TheDoctor[,6]<-as.numeric(gsub(",","",TheDoctor[,6]))
head(TheDoctor)
## Doctor.Who.season doctor.actor ep..no episode.title from
## 1 10 Tennant xmas 2006 the Runaway Bride 2007
## 2 1 Hartnell 1 An Unearthly Child 1963
## 3 1 Hartnell 21 The Daleks' Master Plan -2500
## 4 3 Pertwee 64 The Time Monster 1972
## 5 1 Hartnell 20 The Myth Makers -1200
## 6 1 Hartnell 12 The Romans 64
## to estimated.from estimated. planet sub.location
## 1 -5.0e+09 NA Earth England
## 2 -1.0e+05 NA n/a
## 3 -2.5e+03 4000 Earth
## 4 -2.0e+03 2900 Earth/ Atlantis
## 5 -1.2e+03 3999 y Earth
## 6 6.4e+01 2493 Earth Italy
## location
## 1 London
## 2 n/a
## 3 Egypt
## 4
## 5 Asia Minor/Troy
## 6 Rome
Now, I can gather up the starts and ends for the various Doctors to see when they started and ended. I first did this with the non-estimated data.
NDoctor<-gather(TheDoctor,Extremes,Years,5:6)
head(NDoctor)
## Doctor.Who.season doctor.actor ep..no episode.title
## 1 10 Tennant xmas 2006 the Runaway Bride
## 2 1 Hartnell 1 An Unearthly Child
## 3 1 Hartnell 21 The Daleks' Master Plan
## 4 3 Pertwee 64 The Time Monster
## 5 1 Hartnell 20 The Myth Makers
## 6 1 Hartnell 12 The Romans
## estimated.from estimated. planet sub.location location
## 1 NA Earth England London
## 2 NA n/a n/a
## 3 4000 Earth Egypt
## 4 2900 Earth/ Atlantis
## 5 3999 y Earth Asia Minor/Troy
## 6 2493 Earth Italy Rome
## Extremes Years
## 1 from 2007
## 2 from 1963
## 3 from -2500
## 4 from 1972
## 5 from -1200
## 6 from 64
NDoctor$Doctor.Who.season<-as.numeric(NDoctor$Doctor.Who.season)
OldWho<-NDoctor[NDoctor$Doctor.Who.season<9,]
NuWho<-NDoctor[NDoctor$Doctor.Who.season>8,]
GroupNuWho<-group_by(NuWho,Doctor.Who.season) #need to know about this in terms of both doctor and in terms of old vs nu
GroupOldWho<-group_by(OldWho,Doctor.Who.season)
summarise(GroupNuWho,total.years=abs(min(Years))+abs(max(Years)))
## # A tibble: 3 x 2
## Doctor.Who.season total.years
## <dbl> <dbl>
## 1 9 5.00e 9
## 2 10 1.00e14
## 3 11 5.45e 3
summarise(GroupOldWho,total.years=abs(min(Years))+abs(max(Years)))
## # A tibble: 8 x 2
## Doctor.Who.season total.years
## <dbl> <dbl>
## 1 1 13000004000
## 2 2 3936
## 3 3 4540
## 4 4 4000000001980
## 5 5 13700002683
## 6 6 4352
## 7 7 5480
## 8 8 3998
To include the estimations, I simply broadened the gather to include column 7. With the estimations, we have
NEDoctor<-gather(TheDoctor,Extremes,Years,5:7)
head(NEDoctor)
## Doctor.Who.season doctor.actor ep..no episode.title
## 1 10 Tennant xmas 2006 the Runaway Bride
## 2 1 Hartnell 1 An Unearthly Child
## 3 1 Hartnell 21 The Daleks' Master Plan
## 4 3 Pertwee 64 The Time Monster
## 5 1 Hartnell 20 The Myth Makers
## 6 1 Hartnell 12 The Romans
## estimated. planet sub.location location Extremes Years
## 1 Earth England London from 2007
## 2 n/a n/a from 1963
## 3 Earth Egypt from -2500
## 4 Earth/ Atlantis from 1972
## 5 y Earth Asia Minor/Troy from -1200
## 6 Earth Italy Rome from 64
NEDoctor<-NEDoctor[!is.na(NEDoctor$Years),]
OldEWho<-NEDoctor[NEDoctor$Doctor.Who.season<9,]
NuEWho<-NEDoctor[NEDoctor$Doctor.Who.season>8,]
GroupNuEWho<-group_by(NuEWho,Doctor.Who.season) #need to know about this in terms of both doctor and in terms of old vs nu
GroupOldEWho<-group_by(OldEWho,Doctor.Who.season)
summarise(GroupNuEWho,total.years=abs(max(Years))-(min(Years)))
## # A tibble: 3 x 2
## Doctor.Who.season total.years
## <dbl> <dbl>
## 1 9 5.00e 9
## 2 10 1.00e14
## 3 11 5.24e 3
summarise(GroupOldEWho,total.years=abs(max(Years))-(min(Years)))
## # A tibble: 8 x 2
## Doctor.Who.season total.years
## <dbl> <dbl>
## 1 1 13010000700
## 2 2 444
## 3 3 4900
## 4 4 4000000037166
## 5 5 13710000000
## 6 6 708
## 7 7 3894
## 8 8 12
conciseNuEWho<-summarise(GroupNuEWho,total.years=abs(min(Years))+abs(max(Years)))
conciseOldEWho<-summarise(GroupOldEWho,total.years=abs(min(Years))+abs(max(Years)))
mean(conciseNuEWho$total.years)
## [1] 3.333667e+13
mean(conciseOldEWho$total.years)
## [1] 5.0334e+11
With both the estimations and without them, the 10th Doctor traveled the most and the 8th traveled the least. This is not surprising, as the 10th Doctor has an episode where he goes 100 trillion years into the future and the 8th Doctor was only in a TV movie that involved not much time travel.
Similarly, we can see that, mostly because of the orders of magnitude more travels of the 10th Doctor, the new series has more years covered than the old.
3.)For this data set, I was asked to examine trends in migrant movement over the last 25 years. I interpretted this as seeing which countries had the most migrants, which had the greatest change in migrants and which had the greates percentage change in migrants.
I start by loading in the xlsx turned csv file from row 16 down, as the fancy formatting of the Excel document takes 15 lines. This does have the downside of missing out on the sex differentiation labels, which I address with some string substitution and regex work. Also, as noted in the comments, the labels for the first 5 columns were on row 15, so I relabled the rows at this time.
Migrants<-read.csv("UN_MigrantStockTotal_2015 - Table 1.csv",skip=16,stringsAsFactors = FALSE)#skipping to the row with the year identifiers, .1 means male, .2 means female, will need to melt/gather for this
names(Migrants)[1:5]<-c("sort.order","major.area,region,country.or.area.of.destination","notes","country.code","type.of.data.(a)") #these names appear in the line above the years, thus renaming the columns
head(Migrants)
## sort.order major.area,region,country.or.area.of.destination
## 1 1 WORLD
## 2 2 Developed regions
## 3 3 Developing regions
## 4 4 Least developed countries
## 5 5 Less developed regions excluding least developed countries
## 6 6 Sub-Saharan Africa
## notes country.code type.of.data.(a) X1990 X1995
## 1 900 152 563 212 160 801 752
## 2 (b) 901 82 378 628 92 306 854
## 3 (c) 902 70 184 584 68 494 898
## 4 (d) 941 11 075 966 11 711 703
## 5 934 59 105 261 56 778 501
## 6 (e) 947 14 690 319 15 324 570
## X2000 X2005 X2010 X2015 X1990.1
## 1 172 703 309 191 269 100 221 714 243 243 700 236 77 747 510
## 2 103 375 363 117 181 109 132 560 325 140 481 955 40 263 397
## 3 69 327 946 74 087 991 89 153 918 103 218 281 37 484 113
## 4 10 077 824 9 809 634 10 018 128 11 951 316 5 843 107
## 5 59 244 124 64 272 611 79 130 668 91 262 036 31 641 006
## 6 13 716 539 13 951 086 15 496 764 18 993 986 7 745 306
## X1995.1 X2000.1 X2005.1 X2010.1 X2015.1
## 1 81 737 477 87 884 839 97 866 674 114 613 714 126 115 435
## 2 45 092 799 50 536 796 57 217 777 64 081 077 67 618 619
## 3 36 644 678 37 348 043 40 648 897 50 532 637 58 496 816
## 4 6 142 712 5 361 902 5 383 009 5 462 714 6 463 217
## 5 30 501 966 31 986 141 35 265 888 45 069 923 52 033 599
## 6 8 036 824 7 210 452 7 444 048 8 188 581 10 099 486
## X1990.2 X1995.2 X2000.2 X2005.2 X2010.2
## 1 74 815 702 79 064 275 84 818 470 93 402 426 107 100 529
## 2 42 115 231 47 214 055 52 838 567 59 963 332 68 479 248
## 3 32 700 471 31 850 220 31 979 903 33 439 094 38 621 281
## 4 5 236 216 5 573 685 4 721 920 4 432 371 4 560 536
## 5 27 464 255 26 276 535 27 257 983 29 006 723 34 060 745
## 6 6 945 013 7 287 746 6 506 087 6 507 038 7 308 183
## X2015.2
## 1 117 584 801
## 2 72 863 336
## 3 44 721 465
## 4 5 493 028
## 5 39 228 437
## 6 8 894 500
We can see that the numbers are written as groups of three digits with spaces separating them. While this looks good in Excel and is removed when working there, a shift through Google docs to get a tab of the Excel file renders them into the .csv file. We have to deal with them if we want any useable data.
GMig<-apply(Migrants[,6:23],MARGIN = c(1,2),FUN = function(x) gsub("\\s","",x)) #each 3 number block has a space in front of it, need to get rid of those
Migrants[,6:23]<-GMig #that works, but it only gives me columns 6:23. Need to put this all together
NMig<-apply(Migrants[,6:23],MARGIN = c(1,2),FUN = function(x) as.numeric(x)) #what GMig produced were all character strings. They really are numbers, fixing that. I'm going to have to do this again after gathering, so this may be extra. I wish I could just do Migrants[,6:23], but R gets really annoyed at passing a vector into the as.numeric function
Migrants[,6:23]<-NMig
I then made the data set tall by moving years into a single columns.
MigrantsSkinny<-gather(Migrants,Year.and.Sex,number.of.migrants,6:23,na.rm = TRUE) #the NAs come from countries like South Sudan or East Timor that didn't exist in 1990 or 1995
MigrantsSkinnySpecific<-MigrantsSkinny
As I said above, I needed to now differentiate between the men, women and combined values. Given that the combined values were simply in the format of X
addcombine<- function(x){
y<-ifelse(regexpr("$",x)[1]==6,paste(x,".combined",sep=""),x)
return(y)
}
MigrantsSkinnySpecific$Year.and.Sex<-lapply(MigrantsSkinnySpecific$Year.and.Sex,FUN = function(x) addcombine(x))
MigrantsSkinnySpecific$Year.and.Sex<-str_replace(MigrantsSkinnySpecific$Year.and.Sex,"\\.1",".male")
MigrantsSkinnySpecific$Year.and.Sex<-str_replace(MigrantsSkinnySpecific$Year.and.Sex,"\\.2",".female")
We can now finally separate the year from the sexes of the migrants. As noted, the parenthese in the column name throw things off. Having unsuitable characters in column names is an issue I deal with repeatedly after this.
MigrantsSpecific<-separate(MigrantsSkinnySpecific,Year.and.Sex,into = c("Year","Sex"),sep="\\.")
MigrantsTidy<-MigrantsSpecific[,c(2,5:8)]
names(MigrantsTidy)[2]<-"type.of.data" #the (a) throws everything off
Also, the data is both on the country and region scale. To judge things by countries, I removed the regions.
MigrantsCountriesOnly<-subset(MigrantsTidy,grepl("[A-Z]+",type.of.data))
names(MigrantsCountriesOnly)[1]<-"country"
To see which country overall took in the most migrants, be they temporary or permanent, we can use the summarize command. This will double the actual number, but because we are looking for the largest number, this will not change the result.
MigrantsCountriesOnly<-group_by(MigrantsCountriesOnly,country)
totalmigrants<-summarise(MigrantsCountriesOnly,sum(number.of.migrants))
names(totalmigrants)[2]<-"total.number.of.migrants"
totalmigrants[which.max(totalmigrants$total.number.of.migrants),]
## # A tibble: 1 x 2
## country total.number.of.migrants
## <chr> <dbl>
## 1 United States of America 433170340
Similarly, I check to see which country had the largest absolute change in number of migrants over the last 25 years.
biggestchange<-summarise(MigrantsCountriesOnly,max(number.of.migrants)-min(number.of.migrants))
names(biggestchange)[2]<-"difference"
biggestchange[biggestchange$difference==max(biggestchange$difference),] #which.max was not happy, thus the kludge
## # A tibble: 1 x 2
## country difference
## <chr> <dbl>
## 1 United States of America 35254117
In both cases it is the US. Given that the US maintains its population growth with immigrants, combined with workers coming over here to send money back home, students going to college/graduate school, a small number of refugees, and other immigrants, this is unsurprising.
I also wanted to see what the case was percentage wise. I started with the combined sexes numbers and looked first at total migrants vs. 1990 levels and then 2015 migrants vs. 1990 numbers.
X1990<-MigrantsCountriesOnly[MigrantsCountriesOnly$Year=="X1990"&MigrantsCountriesOnly$Sex=="combined",]
bigper<-inner_join(biggestchange,X1990,by="country")
bigper%<>%mutate(difference/number.of.migrants)
names(bigper)[7]<-"percentage"
bigper[bigper$percentage==max(bigper$percentage),]
## # A tibble: 1 x 7
## country difference type.of.data Year Sex number.of.migra~ percentage
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 Republi~ 1308518 C X1990 comb~ 43017 30.4
bigper[bigper$percentage==max(bigper$percentage),1]
## # A tibble: 1 x 1
## country
## <chr>
## 1 Republic of Korea
#the Republic of Korea has seen the most imigrants in total vs. 1990
X2015<-MigrantsCountriesOnly[MigrantsCountriesOnly$Year=="X2015"&MigrantsCountriesOnly$Sex=="combined",]
miniper<-inner_join(X2015,X1990,by="country")
miniper%<>%mutate(number.of.migrants.x/number.of.migrants.y)
names(miniper)[10]<-"percentage"
miniper[miniper$percentage==max(miniper$percentage),]
## # A tibble: 1 x 10
## # Groups: country [1]
## country type.of.data.x Year.x Sex.x number.of.migran~ type.of.data.y
## <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 Republic ~ C X2015 combi~ 1327324 C
## # ... with 4 more variables: Year.y <chr>, Sex.y <chr>,
## # number.of.migrants.y <dbl>, percentage <dbl>
miniper[miniper$percentage==max(miniper$percentage),1]
## # A tibble: 1 x 1
## # Groups: country [1]
## country
## <chr>
## 1 Republic of Korea
In both cases, it was the Republic of Korea. This makes sense, as South Korea became a democracy only in 1987 and really began to take off economically after that.
I finally wanted to know if the change in migrant women was also largest in the Republic of Korea, or if it was in another country.
X1990F<-MigrantsCountriesOnly[MigrantsCountriesOnly$Year=="X1990"&MigrantsCountriesOnly$Sex=="female",]
X2015F<-MigrantsCountriesOnly[MigrantsCountriesOnly$Year=="X2015"&MigrantsCountriesOnly$Sex=="female",]
miniperF<-inner_join(X2015F,X1990F,by="country")
miniperF%<>%mutate(number.of.migrants.x/number.of.migrants.y)
names(miniperF)[10]<-"percentage"
miniperF[miniperF$percentage==max(miniperF$percentage),]
## # A tibble: 1 x 10
## # Groups: country [1]
## country type.of.data.x Year.x Sex.x number.of.migran~ type.of.data.y
## <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 Republic o~ C X2015 fema~ 582547 C
## # ... with 4 more variables: Year.y <chr>, Sex.y <chr>,
## # number.of.migrants.y <dbl>, percentage <dbl>
I had thought that it was mostly men who were moving and visiting long term to South Korea. Women did so in just about the same percentages.
I also checked to see which country has simply been receiving the same number of migrants over the last 25 years.
miniper%<>%mutate(abs(1-percentage))
names(miniper)[11]<-"close.to.constant"
head(miniper)
## # A tibble: 6 x 11
## # Groups: country [6]
## country type.of.data.x Year.x Sex.x number.of.migrant~ type.of.data.y
## <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 Burundi B R X2015 combin~ 286810 B R
## 2 Comoros B X2015 combin~ 12555 B
## 3 Djibouti B R X2015 combin~ 112351 B R
## 4 Eritrea I X2015 combin~ 15941 I
## 5 Ethiopia B R X2015 combin~ 1072949 B R
## 6 Kenya B R X2015 combin~ 1084357 B R
## # ... with 5 more variables: Year.y <chr>, Sex.y <chr>,
## # number.of.migrants.y <dbl>, percentage <dbl>, close.to.constant <dbl>
min(miniper$close.to.constant)
## [1] 0.009743121
miniper[miniper$close.to.constant==min(miniper$close.to.constant),]
## # A tibble: 1 x 11
## # Groups: country [1]
## country type.of.data.x Year.x Sex.x number.of.migrant~ type.of.data.y
## <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 Costa Ri~ B R X2015 combi~ 421697 B R
## # ... with 5 more variables: Year.y <chr>, Sex.y <chr>,
## # number.of.migrants.y <dbl>, percentage <dbl>, close.to.constant <dbl>
#miniper[miniper$]
Apparently Costa Rica gets about 400,000 migrants annually, no matter what.
Finally, I wanted to see which country had the greatest percentage drop in migrants. Where are people not going anymore?
miniper[miniper$percentage==min(miniper$percentage),]
## # A tibble: 1 x 11
## # Groups: country [1]
## country type.of.data.x Year.x Sex.x number.of.migrant~ type.of.data.y
## <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 Somalia I R X2015 combined 25291 I R
## # ... with 5 more variables: Year.y <chr>, Sex.y <chr>,
## # number.of.migrants.y <dbl>, percentage <dbl>, close.to.constant <dbl>
Given the civil war in Somalia started in 1991 and has not entirely ceased, this is unsurprising.