The assignment tasks students with selecting a messy dataset posted by a peer and cleaning it for analysis as set forth in the peer’s post and beyond. For our dataset, we selected Maria’s dataset from UNICEF, and I undertook her question regarding newborn, infant, and young child nutrition. Prior to importing the data, I executed some operations on the dataset to facilitate importing and analysis including: - change all “-” to “NAs” - change percentages to decimals - strip out footnotes from the bottom - drop tabs that are anciliary to the analysis at hand so as to be able to save the data as .csv - drop extraneous columns and rows
nutrition = read.csv("https://raw.githubusercontent.com/evanmclaughlin/ECM607/master/UNICEF_Table-7-Nutrition-EN.csv")
head(nutrition)
## Country Low.birthweight Unweighed.at.birth
## 1 Afghanistan NA 0.8626837
## 2 Albania 0.04587830 0.1316733
## 3 Algeria 0.07252492 0.1142705
## 4 Andorra 0.07445406 0.1430000
## 5 Angola 0.15256593 0.4480227
## 6 Anguilla NA NA
## Early.initiation.of.breastfeeding Exclusive.breastfeeding...6.months.
## 1 0.6280000 0.575
## 2 0.5652210 0.36538513
## 3 0.3571079 0.25391212
## 4 NA <NA>
## 5 0.4831644 0.37381931
## 6 NA <NA>
## Introduction.to.solid..semi.solid.or.soft.foods Breastfeeding...All
## 1 0.6099433 0.7380154
## 2 0.8850852 0.4320668
## 3 0.7717722 0.3554286
## 4 NA NA
## 5 0.7876945 0.6656777
## 6 NA NA
## Breastfeeding...Poorest.20. Breastfeeding...Richest.20.
## 1 0.8009494 0.6956284
## 2 0.3817219 0.3657669
## 3 0.3531280 0.3357238
## 4 NA NA
## 5 0.7436154 0.5259578
## 6 NA NA
## Minimum.diet.diversity..6.23.months. Minimum.meal.frequency..6.23.months.
## 1 0.2205860 0.5121235
## 2 0.5249851 0.5140581
## 3 NA 0.5204865
## 4 NA NA
## 5 0.2906923 0.3276945
## 6 NA NA
## Minimum.acceptable.diet..6.23.months.
## 1 0.1549437
## 2 0.2924022
## 3 NA
## 4 NA
## 5 0.1326344
## 6 NA
## Zero.vegetable.or.fruit.consumption..6.23.months. Region X X.1
## 1 0.5857060 NA NA
## 2 0.2573565 NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 0.3635851 NA NA
## 6 NA LATAM - Caribbean NA NA
## X.2 X.3 X.4 X.5 X.6 X.7 X.8
## 1 NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA
## 4 NA NA NA NA NA NA NA
## 5 NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA
#start with re-labeling countries and the first variables we're interested in analyzing
names(nutrition)[1] <- "countries"
names(nutrition)[2] <- "low_birthweight"
names(nutrition)[4] <- "early_breast"
names(nutrition)[5] <- "exclusive_breast"
names(nutrition)[7] <- "breast_all"
names(nutrition)[13] <- "zero_veg"
names(nutrition)[14] <- "region"
#head(nutrition)
#next, extract the renamed columns for a subset
nut_ext <- nutrition %>% dplyr::select(1, 2, 4, 5, 7, 13, 14)
head(nut_ext)
## countries low_birthweight early_breast exclusive_breast breast_all
## 1 Afghanistan NA 0.6280000 0.575 0.7380154
## 2 Albania 0.04587830 0.5652210 0.36538513 0.4320668
## 3 Algeria 0.07252492 0.3571079 0.25391212 0.3554286
## 4 Andorra 0.07445406 NA <NA> NA
## 5 Angola 0.15256593 0.4831644 0.37381931 0.6656777
## 6 Anguilla NA NA <NA> NA
## zero_veg region
## 1 0.5857060
## 2 0.2573565
## 3 NA
## 4 NA
## 5 0.3635851
## 6 NA LATAM - Caribbean
## let's filter the dataset for our desired LATAM - Caribbean subest
la_nut <- nut_ext %>% filter(region == 'LATAM - Caribbean')
la_nut
## countries low_birthweight early_breast
## 1 Anguilla NA NA
## 2 Antigua and Barbuda 0.09054119 NA
## 3 Argentina 0.07346758 0.5274035
## 4 Bahamas 0.13135983 NA
## 5 Barbados NA 0.4029047
## 6 Belize 0.08598801 0.6825412
## 7 Bolivia (Plurinational State of) 0.07222335 0.5500000
## 8 Brazil 0.08384270 0.4290000
## 9 British Virgin Islands NA NA
## 10 Chile 0.06246671 NA
## 11 Colombia 0.09955593 0.7200000
## 12 Costa Rica 0.07477285 0.5962632
## 13 Cuba 0.05263557 0.4787731
## 14 Dominica NA NA
## 15 Dominican Republic 0.11294179 0.3807571
## 16 Ecuador 0.11180641 0.5460000
## 17 El Salvador 0.10295944 0.4200486
## 18 Grenada NA NA
## 19 Guatemala 0.10957713 0.6313348
## 20 Haiti NA 0.4735259
## 21 Honduras 0.10897210 0.6379734
## 22 Jamaica 0.14582915 0.6471896
## 23 Mexico 0.07868743 0.5103613
## 24 Montserrat NA NA
## 25 Nicaragua 0.10669083 0.5440000
## 26 Panama 0.10087104 0.4697620
## 27 Paraguay 0.08090015 0.4954847
## 28 Peru 0.09403629 0.4970000
## 29 Saint Kitts and Nevis NA NA
## 30 Saint Lucia NA 0.4958285
## 31 Saint Vincent and the Grenadines NA NA
## 32 Suriname 0.14658528 0.4468282
## 33 Trinidad and Tobago 0.12392281 0.4600000
## 34 Turks and Caicos Islands NA NA
## 35 Uruguay 0.07616777 0.7651041
## 36 Venezuela (Bolivarian Republic of) 0.09104774 NA
## exclusive_breast breast_all zero_veg region
## 1 <NA> NA NA LATAM - Caribbean
## 2 <NA> NA NA LATAM - Caribbean
## 3 0.31951948 0.3910114 NA LATAM - Caribbean
## 4 <NA> NA NA LATAM - Caribbean
## 5 0.19696945 0.4108116 NA LATAM - Caribbean
## 6 0.33164608 0.4713115 0.30324936 LATAM - Caribbean
## 7 0.583 0.5530000 0.19808907 LATAM - Caribbean
## 8 0.386 NA NA LATAM - Caribbean
## 9 <NA> NA NA LATAM - Caribbean
## 10 <NA> NA NA LATAM - Caribbean
## 11 0.361 0.4475464 NA LATAM - Caribbean
## 12 0.32540791 0.3977990 NA LATAM - Caribbean
## 13 0.32832527 0.3068782 0.26882854 LATAM - Caribbean
## 14 <NA> NA NA LATAM - Caribbean
## 15 0.04564405 0.1994798 0.34918953 LATAM - Caribbean
## 16 0.396 NA NA LATAM - Caribbean
## 17 0.46734215 0.6678928 0.15860823 LATAM - Caribbean
## 18 <NA> NA NA LATAM - Caribbean
## 19 0.53236492 0.7204063 0.26755610 LATAM - Caribbean
## 20 0.39875298 0.5246042 0.54659348 LATAM - Caribbean
## 21 0.30749973 0.5891186 0.36059898 LATAM - Caribbean
## 22 0.23779669 0.3766995 NA LATAM - Caribbean
## 23 0.30139965 0.3607925 0.18309301 LATAM - Caribbean
## 24 <NA> NA NA LATAM - Caribbean
## 25 0.317 0.5222334 NA LATAM - Caribbean
## 26 0.21459486 0.4125676 NA LATAM - Caribbean
## 27 0.29593622 0.3283283 0.16459743 LATAM - Caribbean
## 28 0.664 0.6458324 0.07263592 LATAM - Caribbean
## 29 <NA> NA NA LATAM - Caribbean
## 30 0.034969528 0.2869475 NA LATAM - Caribbean
## 31 <NA> NA NA LATAM - Caribbean
## 32 0.027725089 0.1739862 NA LATAM - Caribbean
## 33 0.21 0.3400000 NA LATAM - Caribbean
## 34 <NA> NA NA LATAM - Caribbean
## 35 <NA> NA NA LATAM - Caribbean
## 36 <NA> NA NA LATAM - Caribbean
## We can start by trying to discover some relationship between low birthweight, likelihood of breastfeeding exclusively as an infant, and nutritional habits as the child gets older in the form of the likelihood of zero vegetables being consumed
# Let's strip out the NAs first
la_nut <-na.omit(la_nut)
qplot(x=exclusive_breast, y=zero_veg, data=la_nut, main=" Exclusive Infant Breast-feeding v Zero Vegetables Consumed for Older Children ", xlab="Breast-Feeding", ylab="No Vegetables")
nut_score <- lm(`exclusive_breast` ~ `zero_veg`, data = la_nut)
nut_score
##
## Call:
## lm(formula = exclusive_breast ~ zero_veg, data = la_nut)
##
## Coefficients:
## (Intercept) zero_veg
## 0.681 -1.269
The assignment tasks students with selecting a messy dataset posted by a peer and cleaning it for analysis as set forth in the peer’s post and beyond. For our dataset, we selected Maria’s dataset from UNICEF, and I undertook a new analysis on literacy in Latin America. Prior to importing the data, I executed some operations on the dataset to facilitate importing and analysis including: - change all “-” to “NAs” - change percentages to decimals - strip out footnotes from the bottom - drop tabs that are ancillary to the analysis at hand so as to be able to save the data as .csv - drop extraneous columns and rows - concatenate the gender with each of the column names
education = read.csv("https://raw.githubusercontent.com/evanmclaughlin/ECM607/master/Table-10-Education-EN.csv")
head(education)
## X Completion.Primary.education.male
## 1 Afghanistan 0.67
## 2 Albania 0.91
## 3 Algeria 0.93
## 4 Andorra NA
## 5 Angola 0.53
## 6 Anguilla NA
## Completion.Primary.education.female Completion.Lower.secondary.education.male
## 1 0.40 0.49
## 2 0.93 0.97
## 3 0.94 0.57
## 4 NA NA
## 5 0.49 0.41
## 6 NA NA
## Completion.Lower.secondary.education.female
## 1 0.26
## 2 0.96
## 3 0.72
## 4 NA
## 5 0.31
## 6 NA
## Completion.Upper.secondary.education.male
## 1 0.32
## 2 0.43
## 3 0.30
## 4 NA
## 5 0.21
## 6 NA
## Completion.Upper.secondary.education.female
## 1 0.14
## 2 0.60
## 3 0.47
## 4 NA
## 5 0.15
## 6 NA
## Proportion.of.children.in.grade.2.or.3.achieving.minimum.proficiency.level.reading
## 1 0.47
## 2 0.86
## 3 NA
## 4 NA
## 5 NA
## 6 0.59
## Proportion.of.children.in.grade.2.or.3.achieving.minimum.proficiency.level.math
## 1 0.52
## 2 NA
## 3 0.41
## 4 NA
## 5 NA
## 6 0.38
## Proportion.of.children.at.the.end.of.primary.achieving.minimum.proficiency.level.reading
## 1 0.55
## 2 0.95
## 3 NA
## 4 NA
## 5 NA
## 6 0.76
## Proportion.of.children.at.the.end.of.primary.achieving.minimum.proficiency.level.math
## 1 0.63
## 2 0.97
## 3 NA
## 4 NA
## 5 NA
## 6 0.67
## Proportion.of.children.at.the.end.of.lower.secondary.achieving.minimum.proficiency.level.reading
## 1 NA
## 2 0.48
## 3 0.21
## 4 NA
## 5 NA
## 6 NA
## Proportion.of.children.at.the.end.of.lower.secondary.achieving.minimum.proficiency.level.math
## 1 NA
## 2 0.39
## 3 0.19
## 4 NA
## 5 NA
## 6 NA
## youth_literacy.male youth_literacy.female X.1
## 1 0.62 0.32
## 2 0.99 0.99
## 3 NA NA
## 4 NA NA
## 5 0.85 0.71
## 6 NA NA LATAM-Caribbean
# Let's go ahead and quickly filter for the LATAM and Caribbean nations that we are interested in comparing, as well as the literacy variables we're interested in exploring
names(education)[16] <- "region"
names(education)[1] <- "country"
names(education)[15] <- "youth_literacy_female"
names(education)[14] <- "youth_literacy_male"
education$youth_literacy_male <- as.numeric(as.character(education$youth_literacy_male))
## Warning: NAs introduced by coercion
education$youth_literacy_female <- as.numeric(as.character(education$youth_literacy_female))
## Warning: NAs introduced by coercion
education <- education %>% filter(region == 'LATAM-Caribbean')
#next, let's extract the renamed columns for a subset
la_edu <- education %>% dplyr::select(1, 14, 15)
la_lit <- la_edu %>% filter(!is.na(la_edu$youth_literacy_female))
## We can start by trying to look at any differences between Latin American females and males in terms of youth literacy
m_male <- mean(la_lit$youth_literacy_male)
m_fem <- mean(la_lit$youth_literacy_female)
m_male
## [1] 0.9820833
m_fem
## [1] 0.9858333
# female literacy across LATAM countries is slightly higher than male literacy rate, so we'll sort by descending female literacy rate to get a good look at the region
la_lit <- la_lit %>%
arrange(desc(youth_literacy_female))
library(reactable)
## Warning: package 'reactable' was built under R version 4.0.4
reactable(la_lit)
## All in all, the youth literacy rate is exceptionally high across the Latin America and Caribbean region, with there not being much variance between males and females in any of the countries measured.
Let’s talk about squirrels. Joseph’s post was about how squirrels in New York City live and the possible forecasts that can be made for 2019 and beyond from this 2018 data.
squirrels = read.csv("https://raw.githubusercontent.com/evanmclaughlin/ECM607/master/squirrels.csv")
head(squirrels)
## X Y Unique.Squirrel.ID Hectare Shift Date
## 1 -73.95613 40.79408 37F-PM-1014-03 37F PM 10142018
## 2 -73.96886 40.78378 21B-AM-1019-04 21B AM 10192018
## 3 -73.97428 40.77553 11B-PM-1014-08 11B PM 10142018
## 4 -73.95964 40.79031 32E-PM-1017-14 32E PM 10172018
## 5 -73.97027 40.77621 13E-AM-1017-05 13E AM 10172018
## 6 -73.96836 40.77259 11H-AM-1010-03 11H AM 10102018
## Hectare.Squirrel.Number Age Primary.Fur.Color Highlight.Fur.Color
## 1 3
## 2 4
## 3 8 Gray
## 4 14 Adult Gray
## 5 5 Adult Gray Cinnamon
## 6 3 Adult Cinnamon White
## Combination.of.Primary.and.Highlight.Color
## 1 +
## 2 +
## 3 Gray+
## 4 Gray+
## 5 Gray+Cinnamon
## 6 Cinnamon+White
## Color.notes
## 1
## 2
## 3
## 4 Nothing selected as Primary. Gray selected as Highlights. Made executive adjustments.
## 5
## 6
## Location Above.Ground.Sighter.Measurement Specific.Location Running
## 1 FALSE
## 2 FALSE
## 3 Above Ground 10 FALSE
## 4 FALSE
## 5 Above Ground on tree stump FALSE
## 6 FALSE
## Chasing Climbing Eating Foraging Other.Activities Kuks Quaas Moans
## 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 3 TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 4 FALSE FALSE TRUE TRUE FALSE FALSE FALSE
## 5 FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## 6 FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## Tail.flags Tail.twitches Approaches Indifferent Runs.from Other.Interactions
## 1 FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE
## 3 FALSE FALSE FALSE FALSE FALSE
## 4 FALSE FALSE FALSE FALSE TRUE
## 5 FALSE FALSE FALSE FALSE FALSE
## 6 FALSE TRUE FALSE TRUE FALSE
## Lat.Long
## 1 POINT (-73.9561344937861 40.7940823884086)
## 2 POINT (-73.9688574691102 40.7837825208444)
## 3 POINT (-73.97428114848522 40.775533619083)
## 4 POINT (-73.9596413903948 40.7903128889029)
## 5 POINT (-73.9702676472613 40.7762126854894)
## 6 POINT (-73.9683613516225 40.7725908847499)
names(squirrels)[1] <- "lat"
names(squirrels)[2] <- "long"
names(squirrels)[6] <- "date"
names(squirrels)[16] <- "running"
names(squirrels)[17] <- "chasing"
names(squirrels)[18] <- "climbing"
names(squirrels)[19] <- "eating"
names(squirrels)[20] <- "foraging"
names(squirrels)[9] <- "color"
s_data <- squirrels %>% dplyr::select(1, 2, 6, 9, 16, 17, 18, 19, 20)
head(s_data)
## lat long date color running chasing climbing eating foraging
## 1 -73.95613 40.79408 10142018 FALSE FALSE FALSE FALSE FALSE
## 2 -73.96886 40.78378 10192018 FALSE FALSE FALSE FALSE FALSE
## 3 -73.97428 40.77553 10142018 Gray FALSE TRUE FALSE FALSE FALSE
## 4 -73.95964 40.79031 10172018 Gray FALSE FALSE FALSE TRUE TRUE
## 5 -73.97027 40.77621 10172018 Gray FALSE FALSE FALSE FALSE TRUE
## 6 -73.96836 40.77259 10102018 Cinnamon FALSE FALSE FALSE FALSE TRUE
Let’s take a look at when squirrels are most frequently spotted, what they’re doing when spotters are seeing them.
# we can visualize all of this quickly with a simple bar graph after we use spread to organize the data
# create new dataframe
squirrels_df <- gather(s_data, action, T_F, 5:9)
#squirrels_df
## now let's filter the new dataframe for squirrels that aren't clearly lazy
action_squirrels <- squirrels_df %>% filter(T_F == 'TRUE')
#action_squirrels
action_squirrels$color <- sub("^$", "Other", action_squirrels$color)
head(action_squirrels)
## lat long date color action T_F
## 1 -73.96400 40.78203 10142018 Gray running TRUE
## 2 -73.97686 40.77028 10102018 Cinnamon running TRUE
## 3 -73.97038 40.77875 10182018 Gray running TRUE
## 4 -73.96711 40.77849 10072018 Gray running TRUE
## 5 -73.95874 40.79085 10082018 Gray running TRUE
## 6 -73.96929 40.77695 10132018 Gray running TRUE
# now let's go ahead and look at what these active squirrels are doing most of the time
library("rcartocolor")
## Warning: package 'rcartocolor' was built under R version 4.0.4
ggplot(action_squirrels, aes(x = action, y = action, fill = reorder(color, action))) +
geom_col() +
labs(title="Most Common Squirrel Activities, By Color") +
ylab('Activity') +
xlab('') +
scale_fill_carto_d(name = "color: ", palette = "Vivid")
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
Are squirrels who aren’t gray more likely to be lazy good-for-nothings? Do Cinnamon and Black squirrels more easily pass undetected by human eyes? Or are there just more gray squirrels than any other kind? This is an easy test.
# first let's see the percentage of gray squirrels in the action table
squ_table <- table(action_squirrels$color)
squ_table
##
## Black Cinnamon Gray Other
## 125 529 3170 38
gray_pct_act <- 3170 / nrow(action_squirrels)
gray_pct_act
## [1] 0.8208182
# now let's look at the numbers for the full dataset
squ_table_full <- table(squirrels_df$color)
squ_table_full
##
## Black Cinnamon Gray
## 275 515 1960 12365
gray_pct_full <- 12365 / nrow(squirrels_df)
gray_pct_full
## [1] 0.8180615
What an enormous relief. The percentage of Gray squirrels in the active dataset is roughly the same as in the full dataset. The last thing we need is an elitist population of Black and Cinnamon squirrels taking advantage of an industrious class of Gray squirrels.