Dataset #1 - Background

The assignment tasks students with selecting a messy dataset posted by a peer and cleaning it for analysis as set forth in the peer’s post and beyond. For our dataset, we selected Maria’s dataset from UNICEF, and I undertook her question regarding newborn, infant, and young child nutrition. Prior to importing the data, I executed some operations on the dataset to facilitate importing and analysis including: - change all “-” to “NAs” - change percentages to decimals - strip out footnotes from the bottom - drop tabs that are anciliary to the analysis at hand so as to be able to save the data as .csv - drop extraneous columns and rows

Import and Clean Data

nutrition = read.csv("https://raw.githubusercontent.com/evanmclaughlin/ECM607/master/UNICEF_Table-7-Nutrition-EN.csv")


head(nutrition)
##       Country Low.birthweight Unweighed.at.birth
## 1 Afghanistan              NA          0.8626837
## 2     Albania      0.04587830          0.1316733
## 3     Algeria      0.07252492          0.1142705
## 4     Andorra      0.07445406          0.1430000
## 5      Angola      0.15256593          0.4480227
## 6    Anguilla              NA                 NA
##   Early.initiation.of.breastfeeding Exclusive.breastfeeding...6.months.
## 1                         0.6280000                               0.575
## 2                         0.5652210                          0.36538513
## 3                         0.3571079                          0.25391212
## 4                                NA                                <NA>
## 5                         0.4831644                          0.37381931
## 6                                NA                                <NA>
##   Introduction.to.solid..semi.solid.or.soft.foods Breastfeeding...All
## 1                                       0.6099433           0.7380154
## 2                                       0.8850852           0.4320668
## 3                                       0.7717722           0.3554286
## 4                                              NA                  NA
## 5                                       0.7876945           0.6656777
## 6                                              NA                  NA
##   Breastfeeding...Poorest.20. Breastfeeding...Richest.20.
## 1                   0.8009494                   0.6956284
## 2                   0.3817219                   0.3657669
## 3                   0.3531280                   0.3357238
## 4                          NA                          NA
## 5                   0.7436154                   0.5259578
## 6                          NA                          NA
##   Minimum.diet.diversity..6.23.months. Minimum.meal.frequency..6.23.months.
## 1                            0.2205860                            0.5121235
## 2                            0.5249851                            0.5140581
## 3                                   NA                            0.5204865
## 4                                   NA                                   NA
## 5                            0.2906923                            0.3276945
## 6                                   NA                                   NA
##   Minimum.acceptable.diet..6.23.months.
## 1                             0.1549437
## 2                             0.2924022
## 3                                    NA
## 4                                    NA
## 5                             0.1326344
## 6                                    NA
##   Zero.vegetable.or.fruit.consumption..6.23.months.            Region  X X.1
## 1                                         0.5857060                   NA  NA
## 2                                         0.2573565                   NA  NA
## 3                                                NA                   NA  NA
## 4                                                NA                   NA  NA
## 5                                         0.3635851                   NA  NA
## 6                                                NA LATAM - Caribbean NA  NA
##   X.2 X.3 X.4 X.5 X.6 X.7 X.8
## 1  NA  NA  NA  NA  NA  NA  NA
## 2  NA  NA  NA  NA  NA  NA  NA
## 3  NA  NA  NA  NA  NA  NA  NA
## 4  NA  NA  NA  NA  NA  NA  NA
## 5  NA  NA  NA  NA  NA  NA  NA
## 6  NA  NA  NA  NA  NA  NA  NA

Extract appropriate data for subsetting

#start with re-labeling countries and the first variables we're interested in analyzing
names(nutrition)[1] <- "countries"
names(nutrition)[2] <- "low_birthweight"
names(nutrition)[4] <- "early_breast"
names(nutrition)[5] <- "exclusive_breast"
names(nutrition)[7] <- "breast_all"
names(nutrition)[13] <- "zero_veg"
names(nutrition)[14] <- "region"

#head(nutrition)

#next, extract the renamed columns for a subset
nut_ext <- nutrition %>% dplyr::select(1, 2, 4, 5, 7, 13, 14)
head(nut_ext)
##     countries low_birthweight early_breast exclusive_breast breast_all
## 1 Afghanistan              NA    0.6280000            0.575  0.7380154
## 2     Albania      0.04587830    0.5652210       0.36538513  0.4320668
## 3     Algeria      0.07252492    0.3571079       0.25391212  0.3554286
## 4     Andorra      0.07445406           NA             <NA>         NA
## 5      Angola      0.15256593    0.4831644       0.37381931  0.6656777
## 6    Anguilla              NA           NA             <NA>         NA
##    zero_veg            region
## 1 0.5857060                  
## 2 0.2573565                  
## 3        NA                  
## 4        NA                  
## 5 0.3635851                  
## 6        NA LATAM - Caribbean

Now that we have our data extracted, we can try to make some observations from a filtered dataset

## let's filter the dataset for our desired LATAM - Caribbean subest
la_nut <- nut_ext %>% filter(region == 'LATAM - Caribbean')
la_nut
##                             countries low_birthweight early_breast
## 1                            Anguilla              NA           NA
## 2                 Antigua and Barbuda      0.09054119           NA
## 3                           Argentina      0.07346758    0.5274035
## 4                             Bahamas      0.13135983           NA
## 5                            Barbados              NA    0.4029047
## 6                              Belize      0.08598801    0.6825412
## 7    Bolivia (Plurinational State of)      0.07222335    0.5500000
## 8                              Brazil      0.08384270    0.4290000
## 9              British Virgin Islands              NA           NA
## 10                              Chile      0.06246671           NA
## 11                           Colombia      0.09955593    0.7200000
## 12                         Costa Rica      0.07477285    0.5962632
## 13                               Cuba      0.05263557    0.4787731
## 14                           Dominica              NA           NA
## 15                 Dominican Republic      0.11294179    0.3807571
## 16                            Ecuador      0.11180641    0.5460000
## 17                        El Salvador      0.10295944    0.4200486
## 18                            Grenada              NA           NA
## 19                          Guatemala      0.10957713    0.6313348
## 20                              Haiti              NA    0.4735259
## 21                           Honduras      0.10897210    0.6379734
## 22                            Jamaica      0.14582915    0.6471896
## 23                             Mexico      0.07868743    0.5103613
## 24                         Montserrat              NA           NA
## 25                          Nicaragua      0.10669083    0.5440000
## 26                             Panama      0.10087104    0.4697620
## 27                           Paraguay      0.08090015    0.4954847
## 28                               Peru      0.09403629    0.4970000
## 29              Saint Kitts and Nevis              NA           NA
## 30                        Saint Lucia              NA    0.4958285
## 31   Saint Vincent and the Grenadines              NA           NA
## 32                           Suriname      0.14658528    0.4468282
## 33                Trinidad and Tobago      0.12392281    0.4600000
## 34           Turks and Caicos Islands              NA           NA
## 35                            Uruguay      0.07616777    0.7651041
## 36 Venezuela (Bolivarian Republic of)      0.09104774           NA
##    exclusive_breast breast_all   zero_veg            region
## 1              <NA>         NA         NA LATAM - Caribbean
## 2              <NA>         NA         NA LATAM - Caribbean
## 3        0.31951948  0.3910114         NA LATAM - Caribbean
## 4              <NA>         NA         NA LATAM - Caribbean
## 5        0.19696945  0.4108116         NA LATAM - Caribbean
## 6        0.33164608  0.4713115 0.30324936 LATAM - Caribbean
## 7             0.583  0.5530000 0.19808907 LATAM - Caribbean
## 8             0.386         NA         NA LATAM - Caribbean
## 9              <NA>         NA         NA LATAM - Caribbean
## 10             <NA>         NA         NA LATAM - Caribbean
## 11            0.361  0.4475464         NA LATAM - Caribbean
## 12       0.32540791  0.3977990         NA LATAM - Caribbean
## 13       0.32832527  0.3068782 0.26882854 LATAM - Caribbean
## 14             <NA>         NA         NA LATAM - Caribbean
## 15       0.04564405  0.1994798 0.34918953 LATAM - Caribbean
## 16            0.396         NA         NA LATAM - Caribbean
## 17       0.46734215  0.6678928 0.15860823 LATAM - Caribbean
## 18             <NA>         NA         NA LATAM - Caribbean
## 19       0.53236492  0.7204063 0.26755610 LATAM - Caribbean
## 20       0.39875298  0.5246042 0.54659348 LATAM - Caribbean
## 21       0.30749973  0.5891186 0.36059898 LATAM - Caribbean
## 22       0.23779669  0.3766995         NA LATAM - Caribbean
## 23       0.30139965  0.3607925 0.18309301 LATAM - Caribbean
## 24             <NA>         NA         NA LATAM - Caribbean
## 25            0.317  0.5222334         NA LATAM - Caribbean
## 26       0.21459486  0.4125676         NA LATAM - Caribbean
## 27       0.29593622  0.3283283 0.16459743 LATAM - Caribbean
## 28            0.664  0.6458324 0.07263592 LATAM - Caribbean
## 29             <NA>         NA         NA LATAM - Caribbean
## 30      0.034969528  0.2869475         NA LATAM - Caribbean
## 31             <NA>         NA         NA LATAM - Caribbean
## 32      0.027725089  0.1739862         NA LATAM - Caribbean
## 33             0.21  0.3400000         NA LATAM - Caribbean
## 34             <NA>         NA         NA LATAM - Caribbean
## 35             <NA>         NA         NA LATAM - Caribbean
## 36             <NA>         NA         NA LATAM - Caribbean
## We can start by trying to discover some relationship between low birthweight, likelihood of breastfeeding exclusively as an infant, and nutritional habits as the child gets older in the form of the likelihood of zero vegetables being consumed

# Let's strip out the NAs first

la_nut <-na.omit(la_nut)

qplot(x=exclusive_breast, y=zero_veg, data=la_nut, main=" Exclusive Infant Breast-feeding v Zero Vegetables Consumed for Older Children ", xlab="Breast-Feeding", ylab="No Vegetables")

There does not appear to be much of a graphical relationship among these countries between the likelihood of breast-feeding infants exclusively and the likelihood that young children in those countries consume no vegetables in a given day. But we can try to derive a relationship mathematically.

nut_score <- lm(`exclusive_breast` ~ `zero_veg`, data = la_nut)

nut_score
## 
## Call:
## lm(formula = exclusive_breast ~ zero_veg, data = la_nut)
## 
## Coefficients:
## (Intercept)     zero_veg  
##       0.681       -1.269

No, this is still not demonstrating much in the way of a relationship. Given the proven health benefits of exclusively breast-feeding infants, we might have expected continued better nutrition habits in countries with higher exclusive breat-feeding habits, in this case as measured by fewer children skipping vegetables for an entire day. This relationship doesn’t appear in the data, but that could be because many children in these countries might not have access to vegetables consistently. In any case, it deserves additional analysis, with many, many more variables, and ideally with less incomplete data.

Dataset 2 - Background

The assignment tasks students with selecting a messy dataset posted by a peer and cleaning it for analysis as set forth in the peer’s post and beyond. For our dataset, we selected Maria’s dataset from UNICEF, and I undertook a new analysis on literacy in Latin America. Prior to importing the data, I executed some operations on the dataset to facilitate importing and analysis including: - change all “-” to “NAs” - change percentages to decimals - strip out footnotes from the bottom - drop tabs that are ancillary to the analysis at hand so as to be able to save the data as .csv - drop extraneous columns and rows - concatenate the gender with each of the column names

Import and Clean Data

education = read.csv("https://raw.githubusercontent.com/evanmclaughlin/ECM607/master/Table-10-Education-EN.csv")

head(education)
##             X Completion.Primary.education.male
## 1 Afghanistan                             0.67 
## 2     Albania                             0.91 
## 3     Algeria                             0.93 
## 4     Andorra                               NA 
## 5      Angola                             0.53 
## 6    Anguilla                               NA 
##   Completion.Primary.education.female Completion.Lower.secondary.education.male
## 1                               0.40                                      0.49 
## 2                               0.93                                      0.97 
## 3                               0.94                                      0.57 
## 4                                 NA                                        NA 
## 5                               0.49                                      0.41 
## 6                                 NA                                        NA 
##   Completion.Lower.secondary.education.female
## 1                                       0.26 
## 2                                       0.96 
## 3                                       0.72 
## 4                                         NA 
## 5                                       0.31 
## 6                                         NA 
##   Completion.Upper.secondary.education.male
## 1                                     0.32 
## 2                                     0.43 
## 3                                     0.30 
## 4                                       NA 
## 5                                     0.21 
## 6                                       NA 
##   Completion.Upper.secondary.education.female
## 1                                       0.14 
## 2                                       0.60 
## 3                                       0.47 
## 4                                         NA 
## 5                                       0.15 
## 6                                         NA 
##   Proportion.of.children.in.grade.2.or.3.achieving.minimum.proficiency.level.reading
## 1                                                                              0.47 
## 2                                                                              0.86 
## 3                                                                                NA 
## 4                                                                                NA 
## 5                                                                                NA 
## 6                                                                              0.59 
##   Proportion.of.children.in.grade.2.or.3.achieving.minimum.proficiency.level.math
## 1                                                                           0.52 
## 2                                                                             NA 
## 3                                                                           0.41 
## 4                                                                             NA 
## 5                                                                             NA 
## 6                                                                           0.38 
##   Proportion.of.children.at.the.end.of.primary.achieving.minimum.proficiency.level.reading
## 1                                                                                    0.55 
## 2                                                                                    0.95 
## 3                                                                                      NA 
## 4                                                                                      NA 
## 5                                                                                      NA 
## 6                                                                                    0.76 
##   Proportion.of.children.at.the.end.of.primary.achieving.minimum.proficiency.level.math
## 1                                                                                 0.63 
## 2                                                                                 0.97 
## 3                                                                                   NA 
## 4                                                                                   NA 
## 5                                                                                   NA 
## 6                                                                                 0.67 
##   Proportion.of.children.at.the.end.of.lower.secondary.achieving.minimum.proficiency.level.reading
## 1                                                                                              NA 
## 2                                                                                            0.48 
## 3                                                                                            0.21 
## 4                                                                                              NA 
## 5                                                                                              NA 
## 6                                                                                              NA 
##   Proportion.of.children.at.the.end.of.lower.secondary.achieving.minimum.proficiency.level.math
## 1                                                                                           NA 
## 2                                                                                         0.39 
## 3                                                                                         0.19 
## 4                                                                                           NA 
## 5                                                                                           NA 
## 6                                                                                           NA 
##   youth_literacy.male youth_literacy.female             X.1
## 1               0.62                  0.32                 
## 2               0.99                  0.99                 
## 3                 NA                    NA                 
## 4                 NA                    NA                 
## 5               0.85                  0.71                 
## 6                 NA                    NA  LATAM-Caribbean

Extract appropriate data for subsetting

# Let's go ahead and quickly filter for the LATAM and Caribbean nations that we are interested in comparing, as well as the literacy variables we're interested in exploring

names(education)[16] <- "region"
names(education)[1] <- "country"
names(education)[15] <- "youth_literacy_female"
names(education)[14] <- "youth_literacy_male"
education$youth_literacy_male <- as.numeric(as.character(education$youth_literacy_male))
## Warning: NAs introduced by coercion
education$youth_literacy_female <- as.numeric(as.character(education$youth_literacy_female))
## Warning: NAs introduced by coercion
education <- education %>% filter(region == 'LATAM-Caribbean')

#next, let's extract the renamed columns for a subset
la_edu <- education %>% dplyr::select(1, 14, 15)
la_lit <- la_edu %>% filter(!is.na(la_edu$youth_literacy_female))

Now that we have our data extracted, we can try to make some observations from a filtered dataset

## We can start by trying to look at any differences between Latin American females and males in terms of youth literacy

m_male <- mean(la_lit$youth_literacy_male)
m_fem <- mean(la_lit$youth_literacy_female)

m_male
## [1] 0.9820833
m_fem
## [1] 0.9858333
# female literacy across LATAM countries is slightly higher than male literacy rate, so we'll sort by descending female literacy rate to get a good look at the region
la_lit <- la_lit %>% 
  arrange(desc(youth_literacy_female))
library(reactable)
## Warning: package 'reactable' was built under R version 4.0.4
reactable(la_lit)

## All in all, the youth literacy rate is exceptionally high across the Latin America and Caribbean region, with there not being much variance between males and females in any of the countries measured.

Dataset 3 - Background

Let’s talk about squirrels. Joseph’s post was about how squirrels in New York City live and the possible forecasts that can be made for 2019 and beyond from this 2018 data.

squirrels = read.csv("https://raw.githubusercontent.com/evanmclaughlin/ECM607/master/squirrels.csv")

head(squirrels)
##           X        Y Unique.Squirrel.ID Hectare Shift     Date
## 1 -73.95613 40.79408     37F-PM-1014-03     37F    PM 10142018
## 2 -73.96886 40.78378     21B-AM-1019-04     21B    AM 10192018
## 3 -73.97428 40.77553     11B-PM-1014-08     11B    PM 10142018
## 4 -73.95964 40.79031     32E-PM-1017-14     32E    PM 10172018
## 5 -73.97027 40.77621     13E-AM-1017-05     13E    AM 10172018
## 6 -73.96836 40.77259     11H-AM-1010-03     11H    AM 10102018
##   Hectare.Squirrel.Number   Age Primary.Fur.Color Highlight.Fur.Color
## 1                       3                                            
## 2                       4                                            
## 3                       8                    Gray                    
## 4                      14 Adult              Gray                    
## 5                       5 Adult              Gray            Cinnamon
## 6                       3 Adult          Cinnamon               White
##   Combination.of.Primary.and.Highlight.Color
## 1                                          +
## 2                                          +
## 3                                      Gray+
## 4                                      Gray+
## 5                              Gray+Cinnamon
## 6                             Cinnamon+White
##                                                                             Color.notes
## 1                                                                                      
## 2                                                                                      
## 3                                                                                      
## 4 Nothing selected as Primary. Gray selected as Highlights. Made executive adjustments.
## 5                                                                                      
## 6                                                                                      
##       Location Above.Ground.Sighter.Measurement Specific.Location Running
## 1                                                                   FALSE
## 2                                                                   FALSE
## 3 Above Ground                               10                     FALSE
## 4                                                                   FALSE
## 5 Above Ground                                      on tree stump   FALSE
## 6                                                                   FALSE
##   Chasing Climbing Eating Foraging Other.Activities  Kuks Quaas Moans
## 1   FALSE    FALSE  FALSE    FALSE                  FALSE FALSE FALSE
## 2   FALSE    FALSE  FALSE    FALSE                  FALSE FALSE FALSE
## 3    TRUE    FALSE  FALSE    FALSE                  FALSE FALSE FALSE
## 4   FALSE    FALSE   TRUE     TRUE                  FALSE FALSE FALSE
## 5   FALSE    FALSE  FALSE     TRUE                  FALSE FALSE FALSE
## 6   FALSE    FALSE  FALSE     TRUE                  FALSE FALSE FALSE
##   Tail.flags Tail.twitches Approaches Indifferent Runs.from Other.Interactions
## 1      FALSE         FALSE      FALSE       FALSE     FALSE                   
## 2      FALSE         FALSE      FALSE       FALSE     FALSE                   
## 3      FALSE         FALSE      FALSE       FALSE     FALSE                   
## 4      FALSE         FALSE      FALSE       FALSE      TRUE                   
## 5      FALSE         FALSE      FALSE       FALSE     FALSE                   
## 6      FALSE          TRUE      FALSE        TRUE     FALSE                   
##                                     Lat.Long
## 1 POINT (-73.9561344937861 40.7940823884086)
## 2 POINT (-73.9688574691102 40.7837825208444)
## 3 POINT (-73.97428114848522 40.775533619083)
## 4 POINT (-73.9596413903948 40.7903128889029)
## 5 POINT (-73.9702676472613 40.7762126854894)
## 6 POINT (-73.9683613516225 40.7725908847499)
names(squirrels)[1] <- "lat"
names(squirrels)[2] <- "long"
names(squirrels)[6] <- "date"
names(squirrels)[16] <- "running"
names(squirrels)[17] <- "chasing"
names(squirrels)[18] <- "climbing"
names(squirrels)[19] <- "eating"
names(squirrels)[20] <- "foraging"
names(squirrels)[9] <- "color"

s_data <- squirrels %>% dplyr::select(1, 2, 6, 9, 16, 17, 18, 19, 20)

head(s_data)
##         lat     long     date    color running chasing climbing eating foraging
## 1 -73.95613 40.79408 10142018            FALSE   FALSE    FALSE  FALSE    FALSE
## 2 -73.96886 40.78378 10192018            FALSE   FALSE    FALSE  FALSE    FALSE
## 3 -73.97428 40.77553 10142018     Gray   FALSE    TRUE    FALSE  FALSE    FALSE
## 4 -73.95964 40.79031 10172018     Gray   FALSE   FALSE    FALSE   TRUE     TRUE
## 5 -73.97027 40.77621 10172018     Gray   FALSE   FALSE    FALSE  FALSE     TRUE
## 6 -73.96836 40.77259 10102018 Cinnamon   FALSE   FALSE    FALSE  FALSE     TRUE

Let’s take a look at when squirrels are most frequently spotted, what they’re doing when spotters are seeing them.

# we can visualize all of this quickly with a simple bar graph after we use spread to organize the data

# create new dataframe

squirrels_df <- gather(s_data, action, T_F, 5:9)
#squirrels_df

## now let's filter the new dataframe for squirrels that aren't clearly lazy

action_squirrels <- squirrels_df %>% filter(T_F == 'TRUE')
#action_squirrels

action_squirrels$color <- sub("^$", "Other", action_squirrels$color)
head(action_squirrels)
##         lat     long     date    color  action  T_F
## 1 -73.96400 40.78203 10142018     Gray running TRUE
## 2 -73.97686 40.77028 10102018 Cinnamon running TRUE
## 3 -73.97038 40.77875 10182018     Gray running TRUE
## 4 -73.96711 40.77849 10072018     Gray running TRUE
## 5 -73.95874 40.79085 10082018     Gray running TRUE
## 6 -73.96929 40.77695 10132018     Gray running TRUE
# now let's go ahead and look at what these active squirrels are doing most of the time 

library("rcartocolor")
## Warning: package 'rcartocolor' was built under R version 4.0.4
ggplot(action_squirrels, aes(x = action, y = action, fill = reorder(color, action))) +
  geom_col() +
  labs(title="Most Common Squirrel Activities, By Color") +
  ylab('Activity') +
  xlab('') +
  scale_fill_carto_d(name = "color: ", palette = "Vivid")
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

Are squirrels who aren’t gray more likely to be lazy good-for-nothings? Do Cinnamon and Black squirrels more easily pass undetected by human eyes? Or are there just more gray squirrels than any other kind? This is an easy test.

# first let's see the percentage of gray squirrels in the action table
squ_table <- table(action_squirrels$color)
squ_table
## 
##    Black Cinnamon     Gray    Other 
##      125      529     3170       38
gray_pct_act <- 3170 / nrow(action_squirrels)
gray_pct_act
## [1] 0.8208182
# now let's look at the numbers for the full dataset
squ_table_full <- table(squirrels_df$color)
squ_table_full
## 
##             Black Cinnamon     Gray 
##      275      515     1960    12365
gray_pct_full <- 12365 / nrow(squirrels_df)
gray_pct_full
## [1] 0.8180615

What an enormous relief. The percentage of Gray squirrels in the active dataset is roughly the same as in the full dataset. The last thing we need is an elitist population of Black and Cinnamon squirrels taking advantage of an industrious class of Gray squirrels.