Dataset 1

The first dataset is a list of countries and their populations in past, present, and future (projected) years.

worldPopulation <- read.csv('https://raw.githubusercontent.com/jglendrange/DATA607/main/worldPopulationData.csv')
head(worldPopulation)
##   cca2          name   pop2021   pop2020   pop2050   pop2030   pop2019
## 1   CN         China 1444216.1 1439323.8 1402405.2 1464340.2 1433783.7
## 2   IN         India 1393409.0 1380004.4 1639176.0 1503642.3 1366417.8
## 3   US United States  332915.1  331002.7  379419.1  349641.9  329064.9
## 4   ID     Indonesia  276361.8  273523.6  330904.7  299198.4  270625.6
## 5   PK      Pakistan  225199.9  220892.3  338013.2  262958.8  216565.3
## 6   BR        Brazil  213993.4  212559.4  228980.4  223852.1  211049.5
##     pop2015   pop2010   pop2000   pop1990    pop1980   pop1970    area  Density
## 1 1406847.9 1368810.6 1290550.8 1176883.7 1000089.23 827601.39 9706961 148.7815
## 2 1310152.4 1234281.2 1056575.5  873277.8  698952.84 555189.79 3287590 423.8391
## 3  320878.3  309011.5  281710.9  252120.3  229476.35 209513.34 9372610  35.5200
## 4  258383.3  241834.2  211513.8  181413.4  147447.84 114793.18 1904569 145.1046
## 5  199427.0  179424.6  142343.6  107647.9   78054.34  58142.06  881912 255.3542
## 6  204471.8  195713.6  174790.3  149003.2  120694.01  95113.26 8515767  25.1291
##   GrowthRate WorldPercentage rank
## 1     1.0034          0.1834    1
## 2     1.0097          0.1769    2
## 3     1.0058          0.0423    3
## 4     1.0104          0.0351    4
## 5     1.0195          0.0286    5
## 6     1.0067          0.0272    6

Clean up column names

column_names <- c('country_code','country_name','2021','2020','2050','2030','2019', '2015','2010','2000','1990','1980','1970','area','density','growth_rate','world_percentage','rank')

colnames(worldPopulation) <- column_names 

worldPopulation <- worldPopulation %>% select(1:13)

Here I pivot the wide data set into a long data set. Turning all the year’s colums into 1 row called year and storing all of the population values in a column named population.

longData <- pivot_longer(worldPopulation, c(,'2021','2020','2050','2030','2019', '2015','2010','2000','1990','1980','1970'), names_to = 'year', values_to ='population')

head(longData)
## # A tibble: 6 x 4
##   country_code country_name year  population
##   <chr>        <chr>        <chr>      <dbl>
## 1 CN           China        2021    1444216.
## 2 CN           China        2020    1439324.
## 3 CN           China        2050    1402405.
## 4 CN           China        2030    1464340.
## 5 CN           China        2019    1433784.
## 6 CN           China        2015    1406848.

The population growth across all the countries plotted looks pretty linear. The slope does slightly change from the 2030-2050 timframe. We may be projected a decrease in birth rates in that period.

longData %>%
  group_by(year) %>%
  summarise(total_population = sum(population)) %>%
  plot(type='b')

Here I wanted to drill down on specific countries and how their population is trending. I limited it to the top 15 larget populations measured in 2021. After plotting the results we can see India and China are far above the rest of the world. We see an interesting trend where China begins to even out in growth and India is projected to surpass their population in 2030. I want to remove China and India to get a closer view at the other countries

top15 <- longData %>%
  filter(year==2021) %>%
  top_n(n=15)
## Selecting by population
top15 <- longData[is.element(longData$country_name,top15$country_name),] 

top15 %>%
  ggplot(aes(x = year, y = population, color = country_name)) +
  geom_point()

Two interesting insights from this is Nigeria’s exponential projected growth from 2021 to 2050, and Japan’s decreasing population.

top15 %>%
  filter(country_name != "India") %>%
  filter(country_name != "China") %>%
  ggplot(aes(x = year, y = population, color = country_name)) +
  geom_point()

Dataset 2

In this case I want to know more about the attributes of the squirrels spotted and a few interactions. So I would get rid of a few columns to start

##           X        Y Unique.Squirrel.ID Hectare Shift     Date
## 1 -73.95613 40.79408     37F-PM-1014-03     37F    PM 10142018
## 2 -73.96886 40.78378     21B-AM-1019-04     21B    AM 10192018
## 3 -73.97428 40.77553     11B-PM-1014-08     11B    PM 10142018
## 4 -73.95964 40.79031     32E-PM-1017-14     32E    PM 10172018
## 5 -73.97027 40.77621     13E-AM-1017-05     13E    AM 10172018
## 6 -73.96836 40.77259     11H-AM-1010-03     11H    AM 10102018
##   Hectare.Squirrel.Number   Age Primary.Fur.Color Highlight.Fur.Color
## 1                       3                                            
## 2                       4                                            
## 3                       8                    Gray                    
## 4                      14 Adult              Gray                    
## 5                       5 Adult              Gray            Cinnamon
## 6                       3 Adult          Cinnamon               White
##   Combination.of.Primary.and.Highlight.Color
## 1                                          +
## 2                                          +
## 3                                      Gray+
## 4                                      Gray+
## 5                              Gray+Cinnamon
## 6                             Cinnamon+White
##                                                                             Color.notes
## 1                                                                                      
## 2                                                                                      
## 3                                                                                      
## 4 Nothing selected as Primary. Gray selected as Highlights. Made executive adjustments.
## 5                                                                                      
## 6                                                                                      
##       Location Above.Ground.Sighter.Measurement Specific.Location Running
## 1                                                                   false
## 2                                                                   false
## 3 Above Ground                               10                     false
## 4                                                                   false
## 5 Above Ground                                      on tree stump   false
## 6                                                                   false
##   Chasing Climbing Eating Foraging Other.Activities  Kuks Quaas Moans
## 1   false    false  false    false                  false false false
## 2   false    false  false    false                  false false false
## 3    true    false  false    false                  false false false
## 4   false    false   true     true                  false false false
## 5   false    false  false     true                  false false false
## 6   false    false  false     true                  false false false
##   Tail.flags Tail.twitches Approaches Indifferent Runs.from Other.Interactions
## 1      false         false      false       false     false                   
## 2      false         false      false       false     false                   
## 3      false         false      false       false     false                   
## 4      false         false      false       false      true                   
## 5      false         false      false       false     false                   
## 6      false          true      false        true     false                   
##                                     Lat.Long
## 1 POINT (-73.9561344937861 40.7940823884086)
## 2 POINT (-73.9688574691102 40.7837825208444)
## 3 POINT (-73.97428114848522 40.775533619083)
## 4 POINT (-73.9596413903948 40.7903128889029)
## 5 POINT (-73.9702676472613 40.7762126854894)
## 6 POINT (-73.9683613516225 40.7725908847499)

I selected the columns I want and then replace true and falses with 0’s and 1’s.

squirrelCensus <- squirrelCensus %>% select(3, 5, 8, 9, 10, 17, 18, 19, 20)
colnames(squirrelCensus) <- c('id','shift','age','primary_color', 'highlight_color', 'chasing', 'climbing','eating', 'foraging')

head(squirrelCensus)
##               id shift   age primary_color highlight_color chasing climbing
## 1 37F-PM-1014-03    PM                                       false    false
## 2 21B-AM-1019-04    AM                                       false    false
## 3 11B-PM-1014-08    PM                Gray                    true    false
## 4 32E-PM-1017-14    PM Adult          Gray                   false    false
## 5 13E-AM-1017-05    AM Adult          Gray        Cinnamon   false    false
## 6 11H-AM-1010-03    AM Adult      Cinnamon           White   false    false
##   eating foraging
## 1  false    false
## 2  false    false
## 3  false    false
## 4   true     true
## 5  false     true
## 6  false     true
squirrelCensus[squirrelCensus == 'false'] <- 0
squirrelCensus[squirrelCensus == 'true'] <- 1

head(squirrelCensus)
##               id shift   age primary_color highlight_color chasing climbing
## 1 37F-PM-1014-03    PM                                           0        0
## 2 21B-AM-1019-04    AM                                           0        0
## 3 11B-PM-1014-08    PM                Gray                       1        0
## 4 32E-PM-1017-14    PM Adult          Gray                       0        0
## 5 13E-AM-1017-05    AM Adult          Gray        Cinnamon       0        0
## 6 11H-AM-1010-03    AM Adult      Cinnamon           White       0        0
##   eating foraging
## 1      0        0
## 2      0        0
## 3      0        0
## 4      1        1
## 5      0        1
## 6      0        1

The plot shows us the majority of the squirrels are Adults.

squirrelCensus %>%
  filter(age != "") %>%
ggplot(aes(x = age)) +
  geom_bar() +
  labs(
    x = "", y = "",
    title = "What age is the age of the Squirrels?"
  ) +
  coord_flip() 

The most common squirrels in central park are Gray squirrels.

squirrelCensus %>%
  filter(primary_color != "") %>%
  ggplot(aes(x = primary_color)) +
    geom_bar() +
    labs(
      x = "", y = "",
      title = "What color are the Squirrels?"
    ) +
  coord_flip() 

In the last piece I am pivoting all the activities into a long table and observing which are the most common among squirrels. Here it seems like foraging is the most common.

activities <- squirrelCensus %>% select(1, 6, 7, 8, 9)
activities_long <- pivot_longer(activities, c('chasing', 'climbing','eating','foraging'), names_to='activity')

ggplot(activities_long, aes(x = activity, y=value)) +
  geom_bar(stat="identity")

In conclusion, if you’re spotting a squirl in central park you’re most likely to see an adult gray colored squirrel that is foraging around :)

Dataset 3

nyDeathCauses <- read.csv('https://raw.githubusercontent.com/jglendrange/DATA607/main/New_York_City_Leading_Causes_of_Death.csv')

colnames(nyDeathCauses) <- c('year','leading_cause','sex','race_ethnicity','deaths','death_rate','age_adjusted_death_rate')
head(nyDeathCauses)
##   year                                  leading_cause    sex
## 1 2015          Malignant Neoplasms (Cancer: C00-C97) Female
## 2 2015 Diseases of Heart (I00-I09, I11, I13, I20-I51) Female
## 3 2015      Cerebrovascular Disease (Stroke: I60-I69) Female
## 4 2015        Influenza (Flu) and Pneumonia (J09-J18) Female
## 5 2015                    Diabetes Mellitus (E10-E14) Female
## 6 2015                      Alzheimer's Disease (G30) Female
##               race_ethnicity deaths   death_rate age_adjusted_death_rate
## 1 Asian and Pacific Islander    515 79.726669113            78.865386427
## 2 Asian and Pacific Islander    498 77.094914987            81.605131438
## 3 Asian and Pacific Islander     95 14.706861293            15.337930564
## 4 Asian and Pacific Islander     89 13.778006895            14.706362334
## 5 Asian and Pacific Islander     71 10.991443703            11.537396764
## 6 Asian and Pacific Islander     50 7.7404533119            8.4169129758

White non-hispanics have the highest death rate in this data set.

nyDeathCauses %>%
  group_by(race_ethnicity) %>%
  summarise(death_rate_avg = mean(as.numeric(death_rate))) %>%
  filter(race_ethnicity != "Not Stated/Unknown", race_ethnicity != "Other Race/ Ethnicity") %>%
  ggplot(aes(x = race_ethnicity, y = death_rate_avg)) +
    geom_bar(stat="identity") +
    labs(
      x = "", y = "",
      title = "Which race has the highest deathrate?"
    ) +
  coord_flip() 
## Warning in mean(as.numeric(death_rate)): NAs introduced by coercion

## Warning in mean(as.numeric(death_rate)): NAs introduced by coercion

In 2015, 2016, and 2017 we observed almost the same exact amount of deaths.

nyDeathCauses %>%
  group_by(year) %>%
  summarise(total_deaths = sum(as.numeric(deaths)))%>%
  filter(year >= 2015) %>%
  ggplot(aes(x=year,y=total_deaths)) +
  geom_bar(stat="identity")
## Warning in mask$eval_all_summarise(quo): NAs introduced by coercion

## Warning in mask$eval_all_summarise(quo): NAs introduced by coercion

## Warning in mask$eval_all_summarise(quo): NAs introduced by coercion

## Warning in mask$eval_all_summarise(quo): NAs introduced by coercion

## Warning in mask$eval_all_summarise(quo): NAs introduced by coercion

## Warning in mask$eval_all_summarise(quo): NAs introduced by coercion

## Warning in mask$eval_all_summarise(quo): NAs introduced by coercion

## Warning in mask$eval_all_summarise(quo): NAs introduced by coercion