The first dataset is a list of countries and their populations in past, present, and future (projected) years.
worldPopulation <- read.csv('https://raw.githubusercontent.com/jglendrange/DATA607/main/worldPopulationData.csv')
head(worldPopulation)
## cca2 name pop2021 pop2020 pop2050 pop2030 pop2019
## 1 CN China 1444216.1 1439323.8 1402405.2 1464340.2 1433783.7
## 2 IN India 1393409.0 1380004.4 1639176.0 1503642.3 1366417.8
## 3 US United States 332915.1 331002.7 379419.1 349641.9 329064.9
## 4 ID Indonesia 276361.8 273523.6 330904.7 299198.4 270625.6
## 5 PK Pakistan 225199.9 220892.3 338013.2 262958.8 216565.3
## 6 BR Brazil 213993.4 212559.4 228980.4 223852.1 211049.5
## pop2015 pop2010 pop2000 pop1990 pop1980 pop1970 area Density
## 1 1406847.9 1368810.6 1290550.8 1176883.7 1000089.23 827601.39 9706961 148.7815
## 2 1310152.4 1234281.2 1056575.5 873277.8 698952.84 555189.79 3287590 423.8391
## 3 320878.3 309011.5 281710.9 252120.3 229476.35 209513.34 9372610 35.5200
## 4 258383.3 241834.2 211513.8 181413.4 147447.84 114793.18 1904569 145.1046
## 5 199427.0 179424.6 142343.6 107647.9 78054.34 58142.06 881912 255.3542
## 6 204471.8 195713.6 174790.3 149003.2 120694.01 95113.26 8515767 25.1291
## GrowthRate WorldPercentage rank
## 1 1.0034 0.1834 1
## 2 1.0097 0.1769 2
## 3 1.0058 0.0423 3
## 4 1.0104 0.0351 4
## 5 1.0195 0.0286 5
## 6 1.0067 0.0272 6
column_names <- c('country_code','country_name','2021','2020','2050','2030','2019', '2015','2010','2000','1990','1980','1970','area','density','growth_rate','world_percentage','rank')
colnames(worldPopulation) <- column_names
worldPopulation <- worldPopulation %>% select(1:13)
Here I pivot the wide data set into a long data set. Turning all the year’s colums into 1 row called year and storing all of the population values in a column named population.
longData <- pivot_longer(worldPopulation, c(,'2021','2020','2050','2030','2019', '2015','2010','2000','1990','1980','1970'), names_to = 'year', values_to ='population')
head(longData)
## # A tibble: 6 x 4
## country_code country_name year population
## <chr> <chr> <chr> <dbl>
## 1 CN China 2021 1444216.
## 2 CN China 2020 1439324.
## 3 CN China 2050 1402405.
## 4 CN China 2030 1464340.
## 5 CN China 2019 1433784.
## 6 CN China 2015 1406848.
The population growth across all the countries plotted looks pretty linear. The slope does slightly change from the 2030-2050 timframe. We may be projected a decrease in birth rates in that period.
longData %>%
group_by(year) %>%
summarise(total_population = sum(population)) %>%
plot(type='b')
Here I wanted to drill down on specific countries and how their population is trending. I limited it to the top 15 larget populations measured in 2021. After plotting the results we can see India and China are far above the rest of the world. We see an interesting trend where China begins to even out in growth and India is projected to surpass their population in 2030. I want to remove China and India to get a closer view at the other countries
top15 <- longData %>%
filter(year==2021) %>%
top_n(n=15)
## Selecting by population
top15 <- longData[is.element(longData$country_name,top15$country_name),]
top15 %>%
ggplot(aes(x = year, y = population, color = country_name)) +
geom_point()
Two interesting insights from this is Nigeria’s exponential projected growth from 2021 to 2050, and Japan’s decreasing population.
top15 %>%
filter(country_name != "India") %>%
filter(country_name != "China") %>%
ggplot(aes(x = year, y = population, color = country_name)) +
geom_point()
In this case I want to know more about the attributes of the squirrels spotted and a few interactions. So I would get rid of a few columns to start
## X Y Unique.Squirrel.ID Hectare Shift Date
## 1 -73.95613 40.79408 37F-PM-1014-03 37F PM 10142018
## 2 -73.96886 40.78378 21B-AM-1019-04 21B AM 10192018
## 3 -73.97428 40.77553 11B-PM-1014-08 11B PM 10142018
## 4 -73.95964 40.79031 32E-PM-1017-14 32E PM 10172018
## 5 -73.97027 40.77621 13E-AM-1017-05 13E AM 10172018
## 6 -73.96836 40.77259 11H-AM-1010-03 11H AM 10102018
## Hectare.Squirrel.Number Age Primary.Fur.Color Highlight.Fur.Color
## 1 3
## 2 4
## 3 8 Gray
## 4 14 Adult Gray
## 5 5 Adult Gray Cinnamon
## 6 3 Adult Cinnamon White
## Combination.of.Primary.and.Highlight.Color
## 1 +
## 2 +
## 3 Gray+
## 4 Gray+
## 5 Gray+Cinnamon
## 6 Cinnamon+White
## Color.notes
## 1
## 2
## 3
## 4 Nothing selected as Primary. Gray selected as Highlights. Made executive adjustments.
## 5
## 6
## Location Above.Ground.Sighter.Measurement Specific.Location Running
## 1 false
## 2 false
## 3 Above Ground 10 false
## 4 false
## 5 Above Ground on tree stump false
## 6 false
## Chasing Climbing Eating Foraging Other.Activities Kuks Quaas Moans
## 1 false false false false false false false
## 2 false false false false false false false
## 3 true false false false false false false
## 4 false false true true false false false
## 5 false false false true false false false
## 6 false false false true false false false
## Tail.flags Tail.twitches Approaches Indifferent Runs.from Other.Interactions
## 1 false false false false false
## 2 false false false false false
## 3 false false false false false
## 4 false false false false true
## 5 false false false false false
## 6 false true false true false
## Lat.Long
## 1 POINT (-73.9561344937861 40.7940823884086)
## 2 POINT (-73.9688574691102 40.7837825208444)
## 3 POINT (-73.97428114848522 40.775533619083)
## 4 POINT (-73.9596413903948 40.7903128889029)
## 5 POINT (-73.9702676472613 40.7762126854894)
## 6 POINT (-73.9683613516225 40.7725908847499)
I selected the columns I want and then replace true and falses with 0’s and 1’s.
squirrelCensus <- squirrelCensus %>% select(3, 5, 8, 9, 10, 17, 18, 19, 20)
colnames(squirrelCensus) <- c('id','shift','age','primary_color', 'highlight_color', 'chasing', 'climbing','eating', 'foraging')
head(squirrelCensus)
## id shift age primary_color highlight_color chasing climbing
## 1 37F-PM-1014-03 PM false false
## 2 21B-AM-1019-04 AM false false
## 3 11B-PM-1014-08 PM Gray true false
## 4 32E-PM-1017-14 PM Adult Gray false false
## 5 13E-AM-1017-05 AM Adult Gray Cinnamon false false
## 6 11H-AM-1010-03 AM Adult Cinnamon White false false
## eating foraging
## 1 false false
## 2 false false
## 3 false false
## 4 true true
## 5 false true
## 6 false true
squirrelCensus[squirrelCensus == 'false'] <- 0
squirrelCensus[squirrelCensus == 'true'] <- 1
head(squirrelCensus)
## id shift age primary_color highlight_color chasing climbing
## 1 37F-PM-1014-03 PM 0 0
## 2 21B-AM-1019-04 AM 0 0
## 3 11B-PM-1014-08 PM Gray 1 0
## 4 32E-PM-1017-14 PM Adult Gray 0 0
## 5 13E-AM-1017-05 AM Adult Gray Cinnamon 0 0
## 6 11H-AM-1010-03 AM Adult Cinnamon White 0 0
## eating foraging
## 1 0 0
## 2 0 0
## 3 0 0
## 4 1 1
## 5 0 1
## 6 0 1
The plot shows us the majority of the squirrels are Adults.
squirrelCensus %>%
filter(age != "") %>%
ggplot(aes(x = age)) +
geom_bar() +
labs(
x = "", y = "",
title = "What age is the age of the Squirrels?"
) +
coord_flip()
The most common squirrels in central park are Gray squirrels.
squirrelCensus %>%
filter(primary_color != "") %>%
ggplot(aes(x = primary_color)) +
geom_bar() +
labs(
x = "", y = "",
title = "What color are the Squirrels?"
) +
coord_flip()
In the last piece I am pivoting all the activities into a long table and observing which are the most common among squirrels. Here it seems like foraging is the most common.
activities <- squirrelCensus %>% select(1, 6, 7, 8, 9)
activities_long <- pivot_longer(activities, c('chasing', 'climbing','eating','foraging'), names_to='activity')
ggplot(activities_long, aes(x = activity, y=value)) +
geom_bar(stat="identity")
In conclusion, if you’re spotting a squirl in central park you’re most likely to see an adult gray colored squirrel that is foraging around :)
nyDeathCauses <- read.csv('https://raw.githubusercontent.com/jglendrange/DATA607/main/New_York_City_Leading_Causes_of_Death.csv')
colnames(nyDeathCauses) <- c('year','leading_cause','sex','race_ethnicity','deaths','death_rate','age_adjusted_death_rate')
head(nyDeathCauses)
## year leading_cause sex
## 1 2015 Malignant Neoplasms (Cancer: C00-C97) Female
## 2 2015 Diseases of Heart (I00-I09, I11, I13, I20-I51) Female
## 3 2015 Cerebrovascular Disease (Stroke: I60-I69) Female
## 4 2015 Influenza (Flu) and Pneumonia (J09-J18) Female
## 5 2015 Diabetes Mellitus (E10-E14) Female
## 6 2015 Alzheimer's Disease (G30) Female
## race_ethnicity deaths death_rate age_adjusted_death_rate
## 1 Asian and Pacific Islander 515 79.726669113 78.865386427
## 2 Asian and Pacific Islander 498 77.094914987 81.605131438
## 3 Asian and Pacific Islander 95 14.706861293 15.337930564
## 4 Asian and Pacific Islander 89 13.778006895 14.706362334
## 5 Asian and Pacific Islander 71 10.991443703 11.537396764
## 6 Asian and Pacific Islander 50 7.7404533119 8.4169129758
White non-hispanics have the highest death rate in this data set.
nyDeathCauses %>%
group_by(race_ethnicity) %>%
summarise(death_rate_avg = mean(as.numeric(death_rate))) %>%
filter(race_ethnicity != "Not Stated/Unknown", race_ethnicity != "Other Race/ Ethnicity") %>%
ggplot(aes(x = race_ethnicity, y = death_rate_avg)) +
geom_bar(stat="identity") +
labs(
x = "", y = "",
title = "Which race has the highest deathrate?"
) +
coord_flip()
## Warning in mean(as.numeric(death_rate)): NAs introduced by coercion
## Warning in mean(as.numeric(death_rate)): NAs introduced by coercion
In 2015, 2016, and 2017 we observed almost the same exact amount of deaths.
nyDeathCauses %>%
group_by(year) %>%
summarise(total_deaths = sum(as.numeric(deaths)))%>%
filter(year >= 2015) %>%
ggplot(aes(x=year,y=total_deaths)) +
geom_bar(stat="identity")
## Warning in mask$eval_all_summarise(quo): NAs introduced by coercion
## Warning in mask$eval_all_summarise(quo): NAs introduced by coercion
## Warning in mask$eval_all_summarise(quo): NAs introduced by coercion
## Warning in mask$eval_all_summarise(quo): NAs introduced by coercion
## Warning in mask$eval_all_summarise(quo): NAs introduced by coercion
## Warning in mask$eval_all_summarise(quo): NAs introduced by coercion
## Warning in mask$eval_all_summarise(quo): NAs introduced by coercion
## Warning in mask$eval_all_summarise(quo): NAs introduced by coercion