For this project, I will import three untidy datasets from our week 5 discussion board, tidy them, and analyze them. So first, I started by getting my libraries.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(readr)
library(ggplot2)
This first rmd file will be for the dataset is from Matthew Roland’s post and focuses on world populations. There are columns for information on the country and different population numbers throughout the years
##Step 1: Import the dataset
worldPopulation <- read_csv("world_population.csv")
## Rows: 234 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): CCA3, Country/Territory, Capital, Continent
## dbl (13): Rank, 2022 Population, 2020 Population, 2015 Population, 2010 Popu...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(worldPopulation)
## [1] "Rank" "CCA3"
## [3] "Country/Territory" "Capital"
## [5] "Continent" "2022 Population"
## [7] "2020 Population" "2015 Population"
## [9] "2010 Population" "2000 Population"
## [11] "1990 Population" "1980 Population"
## [13] "1970 Population" "Area (km²)"
## [15] "Density (per km²)" "Growth Rate"
## [17] "World Population Percentage"
##Step 2: Tidy the dataset
In order to tidy up this dataset and make it one observation per row, I
need to separate the columns for the different years’ population. To do
so, I used pivot longer to take in all the year population columns and
turn them into 2 new columns: Year & Population #/Amount. Since we
had 234 rows and 8 different year population columns, we now have 1,872
rows.
I also ordered and wrote this new tidy data to a new csv file
worldPopulation_tidy <-
pivot_longer(worldPopulation,
cols=c('2022 Population', '2020 Population', '2015 Population', '2010 Population',
'2000 Population', '1990 Population', '1980 Population', '1970 Population'),
names_to = 'Year',
values_to = 'Population #')
worldPopulation_tidy <- worldPopulation_tidy[order(worldPopulation_tidy$`Country/Territory`),]
worldPopulation_tidy
## # A tibble: 1,872 Ă— 11
## Rank CCA3 `Country/Territory` Capital Continent `Area (km²)`
## <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 36 AFG Afghanistan Kabul Asia 652230
## 2 36 AFG Afghanistan Kabul Asia 652230
## 3 36 AFG Afghanistan Kabul Asia 652230
## 4 36 AFG Afghanistan Kabul Asia 652230
## 5 36 AFG Afghanistan Kabul Asia 652230
## 6 36 AFG Afghanistan Kabul Asia 652230
## 7 36 AFG Afghanistan Kabul Asia 652230
## 8 36 AFG Afghanistan Kabul Asia 652230
## 9 138 ALB Albania Tirana Europe 28748
## 10 138 ALB Albania Tirana Europe 28748
## # ℹ 1,862 more rows
## # ℹ 5 more variables: `Density (per km²)` <dbl>, `Growth Rate` <dbl>,
## # `World Population Percentage` <dbl>, Year <chr>, `Population #` <dbl>
worldPopulation_tidy[,c(1:3,10:11)]
## # A tibble: 1,872 Ă— 5
## Rank CCA3 `Country/Territory` Year `Population #`
## <dbl> <chr> <chr> <chr> <dbl>
## 1 36 AFG Afghanistan 2022 Population 41128771
## 2 36 AFG Afghanistan 2020 Population 38972230
## 3 36 AFG Afghanistan 2015 Population 33753499
## 4 36 AFG Afghanistan 2010 Population 28189672
## 5 36 AFG Afghanistan 2000 Population 19542982
## 6 36 AFG Afghanistan 1990 Population 10694796
## 7 36 AFG Afghanistan 1980 Population 12486631
## 8 36 AFG Afghanistan 1970 Population 10752971
## 9 138 ALB Albania 2022 Population 2842321
## 10 138 ALB Albania 2020 Population 2866849
## # ℹ 1,862 more rows
worldPopulation_tidy$Year[which(worldPopulation_tidy$Year=="2022 Population")] <- 2022
worldPopulation_tidy$Year[which(worldPopulation_tidy$Year=="2020 Population")] <- 2020
worldPopulation_tidy$Year[which(worldPopulation_tidy$Year=="2015 Population")] <- 2015
worldPopulation_tidy$Year[which(worldPopulation_tidy$Year=="2010 Population")] <- 2010
worldPopulation_tidy$Year[which(worldPopulation_tidy$Year=="2000 Population")] <- 2000
worldPopulation_tidy$Year[which(worldPopulation_tidy$Year=="1990 Population")] <- 1990
worldPopulation_tidy$Year[which(worldPopulation_tidy$Year=="1980 Population")] <- 1980
worldPopulation_tidy$Year[which(worldPopulation_tidy$Year=="1970 Population")] <- 1970
write.csv(worldPopulation_tidy,file='/Users/Ari/Data607/project2/worldPopulation_tidy.csv')
##Step 3: Analysis
ggplot(data=worldPopulation_tidy, aes(x=Year, y=`Population #`,color=`Country/Territory`)) +
geom_line() +
geom_point() +
theme(legend.position = "none")
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
The above graph shows the difference in population throughout the 8 recorded years for the different countries. From the bottom of the graph, you can see the population of those countries do not change that much. There is only the two at the top that have a signifcant curve/increase.
worldPopulation_2022 <- worldPopulation_tidy[ which(worldPopulation_tidy$Year == '2022'), ]
ggplot(data=worldPopulation_2022, aes(x=`Country/Territory`, y=`Population #`)) +
geom_bar(stat="identity")
The above graph shows the different countries’ world population in 2022. For the majority of them, they are within the same range. There are only two countries with significantly higher populations from the rest