Data 607 Project 2: World Population Dataset

For this project, I will import three untidy datasets from our week 5 discussion board, tidy them, and analyze them. So first, I started by getting my libraries.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(readr)
library(ggplot2)

World Population Dataset

This first rmd file will be for the dataset is from Matthew Roland’s post and focuses on world populations. There are columns for information on the country and different population numbers throughout the years

##Step 1: Import the dataset

worldPopulation <- read_csv("world_population.csv")
## Rows: 234 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): CCA3, Country/Territory, Capital, Continent
## dbl (13): Rank, 2022 Population, 2020 Population, 2015 Population, 2010 Popu...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(worldPopulation) 
##  [1] "Rank"                        "CCA3"                       
##  [3] "Country/Territory"           "Capital"                    
##  [5] "Continent"                   "2022 Population"            
##  [7] "2020 Population"             "2015 Population"            
##  [9] "2010 Population"             "2000 Population"            
## [11] "1990 Population"             "1980 Population"            
## [13] "1970 Population"             "Area (km²)"                 
## [15] "Density (per km²)"           "Growth Rate"                
## [17] "World Population Percentage"

##Step 2: Tidy the dataset
In order to tidy up this dataset and make it one observation per row, I need to separate the columns for the different years’ population. To do so, I used pivot longer to take in all the year population columns and turn them into 2 new columns: Year & Population #/Amount. Since we had 234 rows and 8 different year population columns, we now have 1,872 rows.
I also ordered and wrote this new tidy data to a new csv file

worldPopulation_tidy <- 
  pivot_longer(worldPopulation, 
               cols=c('2022 Population', '2020 Population', '2015 Population', '2010 Population', 
                      '2000 Population', '1990 Population', '1980 Population', '1970 Population'),
               names_to = 'Year',
               values_to = 'Population #')
worldPopulation_tidy <- worldPopulation_tidy[order(worldPopulation_tidy$`Country/Territory`),]
worldPopulation_tidy
## # A tibble: 1,872 Ă— 11
##     Rank CCA3  `Country/Territory` Capital Continent `Area (km²)`
##    <dbl> <chr> <chr>               <chr>   <chr>            <dbl>
##  1    36 AFG   Afghanistan         Kabul   Asia            652230
##  2    36 AFG   Afghanistan         Kabul   Asia            652230
##  3    36 AFG   Afghanistan         Kabul   Asia            652230
##  4    36 AFG   Afghanistan         Kabul   Asia            652230
##  5    36 AFG   Afghanistan         Kabul   Asia            652230
##  6    36 AFG   Afghanistan         Kabul   Asia            652230
##  7    36 AFG   Afghanistan         Kabul   Asia            652230
##  8    36 AFG   Afghanistan         Kabul   Asia            652230
##  9   138 ALB   Albania             Tirana  Europe           28748
## 10   138 ALB   Albania             Tirana  Europe           28748
## # ℹ 1,862 more rows
## # ℹ 5 more variables: `Density (per km²)` <dbl>, `Growth Rate` <dbl>,
## #   `World Population Percentage` <dbl>, Year <chr>, `Population #` <dbl>
worldPopulation_tidy[,c(1:3,10:11)]
## # A tibble: 1,872 Ă— 5
##     Rank CCA3  `Country/Territory` Year            `Population #`
##    <dbl> <chr> <chr>               <chr>                    <dbl>
##  1    36 AFG   Afghanistan         2022 Population       41128771
##  2    36 AFG   Afghanistan         2020 Population       38972230
##  3    36 AFG   Afghanistan         2015 Population       33753499
##  4    36 AFG   Afghanistan         2010 Population       28189672
##  5    36 AFG   Afghanistan         2000 Population       19542982
##  6    36 AFG   Afghanistan         1990 Population       10694796
##  7    36 AFG   Afghanistan         1980 Population       12486631
##  8    36 AFG   Afghanistan         1970 Population       10752971
##  9   138 ALB   Albania             2022 Population        2842321
## 10   138 ALB   Albania             2020 Population        2866849
## # ℹ 1,862 more rows
worldPopulation_tidy$Year[which(worldPopulation_tidy$Year=="2022 Population")] <- 2022
worldPopulation_tidy$Year[which(worldPopulation_tidy$Year=="2020 Population")] <- 2020
worldPopulation_tidy$Year[which(worldPopulation_tidy$Year=="2015 Population")] <- 2015
worldPopulation_tidy$Year[which(worldPopulation_tidy$Year=="2010 Population")] <- 2010
worldPopulation_tidy$Year[which(worldPopulation_tidy$Year=="2000 Population")] <- 2000
worldPopulation_tidy$Year[which(worldPopulation_tidy$Year=="1990 Population")] <- 1990
worldPopulation_tidy$Year[which(worldPopulation_tidy$Year=="1980 Population")] <- 1980
worldPopulation_tidy$Year[which(worldPopulation_tidy$Year=="1970 Population")] <- 1970

write.csv(worldPopulation_tidy,file='/Users/Ari/Data607/project2/worldPopulation_tidy.csv')

##Step 3: Analysis

ggplot(data=worldPopulation_tidy, aes(x=Year, y=`Population #`,color=`Country/Territory`)) +
  geom_line() +
  geom_point() +
  theme(legend.position = "none")
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

The above graph shows the difference in population throughout the 8 recorded years for the different countries. From the bottom of the graph, you can see the population of those countries do not change that much. There is only the two at the top that have a signifcant curve/increase.

worldPopulation_2022 <- worldPopulation_tidy[ which(worldPopulation_tidy$Year == '2022'), ]
ggplot(data=worldPopulation_2022, aes(x=`Country/Territory`, y=`Population #`)) +
  geom_bar(stat="identity")

The above graph shows the different countries’ world population in 2022. For the majority of them, they are within the same range. There are only two countries with significantly higher populations from the rest