Library
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(tidyr)
library(ggplot2)
Introduction
For this project we are choosing three “wide” datasets, practice to
tidy and tranform each datasets, as well as analyzing the works.
Load Dataset #1
childcare_raw <- read.csv("https://raw.githubusercontent.com/JaydeeJan/Project-2/refs/heads/main/Childcare_Need___Supply__All.csv")
head(childcare_raw, 10)
## Geographic.Unit Geographic.ID Geographic.Name State.Median.Income.Bracket
## 1 County 53001 Adams County <=60% of SMI
## 2 County 53001 Adams County >60% and <=75% of SMI
## 3 County 53001 Adams County >75% and <=85% of SMI
## 4 County 53001 Adams County >85% of SMI
## 5 County 53001 Adams County <=60% of SMI
## 6 County 53001 Adams County >60% and <=75% of SMI
## 7 County 53001 Adams County >75% and <=85% of SMI
## 8 County 53001 Adams County >85% of SMI
## 9 County 53001 Adams County <=60% of SMI
## 10 County 53001 Adams County >60% and <=75% of SMI
## Age.Group Childcare.Subsidized Private.Care.Estimate
## 1 Infant 18 14
## 2 Infant 0 0
## 3 Infant 0 0
## 4 Infant 0 4
## 5 Preschool 220 23
## 6 Preschool 0 2
## 7 Preschool 0 0
## 8 Preschool NA 11
## 9 School Age 154 58
## 10 School Age 0 5
## Estimated.Children.Receiving.Childcare Estimate.of.Unserved Percent.Need.Met
## 1 32 245 11.6
## 2 0 14 0.0
## 3 0 8 0.0
## 4 4 38 9.5
## 5 243 469 34.1
## 6 2 60 3.2
## 7 0 20 0.0
## 8 NA 100 NA
## 9 212 2178 8.9
## 10 5 219 2.2
tail(childcare_raw)
## Geographic.Unit Geographic.ID Geographic.Name State.Median.Income.Bracket
## 15627 ZIP Code 99403 99403 >75% and <=85% of SMI
## 15628 ZIP Code 99403 99403 >85% of SMI
## 15629 ZIP Code 99403 99403 <=60% of SMI
## 15630 ZIP Code 99403 99403 >60% and <=75% of SMI
## 15631 ZIP Code 99403 99403 >75% and <=85% of SMI
## 15632 ZIP Code 99403 99403 >85% of SMI
## Age.Group Childcare.Subsidized Private.Care.Estimate
## 15627 School Age 0 0
## 15628 School Age 0 6
## 15629 Toddler 60 6
## 15630 Toddler 0 6
## 15631 Toddler 0 1
## 15632 Toddler 0 23
## Estimated.Children.Receiving.Childcare Estimate.of.Unserved
## 15627 0 99
## 15628 6 223
## 15629 66 199
## 15630 6 51
## 15631 1 15
## 15632 23 67
## Percent.Need.Met
## 15627 0.0
## 15628 2.6
## 15629 24.9
## 15630 10.5
## 15631 6.2
## 15632 25.6
Tidy and transform the Data
summary(childcare_raw) # Checking for missing value and property
## Geographic.Unit Geographic.ID Geographic.Name
## Length:15632 Min. : 53001 Length:15632
## Class :character 1st Qu.: 98292 Class :character
## Mode :character Median : 98843 Mode :character
## Mean :1666602
## 3rd Qu.:5301680
## Max. :5310170
##
## State.Median.Income.Bracket Age.Group Childcare.Subsidized
## Length:15632 Length:15632 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00
## Mode :character Mode :character Median : 0.00
## Mean : 20.75
## 3rd Qu.: 0.00
## Max. :7725.00
## NA's :2175
## Private.Care.Estimate Estimated.Children.Receiving.Childcare
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 1.00 Median : 1.00
## Mean : 30.48 Mean : 45.97
## 3rd Qu.: 11.00 3rd Qu.: 15.00
## Max. :18774.00 Max. :15645.00
## NA's :2175
## Estimate.of.Unserved Percent.Need.Met
## Min. : 0.0 Min. : 0.00
## 1st Qu.: 3.0 1st Qu.: 0.00
## Median : 24.0 Median : 5.90
## Mean : 250.1 Mean : 11.36
## 3rd Qu.: 135.0 3rd Qu.: 16.00
## Max. :86007.0 Max. :100.00
## NA's :3922
# convert relevant columns to numeric
numeric_cols <- c("Childcare.Subsidized", "Private.Care.Estimate", "Estimated.Children.Receiving.Childcare", "Estimate.of.Unserved", "Percent.Need.Met")
childcare_raw[numeric_cols] <- lapply(childcare_raw[numeric_cols], as.numeric)
data_cleaned <- childcare_raw %>% drop_na() # remove rows with missing data (N/A).
str(data_cleaned)
## 'data.frame': 11710 obs. of 10 variables:
## $ Geographic.Unit : chr "County" "County" "County" "County" ...
## $ Geographic.ID : int 53001 53001 53001 53001 53001 53001 53001 53001 53001 53001 ...
## $ Geographic.Name : chr "Adams County" "Adams County" "Adams County" "Adams County" ...
## $ State.Median.Income.Bracket : chr "<=60% of SMI" ">60% and <=75% of SMI" ">75% and <=85% of SMI" ">85% of SMI" ...
## $ Age.Group : chr "Infant" "Infant" "Infant" "Infant" ...
## $ Childcare.Subsidized : num 18 0 0 0 220 0 0 154 0 0 ...
## $ Private.Care.Estimate : num 14 0 0 4 23 2 0 58 5 2 ...
## $ Estimated.Children.Receiving.Childcare: num 32 0 0 4 243 2 0 212 5 2 ...
## $ Estimate.of.Unserved : num 245 14 8 38 469 ...
## $ Percent.Need.Met : num 11.6 0 0 9.5 34.1 3.2 0 8.9 2.2 2.3 ...
#rename the cols
head(data_cleaned, 10)
## Geographic.Unit Geographic.ID Geographic.Name State.Median.Income.Bracket
## 1 County 53001 Adams County <=60% of SMI
## 2 County 53001 Adams County >60% and <=75% of SMI
## 3 County 53001 Adams County >75% and <=85% of SMI
## 4 County 53001 Adams County >85% of SMI
## 5 County 53001 Adams County <=60% of SMI
## 6 County 53001 Adams County >60% and <=75% of SMI
## 7 County 53001 Adams County >75% and <=85% of SMI
## 8 County 53001 Adams County <=60% of SMI
## 9 County 53001 Adams County >60% and <=75% of SMI
## 10 County 53001 Adams County >75% and <=85% of SMI
## Age.Group Childcare.Subsidized Private.Care.Estimate
## 1 Infant 18 14
## 2 Infant 0 0
## 3 Infant 0 0
## 4 Infant 0 4
## 5 Preschool 220 23
## 6 Preschool 0 2
## 7 Preschool 0 0
## 8 School Age 154 58
## 9 School Age 0 5
## 10 School Age 0 2
## Estimated.Children.Receiving.Childcare Estimate.of.Unserved Percent.Need.Met
## 1 32 245 11.6
## 2 0 14 0.0
## 3 0 8 0.0
## 4 4 38 9.5
## 5 243 469 34.1
## 6 2 60 3.2
## 7 0 20 0.0
## 8 212 2178 8.9
## 9 5 219 2.2
## 10 2 86 2.3
tail(data_cleaned)
## Geographic.Unit Geographic.ID Geographic.Name State.Median.Income.Bracket
## 11705 ZIP Code 99403 99403 >75% and <=85% of SMI
## 11706 ZIP Code 99403 99403 >85% of SMI
## 11707 ZIP Code 99403 99403 <=60% of SMI
## 11708 ZIP Code 99403 99403 >60% and <=75% of SMI
## 11709 ZIP Code 99403 99403 >75% and <=85% of SMI
## 11710 ZIP Code 99403 99403 >85% of SMI
## Age.Group Childcare.Subsidized Private.Care.Estimate
## 11705 School Age 0 0
## 11706 School Age 0 6
## 11707 Toddler 60 6
## 11708 Toddler 0 6
## 11709 Toddler 0 1
## 11710 Toddler 0 23
## Estimated.Children.Receiving.Childcare Estimate.of.Unserved
## 11705 0 99
## 11706 6 223
## 11707 66 199
## 11708 6 51
## 11709 1 15
## 11710 23 67
## Percent.Need.Met
## 11705 0.0
## 11706 2.6
## 11707 24.9
## 11708 10.5
## 11709 6.2
## 11710 25.6
received_chaildcare <- data_cleaned %>% select(Geographic.ID, Age.Group, State.Median.Income.Bracket, Estimated.Children.Receiving.Childcare)
receive_care_cleaned <- received_chaildcare %>% filter(Estimated.Children.Receiving.Childcare != 0) # removing rows with value in 0.
head(receive_care_cleaned, 20)
## Geographic.ID Age.Group State.Median.Income.Bracket
## 1 53001 Infant <=60% of SMI
## 2 53001 Infant >85% of SMI
## 3 53001 Preschool <=60% of SMI
## 4 53001 Preschool >60% and <=75% of SMI
## 5 53001 School Age <=60% of SMI
## 6 53001 School Age >60% and <=75% of SMI
## 7 53001 School Age >75% and <=85% of SMI
## 8 53001 School Age >85% of SMI
## 9 53001 Toddler <=60% of SMI
## 10 53001 Toddler >75% and <=85% of SMI
## 11 53003 Infant <=60% of SMI
## 12 53003 Infant >60% and <=75% of SMI
## 13 53003 Infant >75% and <=85% of SMI
## 14 53003 Infant >85% of SMI
## 15 53003 Preschool <=60% of SMI
## 16 53003 Preschool >60% and <=75% of SMI
## 17 53003 Preschool >75% and <=85% of SMI
## 18 53003 Preschool >85% of SMI
## 19 53003 School Age <=60% of SMI
## 20 53003 School Age >85% of SMI
## Estimated.Children.Receiving.Childcare
## 1 32
## 2 4
## 3 243
## 4 2
## 5 212
## 6 5
## 7 2
## 8 30
## 9 103
## 10 1
## 11 31
## 12 3
## 13 1
## 14 16
## 15 132
## 16 12
## 17 6
## 18 63
## 19 79
## 20 6
percent_met <- data_cleaned %>% select(Geographic.ID, Age.Group, State.Median.Income.Bracket, Percent.Need.Met)
percent_met_cleaned <- percent_met %>% filter(Percent.Need.Met != 0)
head(percent_met_cleaned, 20)
## Geographic.ID Age.Group State.Median.Income.Bracket Percent.Need.Met
## 1 53001 Infant <=60% of SMI 11.6
## 2 53001 Infant >85% of SMI 9.5
## 3 53001 Preschool <=60% of SMI 34.1
## 4 53001 Preschool >60% and <=75% of SMI 3.2
## 5 53001 School Age <=60% of SMI 8.9
## 6 53001 School Age >60% and <=75% of SMI 2.2
## 7 53001 School Age >75% and <=85% of SMI 2.3
## 8 53001 School Age >85% of SMI 8.5
## 9 53001 Toddler <=60% of SMI 18.8
## 10 53001 Toddler >75% and <=85% of SMI 7.1
## 11 53003 Infant <=60% of SMI 22.3
## 12 53003 Infant >60% and <=75% of SMI 12.5
## 13 53003 Infant >75% and <=85% of SMI 4.3
## 14 53003 Infant >85% of SMI 34.8
## 15 53003 Preschool <=60% of SMI 37.0
## 16 53003 Preschool >60% and <=75% of SMI 16.0
## 17 53003 Preschool >75% and <=85% of SMI 13.6
## 18 53003 Preschool >85% of SMI 48.5
## 19 53003 School Age <=60% of SMI 7.2
## 20 53003 School Age >85% of SMI 2.2
Analyze Data
# summarize the average percentage of childcare needs met by state median income bracket.
percent_met_analysis <- percent_met_cleaned %>%
group_by(State.Median.Income.Bracket) %>%
summarise(Average_Need_Met = mean(Percent.Need.Met))
ggplot(percent_met_analysis, aes(x = reorder(State.Median.Income.Bracket, Average_Need_Met), y = Average_Need_Met)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Average Percentage of Childcare Needs Met by Income Bracket", x = "Income Bracket", y = "Percent Need Met") +
coord_flip() +
theme_minimal()

Conclusion for the first dataset
For families earning less or equal to 60% of State Median Income
(SMI), thye have the highest percentage of childcare needs met compared
to others. The families with the lowest incomes may benefit the most
from government subsideies or other support programs aiming at childcare
need. However, for families whose SMI between 60% to 85% received least
childcare support. while families earning greater than 85% SMI received
second highest percentage of childcare needs.
Load Dataset #2
hiv_raw <- read.csv("https://raw.githubusercontent.com/JaydeeJan/Project-2/refs/heads/main/HIV_AIDS_Diagnoses_by_Neighborhood__Sex__and_Race_Ethnicity_20241020.csv")
head(hiv_raw, 10)
## YEAR Borough Neighborhood..U.H.F. SEX RACE.ETHNICITY
## 1 2010 Greenpoint Male Black
## 2 2011 Stapleton - St. George Female Native American
## 3 2010 Southeast Queens Male All
## 4 2012 Upper Westside Female Unknown
## 5 2013 Willowbrook Male Unknown
## 6 2013 East Flatbush - Flatbush Male Black
## 7 2013 East Flatbush - Flatbush Female Native American
## 8 2013 Southwest Queens Female Unknown
## 9 2012 Fordham - Bronx Park Male Unknown
## 10 2010 Flushing - Clearview All All
## TOTAL.NUMBER.OF.HIV.DIAGNOSES HIV.DIAGNOSES.PER.100.000.POPULATION
## 1 6 330.4
## 2 0 0
## 3 23 25.4
## 4 0 0
## 5 0 0
## 6 54 56.5
## 7 0 0
## 8 0 0
## 9 0 0
## 10 14 5.4
## TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES
## 1 0
## 2 0
## 3 5
## 4 0
## 5 0
## 6 8
## 7 0
## 8 0
## 9 0
## 10 5
## PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES
## 1 0
## 2 0
## 3 21.7
## 4 0
## 5 0
## 6 14.8
## 7 0
## 8 0
## 9 0
## 10 35.7
## TOTAL.NUMBER.OF.AIDS.DIAGNOSES AIDS.DIAGNOSES.PER.100.000.POPULATION
## 1 5 275.3
## 2 0 0
## 3 14 15.4
## 4 0 0
## 5 0 0
## 6 33 34.5
## 7 0 0
## 8 0 0
## 9 0 0
## 10 12 4.6
Tidy and transforming data
summary(hiv_raw)
## YEAR Borough Neighborhood..U.H.F. SEX
## Min. :2010 Length:8976 Length:8976 Length:8976
## 1st Qu.:2013 Class :character Class :character Class :character
## Median :2017 Mode :character Mode :character Mode :character
## Mean :2016
## 3rd Qu.:2020
## Max. :2021
## RACE.ETHNICITY TOTAL.NUMBER.OF.HIV.DIAGNOSES
## Length:8976 Length:8976
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## HIV.DIAGNOSES.PER.100.000.POPULATION
## Length:8976
## Class :character
## Mode :character
##
##
##
## TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES
## Length:8976
## Class :character
## Mode :character
##
##
##
## PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES
## Length:8976
## Class :character
## Mode :character
##
##
##
## TOTAL.NUMBER.OF.AIDS.DIAGNOSES AIDS.DIAGNOSES.PER.100.000.POPULATION
## Length:8976 Length:8976
## Class :character Class :character
## Mode :character Mode :character
##
##
##
numeric_cols <- c("TOTAL.NUMBER.OF.HIV.DIAGNOSES", "HIV.DIAGNOSES.PER.100.000.POPULATION", "TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES", "PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES", "TOTAL.NUMBER.OF.AIDS.DIAGNOSES", "AIDS.DIAGNOSES.PER.100.000.POPULATION")
hiv_raw[numeric_cols] <- lapply(hiv_raw[numeric_cols], as.numeric)
## Warning in lapply(hiv_raw[numeric_cols], as.numeric): NAs introduced by
## coercion
## Warning in lapply(hiv_raw[numeric_cols], as.numeric): NAs introduced by
## coercion
## Warning in lapply(hiv_raw[numeric_cols], as.numeric): NAs introduced by
## coercion
## Warning in lapply(hiv_raw[numeric_cols], as.numeric): NAs introduced by
## coercion
## Warning in lapply(hiv_raw[numeric_cols], as.numeric): NAs introduced by
## coercion
## Warning in lapply(hiv_raw[numeric_cols], as.numeric): NAs introduced by
## coercion
hiv_cleaned <- hiv_raw %>% drop_na() # remove rows with missing data (N/A).
str(hiv_cleaned)
## 'data.frame': 7062 obs. of 11 variables:
## $ YEAR : int 2010 2011 2010 2012 2013 2013 2013 2013 2012 2010 ...
## $ Borough : chr "" "" "" "" ...
## $ Neighborhood..U.H.F. : chr "Greenpoint" "Stapleton - St. George" "Southeast Queens" "Upper Westside" ...
## $ SEX : chr "Male" "Female" "Male" "Female" ...
## $ RACE.ETHNICITY : chr "Black" "Native American" "All" "Unknown" ...
## $ TOTAL.NUMBER.OF.HIV.DIAGNOSES : num 6 0 23 0 0 54 0 0 0 14 ...
## $ HIV.DIAGNOSES.PER.100.000.POPULATION : num 330.4 0 25.4 0 0 ...
## $ TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES : num 0 0 5 0 0 8 0 0 0 5 ...
## $ PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES: num 0 0 21.7 0 0 14.8 0 0 0 35.7 ...
## $ TOTAL.NUMBER.OF.AIDS.DIAGNOSES : num 5 0 14 0 0 33 0 0 0 12 ...
## $ AIDS.DIAGNOSES.PER.100.000.POPULATION : num 275.3 0 15.4 0 0 ...
# Group by year, sex, and race/ethnicity for summary
hiv_trans <- hiv_cleaned %>%
group_by(YEAR, SEX, RACE.ETHNICITY) %>%
summarise(
Total_HIV_Diagnoses = sum(TOTAL.NUMBER.OF.HIV.DIAGNOSES),
.groups = 'drop'
)
head(hiv_trans, 20)
## # A tibble: 20 × 4
## YEAR SEX RACE.ETHNICITY Total_HIV_Diagnoses
## <int> <chr> <chr> <dbl>
## 1 2010 All All 6391
## 2 2010 Female All 708
## 3 2010 Female Asian/Pacific Islander 10
## 4 2010 Female Black 467
## 5 2010 Female Hispanic 195
## 6 2010 Female Multiracial 0
## 7 2010 Female Native American 0
## 8 2010 Female Unknown 0
## 9 2010 Female White 35
## 10 2010 Male All 2330
## 11 2010 Male Asian/Pacific Islander 74
## 12 2010 Male Black 986
## 13 2010 Male Hispanic 762
## 14 2010 Male Multiracial 8
## 15 2010 Male Native American 1
## 16 2010 Male Unknown 0
## 17 2010 Male White 498
## 18 2011 All All 6125
## 19 2011 Female All 669
## 20 2011 Female Asian/Pacific Islander 10
Data Visualization
# HIV diagnoses over time by sex
ggplot(hiv_trans, aes(x = YEAR, y = Total_HIV_Diagnoses, color = SEX)) +
geom_line(size = 1) +
geom_point(size = 2, shape = 21, fill = "white") +
scale_color_brewer(palette = "Set1") +
labs(title = "Total HIV Diagnoses Over Time by Sex",
x = "Year", y = "Total HIV Diagnoses") +
theme_minimal(base_size = 14) +
theme(legend.position = "top")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# HIV diagnoses over time by race/ethnicity
ggplot(hiv_trans, aes(x = YEAR, y = Total_HIV_Diagnoses, color = RACE.ETHNICITY)) +
geom_line(size = 1) +
geom_point(size = 2, shape = 21, fill = "white") +
scale_color_brewer(palette = "Set1") +
labs(title = "Total HIV Diagnoses Over Time by Race/Ethnicity",
x = "Year", y = "Total HIV Diagnoses") +
theme_minimal()
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors
## Warning: Removed 34 rows containing missing values or values outside the scale range
## (`geom_line()`).
## Warning: Removed 34 rows containing missing values or values outside the scale range
## (`geom_point()`).

Conclusion for the second dataset
HIV diagnoses have been consistently higher among males and with
Black and Hispanic communities, though there has been a gradual decline
over time. Disparities in diagnoses by race/ethnicity indicate a need
for targeted interventions in these affected populations.
Load Dataset #3
population_raw <- read.csv("https://raw.githubusercontent.com/JaydeeJan/Project-2/refs/heads/main/world_population.csv")
head(population_raw)
## Rank CCA3 Country.Territory Capital Continent X2022.Population
## 1 36 AFG Afghanistan Kabul Asia 41128771
## 2 138 ALB Albania Tirana Europe 2842321
## 3 34 DZA Algeria Algiers Africa 44903225
## 4 213 ASM American Samoa Pago Pago Oceania 44273
## 5 203 AND Andorra Andorra la Vella Europe 79824
## 6 42 AGO Angola Luanda Africa 35588987
## X2020.Population X2015.Population X2010.Population X2000.Population
## 1 38972230 33753499 28189672 19542982
## 2 2866849 2882481 2913399 3182021
## 3 43451666 39543154 35856344 30774621
## 4 46189 51368 54849 58230
## 5 77700 71746 71519 66097
## 6 33428485 28127721 23364185 16394062
## X1990.Population X1980.Population X1970.Population Area..km..
## 1 10694796 12486631 10752971 652230
## 2 3295066 2941651 2324731 28748
## 3 25518074 18739378 13795915 2381741
## 4 47818 32886 27075 199
## 5 53569 35611 19860 468
## 6 11828638 8330047 6029700 1246700
## Density..per.km.. Growth.Rate World.Population.Percentage
## 1 63.0587 1.0257 0.52
## 2 98.8702 0.9957 0.04
## 3 18.8531 1.0164 0.56
## 4 222.4774 0.9831 0.00
## 5 170.5641 1.0100 0.00
## 6 28.5466 1.0315 0.45
Tidy and Transform Data
# rename columns
colnames(population_raw) <- c("Rank", "CCA3", "Country", "Capital", "Continent", "Population_2022", "Population_2015", "Population_2010", "Population_2000", "Population_1990", "Population_1980", "Population_1970", "Area_km2", "Density_per_km2", "Growth_Rate", "World_Pop_Percentage")
population_cleaned <- population_raw
colnames(population_cleaned) <- make.names(colnames(population_cleaned), unique = TRUE)
str(population_cleaned)
## 'data.frame': 234 obs. of 17 variables:
## $ Rank : int 36 138 34 213 203 42 224 201 33 140 ...
## $ CCA3 : chr "AFG" "ALB" "DZA" "ASM" ...
## $ Country : chr "Afghanistan" "Albania" "Algeria" "American Samoa" ...
## $ Capital : chr "Kabul" "Tirana" "Algiers" "Pago Pago" ...
## $ Continent : chr "Asia" "Europe" "Africa" "Oceania" ...
## $ Population_2022 : int 41128771 2842321 44903225 44273 79824 35588987 15857 93763 45510318 2780469 ...
## $ Population_2015 : int 38972230 2866849 43451666 46189 77700 33428485 15585 92664 45036032 2805608 ...
## $ Population_2010 : int 33753499 2882481 39543154 51368 71746 28127721 14525 89941 43257065 2878595 ...
## $ Population_2000 : int 28189672 2913399 35856344 54849 71519 23364185 13172 85695 41100123 2946293 ...
## $ Population_1990 : int 19542982 3182021 30774621 58230 66097 16394062 11047 75055 37070774 3168523 ...
## $ Population_1980 : int 10694796 3295066 25518074 47818 53569 11828638 8316 63328 32637657 3556539 ...
## $ Population_1970 : int 12486631 2941651 18739378 32886 35611 8330047 6560 64888 28024803 3135123 ...
## $ Area_km2 : int 10752971 2324731 13795915 27075 19860 6029700 6283 64516 23842803 2534377 ...
## $ Density_per_km2 : int 652230 28748 2381741 199 468 1246700 91 442 2780400 29743 ...
## $ Growth_Rate : num 63.1 98.9 18.9 222.5 170.6 ...
## $ World_Pop_Percentage: num 1.026 0.996 1.016 0.983 1.01 ...
## $ NA. : num 0.52 0.04 0.56 0 0 0.45 0 0 0.57 0.03 ...
head(population_cleaned)
## Rank CCA3 Country Capital Continent Population_2022
## 1 36 AFG Afghanistan Kabul Asia 41128771
## 2 138 ALB Albania Tirana Europe 2842321
## 3 34 DZA Algeria Algiers Africa 44903225
## 4 213 ASM American Samoa Pago Pago Oceania 44273
## 5 203 AND Andorra Andorra la Vella Europe 79824
## 6 42 AGO Angola Luanda Africa 35588987
## Population_2015 Population_2010 Population_2000 Population_1990
## 1 38972230 33753499 28189672 19542982
## 2 2866849 2882481 2913399 3182021
## 3 43451666 39543154 35856344 30774621
## 4 46189 51368 54849 58230
## 5 77700 71746 71519 66097
## 6 33428485 28127721 23364185 16394062
## Population_1980 Population_1970 Area_km2 Density_per_km2 Growth_Rate
## 1 10694796 12486631 10752971 652230 63.0587
## 2 3295066 2941651 2324731 28748 98.8702
## 3 25518074 18739378 13795915 2381741 18.8531
## 4 47818 32886 27075 199 222.4774
## 5 53569 35611 19860 468 170.5641
## 6 11828638 8330047 6029700 1246700 28.5466
## World_Pop_Percentage NA.
## 1 1.0257 0.52
## 2 0.9957 0.04
## 3 1.0164 0.56
## 4 0.9831 0.00
## 5 1.0100 0.00
## 6 1.0315 0.45
# select relevant columns and calculate population growth percentage since 2000
population_long <- population_cleaned %>%
select(Country, Continent, Population_2000, Population_2022) %>%
mutate(Population_Growth_Percent = ((Population_2022 - Population_2000) / Population_2000) * 100)
head(population_long, 10)
## Country Continent Population_2000 Population_2022
## 1 Afghanistan Asia 28189672 41128771
## 2 Albania Europe 2913399 2842321
## 3 Algeria Africa 35856344 44903225
## 4 American Samoa Oceania 54849 44273
## 5 Andorra Europe 71519 79824
## 6 Angola Africa 23364185 35588987
## 7 Anguilla North America 13172 15857
## 8 Antigua and Barbuda North America 85695 93763
## 9 Argentina South America 41100123 45510318
## 10 Armenia Asia 2946293 2780469
## Population_Growth_Percent
## 1 45.900140
## 2 -2.439693
## 3 25.230908
## 4 -19.282029
## 5 11.612299
## 6 52.322827
## 7 20.384148
## 8 9.414785
## 9 10.730369
## 10 -5.628225
# show top 10 countries by population growth percentage
top_10_growth <- population_long %>%
arrange(desc(Population_Growth_Percent)) %>%
head(10)
top_10_growth
## Country Continent Population_2000 Population_2022
## 1 Jordan Asia 6931258 11285869
## 2 Oman Asia 2881914 4576298
## 3 Niger Africa 16647543 26207977
## 4 Qatar Asia 1713504 2695122
## 5 Mayotte Africa 211786 326101
## 6 Turks and Caicos Islands North America 29726 45703
## 7 Equatorial Guinea Africa 1094524 1674908
## 8 Angola Africa 23364185 35588987
## 9 DR Congo Africa 66391257 99010212
## 10 Chad Africa 11894727 17723315
## Population_Growth_Percent
## 1 62.82569
## 2 58.79370
## 3 57.42850
## 4 57.28717
## 5 53.97666
## 6 53.74756
## 7 53.02616
## 8 52.32283
## 9 49.13140
## 10 49.00144
Analyze Data
ggplot(top_10_growth, aes(x = reorder(Country, Population_Growth_Percent),
y = Population_Growth_Percent, fill = Continent)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Top 10 Countries by Population Growth Since 2000",
x = "Country",
y = "Population Growth Percentage",
fill = "Continent") +
theme_minimal()

Conclusion for the third dataset
The population growth between 2000 and 2022 was predominantly
increase in Africa and Asia. with Africa leading due to high fertility
rates and a youthful population, while Asia’s growth is largely
migration-driven in economically expanding Middle Eastern nations like
Jordan, Oman, and Qatar.