Project 2

Library

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(tidyr)
library(ggplot2)

Introduction

For this project we are choosing three “wide” datasets, practice to tidy and tranform each datasets, as well as analyzing the works.

Load Dataset #1

childcare_raw <- read.csv("https://raw.githubusercontent.com/JaydeeJan/Project-2/refs/heads/main/Childcare_Need___Supply__All.csv")
head(childcare_raw, 10)

##    Geographic.Unit Geographic.ID Geographic.Name State.Median.Income.Bracket
## 1           County         53001    Adams County                <=60% of SMI
## 2           County         53001    Adams County       >60% and <=75% of SMI
## 3           County         53001    Adams County       >75% and <=85% of SMI
## 4           County         53001    Adams County                 >85% of SMI
## 5           County         53001    Adams County                <=60% of SMI
## 6           County         53001    Adams County       >60% and <=75% of SMI
## 7           County         53001    Adams County       >75% and <=85% of SMI
## 8           County         53001    Adams County                 >85% of SMI
## 9           County         53001    Adams County                <=60% of SMI
## 10          County         53001    Adams County       >60% and <=75% of SMI
##     Age.Group Childcare.Subsidized Private.Care.Estimate
## 1      Infant                   18                    14
## 2      Infant                    0                     0
## 3      Infant                    0                     0
## 4      Infant                    0                     4
## 5   Preschool                  220                    23
## 6   Preschool                    0                     2
## 7   Preschool                    0                     0
## 8   Preschool                   NA                    11
## 9  School Age                  154                    58
## 10 School Age                    0                     5
##    Estimated.Children.Receiving.Childcare Estimate.of.Unserved Percent.Need.Met
## 1                                      32                  245             11.6
## 2                                       0                   14              0.0
## 3                                       0                    8              0.0
## 4                                       4                   38              9.5
## 5                                     243                  469             34.1
## 6                                       2                   60              3.2
## 7                                       0                   20              0.0
## 8                                      NA                  100               NA
## 9                                     212                 2178              8.9
## 10                                      5                  219              2.2

tail(childcare_raw)

##       Geographic.Unit Geographic.ID Geographic.Name State.Median.Income.Bracket
## 15627        ZIP Code         99403           99403       >75% and <=85% of SMI
## 15628        ZIP Code         99403           99403                 >85% of SMI
## 15629        ZIP Code         99403           99403                <=60% of SMI
## 15630        ZIP Code         99403           99403       >60% and <=75% of SMI
## 15631        ZIP Code         99403           99403       >75% and <=85% of SMI
## 15632        ZIP Code         99403           99403                 >85% of SMI
##        Age.Group Childcare.Subsidized Private.Care.Estimate
## 15627 School Age                    0                     0
## 15628 School Age                    0                     6
## 15629    Toddler                   60                     6
## 15630    Toddler                    0                     6
## 15631    Toddler                    0                     1
## 15632    Toddler                    0                    23
##       Estimated.Children.Receiving.Childcare Estimate.of.Unserved
## 15627                                      0                   99
## 15628                                      6                  223
## 15629                                     66                  199
## 15630                                      6                   51
## 15631                                      1                   15
## 15632                                     23                   67
##       Percent.Need.Met
## 15627              0.0
## 15628              2.6
## 15629             24.9
## 15630             10.5
## 15631              6.2
## 15632             25.6

Tidy and transform the Data

summary(childcare_raw) # Checking for missing value and property

##  Geographic.Unit    Geographic.ID     Geographic.Name   
##  Length:15632       Min.   :  53001   Length:15632      
##  Class :character   1st Qu.:  98292   Class :character  
##  Mode  :character   Median :  98843   Mode  :character  
##                     Mean   :1666602                     
##                     3rd Qu.:5301680                     
##                     Max.   :5310170                     
##                                                         
##  State.Median.Income.Bracket  Age.Group         Childcare.Subsidized
##  Length:15632                Length:15632       Min.   :   0.00     
##  Class :character            Class :character   1st Qu.:   0.00     
##  Mode  :character            Mode  :character   Median :   0.00     
##                                                 Mean   :  20.75     
##                                                 3rd Qu.:   0.00     
##                                                 Max.   :7725.00     
##                                                 NA's   :2175        
##  Private.Care.Estimate Estimated.Children.Receiving.Childcare
##  Min.   :    0.00      Min.   :    0.00                      
##  1st Qu.:    0.00      1st Qu.:    0.00                      
##  Median :    1.00      Median :    1.00                      
##  Mean   :   30.48      Mean   :   45.97                      
##  3rd Qu.:   11.00      3rd Qu.:   15.00                      
##  Max.   :18774.00      Max.   :15645.00                      
##                        NA's   :2175                          
##  Estimate.of.Unserved Percent.Need.Met
##  Min.   :    0.0      Min.   :  0.00  
##  1st Qu.:    3.0      1st Qu.:  0.00  
##  Median :   24.0      Median :  5.90  
##  Mean   :  250.1      Mean   : 11.36  
##  3rd Qu.:  135.0      3rd Qu.: 16.00  
##  Max.   :86007.0      Max.   :100.00  
##                       NA's   :3922

# convert relevant columns to numeric 
numeric_cols <- c("Childcare.Subsidized", "Private.Care.Estimate", "Estimated.Children.Receiving.Childcare", "Estimate.of.Unserved", "Percent.Need.Met")

childcare_raw[numeric_cols] <- lapply(childcare_raw[numeric_cols], as.numeric)

data_cleaned <- childcare_raw %>% drop_na() # remove rows with missing data (N/A).
str(data_cleaned)

## 'data.frame':    11710 obs. of  10 variables:
##  $ Geographic.Unit                       : chr  "County" "County" "County" "County" ...
##  $ Geographic.ID                         : int  53001 53001 53001 53001 53001 53001 53001 53001 53001 53001 ...
##  $ Geographic.Name                       : chr  "Adams County" "Adams County" "Adams County" "Adams County" ...
##  $ State.Median.Income.Bracket           : chr  "<=60% of SMI" ">60% and <=75% of SMI" ">75% and <=85% of SMI" ">85% of SMI" ...
##  $ Age.Group                             : chr  "Infant" "Infant" "Infant" "Infant" ...
##  $ Childcare.Subsidized                  : num  18 0 0 0 220 0 0 154 0 0 ...
##  $ Private.Care.Estimate                 : num  14 0 0 4 23 2 0 58 5 2 ...
##  $ Estimated.Children.Receiving.Childcare: num  32 0 0 4 243 2 0 212 5 2 ...
##  $ Estimate.of.Unserved                  : num  245 14 8 38 469 ...
##  $ Percent.Need.Met                      : num  11.6 0 0 9.5 34.1 3.2 0 8.9 2.2 2.3 ...

#rename the cols 
head(data_cleaned, 10)

##    Geographic.Unit Geographic.ID Geographic.Name State.Median.Income.Bracket
## 1           County         53001    Adams County                <=60% of SMI
## 2           County         53001    Adams County       >60% and <=75% of SMI
## 3           County         53001    Adams County       >75% and <=85% of SMI
## 4           County         53001    Adams County                 >85% of SMI
## 5           County         53001    Adams County                <=60% of SMI
## 6           County         53001    Adams County       >60% and <=75% of SMI
## 7           County         53001    Adams County       >75% and <=85% of SMI
## 8           County         53001    Adams County                <=60% of SMI
## 9           County         53001    Adams County       >60% and <=75% of SMI
## 10          County         53001    Adams County       >75% and <=85% of SMI
##     Age.Group Childcare.Subsidized Private.Care.Estimate
## 1      Infant                   18                    14
## 2      Infant                    0                     0
## 3      Infant                    0                     0
## 4      Infant                    0                     4
## 5   Preschool                  220                    23
## 6   Preschool                    0                     2
## 7   Preschool                    0                     0
## 8  School Age                  154                    58
## 9  School Age                    0                     5
## 10 School Age                    0                     2
##    Estimated.Children.Receiving.Childcare Estimate.of.Unserved Percent.Need.Met
## 1                                      32                  245             11.6
## 2                                       0                   14              0.0
## 3                                       0                    8              0.0
## 4                                       4                   38              9.5
## 5                                     243                  469             34.1
## 6                                       2                   60              3.2
## 7                                       0                   20              0.0
## 8                                     212                 2178              8.9
## 9                                       5                  219              2.2
## 10                                      2                   86              2.3

tail(data_cleaned)

##       Geographic.Unit Geographic.ID Geographic.Name State.Median.Income.Bracket
## 11705        ZIP Code         99403           99403       >75% and <=85% of SMI
## 11706        ZIP Code         99403           99403                 >85% of SMI
## 11707        ZIP Code         99403           99403                <=60% of SMI
## 11708        ZIP Code         99403           99403       >60% and <=75% of SMI
## 11709        ZIP Code         99403           99403       >75% and <=85% of SMI
## 11710        ZIP Code         99403           99403                 >85% of SMI
##        Age.Group Childcare.Subsidized Private.Care.Estimate
## 11705 School Age                    0                     0
## 11706 School Age                    0                     6
## 11707    Toddler                   60                     6
## 11708    Toddler                    0                     6
## 11709    Toddler                    0                     1
## 11710    Toddler                    0                    23
##       Estimated.Children.Receiving.Childcare Estimate.of.Unserved
## 11705                                      0                   99
## 11706                                      6                  223
## 11707                                     66                  199
## 11708                                      6                   51
## 11709                                      1                   15
## 11710                                     23                   67
##       Percent.Need.Met
## 11705              0.0
## 11706              2.6
## 11707             24.9
## 11708             10.5
## 11709              6.2
## 11710             25.6

received_chaildcare <- data_cleaned %>% select(Geographic.ID, Age.Group, State.Median.Income.Bracket, Estimated.Children.Receiving.Childcare)

receive_care_cleaned <- received_chaildcare %>% filter(Estimated.Children.Receiving.Childcare != 0) # removing rows with value in 0. 

head(receive_care_cleaned, 20)

##    Geographic.ID  Age.Group State.Median.Income.Bracket
## 1          53001     Infant                <=60% of SMI
## 2          53001     Infant                 >85% of SMI
## 3          53001  Preschool                <=60% of SMI
## 4          53001  Preschool       >60% and <=75% of SMI
## 5          53001 School Age                <=60% of SMI
## 6          53001 School Age       >60% and <=75% of SMI
## 7          53001 School Age       >75% and <=85% of SMI
## 8          53001 School Age                 >85% of SMI
## 9          53001    Toddler                <=60% of SMI
## 10         53001    Toddler       >75% and <=85% of SMI
## 11         53003     Infant                <=60% of SMI
## 12         53003     Infant       >60% and <=75% of SMI
## 13         53003     Infant       >75% and <=85% of SMI
## 14         53003     Infant                 >85% of SMI
## 15         53003  Preschool                <=60% of SMI
## 16         53003  Preschool       >60% and <=75% of SMI
## 17         53003  Preschool       >75% and <=85% of SMI
## 18         53003  Preschool                 >85% of SMI
## 19         53003 School Age                <=60% of SMI
## 20         53003 School Age                 >85% of SMI
##    Estimated.Children.Receiving.Childcare
## 1                                      32
## 2                                       4
## 3                                     243
## 4                                       2
## 5                                     212
## 6                                       5
## 7                                       2
## 8                                      30
## 9                                     103
## 10                                      1
## 11                                     31
## 12                                      3
## 13                                      1
## 14                                     16
## 15                                    132
## 16                                     12
## 17                                      6
## 18                                     63
## 19                                     79
## 20                                      6

percent_met <- data_cleaned %>% select(Geographic.ID, Age.Group, State.Median.Income.Bracket, Percent.Need.Met)

percent_met_cleaned <- percent_met %>% filter(Percent.Need.Met != 0) 

head(percent_met_cleaned, 20)

##    Geographic.ID  Age.Group State.Median.Income.Bracket Percent.Need.Met
## 1          53001     Infant                <=60% of SMI             11.6
## 2          53001     Infant                 >85% of SMI              9.5
## 3          53001  Preschool                <=60% of SMI             34.1
## 4          53001  Preschool       >60% and <=75% of SMI              3.2
## 5          53001 School Age                <=60% of SMI              8.9
## 6          53001 School Age       >60% and <=75% of SMI              2.2
## 7          53001 School Age       >75% and <=85% of SMI              2.3
## 8          53001 School Age                 >85% of SMI              8.5
## 9          53001    Toddler                <=60% of SMI             18.8
## 10         53001    Toddler       >75% and <=85% of SMI              7.1
## 11         53003     Infant                <=60% of SMI             22.3
## 12         53003     Infant       >60% and <=75% of SMI             12.5
## 13         53003     Infant       >75% and <=85% of SMI              4.3
## 14         53003     Infant                 >85% of SMI             34.8
## 15         53003  Preschool                <=60% of SMI             37.0
## 16         53003  Preschool       >60% and <=75% of SMI             16.0
## 17         53003  Preschool       >75% and <=85% of SMI             13.6
## 18         53003  Preschool                 >85% of SMI             48.5
## 19         53003 School Age                <=60% of SMI              7.2
## 20         53003 School Age                 >85% of SMI              2.2

Analyze Data

# summarize the average percentage of childcare needs met by state median income bracket.
percent_met_analysis <- percent_met_cleaned %>%
  group_by(State.Median.Income.Bracket) %>%
  summarise(Average_Need_Met = mean(Percent.Need.Met))

ggplot(percent_met_analysis, aes(x = reorder(State.Median.Income.Bracket, Average_Need_Met), y = Average_Need_Met)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Average Percentage of Childcare Needs Met by Income Bracket", x = "Income Bracket", y = "Percent Need Met") +
  coord_flip() +
  theme_minimal()

Conclusion for the first dataset

For families earning less or equal to 60% of State Median Income (SMI), thye have the highest percentage of childcare needs met compared to others. The families with the lowest incomes may benefit the most from government subsideies or other support programs aiming at childcare need. However, for families whose SMI between 60% to 85% received least childcare support. while families earning greater than 85% SMI received second highest percentage of childcare needs.

Load Dataset #2

hiv_raw <- read.csv("https://raw.githubusercontent.com/JaydeeJan/Project-2/refs/heads/main/HIV_AIDS_Diagnoses_by_Neighborhood__Sex__and_Race_Ethnicity_20241020.csv")
head(hiv_raw, 10)

##    YEAR Borough     Neighborhood..U.H.F.    SEX  RACE.ETHNICITY
## 1  2010                       Greenpoint   Male           Black
## 2  2011           Stapleton - St. George Female Native American
## 3  2010                 Southeast Queens   Male             All
## 4  2012                   Upper Westside Female         Unknown
## 5  2013                      Willowbrook   Male         Unknown
## 6  2013         East Flatbush - Flatbush   Male           Black
## 7  2013         East Flatbush - Flatbush Female Native American
## 8  2013                 Southwest Queens Female         Unknown
## 9  2012             Fordham - Bronx Park   Male         Unknown
## 10 2010             Flushing - Clearview    All             All
##    TOTAL.NUMBER.OF.HIV.DIAGNOSES HIV.DIAGNOSES.PER.100.000.POPULATION
## 1                              6                                330.4
## 2                              0                                    0
## 3                             23                                 25.4
## 4                              0                                    0
## 5                              0                                    0
## 6                             54                                 56.5
## 7                              0                                    0
## 8                              0                                    0
## 9                              0                                    0
## 10                            14                                  5.4
##    TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES
## 1                                              0
## 2                                              0
## 3                                              5
## 4                                              0
## 5                                              0
## 6                                              8
## 7                                              0
## 8                                              0
## 9                                              0
## 10                                             5
##    PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES
## 1                                                                    0
## 2                                                                    0
## 3                                                                 21.7
## 4                                                                    0
## 5                                                                    0
## 6                                                                 14.8
## 7                                                                    0
## 8                                                                    0
## 9                                                                    0
## 10                                                                35.7
##    TOTAL.NUMBER.OF.AIDS.DIAGNOSES AIDS.DIAGNOSES.PER.100.000.POPULATION
## 1                               5                                 275.3
## 2                               0                                     0
## 3                              14                                  15.4
## 4                               0                                     0
## 5                               0                                     0
## 6                              33                                  34.5
## 7                               0                                     0
## 8                               0                                     0
## 9                               0                                     0
## 10                             12                                   4.6

Tidy and transforming data

summary(hiv_raw)

##       YEAR        Borough          Neighborhood..U.H.F.     SEX           
##  Min.   :2010   Length:8976        Length:8976          Length:8976       
##  1st Qu.:2013   Class :character   Class :character     Class :character  
##  Median :2017   Mode  :character   Mode  :character     Mode  :character  
##  Mean   :2016                                                             
##  3rd Qu.:2020                                                             
##  Max.   :2021                                                             
##  RACE.ETHNICITY     TOTAL.NUMBER.OF.HIV.DIAGNOSES
##  Length:8976        Length:8976                  
##  Class :character   Class :character             
##  Mode  :character   Mode  :character             
##                                                  
##                                                  
##                                                  
##  HIV.DIAGNOSES.PER.100.000.POPULATION
##  Length:8976                         
##  Class :character                    
##  Mode  :character                    
##                                      
##                                      
##                                      
##  TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES
##  Length:8976                                  
##  Class :character                             
##  Mode  :character                             
##                                               
##                                               
##                                               
##  PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES
##  Length:8976                                                        
##  Class :character                                                   
##  Mode  :character                                                   
##                                                                     
##                                                                     
##                                                                     
##  TOTAL.NUMBER.OF.AIDS.DIAGNOSES AIDS.DIAGNOSES.PER.100.000.POPULATION
##  Length:8976                    Length:8976                          
##  Class :character               Class :character                     
##  Mode  :character               Mode  :character                     
##                                                                      
##                                                                      
##

numeric_cols <- c("TOTAL.NUMBER.OF.HIV.DIAGNOSES", "HIV.DIAGNOSES.PER.100.000.POPULATION", "TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES", "PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES", "TOTAL.NUMBER.OF.AIDS.DIAGNOSES", "AIDS.DIAGNOSES.PER.100.000.POPULATION")
  
hiv_raw[numeric_cols] <- lapply(hiv_raw[numeric_cols], as.numeric)

## Warning in lapply(hiv_raw[numeric_cols], as.numeric): NAs introduced by
## coercion
## Warning in lapply(hiv_raw[numeric_cols], as.numeric): NAs introduced by
## coercion
## Warning in lapply(hiv_raw[numeric_cols], as.numeric): NAs introduced by
## coercion
## Warning in lapply(hiv_raw[numeric_cols], as.numeric): NAs introduced by
## coercion
## Warning in lapply(hiv_raw[numeric_cols], as.numeric): NAs introduced by
## coercion
## Warning in lapply(hiv_raw[numeric_cols], as.numeric): NAs introduced by
## coercion

hiv_cleaned <- hiv_raw %>% drop_na() # remove rows with missing data (N/A).
str(hiv_cleaned)

## 'data.frame':    7062 obs. of  11 variables:
##  $ YEAR                                                               : int  2010 2011 2010 2012 2013 2013 2013 2013 2012 2010 ...
##  $ Borough                                                            : chr  "" "" "" "" ...
##  $ Neighborhood..U.H.F.                                               : chr  "Greenpoint" "Stapleton - St. George" "Southeast Queens" "Upper Westside" ...
##  $ SEX                                                                : chr  "Male" "Female" "Male" "Female" ...
##  $ RACE.ETHNICITY                                                     : chr  "Black" "Native American" "All" "Unknown" ...
##  $ TOTAL.NUMBER.OF.HIV.DIAGNOSES                                      : num  6 0 23 0 0 54 0 0 0 14 ...
##  $ HIV.DIAGNOSES.PER.100.000.POPULATION                               : num  330.4 0 25.4 0 0 ...
##  $ TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES                      : num  0 0 5 0 0 8 0 0 0 5 ...
##  $ PROPORTION.OF.CONCURRENT.HIV.AIDS.DIAGNOSES.AMONG.ALL.HIV.DIAGNOSES: num  0 0 21.7 0 0 14.8 0 0 0 35.7 ...
##  $ TOTAL.NUMBER.OF.AIDS.DIAGNOSES                                     : num  5 0 14 0 0 33 0 0 0 12 ...
##  $ AIDS.DIAGNOSES.PER.100.000.POPULATION                              : num  275.3 0 15.4 0 0 ...

# Group by year, sex, and race/ethnicity for summary
hiv_trans <- hiv_cleaned %>% 
  group_by(YEAR, SEX, RACE.ETHNICITY) %>%
  summarise(
    Total_HIV_Diagnoses = sum(TOTAL.NUMBER.OF.HIV.DIAGNOSES),
    .groups = 'drop'
  )

head(hiv_trans, 20)

## # A tibble: 20 × 4
##     YEAR SEX    RACE.ETHNICITY         Total_HIV_Diagnoses
##    <int> <chr>  <chr>                                <dbl>
##  1  2010 All    All                                   6391
##  2  2010 Female All                                    708
##  3  2010 Female Asian/Pacific Islander                  10
##  4  2010 Female Black                                  467
##  5  2010 Female Hispanic                               195
##  6  2010 Female Multiracial                              0
##  7  2010 Female Native American                          0
##  8  2010 Female Unknown                                  0
##  9  2010 Female White                                   35
## 10  2010 Male   All                                   2330
## 11  2010 Male   Asian/Pacific Islander                  74
## 12  2010 Male   Black                                  986
## 13  2010 Male   Hispanic                               762
## 14  2010 Male   Multiracial                              8
## 15  2010 Male   Native American                          1
## 16  2010 Male   Unknown                                  0
## 17  2010 Male   White                                  498
## 18  2011 All    All                                   6125
## 19  2011 Female All                                    669
## 20  2011 Female Asian/Pacific Islander                  10

Data Visualization

# HIV diagnoses over time by sex
ggplot(hiv_trans, aes(x = YEAR, y = Total_HIV_Diagnoses, color = SEX)) +
  geom_line(size = 1) +
  geom_point(size = 2, shape = 21, fill = "white") +
  scale_color_brewer(palette = "Set1") +
  labs(title = "Total HIV Diagnoses Over Time by Sex",
       x = "Year", y = "Total HIV Diagnoses") +
  theme_minimal(base_size = 14) +
  theme(legend.position = "top")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# HIV diagnoses over time by race/ethnicity
ggplot(hiv_trans, aes(x = YEAR, y = Total_HIV_Diagnoses, color = RACE.ETHNICITY)) +
  geom_line(size = 1) +
  geom_point(size = 2, shape = 21, fill = "white") +
  scale_color_brewer(palette = "Set1") +
  labs(title = "Total HIV Diagnoses Over Time by Race/Ethnicity",
       x = "Year", y = "Total HIV Diagnoses") +
  theme_minimal()

## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors

## Warning: Removed 34 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 34 rows containing missing values or values outside the scale range
## (`geom_point()`).

Conclusion for the second dataset

HIV diagnoses have been consistently higher among males and with Black and Hispanic communities, though there has been a gradual decline over time. Disparities in diagnoses by race/ethnicity indicate a need for targeted interventions in these affected populations.

Load Dataset #3

population_raw <- read.csv("https://raw.githubusercontent.com/JaydeeJan/Project-2/refs/heads/main/world_population.csv")
head(population_raw)

##   Rank CCA3 Country.Territory          Capital Continent X2022.Population
## 1   36  AFG       Afghanistan            Kabul      Asia         41128771
## 2  138  ALB           Albania           Tirana    Europe          2842321
## 3   34  DZA           Algeria          Algiers    Africa         44903225
## 4  213  ASM    American Samoa        Pago Pago   Oceania            44273
## 5  203  AND           Andorra Andorra la Vella    Europe            79824
## 6   42  AGO            Angola           Luanda    Africa         35588987
##   X2020.Population X2015.Population X2010.Population X2000.Population
## 1         38972230         33753499         28189672         19542982
## 2          2866849          2882481          2913399          3182021
## 3         43451666         39543154         35856344         30774621
## 4            46189            51368            54849            58230
## 5            77700            71746            71519            66097
## 6         33428485         28127721         23364185         16394062
##   X1990.Population X1980.Population X1970.Population Area..km..
## 1         10694796         12486631         10752971     652230
## 2          3295066          2941651          2324731      28748
## 3         25518074         18739378         13795915    2381741
## 4            47818            32886            27075        199
## 5            53569            35611            19860        468
## 6         11828638          8330047          6029700    1246700
##   Density..per.km.. Growth.Rate World.Population.Percentage
## 1           63.0587      1.0257                        0.52
## 2           98.8702      0.9957                        0.04
## 3           18.8531      1.0164                        0.56
## 4          222.4774      0.9831                        0.00
## 5          170.5641      1.0100                        0.00
## 6           28.5466      1.0315                        0.45

Tidy and Transform Data

# rename columns
colnames(population_raw) <- c("Rank", "CCA3", "Country", "Capital", "Continent", "Population_2022", "Population_2015", "Population_2010", "Population_2000", "Population_1990", "Population_1980", "Population_1970", "Area_km2", "Density_per_km2", "Growth_Rate", "World_Pop_Percentage")

population_cleaned <- population_raw

colnames(population_cleaned) <- make.names(colnames(population_cleaned), unique = TRUE)

str(population_cleaned)

## 'data.frame':    234 obs. of  17 variables:
##  $ Rank                : int  36 138 34 213 203 42 224 201 33 140 ...
##  $ CCA3                : chr  "AFG" "ALB" "DZA" "ASM" ...
##  $ Country             : chr  "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##  $ Capital             : chr  "Kabul" "Tirana" "Algiers" "Pago Pago" ...
##  $ Continent           : chr  "Asia" "Europe" "Africa" "Oceania" ...
##  $ Population_2022     : int  41128771 2842321 44903225 44273 79824 35588987 15857 93763 45510318 2780469 ...
##  $ Population_2015     : int  38972230 2866849 43451666 46189 77700 33428485 15585 92664 45036032 2805608 ...
##  $ Population_2010     : int  33753499 2882481 39543154 51368 71746 28127721 14525 89941 43257065 2878595 ...
##  $ Population_2000     : int  28189672 2913399 35856344 54849 71519 23364185 13172 85695 41100123 2946293 ...
##  $ Population_1990     : int  19542982 3182021 30774621 58230 66097 16394062 11047 75055 37070774 3168523 ...
##  $ Population_1980     : int  10694796 3295066 25518074 47818 53569 11828638 8316 63328 32637657 3556539 ...
##  $ Population_1970     : int  12486631 2941651 18739378 32886 35611 8330047 6560 64888 28024803 3135123 ...
##  $ Area_km2            : int  10752971 2324731 13795915 27075 19860 6029700 6283 64516 23842803 2534377 ...
##  $ Density_per_km2     : int  652230 28748 2381741 199 468 1246700 91 442 2780400 29743 ...
##  $ Growth_Rate         : num  63.1 98.9 18.9 222.5 170.6 ...
##  $ World_Pop_Percentage: num  1.026 0.996 1.016 0.983 1.01 ...
##  $ NA.                 : num  0.52 0.04 0.56 0 0 0.45 0 0 0.57 0.03 ...

head(population_cleaned)

##   Rank CCA3        Country          Capital Continent Population_2022
## 1   36  AFG    Afghanistan            Kabul      Asia        41128771
## 2  138  ALB        Albania           Tirana    Europe         2842321
## 3   34  DZA        Algeria          Algiers    Africa        44903225
## 4  213  ASM American Samoa        Pago Pago   Oceania           44273
## 5  203  AND        Andorra Andorra la Vella    Europe           79824
## 6   42  AGO         Angola           Luanda    Africa        35588987
##   Population_2015 Population_2010 Population_2000 Population_1990
## 1        38972230        33753499        28189672        19542982
## 2         2866849         2882481         2913399         3182021
## 3        43451666        39543154        35856344        30774621
## 4           46189           51368           54849           58230
## 5           77700           71746           71519           66097
## 6        33428485        28127721        23364185        16394062
##   Population_1980 Population_1970 Area_km2 Density_per_km2 Growth_Rate
## 1        10694796        12486631 10752971          652230     63.0587
## 2         3295066         2941651  2324731           28748     98.8702
## 3        25518074        18739378 13795915         2381741     18.8531
## 4           47818           32886    27075             199    222.4774
## 5           53569           35611    19860             468    170.5641
## 6        11828638         8330047  6029700         1246700     28.5466
##   World_Pop_Percentage  NA.
## 1               1.0257 0.52
## 2               0.9957 0.04
## 3               1.0164 0.56
## 4               0.9831 0.00
## 5               1.0100 0.00
## 6               1.0315 0.45

# select relevant columns and calculate population growth percentage since 2000
population_long <- population_cleaned %>%
  select(Country, Continent, Population_2000, Population_2022) %>%
  mutate(Population_Growth_Percent = ((Population_2022 - Population_2000) / Population_2000) * 100)

head(population_long, 10)

##                Country     Continent Population_2000 Population_2022
## 1          Afghanistan          Asia        28189672        41128771
## 2              Albania        Europe         2913399         2842321
## 3              Algeria        Africa        35856344        44903225
## 4       American Samoa       Oceania           54849           44273
## 5              Andorra        Europe           71519           79824
## 6               Angola        Africa        23364185        35588987
## 7             Anguilla North America           13172           15857
## 8  Antigua and Barbuda North America           85695           93763
## 9            Argentina South America        41100123        45510318
## 10             Armenia          Asia         2946293         2780469
##    Population_Growth_Percent
## 1                  45.900140
## 2                  -2.439693
## 3                  25.230908
## 4                 -19.282029
## 5                  11.612299
## 6                  52.322827
## 7                  20.384148
## 8                   9.414785
## 9                  10.730369
## 10                 -5.628225

# show top 10 countries by population growth percentage
top_10_growth <- population_long %>%
  arrange(desc(Population_Growth_Percent)) %>%
  head(10)

top_10_growth

##                     Country     Continent Population_2000 Population_2022
## 1                    Jordan          Asia         6931258        11285869
## 2                      Oman          Asia         2881914         4576298
## 3                     Niger        Africa        16647543        26207977
## 4                     Qatar          Asia         1713504         2695122
## 5                   Mayotte        Africa          211786          326101
## 6  Turks and Caicos Islands North America           29726           45703
## 7         Equatorial Guinea        Africa         1094524         1674908
## 8                    Angola        Africa        23364185        35588987
## 9                  DR Congo        Africa        66391257        99010212
## 10                     Chad        Africa        11894727        17723315
##    Population_Growth_Percent
## 1                   62.82569
## 2                   58.79370
## 3                   57.42850
## 4                   57.28717
## 5                   53.97666
## 6                   53.74756
## 7                   53.02616
## 8                   52.32283
## 9                   49.13140
## 10                  49.00144

Analyze Data

ggplot(top_10_growth, aes(x = reorder(Country, Population_Growth_Percent),
                          y = Population_Growth_Percent, fill = Continent)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Top 10 Countries by Population Growth Since 2000",
       x = "Country",
       y = "Population Growth Percentage",
       fill = "Continent") +
  theme_minimal()

Conclusion for the third dataset

The population growth between 2000 and 2022 was predominantly increase in Africa and Asia. with Africa leading due to high fertility rates and a youthful population, while Asia’s growth is largely migration-driven in economically expanding Middle Eastern nations like Jordan, Oman, and Qatar.

Project 2

Jayden Jiang

2024-10-17

Library

Introduction

For this project we are choosing three “wide” datasets, practice to tidy and tranform each datasets, as well as analyzing the works.

Load Dataset #1

Tidy and transform the Data

Analyze Data

Conclusion for the first dataset

Load Dataset #2

Tidy and transforming data

Data Visualization

Conclusion for the second dataset

HIV diagnoses have been consistently higher among males and with Black and Hispanic communities, though there has been a gradual decline over time. Disparities in diagnoses by race/ethnicity indicate a need for targeted interventions in these affected populations.

Load Dataset #3

Tidy and Transform Data

Analyze Data

Conclusion for the third dataset

The population growth between 2000 and 2022 was predominantly increase in Africa and Asia. with Africa leading due to high fertility rates and a youthful population, while Asia’s growth is largely migration-driven in economically expanding Middle Eastern nations like Jordan, Oman, and Qatar.