DATASCI 306, Fall 2024, Final Group Project

Throughout this course, you’ve dedicated yourself to refining your analytical abilities using R programming language. These skills are highly coveted in today’s job market!

Now, for the semester project, you’ll apply your learning to craft a compelling Data Story that can enrich your portfolio and impress prospective employers. Collaborating with a team (up to 5 members of your choosing), you’ll construct a Data Story akin to the example provided here: https://ourworldindata.org/un-population-2024-revision

Data is already in the data folder. This data is downloaded from: https://population.un.org/wpp/Download/Standard/MostUsed/

You’ll conduct Exploratory Data Analysis (EDA) on the provided data. The provided article already includes 6 diagrams. Show either the line or the map option for these 6 charts. You may ignore the table view. I’m also interested in seeing how each team will expand upon the initial analysis and generate additional 12 insightful charts that includes US and any other region or country that the author did not show. For e.g., one question you may want to answer is; US population is expected to increase to 421 million by 2100. You may want to show how the fertility rate and migration may be contributing to this increase in population.

Deliverable

1. Requirement-1 (2 pt) Import the data given in the .xlxs file into two separate dataframes;

one dataframe to show data from the Estimates tab
one dataframe to show data from the Medium variant tab

Hint: Some of the steps you may take while importing include:

skip the first several comment lines in the spread sheet
Importing the data as text first and then converting the relevant columns to different datatypes in step 2 below.

#read file
library(readxl)
file <- "data/WPP2024_GEN_F01_DEMOGRAPHIC_INDICATORS_COMPACT.xlsx"
#correct column types string
wgdic_xlsx_coltypes= c('numeric', 'text', 'text', 'text', 'numeric', 'text',
                       'text', 'numeric', 'text', 'numeric', 'numeric', 'numeric',
                       'numeric', 'numeric', 'numeric', 'numeric', 'numeric',
                       'numeric', 'numeric', 'numeric', 'numeric', 'numeric',
                       'numeric', 'numeric', 'numeric', 'numeric', 'numeric',
                       'numeric', 'numeric', 'numeric', 'numeric', 'numeric',
                       'numeric', 'numeric', 'numeric', 'numeric', 'numeric',
                       'numeric', 'numeric', 'numeric', 'numeric', 'numeric',
                       'numeric', 'numeric', 'numeric', 'numeric', 'numeric',
                       'numeric', 'numeric', 'numeric', 'numeric', 'numeric',
                       'numeric', 'numeric', 'numeric', 'numeric', 'numeric',
                       'numeric', 'numeric', 'numeric', 'numeric', 'numeric',
                       'numeric', 'numeric', 'numeric')
estimates_df <- read_excel(file, sheet = "Estimates", skip = 16,
                           col_types = wgdic_xlsx_coltypes, na = c('...','')) 
                                  #found '...' and '' for NA values throughout 

# Read the data from the 'Medium variant' tab
medium_variant_df <- read_excel(file, sheet = "Medium variant", skip = 16,
                                col_types = wgdic_xlsx_coltypes, na = c('...',''))
                                                    #same for Second sheet

2. Requirement-2 (5 pt)

You should show at least 5 steps you adopt to clean and/or transform the data. Your cleaning should include:

Renaming column names to make it more readable; removing space, making it lowercase or completely giving a different short name; all are acceptable.
Removing rows that are irrelevant; look at rows that have Type value as ‘Label/Separator’; are those rows required?
Removing columns that are redundant; For e.g., variant column
Converting text values to numeric on the columns that need this transformation

You could also remove the countries/regions that you are not interested in exploring in this step and re-save a smaller file in the same data folder, with a different name so that working with it becomes easier going forward.

Explain your reasoning for each clean up step.

#First im going to fix the column names and select useful columns only
correct_names = c(
  'id', 'area', 'Location_id', 'ISO3', 'ISO2', 'SDMX', 'type', 'parent_id', 'year',
  'Jan_pop', 'July_pop', 'male_july_pop', 'female_july_pop', 'pop_dens_july', 'sex_ratio_july', 
  'med_age_july', 'nat_change', 'rate_nat_change', 'pop_change', 'pop_grow_rate', 'pop_doubl_time', 
  'births', 'teen_births', 'birth_rate', 'fertility_rate', 'reproduction_rate', 'mean_birthing_age', 
  'birth_sex_ratio', 'deaths', 'male_deaths', 'female_deaths', 'death_rate', 
  'newborn_life_expectancy', 'male_newborn_life_expectancy', 'female_newborn_life_expectancy', 
  'age15_life_expectancy', 'male_age15_life_expectancy', 'female_age15_life_expectancy', 
  'age65_life_expectancy', 'male_age65_life_expectancy', 'female_age65_life_expectancy', 
  'age80_life_expectancy', 'male_age80_life_expectancy', 'female_age80_life_expectancy', 
  'infant_deaths', 'infant_mortality_rate', 'live_births', 'under5_deaths', 'under5_mortality_rate', 
  'under40_mortality_rate', 'male_under40_mortality_rate', 'female_under40_mortality_rate', 
  'under60_mortality_rate', 'male_under60_mortality_rate', 'female_under60_mortality_rate', 
  'between15_50_mortality_rate', 'male_between15_50_mortality_rate', 'female_between15_50_mortality_rate', 
  'between15_60_mortality_rate', 'male_between15_60_mortality_rate', 'female_between15_60_mortality_rate', 
  'num_migrants', 'migration_rate'
)
#Get rid of a few more columns and add correct names
estimates_df <- estimates_df |>
  select(-Variant, -Notes)
colnames(estimates_df) <- correct_names
medium_variant_df<- medium_variant_df |>
  select(-Variant, -Notes)
colnames(medium_variant_df) <- correct_names

#I want to find the least common factors for each covariate and then manually 
#                                                     parse for weird one offs
estimates_df |>
  mutate(across(everything(), as.character)) |> 
  pivot_longer(everything(), names_to = "covariate", values_to = "value") |>
  group_by(covariate, value) |>
  summarise(count = n(), .groups = "drop") |>
  arrange(covariate, count) |>
  group_by(covariate) |>
  slice_head(n = 5) |>
  ungroup()

## # A tibble: 315 × 3
##    covariate value count
##    <chr>     <chr> <int>
##  1 ISO2      AD       74
##  2 ISO2      AE       74
##  3 ISO2      AF       74
##  4 ISO2      AG       74
##  5 ISO2      AI       74
##  6 ISO3      ABW      74
##  7 ISO3      AFG      74
##  8 ISO3      AGO      74
##  9 ISO3      AIA      74
## 10 ISO3      ALB      74
## # ℹ 305 more rows

medium_variant_df |>
  mutate(across(everything(), as.character)) |> 
  pivot_longer(everything(), names_to = "covariate", values_to = "value") |>
  group_by(covariate, value) |>
  summarise(count = n(), .groups = "drop") |>
  arrange(covariate, count) |>
  group_by(covariate) |>
  slice_head(n = 5) |>
  ungroup()

## # A tibble: 315 × 3
##    covariate value count
##    <chr>     <chr> <int>
##  1 ISO2      AD       77
##  2 ISO2      AE       77
##  3 ISO2      AF       77
##  4 ISO2      AG       77
##  5 ISO2      AI       77
##  6 ISO3      ABW      77
##  7 ISO3      AFG      77
##  8 ISO3      AGO      77
##  9 ISO3      AIA      77
## 10 ISO3      ALB      77
## # ℹ 305 more rows

#Identify Issues illuminated:
  #"Region, subregion, country or area *"
  #"Type: "Label/Separator"
  #"Year: NA"

#Here i will deal with the above issues
estimates_df <- estimates_df |>
  filter(type != "Label/Separator" | !is.na(year))

medium_variant_df <- medium_variant_df |>
  filter(type != "Label/Separator" | !is.na(year))

#I want to check the amount of NA values for each covarriate and deal with 
#                                                               them as needed

#Check how many NAs in each column:
estimates_df |> summarise(across(everything(), ~ sum(is.na(.)),
                                 .names = "NA_count_{col}"))

## # A tibble: 1 × 63
##   NA_count_id NA_count_area NA_count_Location_id NA_count_ISO3 NA_count_ISO2
##         <int>         <int>                <int>         <int>         <int>
## 1           0             0                    0          4440          4440
## # ℹ 58 more variables: NA_count_SDMX <int>, NA_count_type <int>,
## #   NA_count_parent_id <int>, NA_count_year <int>, NA_count_Jan_pop <int>,
## #   NA_count_July_pop <int>, NA_count_male_july_pop <int>,
## #   NA_count_female_july_pop <int>, NA_count_pop_dens_july <int>,
## #   NA_count_sex_ratio_july <int>, NA_count_med_age_july <int>,
## #   NA_count_nat_change <int>, NA_count_rate_nat_change <int>,
## #   NA_count_pop_change <int>, NA_count_pop_grow_rate <int>, …

#Filter out regions you dont need
estimates_df<- estimates_df |> filter(area != 'Oceania')

medium_variant_df<- medium_variant_df |> filter(area != 'Oceania')
#Filter out types you dont need
estimates_df<- estimates_df |> filter(type=='World' | type == 'Region')

medium_variant_df<- medium_variant_df |> filter(type=='World' | type == 'Region')

estimates_df |> group_by(type)|> summarise(n())

## # A tibble: 2 × 2
##   type   `n()`
##   <chr>  <int>
## 1 Region   370
## 2 World     74

#Write out to xlxs file
library(openxlsx)
write.xlsx(estimates_df, file = "data/estimates.xlsx")
write.xlsx(medium_variant_df, file= "data/variant.xlsx")
#both are about 190KB now!

3. Requirement-3 (3 pt) Replicate the 6 diagrams shown in the article. Show only the ‘2024’ projection values where ever you have both ‘2022’ and ‘2024’ displayed. Show only the diagrams that are shown on the webpage with default options.

#what to do:
  # use variant data
  # facet wrap based on the area
  # first check to see if there is a ggplot setting that allows for billions 
#     to be represented as 1.5B and millions to be represented as 760M...if not
    # change population numbers to billions for world, africa, and asia
    # change population to millions for Europe, NA, LA
  # Style:
    # blue line
    # No vertical hashes, only horizontal
    # x scales by 20 with 2100 as last and 2024 as first (2040 as second)
    # facet_wrap for 6 graphs
    # Show no points, just a smooth line


#Transform data by selecting needed covariates and combining pop data into 1 col
var_graph_df <- medium_variant_df |>
  select(area, year, Jan_pop, July_pop) |>
  pivot_longer(cols = c(Jan_pop, July_pop), names_to = "pop_type", values_to = "new_pop") |> 
  mutate(
    year = if_else(pop_type == "July_pop", year + 0.5, year),
    scale_factor = case_when(
      area %in% c("World", "Africa", "Asia") ~ 1e6,  # to represent in billions
      area %in% c("Europe", "Northern America", "Latin America and the Caribbean") ~ 1e3, 
      TRUE ~ 1
    ),
    unit_label = case_when(
      area %in% c("World", "Africa", "Asia") ~ "(Billions)",
      area %in% c("Europe", "Northern America", "Latin America and the Caribbean") ~ "(Millions)",
      TRUE ~ ""
    ),
    new_pop = new_pop / scale_factor,
    #get graph order correct
    area_unit = factor(
      paste(area, unit_label),
      levels = c(
        "World (Billions)",
        "Africa (Billions)",
        "Asia (Billions)",
        "Europe (Millions)",
        "Northern America (Millions)",
        "Latin America and the Caribbean (Millions)"
      )  # Explicitly order facets
    )
  )

# Create the plot

ggplot(var_graph_df, aes(x = year, y = new_pop)) +
  geom_line(color = "blue", size = 1) +
  geom_smooth(se = FALSE, method = "loess") + #suggested by debugger
  facet_wrap(~area_unit, scales = "free_y", ncol = 3, labeller = label_wrap_gen(width = 20)) +
  scale_y_continuous(labels = scales::label_number()) +
  scale_x_continuous(
    breaks = c(2024, 2040, 2060, 2080, 2100),  # Specify x-axis breaks
    limits = c(2024, 2100)                    # Set x-axis limits
  ) +
  theme_minimal() +
  theme(
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
    panel.grid.major.y = element_line(color = "gray"),
    strip.text = element_text(size = 12, face = "bold"),
    strip.text.x = element_text(margin = margin(b = 10)),
    plot.margin = margin(20, 20, 20, 20)
  ) +
  labs(
    title = "Population Trends by Region",
    x = "Year",
    y = "Population"
  )

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 6 rows containing non-finite values (`stat_smooth()`).

## Warning: Removed 1 row containing missing values (`geom_line()`).

4. Requirement-4 (12 pt)

Select United States related data, and any other country or region(s) of your choosing to perform EDA. Chart at least 12 additional diagrams that may show relationships like correlations, frequencies, trend charts, between various variables with plots of at least 3 different types (line, heatmap, pie, etc.). Every plot should have a title and the x/y axis should have legible labels without any label overlaps for full credit.

Summarize your interpretations after each chart.

#pi charts of population proportions through years

#regrab data
estimates_df <- read_excel(file, sheet = "Estimates", skip = 16,
                           col_types = wgdic_xlsx_coltypes, na = c('...',''))
medium_variant_df <- read_excel(file, sheet = "Medium variant", skip = 16,
                              col_types = wgdic_xlsx_coltypes, na = c('...',''))
estimates_df <- estimates_df |>
  select(-Variant, -Notes)
colnames(estimates_df) <- correct_names
medium_variant_df<- medium_variant_df |>
  select(-Variant, -Notes)
colnames(medium_variant_df) <- correct_names
estimates_df <- estimates_df |>
  filter(type != "Label/Separator" | !is.na(year))
medium_variant_df <- medium_variant_df |>
  filter(type != "Label/Separator" | !is.na(year))
#get data section of data wanted for this graph
pop_prop_df <- medium_variant_df|> select(area, type, year, Jan_pop) |> 
  filter(type %in% c("Region","Subregion") | 
           (area %in% c('United States of America', 'Canada') )) 
#check the overlaps:
pop_prop_df |> group_by(type,area) |> distinct(type,area)

## # A tibble: 29 × 2
## # Groups:   type, area [29]
##    type      area           
##    <chr>     <chr>          
##  1 Region    Africa         
##  2 Subregion Eastern Africa 
##  3 Subregion Middle Africa  
##  4 Subregion Northern Africa
##  5 Subregion Southern Africa
##  6 Subregion Western Africa 
##  7 Region    Asia           
##  8 Subregion Central Asia   
##  9 Subregion Eastern Asia   
## 10 Subregion Southern Asia  
## # ℹ 19 more rows

#found that I also want US and canada if i want any subregion for northern america
#added line to data retrieval line and here ill classify them as subregions

#classify us and canada as subregions
pop_prop_df <- pop_prop_df |> mutate(type = ifelse(
                                    type== "Country/Area", "Subregion", type))

#create function to classify subregions as regions
sub_r <- c('Eastern Africa', 'Middle Africa', 'Northern Africa', 'Southern Africa',
           'Western Africa','Central Asia', 'Eastern Asia', 'Southern Asia',
           'South-Eastern Asia', 'Western Asia', 'Eastern Europe', 'Northern Europe',
           'Southern Europe', 'Western Europe', 'Caribbean', 'Central America',
           'South America', 'Northern America', 'United States of America', 'Canada',
           'Micronesia', 'Australia/New Zealand', 'Melanesia',  'Polynesia')
sub_2_reg <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4,5,5,6,6,6,6)
reg <- c('Africa', 'Asia', 'Europe', 'Latin America and the Caribbean',
         'Northern America', 'Oceania')

what_region <- function(subregion){
  return(reg[sub_2_reg[match(subregion, sub_r)]])
}

#fix data for regions and subregions only using above classifications
sub_and_reg <-pop_prop_df |> filter(type == 'Subregion') |>
  mutate(Region = what_region(area), Subregion= area) |>
  select(year, Region, Subregion, Jan_pop)


#plot
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

region_data <- sub_and_reg |>
  group_by(year, Region) |>
  summarize(total_pop = sum(Jan_pop), .groups = "drop")

plot_ly() |>
  # Outer donut (SubRegion-level)
  add_pie(data = sub_and_reg,
          labels = ~Subregion,
          values = ~Jan_pop,
          frame = ~year,
          hole = 0.6,
          textinfo = 'label',
          textposition = 'inside',
          insidetextfont = list(color = '#FFFFFF'),
          marker = list(line = list(color = '#FFFFFF', width = 1)),
          direction = 'clockwise',
          sort = FALSE,
          text = ~paste(Region, Jan_pop),
          hoverinfo = 'text+percent') |>
  
  # Inner pie (Region-level)
  add_pie(data = region_data,
          labels = ~Region,
          values = ~total_pop,
          frame = ~year,
          textinfo = 'label',
          textposition = 'inside',
          direction = 'clockwise',
          sort = FALSE,
          name = "Region Data",
          marker = list(line = list(color = '#FFFFFF', width = 1)),
          # Domain places this pie inside the donut hole
          domain = list(x = c(0.2, 0.8), y = c(0.2, 0.8)),
          hoverinfo = 'text+percent') |>
  
  config( displaylogo = FALSE) |>
  
  layout(title = "Population Over Time by Region and Subregion",
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         autosize = FALSE,
         updatemenus = list(
           list(
             type = "buttons",
             showactive = FALSE,
             buttons = list(
               list(label = "Play",
                    method = "animate",
                    args = list(NULL, 
                                list(mode = "immediate",
                                     frame = list(duration = 500, redraw = FALSE),
                                     transition = list(duration = 0)))))
           )
         )) |>
  
  animation_opts(
    frame = 500,      # Duration per frame (ms)
    transition = 0,   # No transition between frames
    easing = "linear"
  )

# This graph helps us see the large increase in african population,
#                                     especially in western and eastern africa,
# and the fast population proportion decrease of eastern asia

#Slinding bar charts of pop_grow_rate for subregions (and us/canada)
#get data we want for this graph
pop_growth_df <- medium_variant_df |>
  select(area, type, year, pop_grow_rate) |>
  filter(type %in% c("Region", "Subregion") |
           (area %in% c("United States of America", "Canada"))) |>
  mutate(type = ifelse(type == "Country/Area", "Subregion", type))
pop_growth_data <- pop_growth_df |>
  filter(type == "Subregion") |>
  mutate(Region = what_region(area), Subregion = area) |>
  select(year, Region, Subregion, pop_grow_rate)
pop_growth_data <- pop_growth_data |>
  arrange(Region, Subregion)

# Update Subregion to group regions for x-axis
pop_growth_data$Subregion <- factor(pop_growth_data$Subregion,
                                    levels = unique(pop_growth_data$Subregion))

# Plot
plot_ly(data = pop_growth_data,
        x = ~Subregion,
        y = ~pop_grow_rate,
        color = ~Region,
        frame = ~year,
        type = "bar",
        text = ~paste(Subregion, "<br>Growth Rate:", round(pop_grow_rate, 2)),
        hoverinfo = "text") |>
  layout(
    title = "Population Growth Rate by Subregion (Grouped by Region)",
    xaxis = list(
      title = "Subregion",
      categoryorder = "array",  # Maintain the factor order of Subregion
      categoryarray = levels(pop_growth_data$Subregion),  # factorlevels as order
      tickangle = -45,
      showgrid = FALSE
    ),
    yaxis = list(
      title = "Population Growth Rate",
      showgrid = TRUE,
      zeroline = TRUE
    ),
    updatemenus = list(
      list(
        type = "buttons",
        showactive = FALSE,
        buttons = list(
          list(label = "Play",
               method = "animate",
               args = list(NULL, 
                           list(mode = "immediate",
                                frame = list(duration = 500, redraw = FALSE),
                                transition = list(duration = 0)))),
          list(label = "Pause",
               method = "animate",
               args = list(NULL, 
                           list(mode = "immediate",
                                frame = list(duration = 0, redraw = FALSE),
                                transition = list(duration = 0))))
        )
      )
    )
  ) |>
  animation_opts(
    frame = 500,      # Duration per frame (ms)
    transition = 0,   # No transition between frames
    easing = "linear"
  ) |>
  config(displaylogo = FALSE)

#This may illuminate why Africa shot up in the pie chart: because it has the 
#highest growth rate right now but it is still subject to the overall trend of 
                                                    #declining growth rates.

#What I found interesting is that Oceania (and maybe Europe) seems to resist 
#this declining growth rate more than all other regions, only changing a less 
#                                   over the 75 years than other regions

# Makes me wonder about what is happening differently in Eastern Asia?

#get correct data
#Try examining reproduction_rate and fertility_rate throughout time for USA and
  #the two counties with most different changes from the US's changes
#Following code aims to identify countries with the biggest changes in these two rates 
fertility_reproduction_df <- medium_variant_df |>
  filter(type == "Country/Area") |>
  select(area, year, reproduction_rate, fertility_rate)
#find biggest changes 
change_country_df <- fertility_reproduction_df |>
  filter(year == 2024 | year == 2100) |>
  mutate(
    reproduction_rate = ifelse(year == 2024, -reproduction_rate, reproduction_rate),
    fertility_rate = ifelse(year == 2024, -fertility_rate, fertility_rate)
  ) |>
  group_by(area) |>
  summarize(
    difference_fertility_rate = sum(fertility_rate),
    difference_reproduction_rate = sum(reproduction_rate),
    .groups = "drop")

# Constants for US rates
us_dfr <- 0.025
us_drr <- 0.018

# Add distance columns and sort
change_country_df <- change_country_df |>
  mutate(
    difference_fertility_rate_dist = difference_fertility_rate - us_dfr,
    difference_reproduction_rate_dist = difference_reproduction_rate - us_drr) |>
  arrange(desc(difference_fertility_rate_dist+difference_reproduction_rate_dist))
print(change_country_df)

## # A tibble: 237 × 5
##    area     difference_fertility…¹ difference_reproduct…² difference_fertility…³
##    <chr>                     <dbl>                  <dbl>                  <dbl>
##  1 Republi…                  0.562                  0.273                  0.537
##  2 Saint B…                  0.499                  0.247                  0.474
##  3 China, …                  0.478                  0.231                  0.453
##  4 China, …                  0.47                   0.227                  0.445
##  5 China, …                  0.464                  0.228                  0.439
##  6 Puerto …                  0.414                  0.205                  0.389
##  7 Singapo…                  0.411                  0.2                    0.386
##  8 British…                  0.349                  0.173                  0.324
##  9 China                     0.332                  0.176                  0.307
## 10 Andorra                   0.324                  0.161                  0.299
## # ℹ 227 more rows
## # ℹ abbreviated names: ¹difference_fertility_rate,
## #   ²difference_reproduction_rate, ³difference_fertility_rate_dist
## # ℹ 1 more variable: difference_reproduction_rate_dist <dbl>

#we see that we want 'Republic of Korea' and 'Central African Republic' 
#                               and 'United States of America'
#Notice the amount of eastern Asian counties on page 1 (?)
  #Remember that that Asia was declining in pop and Africa was increasing
  #Why this might be initially counterintuitive but its becuase african countries
  # are starting with high reproduction rates and firtility rates and it goes to normal (<0)

#graph these three countires throughout the timeframe
all_data <- medium_variant_df |>
  filter(type == "Country/Area") |>
  select(area, year, reproduction_rate, fertility_rate)
all_data<- all_data |> mutate(Region= what_region(area))

# Define the highlight countries
highlight_countries <- c("Republic of Korea",
                         "Central African Republic",
                         "United States of America")

# Filter for just the highlighted countries
highlight_data <- all_data |>
  filter(area %in% highlight_countries)

# Plot Reproduction Rate
ggplot() +
  # All countries in gray at semi-transparency
  geom_line(data = all_data, aes(x = year, y = reproduction_rate, group = area),
            color = "gray", alpha = 0.5) +
  
  # Highlighted countries in color
  geom_line(data = highlight_data,
            aes(x = year, y = reproduction_rate, color = area), linewidth = 1) +
  geom_smooth(data = highlight_data,
              aes(x = year, y = reproduction_rate, color = area), se = FALSE,
              method = "loess") +
  
  # Format axes
  scale_y_continuous(labels = scales::label_number()) +
  scale_x_continuous(
    breaks = c(2024, 2040, 2060, 2080, 2100),
    limits = c(2024, 2100)
  ) +
  
  # Theme adjustments
  theme_minimal() +
  theme(
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
    panel.grid.major.y = element_line(color = "gray"),
    strip.text = element_text(size = 12, face = "bold"),
    strip.text.x = element_text(margin = margin(b = 10)),
    plot.margin = margin(20, 20, 20, 20)
  ) +
  
  # Labels
  labs(
    title = "Reproduction Rate by Year",
    x = "Year",
    y = "Reproduction Rate (surviving daughters per woman)"
  )

## `geom_smooth()` using formula = 'y ~ x'

ggplot()+
geom_line(data = all_data, aes(x = year, y = fertility_rate, group = area),
          color = "gray", alpha = 0.5) +
  
  # Highlighted countries in color
  geom_line(data = highlight_data, aes(x = year, y = fertility_rate,
                                       color = area), linewidth = 1) +
  geom_smooth(data = highlight_data,
              aes(x = year, y = fertility_rate,color = area), se = FALSE,
              method = "loess") +
  
  # Format axes
  scale_y_continuous(labels = scales::label_number()) +
  scale_x_continuous(
    breaks = c(2024, 2040, 2060, 2080, 2100),
    limits = c(2024, 2100)
  ) +
  
  # Theme adjustments
  theme_minimal() +
  theme(
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
    panel.grid.major.y = element_line(color = "gray"),
    strip.text = element_text(size = 12, face = "bold"),
    strip.text.x = element_text(margin = margin(b = 10)),
    plot.margin = margin(20, 20, 20, 20)
  ) +
  
  # Labels
  labs(
    title = "Fertility Rate by Year",
    x = "Year",
    y = "Fertility Rate (live births per woman)"
  )

## `geom_smooth()` using formula = 'y ~ x'

# It looks like these two variables are highly correlated and are asymptotically
#                       approaching the more stable value of the united states.

# The trends look logarithmic with time.

#This may show signs of multicollinearity (An ever-present theme  in this data)

#Heatmap that y is subregions, x is time in 20 year chunks, and the value is the 
#   gender breakdown (dark blue, blue, light blue, light pink, pink, dark pink)

#facet wrap on Region 
#get data i want
gender_heatmap_df<- medium_variant_df |>
  select(area, type, year, sex_ratio_july) |>
  filter(type %in% c("Region", "Subregion") |
           (area %in% c("United States of America", "Canada"))) |>
  mutate(type = ifelse(type == "Country/Area", "Subregion", type))
#add region column
gender_heatmap_sub <- gender_heatmap_df |> filter(type== "Subregion") |>
  mutate(Region = what_region(area), Subregion = area) |>
  select(year, Region, Subregion, sex_ratio_july)
#make year buckets
gender_heatmap_sub <- gender_heatmap_sub |>
  mutate(year_chunk = floor((year-.0001)/20)*20)
#Helper function for graphing 
#function to pick which value over 20 year chunk to pick to show 
chunk_gender <- function(Subr, yer) {
  target_year <- (floor((yer-.0001)/20)*20) + 19
  val <- gender_heatmap_sub |> 
    filter(year == target_year & Subregion == Subr) |> 
    pull(sex_ratio_july)
  return(toString(round(val)))
}
gender_heatmap_sub_with_text <- gender_heatmap_sub |>
  rowwise() |>
  mutate(label_text = chunk_gender(Subregion, year)) |>
  ungroup()

ggplot(gender_heatmap_sub_with_text, aes(x = factor(year_chunk),
                                         y = Subregion, fill = sex_ratio_july))+
  geom_tile() +
  geom_text(aes(label = label_text), color = "black", size = 7) +
  facet_wrap(~ Region, scales = "free_y") +
  scale_fill_gradientn(colors = c("steelblue4", "steelblue3", "steelblue1",
                                  "plum1", "plum3", "plum4")) +
  theme_minimal() +
  labs(
    title = "Sex Ratio Over Time by Subregion",
    x = "20-Year Time Chunks",
    y = "Subregion",
    fill = "Sex Ratio (July)",
  ) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    strip.text = element_text(face = "bold"),
    aspect.ratio = 1
  )

#we see a pretty normal shift between male an femeale 
#                                   (most rows going towards 100/100 over time).
  
#another interesting observation is that east asia is treding from women -> men
# whereas most of africa is trending men -> women. 

#this might be a reason for population proportion swings

#I want to look at birthing ages because i know that there is a trend in Korea 
# where younger women aren't dating because how society treats them

#plot with slider on year. x is different subregions. The y for 1 graph 
#                   is mean_birthing_age, and for the other is pop_dens_july
# Prepare the data for both mean_birthing_age and pop_dens_july
combined_data <- medium_variant_df |>
  select(area, type, year, mean_birthing_age, pop_dens_july) |>
  filter(type %in% c("Region", "Subregion") | 
           (area %in% c("United States of America", "Canada"))) |>
  mutate(type = ifelse(type == "Country/Area", "Subregion", type)) |>
  filter(type == "Subregion") |>
  mutate(Region = what_region(area), Subregion = area) |>
  arrange(Region, Subregion)

# Ensure Subregion order is fixed for plotting
combined_data$Subregion <- factor(combined_data$Subregion,
                                  levels = unique(combined_data$Subregion))

#plot
plot_ly(data = combined_data,
        x = ~Subregion,
        y = ~mean_birthing_age,
        frame = ~year,
        type = "bar",
        name = "Mean Birthing Age",
        text = ~paste(Subregion, "<br>Mean Birthing Age:", round(mean_birthing_age, 2)),
        hoverinfo = "text") |>
  add_trace(y = ~pop_dens_july,
            name = "Population Density",
            yaxis = "y2",
            text = ~paste(Subregion, "<br>Population Density:", round(pop_dens_july, 2)),
            hoverinfo = "text") |>
  layout(
    title = "Mean Birthing Age and Population Density by Subregion Over Time",
    xaxis = list(
      title = "Subregion",
      tickangle = -45,
      showgrid = FALSE
    ),
    yaxis = list(
      title = "Mean Birthing Age",
      range = c(25,35)
    ),
    yaxis2 = list(
      title = "Population Density",
      overlaying = "y",
      side = "right",
      range = c(0,410)
    ),
    barmode = "group",
    updatemenus = list(
      list(
        type = "buttons",
        showactive = FALSE,
        buttons = list(
          list(label = "Play",
               method = "animate",
               args = list(NULL, 
                           list(mode = "immediate",
                                frame = list(duration = 500, redraw = FALSE),
                                transition = list(duration = 0)))),
          list(label = "Pause",
               method = "animate",
               args = list(NULL, 
                           list(mode = "immediate",
                                frame = list(duration = 0, redraw = FALSE),
                                transition = list(duration = 0))))
        )
      )
    )
  ) |>
  animation_opts(
    frame = 500,
    transition = 0,
    easing = "linear"
  ) |>
  config(displaylogo = FALSE)

#These two seem to be not directly connected. The mean_birthing_age seems to 
#         raise (maybe more so while pop_denstiy is also growing) in all cases
#some subregions become less dense but is predicted to have a raising mean_birthing rate

#for Asia and Africa is looks like the mean_birthing_age lowering is a good sign
# of future population density growth.

5. Requirement-5 (2 pt) Having developed a strong understanding of your data, you’ll now create a machine learning (ML) model to predict a specific metric. This involves selecting the most relevant variables from your dataset.

The UN’s World Population Prospects provides a range of projected scenarios of population change. These rely on different assumptions in fertility, mortality and/or migration patterns to explore different demographic futures. Check this link for more info: https://population.un.org/wpp/DefinitionOfProjectionScenarios

You can choose to predict the same metric the UN provides (e.g., future population using fertility, mortality, and migration data). Compare your model’s predictions to the UN’s.

How significantly do your population projections diverge from those of the United Nations? Provide a comparison of the two. If you choose a different projection for which there is no UN data to compare with, then this comparison is not required.

#Im going to do my model a bit differnt here. I am going to do a forward AIC 
#                               to try to arrive at a good model
library(leaps)
lm_df <- estimates_df |> select(-ISO2, -ISO3, -type, -Location_id, -area,
                                - id, -parent_id)
lm_df|> summarise(across(everything(), ~ sum(is.na(.)), .names = "NA_count_{col}"))

## # A tibble: 1 × 56
##   NA_count_SDMX NA_count_year NA_count_Jan_pop NA_count_July_pop
##           <int>         <int>            <int>             <int>
## 1          1036             0                0                 0
## # ℹ 52 more variables: NA_count_male_july_pop <int>,
## #   NA_count_female_july_pop <int>, NA_count_pop_dens_july <int>,
## #   NA_count_sex_ratio_july <int>, NA_count_med_age_july <int>,
## #   NA_count_nat_change <int>, NA_count_rate_nat_change <int>,
## #   NA_count_pop_change <int>, NA_count_pop_grow_rate <int>,
## #   NA_count_pop_doubl_time <int>, NA_count_births <int>,
## #   NA_count_teen_births <int>, NA_count_birth_rate <int>, …

#looked into na values and I will probably just take out SDMX and then take 
                                      #out the NA tuples in pop_doubl_time
lm_df <- lm_df |> filter(!is.na(pop_doubl_time)) |> select(-SDMX)
#make population one column
lm_df <- lm_df |> pivot_longer(cols = c(Jan_pop, July_pop),
                               names_to = "pop_type", values_to = "Population")|>
  mutate(year = if_else(pop_type == "July_pop", year + 0.5, year) )|>
  select(-pop_type, -female_july_pop, -male_july_pop)
#Perform backwards step BIC on predicting  (commented out for time)
  # lm.empty = lm(Population~1, dat=lm_df)
  # lm.full = lm(Population~., dat = lm_df)
  # lm.forward.AIC = step(lm.empty, scope = list(lower = lm.empty, upper = lm.full), direction = "forward")
  # summary(lm.forward.AIC)$r.squared
  # length(lm.forward.AIC$coef)-1

FAIC_model_text <- 'Population ~ live_births + births + male_deaths + birth_sex_ratio + 
    nat_change + teen_births + under5_deaths + pop_change + under5_mortality_rate + 
    male_newborn_life_expectancy + age15_life_expectancy + mean_birthing_age + 
    death_rate + age65_life_expectancy + between15_60_mortality_rate + 
    male_under40_mortality_rate + male_age80_life_expectancy + 
    infant_mortality_rate + male_under60_mortality_rate + num_migrants + 
    male_between15_60_mortality_rate + male_between15_50_mortality_rate + 
    between15_50_mortality_rate + age80_life_expectancy + under60_mortality_rate + 
    under40_mortality_rate + female_age80_life_expectancy + year + 
    female_newborn_life_expectancy + female_between15_50_mortality_rate + 
    female_under60_mortality_rate + female_between15_60_mortality_rate + 
    sex_ratio_july + female_age15_life_expectancy + male_age15_life_expectancy + 
    male_age65_life_expectancy + female_age65_life_expectancy + 
    newborn_life_expectancy + pop_doubl_time + migration_rate + 
    reproduction_rate + rate_nat_change + female_under40_mortality_rate + 
    infant_deaths + med_age_july' 
FAIC_model <- lm(FAIC_model_text, dat = lm_df)
summary(FAIC_model)

## 
## Call:
## lm(formula = FAIC_model_text, data = lm_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -658048   -8946     777   10590  306146 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         6.520e+05  8.224e+04   7.929 2.28e-15 ***
## live_births                         1.969e+02  1.457e+01  13.517  < 2e-16 ***
## births                             -2.883e+02  1.440e+01 -20.025  < 2e-16 ***
## male_deaths                         3.831e+02  5.725e+00  66.920  < 2e-16 ***
## birth_sex_ratio                     4.436e+03  1.426e+02  31.112  < 2e-16 ***
## nat_change                          1.770e+04  1.948e+03   9.083  < 2e-16 ***
## teen_births                         2.017e+01  6.009e-01  33.558  < 2e-16 ***
## under5_deaths                      -9.172e+01  2.896e+00 -31.671  < 2e-16 ***
## pop_change                         -1.758e+04  1.949e+03  -9.023  < 2e-16 ***
## under5_mortality_rate               1.058e+03  5.763e+01  18.363  < 2e-16 ***
## male_newborn_life_expectancy        1.135e+05  2.259e+04   5.024 5.09e-07 ***
## age15_life_expectancy               1.839e+05  4.408e+04   4.173 3.01e-05 ***
## mean_birthing_age                  -2.791e+03  2.268e+02 -12.304  < 2e-16 ***
## death_rate                         -2.703e+03  2.681e+02 -10.082  < 2e-16 ***
## age65_life_expectancy              -3.641e+04  9.832e+03  -3.703 0.000213 ***
## between15_60_mortality_rate        -4.874e+03  1.811e+03  -2.691 0.007128 ** 
## male_under40_mortality_rate        -2.053e+03  4.071e+02  -5.043 4.62e-07 ***
## male_age80_life_expectancy         -3.873e+04  2.889e+03 -13.406  < 2e-16 ***
## infant_mortality_rate               3.187e+02  2.541e+01  12.541  < 2e-16 ***
## male_under60_mortality_rate        -4.539e+02  9.421e+02  -0.482 0.629941    
## num_migrants                        1.757e+04  1.949e+03   9.018  < 2e-16 ***
## male_between15_60_mortality_rate   -1.304e+03  9.317e+02  -1.400 0.161511    
## male_between15_50_mortality_rate    1.282e+03  3.241e+02   3.955 7.67e-05 ***
## between15_50_mortality_rate         1.835e+03  6.127e+02   2.994 0.002751 ** 
## age80_life_expectancy               4.995e+04  6.091e+03   8.201 2.45e-16 ***
## under60_mortality_rate              6.290e+03  1.810e+03   3.475 0.000512 ***
## under40_mortality_rate              5.418e+02  7.755e+02   0.699 0.484780    
## female_age80_life_expectancy       -1.896e+04  3.703e+03  -5.120 3.07e-07 ***
## year                               -3.753e+01  1.543e+01  -2.432 0.015027 *  
## female_newborn_life_expectancy      6.751e+04  2.119e+04   3.186 0.001446 ** 
## female_between15_50_mortality_rate -2.652e+03  3.222e+02  -8.231  < 2e-16 ***
## female_under60_mortality_rate      -5.506e+03  8.966e+02  -6.141 8.30e-10 ***
## female_between15_60_mortality_rate  4.968e+03  8.997e+02   5.522 3.38e-08 ***
## sex_ratio_july                     -2.299e+02  2.573e+01  -8.936  < 2e-16 ***
## female_age15_life_expectancy       -8.868e+04  2.152e+04  -4.120 3.79e-05 ***
## male_age15_life_expectancy         -1.260e+05  2.283e+04  -5.519 3.44e-08 ***
## male_age65_life_expectancy          3.150e+04  4.977e+03   6.329 2.50e-10 ***
## female_age65_life_expectancy        2.826e+04  5.664e+03   4.989 6.10e-07 ***
## newborn_life_expectancy            -1.714e+05  4.335e+04  -3.955 7.68e-05 ***
## pop_doubl_time                      3.884e+01  1.329e+01   2.923 0.003470 ** 
## migration_rate                      5.326e+01  2.086e+01   2.553 0.010672 *  
## reproduction_rate                   1.064e+04  1.578e+03   6.745 1.55e-11 ***
## rate_nat_change                    -6.370e+02  1.159e+02  -5.497 3.90e-08 ***
## female_under40_mortality_rate       9.781e+02  3.928e+02   2.490 0.012769 *  
## infant_deaths                      -2.490e+01  1.059e+01  -2.353 0.018646 *  
## med_age_july                        2.906e+02  1.276e+02   2.277 0.022786 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38730 on 35864 degrees of freedom
## Multiple R-squared:  0.9974, Adjusted R-squared:  0.9974 
## F-statistic: 3.071e+05 on 45 and 35864 DF,  p-value: < 2.2e-16

#check the model on the variant data
pred_df<- medium_variant_df|> select(-fertility_rate, -female_deaths, 
                                     -pop_grow_rate, -pop_dens_july, -ISO2, 
                                     -ISO3, -type, -Location_id, -area, - id, 
                                     -parent_id, -female_july_pop, 
                                     -male_july_pop, -SDMX)
pred_df |> summarise(across(everything(), ~ sum(is.na(.)), .names = "NA_count_{col}"))

## # A tibble: 1 × 49
##   NA_count_year NA_count_Jan_pop NA_count_July_pop NA_count_sex_ratio_july
##           <int>            <int>             <int>                   <int>
## 1             0                0                 0                       0
## # ℹ 45 more variables: NA_count_med_age_july <int>, NA_count_nat_change <int>,
## #   NA_count_rate_nat_change <int>, NA_count_pop_change <int>,
## #   NA_count_pop_doubl_time <int>, NA_count_births <int>,
## #   NA_count_teen_births <int>, NA_count_birth_rate <int>,
## #   NA_count_reproduction_rate <int>, NA_count_mean_birthing_age <int>,
## #   NA_count_birth_sex_ratio <int>, NA_count_deaths <int>,
## #   NA_count_male_deaths <int>, NA_count_death_rate <int>, …

pred_df <- pred_df |> filter(!is.na(pop_doubl_time))
pred_df <- pred_df |> 
  pivot_longer(cols = c(Jan_pop, July_pop),
               names_to = "pop_type", values_to = "Population") |>
  mutate(year = if_else(pop_type == "July_pop", year + 0.5, year) )|>
  select(-pop_type)
preds <-predict(FAIC_model, pred_df)

pred_comp_df <- pred_df |> mutate(predicted = preds)

# Now you can compare predicted to actual:
# For example, compute mean absolute error
mae <- mean(abs(pred_comp_df$predicted - pred_comp_df$Population))

# Or simply look at correlation
corr <- cor(pred_comp_df$predicted, pred_comp_df$Population)

# View the first few rows
head(pred_comp_df)

## # A tibble: 6 × 49
##    year sex_ratio_july med_age_july nat_change rate_nat_change pop_change
##   <dbl>          <dbl>        <dbl>      <dbl>           <dbl>      <dbl>
## 1 2024            101.         30.6     70016.            8.58     70017.
## 2 2024.           101.         30.6     70016.            8.58     70017.
## 3 2025            101.         30.9     69263.            8.41     69264.
## 4 2026.           101.         30.9     69263.            8.41     69264.
## 5 2026            101.         31.1     68866.            8.30     68866.
## 6 2026.           101.         31.1     68866.            8.30     68866.
## # ℹ 43 more variables: pop_doubl_time <dbl>, births <dbl>, teen_births <dbl>,
## #   birth_rate <dbl>, reproduction_rate <dbl>, mean_birthing_age <dbl>,
## #   birth_sex_ratio <dbl>, deaths <dbl>, male_deaths <dbl>, death_rate <dbl>,
## #   newborn_life_expectancy <dbl>, male_newborn_life_expectancy <dbl>,
## #   female_newborn_life_expectancy <dbl>, age15_life_expectancy <dbl>,
## #   male_age15_life_expectancy <dbl>, female_age15_life_expectancy <dbl>,
## #   age65_life_expectancy <dbl>, male_age65_life_expectancy <dbl>, …

# If desired, visualize predictions vs. actuals
library(ggplot2)
ggplot(pred_comp_df, aes(x = Population, y = predicted)) +
  geom_point() +
  geom_abline(color = "red") +
  labs(title = "Predicted vs Actual Population",
       x = "Actual Population",
       y = "Predicted Population")

6. Requirement-5 (1 pt)

Conclusion

Your analysis should conclude with a summary of key findings. I’m especially interested in any novel insights you uncover that go beyond the article’s original conclusions.

# What I found interesting in this data set was the way things were predicted to change. 
# Now this is a lot to do with how the UN predicts this data but most of the time 
# prediction models are used without users fully understanding the complexity of 
# the problems they are solving.
# 
# The change in population proportion was major in Africa and Asia and the USA was
# a good reference point for more stable data. We saw Africa was growing and Asia 
# was growing way slower.
# 
# We investigated the change in population growth for sub regions and found overall trends
# as well as some regions that are projected to defy some of these trends (like Oceania
# and Europe) and that eastern asia had influence on top of this overall trend. 
# 
# Looking into the fertility and reproduction rates showed us trends we expect to see 
# with some constant that we are approaching asymptotically.
# 
# The gender breakdown might be a important insight into what matter when populations
# are changing because we see the change in gender come from different sides for 
# Africa and Asia.
# 
# Finally looking and population density and mean birthing age we saw that the
# population growth (showed in the population density) usually come after some change
# in mean birthing age. This makes sense that if people are giving birth more,the 
# group that would be added onto this population would most only get younger.

7. Extra Credit (1 pt) Develop an interactive Shiny app to visualize your machine learning model’s projections. The app must include at least one interactive widget (e.g., dropdown, radio buttons, text input) allowing users to select a variable value (such as country/region) and view the corresponding projections.

Submission

You will upload the zip file containing finals.Rmd file and its PDF as a deliverable to Canvas. If you created a shiny app for predictions, you will add those files also to your zip file.
You will present your findings by creating a video of a maximum 15 minutes duration, explaining the code and the workings of your project; all team members should explain their part in the project to receive credit. You will share the URL of the video on Canvas for us to evaluate. An ideal way to create this video would be to start a Zoom meeting, start recording, and then every member share their screen and explain their contribution.

It is not necessary to prepare slides (if you do it doesn’t hurt) for the presentation. You may speak by showing the diagrams and/or code from your Posit project. Every team member should explain their part in the project along with the insights they derived by explaining the charts and summaries for full credit to each member.

Your project will be evaluated for clean code, meaningful/insightful EDA and predictions.

Note:

Each plot must be accompanied by a summary that clarifies the rationale behind its creation and what insights the plot unveils. Every diagram should possess standalone significance, revealing something more compelling than the other charts
After the deadline, instructors will select the top three outstanding analytics projects. The teams responsible for these exceptional analyses will have their video shared with the class

We will not accept submissions after the deadline; December 10th 4 pm

DATASCI 306, Fall 2024, Final Group Project

Trevor Kudlak