Kim_Roy_Homework

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

packages <- c("tidyverse", "fst", "modelsummary", "viridis") # add any you need here

# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load the packages
lapply(packages, library, character.only = TRUE)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: viridisLite

## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "fst"       "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "modelsummary" "fst"          "lubridate"    "forcats"      "stringr"     
##  [6] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [11] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [16] "utils"        "datasets"     "methods"      "base"        
## 
## [[4]]
##  [1] "viridis"      "viridisLite"  "modelsummary" "fst"          "lubridate"   
##  [6] "forcats"      "stringr"      "dplyr"        "purrr"        "readr"       
## [11] "tidyr"        "tibble"       "ggplot2"      "tidyverse"    "stats"       
## [16] "graphics"     "grDevices"    "utils"        "datasets"     "methods"     
## [21] "base"

ess <- read_fst("All-ESS-Data.fst")

france_data <- read.fst("france_data.fst")

task 1

Provide code and answer.

Prompt: in the tutorial, we calculated the average trust in others for France and visualized it. Using instead the variable ‘Trust in Parliament’ (trstplt) and the country of Spain (country file provided on course website), visualize the average trust by survey year. You can truncate the y-axis if you wish. Provide appropriate titles and labels given the changes. What are your main takeaways based on the visual (e.g., signs of increase, decrease, or stall)?

#  First I set and cleaned the variables as 'trstplt' in the country of Spain. 
spain_data <- read.fst("spain_data.fst")

spain_data <- spain_data %>%
  mutate(
    trstplt = ifelse(trstplt %in% c(77, 88, 99), NA, trstplt), # set values 77, 88, and 99 to NA.
  )

# Second, I set the variable of interest as trsplt.
table(spain_data$trsplt)

## < table of extent 0 >

# This line of code helps set up the visualtion for the following code.
spain_data$year <- NA
replacements <- c(2002, 2004, 2006, 2008, 2010, 2012, 2014, 2016, 2018, 2020)
for(i in 1:10){
  spain_data$year[spain_data$essround == i] <- replacements[i]
}

table(spain_data$year)

## 
## 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 
## 1729 1663 1876 2576 1885 1889 1925 1958 1668 2283

# This code generates a new data frame (trust_by_year) that contains the mean trust values for each year based on the "trstplt" column in the original spain_data data frame.
trust_by_year <- spain_data %>%
  group_by(year) %>%
  summarize(mean_trust = mean(trstplt, na.rm = TRUE))
trust_by_year

## # A tibble: 10 × 2
##     year mean_trust
##    <dbl>      <dbl>
##  1  2002       3.41
##  2  2004       3.66
##  3  2006       3.49
##  4  2008       3.32
##  5  2010       2.72
##  6  2012       1.91
##  7  2014       2.23
##  8  2016       2.40
##  9  2018       2.55
## 10  2020       1.94

# Finally, this code inputs all the neccesary characteristics for the visualization of the graph. 
ggplot(trust_by_year, aes(x = year, y = mean_trust)) +
  geom_line(color = "blue", size = 1) +  
  geom_point(color = "red", size = 3) +  
  labs(title = "Trust in Parliament in Spain (2002-2020)", 
       x = "Survey Year", 
       y = "Average Trust (0-10 scale)") +
  ylim(0, 10) +  
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# This code is used to simplify the overall look of the graph.
ggplot(trust_by_year, aes(x = year, y = mean_trust)) +
  geom_line(aes(group = 1), color = "blue", size = 1, linetype = "longdash") +  
#  geom_point(aes(color = mean_trust), size = 3) +  
#  scale_color_viridis(option = "D", end = 0.9, direction = -1) + 
  labs(title = "Trust in Parliament in Spain (2002-2020)", 
       x = "Survey Year", 
       y = "Average Trust (0-10 scale)") +
  ylim(0, 10) +  
  theme_minimal() +  
  theme(legend.position = "none")

As the table shows, the trust in parliment that the population of Spain has over the years 2002-2020 are declining as a whole compared to the much more steady average trust in France over the same time period.

task 2

Provide answer only.

Prompt and question: Based on the figure we produced above called task2_plot, tell us: what are your main takeaways regarding France relative to Italy and Norway? Make sure to be concrete and highlight at least two important comparative trends visualized in the graph.

Based on the ESS data, we can see that France has the same overall trend as Italy and Norway in decreasing proportions of the population saying ‘yes’ to feeling close to a party from the years 1920-2020. Specifically, Norway’s overall decrease over the 100 years is seen to be more of a gradual decrease compared to Italy and France. It is important to note the outstanding outlier of proportions for Italy that are below the 25th percentile for feeling close to a party during the 1930s, which is important to consider when analyzing the data as a whole.

task 3

Provide code and answer.

Question: What is the marginal percentage of Italian men who feel close to a particular political party?

This code pulls the data from the ESS Italy dataset

# This code pulls the data from the ESS Italy dataset
italy_data <- read.fst("italy_data.fst")

italy_data <- italy_data %>%
  mutate(
    gndr = case_when(
      gndr == 1 ~ "Male",
      gndr == 2 ~ "Female",
      TRUE ~ NA_character_  
    ),
    clsprty = case_when(
      clsprty == 1 ~ "Yes",       
      clsprty == 2 ~ "No",    
      TRUE ~ NA_character_ 
    )    
  )

# This code provides the calculations needed for forming the data table. 
clsprty_percentages <- italy_data %>% 
  filter(!is.na(clsprty), !is.na(gndr)) %>%  
  group_by(gndr, clsprty) %>%  
  summarise(count = n(), .groups = 'drop') %>%  
  mutate(percentage = count / sum(count) * 100) 

clsprty_percentages

## # A tibble: 4 × 4
##   gndr   clsprty count percentage
##   <chr>  <chr>   <int>      <dbl>
## 1 Female No       3228       34.2
## 2 Female Yes      1686       17.9
## 3 Male   No       2593       27.5
## 4 Male   Yes      1936       20.5

The table shows that a total of 20.50196% of Italian men feel close to a particular politcal party.

task 4

Provide code and output only.

Prompt: In the tutorial, we calculated then visualized the percentage distribution for left vs. right by gender for France. Your task is to replicate the second version of the visualization but for the country of Sweden instead.

sweden_data <- read.fst("sweden_data.fst")

# The following line of code is to interpret the data for the calculation.
sweden_data <- sweden_data %>%
  mutate(
    gndr = case_when(
      gndr == 1 ~ "Male",
      gndr == 2 ~ "Female",
      TRUE ~ NA_character_  
    ),
    lrscale = case_when(
      lrscale %in% 0:3 ~ "Left",       
      lrscale %in% 7:10 ~ "Right",     
      TRUE ~ NA_character_  
    )    
  )

# We will calcualte the dataset for Sweden, filter out the rows for the table, and calcualte the percentage for each group. 
lrscale_percentages <- sweden_data %>%  
  filter(!is.na(lrscale), !is.na(gndr)) %>%  
  group_by(gndr, lrscale) %>%  
  summarise(count = n(), .groups = 'drop') %>%  
  mutate(percentage = count / sum(count) * 100)

lrscale_percentages

## # A tibble: 4 × 4
##   gndr   lrscale count percentage
##   <chr>  <chr>   <int>      <dbl>
## 1 Female Left     2296       23.0
## 2 Female Right    2530       25.3
## 3 Male   Left     2062       20.6
## 4 Male   Right    3107       31.1

# This chunk of code utilizes the ggplot 2 package to visulaize the table. 
lrscale_plot_v2 <- ggplot(lrscale_percentages, 
            aes(x = percentage,  
                y = reorder(gndr, -percentage),  
                fill = gndr)) +  
  geom_col() +  
  coord_flip() +
  guides(fill = "none") +  
  facet_wrap(~ lrscale, nrow = 1) + 
  labs(x = "Percentage of Respondents",
       y = NULL,  # Remove Y-axis label
       title = "Political Orientation by Gender",  
       subtitle = "Comparing the percentage distribution of left vs. right for Sweden (2002-2020)") + 
  theme(plot.title = element_text(size = 16, face = "bold"),
        plot.subtitle = element_text(size = 12),  
        axis.title.y = element_blank(), 
        legend.position = "bottom")  


lrscale_plot_v2

task 5

Provide code and answer: In Hungary, what is the conditional probability of NOT feeling close to any particular party given that the person lives in a rural area?

# I started by pulling the Hungary dataset from ESS 
hungary_data <- read.fst("hungary_data.fst")

# This code contributes to the visualation by setting yearborn(yrbrn) as the variable. 
hungary_data <- hungary_data %>%
  mutate(
    clsprty = ifelse(clsprty == 2, 0, ifelse(clsprty %in% c(7, 8, 9), NA, clsprty))
  ) %>%
  mutate(
    yrbrn = ifelse(yrbrn %in% c(7777, 8888, 9999), NA, yrbrn)
  )

# We will calculate the proportions accordingly. 
clsprty_proportions <- hungary_data %>%
  filter(!is.na(yrbrn) & yrbrn >= 1920 & !is.na(clsprty)) %>%  
  group_by(yrbrn, clsprty) %>%
  tally() %>%
  mutate(proportion = n / sum(n)) %>%
  
  ungroup()

# This chunk of code represents the categories of urban and rural areas.
hungary_data <- hungary_data %>%
  mutate(
    geo = recode(as.character(domicil), 
                 '1' = "Urban", 
                 '2' = "Urban",
                 '3' = "Rural", 
                 '4' = "Rural", 
                 '5' = "Rural",
                 '7' = NA_character_,
                 '8' = NA_character_,
                 '9' = NA_character_)
  ) %>%
  filter(!is.na(clsprty), !is.na(geo))

# The last chunk of code will provide us the table with the conditional probabilities. 
cond <- hungary_data %>%
  count(clsprty, geo) %>%
  group_by(geo) %>%
  mutate(prob = n / sum(n))

cond

## # A tibble: 4 × 4
## # Groups:   geo [2]
##   clsprty geo       n  prob
##     <dbl> <chr> <int> <dbl>
## 1       0 Rural  6275 0.554
## 2       0 Urban  2395 0.512
## 3       1 Rural  5055 0.446
## 4       1 Urban  2283 0.488

The table shows that the conditional probability of not feeling close to any particular party given that the person lives in a rural area is 0.5538394.

Kim_Roy_Homework_2

2024-01-26