HW the 2nd

R Markdown

Setting up your environment

We always need to do this step first and include all the packages we need

# List of packages
packages <- c("tidyverse", "fst", "modelsummary", "viridis") # add any you need here

# Install packages if they aren't installed already
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load the packages
lapply(packages, library, character.only = TRUE)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: viridisLite

## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "fst"       "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"    
## [13] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "modelsummary" "fst"          "lubridate"    "forcats"      "stringr"     
##  [6] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [11] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [16] "utils"        "datasets"     "methods"      "base"        
## 
## [[4]]
##  [1] "viridis"      "viridisLite"  "modelsummary" "fst"          "lubridate"   
##  [6] "forcats"      "stringr"      "dplyr"        "purrr"        "readr"       
## [11] "tidyr"        "tibble"       "ggplot2"      "tidyverse"    "stats"       
## [16] "graphics"     "grDevices"    "utils"        "datasets"     "methods"     
## [21] "base"

Task 1

Provide code and answer.

Prompt: in the tutorial, we calculated the average trust in others for France and visualized it. Using instead the variable ‘Trust in Parliament’ (trstplt) and the country of Spain (country file provided on course website), visualize the average trust by survey year. You can truncate the y-axis if you wish. Provide appropriate titles and labels given the changes. What are your main takeaways based on the visual (e.g., signs of increase, decrease, or stall)?

Loading Data into R

Let’s load our data. We will work with Spain.

spain_data <- read.fst("spain_data.fst")

Reviewing & Adding

First, let’s calculate the average for a variable of interest and then visualize.

We will be working with a trust variable – trust in parliament, measured from 0-10, where 0 represents “No trust at all” and 10 is “Complete trust”.

Let’s check our work. One quick way to do so:

table(spain_data$trstplt) # if there are values that are not supposed to be there (e.g., 77, 88, 99 in this case), then we need to deal with it

## 
##    0    1    2    3    4    5    6    7    8    9   10   77   88   99 
## 5165 1830 2329 2441 2085 2890 1154  639  355   80   71   46  336   31

Before proceeding, we need to clean and transform our variable.

spain_data <- spain_data %>%
  mutate(
    trstplt = ifelse(trstplt %in% c(77, 88, 99), NA, trstplt), # set values 77, 88, and 99 to NA.
  )

table(spain_data$trstplt) # Now we deleted the unneccessary values

## 
##    0    1    2    3    4    5    6    7    8    9   10 
## 5165 1830 2329 2441 2085 2890 1154  639  355   80   71

Next, let’s create a ‘year’ variable from the essround variable. We will use it to visualize.

spain_data$year <- NA
replacements <- c(2002, 2004, 2006, 2008, 2010, 2012, 2014, 2016, 2018, 2020)
for(i in 1:10){
  spain_data$year[spain_data$essround == i] <- replacements[i]
}

Let’s check that it worked

table(spain_data$year)

## 
## 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 
## 1729 1663 1876 2576 1885 1889 1925 1958 1668 2283

Now, we will calculate the average by year to then visualize:

trust_by_year <- spain_data %>%
  group_by(year) %>%
  summarize(mean_trust = mean(trstplt, na.rm = TRUE))
trust_by_year

## # A tibble: 10 × 2
##     year mean_trust
##    <dbl>      <dbl>
##  1  2002       3.41
##  2  2004       3.66
##  3  2006       3.49
##  4  2008       3.32
##  5  2010       2.72
##  6  2012       1.91
##  7  2014       2.23
##  8  2016       2.40
##  9  2018       2.55
## 10  2020       1.94

We can see from this table that the average DOES shift much from year to year.

Now it’s time to:

Visualize

ggplot(trust_by_year, aes(x = year, y = mean_trust)) +
  geom_line(color = "blue", size = 1) +  # Line to show the trend
  geom_point(color = "red", size = 3) +  # Points to highlight each year's value
  labs(title = "Trust in Others in Spain (2002-2020)", 
       x = "Survey Year", 
       y = "Average Trust (0-10 scale)") +
  ylim(0, 10) +  # Setting the y-axis limits from 0 to 10
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Here we can see a general decrease in trust towards the Parliament year by year (only a slightly increase after 2015). After a period of relative stability until 2007, there’s a sharp decline until 2012. The trust levels reduced over the years.

Task 2

Provide answer only.

Prompt and question: Based on the figure we produced above called task2_plot, tell us: what are your main takeaways regarding France relative to Italy and Norway? Make sure to be concrete and highlight at least two important comparative trends visualized in the graph.

The proportion of “Saying Yes to Feeling Close to Political Party” of France stands relatively higher when we compared to the Italy. And lower than Norway.

Moreover, Norway has the highest proportion of saying yes to a party, whereas Italy has the least proportion among every cohorts each year and the proportion values for France is in the middle of amongst them in every cohort. Both of three countries’ proportion are decreasing year by year almost at the same rate of change.

Task 3

Provide code and answer.

Question: What is the marginal percentage of Italian men who feel close to a particular political party?

clsprty

Loading Data into R

Let’s load our data. We will work with Spain.

italy_data <- read.fst("italy_data.fst")

italy_data <- italy_data %>%
  mutate(
    gndr = case_when(
      gndr == 1 ~ "Male",
      gndr == 2 ~ "Female",
      TRUE ~ NA_character_  
    ),
  clsprty = case_when(
      clsprty %in% 1 ~ "Yes",       
      clsprty %in% 2 ~ "No",     
      TRUE ~ NA_character_  
    )    
  )

We wanted to showcase how to display the percentages of two categories (yes and no to a political party) by two other categories (male and female).

clsprty_percentages <- italy_data %>%  
  filter(!is.na(clsprty), !is.na(gndr)) %>%  
  group_by(gndr, clsprty) %>%  
  summarise(count = n(), .groups = 'drop') %>%  
  mutate(percentage = count / sum(count) * 100)  

clsprty_percentages

## # A tibble: 4 × 4
##   gndr   clsprty count percentage
##   <chr>  <chr>   <int>      <dbl>
## 1 Female No       3228       34.2
## 2 Female Yes      1686       17.9
## 3 Male   No       2593       27.5
## 4 Male   Yes      1936       20.5

Here we can see that the marginal percentage of Italian men (male) who feel close (says yes) to a particular political party is 20.502%.

Task 4

Provide code and output only.

Prompt: In the tutorial, we calculated then visualized the percentage distribution for left vs. right by gender for France. Your task is to replicate the second version of the visualization but for the country of Sweden instead.

Loading Data into R

Let’s load our data. We will work with Sweden now

sweden_data <- read.fst("sweden_data.fst")

Before proceeding, we need to clean and transform our variable.

sweden_data <- sweden_data %>%
  mutate(
    lrscale = ifelse(lrscale %in% c(77, 88, 99), NA, lrscale), # set values 77, 88, and 99 to NA.
  )

table(sweden_data$lrscale) # Now we deleted the unneccessary values

## 
##    0    1    2    3    4    5    6    7    8    9   10 
##  673  427 1193 2069 1826 3880 1765 2656 1905  493  587

Percentages

sweden_data <- sweden_data %>%
  mutate(
    gndr = case_when(
      gndr == 1 ~ "Male",
      gndr == 2 ~ "Female",
      TRUE ~ NA_character_  # Set anything that is not 1 or 2 to NA
    ),
    lrscale = case_when(
      lrscale %in% 0:3 ~ "Left",       # Left-wing (0 to 3)
      lrscale %in% 7:10 ~ "Right",     # Right-wing (7 to 10)
      TRUE ~ NA_character_  # Moderate (4, 5, 6) and special codes (77, 88, 99) set to NA 
    )    
  )

Calculations

lrscale_percentages <- sweden_data %>%  # Begin with the dataset 'sweden_data'
  filter(!is.na(lrscale), !is.na(gndr)) %>%  # Filter out rows where 'lrscale' or 'gender' is NA (missing data)
  group_by(gndr, lrscale) %>%  # Group the data by 'gender' and 'lrscale' categories
  summarise(count = n(), .groups = 'drop') %>%  # Summarise each group to get counts, and then drop groupings
  mutate(percentage = count / sum(count) * 100)  # Calculate percentage for each group by dividing count by total count and multiplying by 100

lrscale_percentages  # The resulting dataframe

## # A tibble: 4 × 4
##   gndr   lrscale count percentage
##   <chr>  <chr>   <int>      <dbl>
## 1 Female Left     2296       23.0
## 2 Female Right    2530       25.3
## 3 Male   Left     2062       20.6
## 4 Male   Right    3107       31.1

Visualization

lrscale_plot <- ggplot(lrscale_percentages, aes(x = lrscale, y = percentage, fill = lrscale)) +
  geom_bar(stat = "identity", position = position_dodge()) +  # Dodged bar chart
  facet_wrap(~ gndr, scales = "fixed") +  # Fixed scales for y-axis across facets
  scale_fill_brewer(palette = "Set1") +  # Distinct colors for Left and Right
  labs(
    title = "Political Orientation (Left vs. Right) by Gender in Sweden",
    x = "Political Orientation",
    y = "Percentage of Respondents",
    fill = "Orientation"
  ) +
  theme_minimal() +  # Minimal theme for clarity
  theme(legend.position = "bottom")  # Legend at the bottom

# Display the ggplot object
lrscale_plot

Task 5

Provide code and answer: In Hungary, what is the conditional probability of NOT feeling close to any particular party given that the person lives in a rural area?

Loading Data into R

Let’s load our data. We will work with Hungary.

hungary_data <- read.fst("hungary_data.fst")

Reviewing & Adding

We will be working with the “probability of feeling close to any particular party” variable. From there, we’ll endevour to come to the “Feeling NOT close to a political party”

Let’s check our work. One quick way to do so:

table(hungary_data$clsprty) # if there are values that are not supposed to be there (e.g., 77, 88, 99 in this case), then we need to deal with it

## 
##    1    2    7    8    9 
## 7342 8679  322  291    8

So there is no need us to clean or transform our variable.

# Recode clsprty and geo variables, removing NAs
hungary_data <- hungary_data %>%
  mutate(
    geo = recode(as.character(domicil), 
                 '1' = "Urban", 
                 '2' = "Urban",
                 '3' = "Rural", 
                 '4' = "Rural", 
                 '5' = "Rural",
                 '7' = NA_character_,
                 '8' = NA_character_,
                 '9' = NA_character_)
  ) %>%
  filter(!is.na(clsprty), !is.na(geo))  # Removing rows with NA in clsprty or geo

hungary_data <- hungary_data %>%
  filter(!is.na(clsprty)) %>%
  mutate(
    clsprty = case_when(
      clsprty == 1 ~ "Yes",       
      clsprty == 2 ~ "No"
      
    )
  ) %>%
  filter(!is.na(clsprty))

In the next step, we will calculate conditional probabilities where:

count(clsprty, geo) tallies the number of occurrences for each combination of clsprty and geo.
group_by(geo) groups the data by the geographical area.
mutate(prob = n / sum(n)) calculates the probability of each clsprty value given each geographical area.

# Calculate conditional probabilities, excluding NAs
cond <- hungary_data %>%
  count(clsprty, geo) %>%
  group_by(geo) %>%
  mutate(prob = n / sum(n))

cond

## # A tibble: 4 × 4
## # Groups:   geo [2]
##   clsprty geo       n  prob
##   <chr>   <chr> <int> <dbl>
## 1 No      Rural  6275 0.554
## 2 No      Urban  2395 0.512
## 3 Yes     Rural  5055 0.446
## 4 Yes     Urban  2283 0.488

So the conditional probability of NOT feeling close to any particular party given that the person lives in a rural area is 0.554

HW the 2nd

Meryem Karayunusoglu

2024-01-29

R Markdown

Setting up your environment

Task 1

Loading Data into R

Reviewing & Adding

Visualize

Task 2

Task 3

Loading Data into R

Task 4

Loading Data into R

Percentages

Calculations

Visualization

Task 5

Loading Data into R

Reviewing & Adding