R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

packages <- c("tidyverse", "modelsummary", "forcats", "RColorBrewer", 
              "fst", "viridis", "knitr", "kableExtra", "rmarkdown", "ggridges", "viridis", "questionr")


new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)


lapply(packages, library, character.only = TRUE)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: viridisLite
## 
## 
## Attaching package: 'kableExtra'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
## [[1]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics" 
## [13] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "modelsummary" "lubridate"    "forcats"      "stringr"      "dplyr"       
##  [6] "purrr"        "readr"        "tidyr"        "tibble"       "ggplot2"     
## [11] "tidyverse"    "stats"        "graphics"     "grDevices"    "utils"       
## [16] "datasets"     "methods"      "base"        
## 
## [[3]]
##  [1] "modelsummary" "lubridate"    "forcats"      "stringr"      "dplyr"       
##  [6] "purrr"        "readr"        "tidyr"        "tibble"       "ggplot2"     
## [11] "tidyverse"    "stats"        "graphics"     "grDevices"    "utils"       
## [16] "datasets"     "methods"      "base"        
## 
## [[4]]
##  [1] "RColorBrewer" "modelsummary" "lubridate"    "forcats"      "stringr"     
##  [6] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [11] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [16] "utils"        "datasets"     "methods"      "base"        
## 
## [[5]]
##  [1] "fst"          "RColorBrewer" "modelsummary" "lubridate"    "forcats"     
##  [6] "stringr"      "dplyr"        "purrr"        "readr"        "tidyr"       
## [11] "tibble"       "ggplot2"      "tidyverse"    "stats"        "graphics"    
## [16] "grDevices"    "utils"        "datasets"     "methods"      "base"        
## 
## [[6]]
##  [1] "viridis"      "viridisLite"  "fst"          "RColorBrewer" "modelsummary"
##  [6] "lubridate"    "forcats"      "stringr"      "dplyr"        "purrr"       
## [11] "readr"        "tidyr"        "tibble"       "ggplot2"      "tidyverse"   
## [16] "stats"        "graphics"     "grDevices"    "utils"        "datasets"    
## [21] "methods"      "base"        
## 
## [[7]]
##  [1] "knitr"        "viridis"      "viridisLite"  "fst"          "RColorBrewer"
##  [6] "modelsummary" "lubridate"    "forcats"      "stringr"      "dplyr"       
## [11] "purrr"        "readr"        "tidyr"        "tibble"       "ggplot2"     
## [16] "tidyverse"    "stats"        "graphics"     "grDevices"    "utils"       
## [21] "datasets"     "methods"      "base"        
## 
## [[8]]
##  [1] "kableExtra"   "knitr"        "viridis"      "viridisLite"  "fst"         
##  [6] "RColorBrewer" "modelsummary" "lubridate"    "forcats"      "stringr"     
## [11] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [16] "ggplot2"      "tidyverse"    "stats"        "graphics"     "grDevices"   
## [21] "utils"        "datasets"     "methods"      "base"        
## 
## [[9]]
##  [1] "rmarkdown"    "kableExtra"   "knitr"        "viridis"      "viridisLite" 
##  [6] "fst"          "RColorBrewer" "modelsummary" "lubridate"    "forcats"     
## [11] "stringr"      "dplyr"        "purrr"        "readr"        "tidyr"       
## [16] "tibble"       "ggplot2"      "tidyverse"    "stats"        "graphics"    
## [21] "grDevices"    "utils"        "datasets"     "methods"      "base"        
## 
## [[10]]
##  [1] "ggridges"     "rmarkdown"    "kableExtra"   "knitr"        "viridis"     
##  [6] "viridisLite"  "fst"          "RColorBrewer" "modelsummary" "lubridate"   
## [11] "forcats"      "stringr"      "dplyr"        "purrr"        "readr"       
## [16] "tidyr"        "tibble"       "ggplot2"      "tidyverse"    "stats"       
## [21] "graphics"     "grDevices"    "utils"        "datasets"     "methods"     
## [26] "base"        
## 
## [[11]]
##  [1] "ggridges"     "rmarkdown"    "kableExtra"   "knitr"        "viridis"     
##  [6] "viridisLite"  "fst"          "RColorBrewer" "modelsummary" "lubridate"   
## [11] "forcats"      "stringr"      "dplyr"        "purrr"        "readr"       
## [16] "tidyr"        "tibble"       "ggplot2"      "tidyverse"    "stats"       
## [21] "graphics"     "grDevices"    "utils"        "datasets"     "methods"     
## [26] "base"        
## 
## [[12]]
##  [1] "questionr"    "ggridges"     "rmarkdown"    "kableExtra"   "knitr"       
##  [6] "viridis"      "viridisLite"  "fst"          "RColorBrewer" "modelsummary"
## [11] "lubridate"    "forcats"      "stringr"      "dplyr"        "purrr"       
## [16] "readr"        "tidyr"        "tibble"       "ggplot2"      "tidyverse"   
## [21] "stats"        "graphics"     "grDevices"    "utils"        "datasets"    
## [26] "methods"      "base"
ess <- read_fst("All-ESS-Data.fst")
table(ess$essround)
## 
##     1     2     3     4     5     6     7     8     9    10 
## 42359 47537 43000 56752 52458 54673 40185 44387 49519 59685
ess$year <- NA
replacements <- c(2002, 2004, 2006, 2008, 2010, 2012, 2014, 2016, 2018, 2020)
for(i in 1:10){
  ess$year[ess$essround == i] <- replacements[i]
}
table(ess$fltlnl)
## 
##     1     2     3     4     7     8     9 
## 91011 33675  8198  3902   125   805   142
table(ess$wrhpp)
## 
##     1     2     3     4     7     8     9 
##  7040 32864 63754 32331   136  1560   173
table(ess$brncntr)
## 
##      1      2      7      8      9 
## 443915  45887     78    108    567
france_data <- ess %>% 
  filter(cntry == "FR") %>% 
  mutate(
    fltlnl = ifelse(fltlnl == 2, 0, ifelse(fltlnl %in% c(7, 8, 9), NA, fltlnl)), 
    wrhpp = ifelse(wrhpp %in% c(7, 8, 9), NA, wrhpp), 
    brncntr = ifelse(brncntr %in% c(7, 8, 9), NA, brncntr), 
  )

Task 1 Do a data summary table of three variables of interest. Discuss briefly what you note (i.e., add text in your markdown after the Task 1 code).

datasummary_skim(france_data %>% select(fltlnl, wrhpp, brncntr))
Unique (#) Missing (%) Mean SD Min Median Max
fltlnl 5 69 1.0 1.0 0.0 1.0 4.0
wrhpp 5 69 2.9 0.8 1.0 3.0 4.0
brncntr 3 0 1.1 0.3 1.0 1.0 2.0

From the summary, we can see that most of the respondents have a good relationship with their families; the average number of hours most people worked in the past week is in the middle of the histogram, which means that most people work similar hours; The last line shows all respondents were born in France.

Task 2 Choose one of the three variables you just summarized in the table. This will be your current main outcome of interest.

Produce a visual that showcases the mean (average) for your outcome of interest by survey year (can be, e.g., point + line plot or ridge plot, depending on your variable). Discuss briefly what you note (i.e., add text in your markdown after the Task 2 code).

fltlnl_by_year <- france_data %>%
  group_by(year) %>%
  summarize(mean_fltlnl = mean(fltlnl, na.rm = TRUE))

fltlnl_by_year
## # A tibble: 10 × 2
##     year mean_fltlnl
##    <dbl>       <dbl>
##  1  2002      NaN   
##  2  2004      NaN   
##  3  2006        1.05
##  4  2008      NaN   
##  5  2010      NaN   
##  6  2012        1.08
##  7  2014        1.00
##  8  2016      NaN   
##  9  2018      NaN   
## 10  2020      NaN
france_data %>%
  ggplot(aes(x = fltlnl, y = as.factor(year), fill = as.factor(year))) +
  geom_density_ridges() + 
  labs(title = "Distribution of fltlnl Across Survey Years For France", 
       x = "fltlnl Value", y = "Survey Year",
       fill = "Survey Year") +
  theme_minimal()
## Picking joint bandwidth of 0.164
## Warning: Removed 13168 rows containing non-finite values
## (`stat_density_ridges()`).

Observing the ridge plot of the output, I observed that the survey data for fltlnl all showed almost the same peak value in 2006, 2012, and 2014, indicating that the factors affecting the fltlnl variable remained relatively stable over the years.

Task 3 Provide a comparison visual of your outcome of interest with two other countries. You can choose the geom() you prefer. Discuss briefly what you note (i.e., add text in your markdown after the Task 3 code).

table(ess$fltlnl)
## 
##     1     2     3     4     7     8     9 
## 91011 33675  8198  3902   125   805   142
ess_selected <- ess %>%
  filter(cntry %in% c("FR", "DE", "CH")) %>%
  mutate(fltlnl = ifelse(fltlnl %in% c(1,2), NA, fltlnl))

task3plot <- ggplot(ess_selected, aes(x = reorder(cntry, -fltlnl, FUN=median), y = fltlnl, fill = cntry)) +
  geom_boxplot() +
  theme_minimal() + 
  theme(legend.position = "none") + 
  labs(title = "Boxplot comparison for fltlnl (France, Germany, Switzerland)", 
       x = "Country", 
       y = "Scale (0-10)")

task3plot
## Warning: Removed 69040 rows containing non-finite values (`stat_boxplot()`).

It can be seen from the box plot that the median, first quartile and third quartile of the fltlnld data in the three countries are similar; it can also be seen that the outlier of Switzerland is very high, which means that Switzerland has some specific A situation or group in which they have a different perspective than the majority.

Task 4 Produce a cross-tab between your outcome of interest and a socio-demographic variable (use datasummary_crosstab). Then, calculate column percentages using cprop(), making sure to pick a second socio-demographic variable. Discuss briefly what you note (i.e., add text in your markdown after the Task 4 code).

france_data <- france_data %>%
  mutate(
    edulvla = case_when(
      essround < 5 & edulvla == 55 ~ NA_real_,
      TRUE ~ edulvla
    ),
    edulvlb = case_when(
      essround >= 5 & edulvlb == 5555 ~ NA_real_,
      TRUE ~ edulvlb
    ),

    educ_level = case_when(
      essround < 5 & edulvla == 5 ~ "BA",
      essround >= 5 & edulvlb > 600 ~ "BA",
      TRUE ~ "No BA"
    )
  )

table(france_data$educ_level)
## 
##    BA No BA 
##  4235 14803
fltlnledu <- datasummary_crosstab(fltlnl ~ educ_level, data = france_data)
fltlnledu
fltlnl BA No BA All
0 N 346 1172 1518
% row 22.8 77.2 100.0
1 N 806 2816 3622
% row 22.3 77.7 100.0
3 N 58 355 413
% row 14.0 86.0 100.0
4 N 25 292 317
% row 7.9 92.1 100.0
All N 4235 14803 19038
% row 22.2 77.8 100.0
france_data <- france_data %>%
  mutate(religion = case_when(
    rlgblg == 2 ~ "No",
    rlgblg == 1 ~ "Yes",
    rlgblg %in% c(7, 8, 9) ~ NA_character_,
    TRUE ~ as.character(rlgblg)
  ))

table(france_data$religion)
## 
##   No  Yes 
## 9434 9509
table(france_data$fltlnl, france_data$religion) %>% cprop()
##        
##         No    Yes   All  
##   0      25.8  25.8  25.8
##   1      62.7  60.8  61.7
##   3       7.1   7.0   7.0
##   4       4.4   6.4   5.4
##   Total 100.0 100.0 100.0

When the social demographic variable is education, the number of BAs in fltlnl is 4235, and the number of No BAs is 14803; when the sociodemographic variable is religion, the number of BAs in fltlnl is 9509, and the number of No BAs is 9434.

Task 5 Choose one of the two socio-demographic variables you just worked with. Visualize the conditional probability (or column percentages) of your outcome given your selected socio-dem variable. Discuss briefly what you note (i.e., add text in your markdown after the Task 5 code).

df <- france_data %>%
  filter(!is.na(educ_level) & !is.na(fltlnl))

df <- df %>%
  mutate(fltlnl = case_when(
    fltlnl == 1 ~ "None or almost none of the time",
    fltlnl == 2 ~ "Some of the time",
    fltlnl == 3 ~ "Most of the time",
    fltlnl == 4 ~ "All or almost all of the time",
    fltlnl == 7 ~ "Refusal*",
    TRUE ~ as.character(fltlnl)  
  ))

table(df$fltlnl)
## 
##                               0   All or almost all of the time 
##                            1518                             317 
##                Most of the time None or almost none of the time 
##                             413                            3622
table(df$fltlnl, df$educ_level) %>%
  cprop() %>%
  as.data.frame() %>%
  filter(Var1 != "Total",
         Var2 != "All") %>%
  ggplot(aes(x=Var1, y=Freq, fill=Var2)) +
  geom_col(position = "dodge") +
  labs(title="Feeling Lonely by Education Level in France",
       y = "Conditional Percentage",
       x = "Feeling lonely",
       fill = "At least BA vs. Not")

It is worth noting that the number of BAs who feel lonely at none or almost none of the time in France is greater than the number of No BAs; the number of No BAs who feel lonely in Most of the time and All or almost all of the time is greater than the number of BAs.