Week 2 Data Dive

Julia Souza Due: Jan 26, 2026

Relevant packages

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading the Data

library(readr)
Ess_Full <- read_csv("C:/Users/Julia/Desktop/Stats Class Files/ESS Dataset 11.csv")
## Rows: 50116 Columns: 570
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (24): name, proddate, cntry, cntbrthd, lnghom1, lnghom2, fbrncntc, mbrn...
## dbl (546): essround, edition, idno, dweight, pspwght, pweight, anweight, nws...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data cleaning

Disclaimer: This section was coded with the aid of ChatGPT, mostly to avoid retyping all the variable names, which would be very time consuming. Empty columns were removed manually, before upload.

# List colums to be removed into a values list
cols_to_remove <- c(
  # Administrative / interview timing
  "inwds", "ainws", "ainwe", "binwe", "cinwe", "dinwe",
  "einwe", "finwe", "hinwe", "iinwe", "kinwe", "rinwe",
  "inwde", "jinws", "jinwe", "inwtm", "mode",
  
  # Recontact + sampling design
  "recon", "domain", "prob", "stratum", "psu",
  
  # Country-specific party vote variables
  "prtvtdat", "prtvtebe", "prtvtfbg", "prtvtchr", "prtvtccy",
  "prtvtiee", "prtvtffi", "prtvtffr", "prtvgde1", "prtvgde2",
  "prtvtegr", "prtvthhu", "prtvteis", "prtvteie", "prtvteil",
  "prtvteit", "prtvtblv", "prtvclt1", "prtvclt2", "prtvclt3",
  "prtvtbme", "prtvtinl", "prtvtcno", "prtvtfpl", "prtvtept",
  "prtvtbrs", "prtvtesk", "prtvtgsi", "prtvtges", "prtvtese",
  "prtvthch", "prtvtdua", "prtvtdgb",
  
  # Country-specific party closeness variables
  "prtcleat", "prtclebe", "prtclfbg", "prtclbhr", "prtclccy",
  "prtcliee", "prtclgfi", "prtclgfr", "prtclgde", "prtclegr",
  "prtclihu", "prtcleis", "prtclfie", "prtclfil", "prtclfit",
  "prtclblv", "prtclclt", "prtclbme", "prtclhnl", "prtclcno",
  "prtcljpl", "prtclgpt", "prtclbrs", "prtclesk", "prtclgsi",
  "prtclhes", "prtclese", "prtclhch", "prtcleua", "prtcldgb",
  
  # Party closeness strength
  "prtdgcl"
)

# Remove columns from dataset (overriding)
Ess_Full <- Ess_Full %>%
  select(-any_of(cols_to_remove))

# Save clean data for future assignments
write.csv(Ess_Full,"Ess_Stats_class.csv")

I will remove more columns once I have a better idea of my project.

Data Dive Start

Questions:

Question 1: Investigating country differences How different are the sets for each country? Is there significant variance in sample size (summary = count), age of participants (quantiles, median), etc? Are the differences representative of the national population, or simply due to the sample selection?

Question 2: Are there connections between different variables? Is there a correlation between variables that can be analyzed more in depth?

Question 3: For respondents in the sampled countries, is there a correlation between time spent using the internet and time consuming political news? *This question was thought with the goal of generating the Scatterplot visualization. I am unsure if this will be the continued direction of the project.

Numeric Summaries

# How many countries? (Numeric Summary #1)
length(unique(Ess_Full$cntry))
## [1] 30
# Sample per country (Numeric Summary #2)
Ess_Full |>
  group_by(cntry) |>
  count()
## # A tibble: 30 × 2
## # Groups:   cntry [30]
##    cntry     n
##    <chr> <int>
##  1 AT     2354
##  2 BE     1594
##  3 BG     2239
##  4 CH     1384
##  5 CY      685
##  6 DE     2420
##  7 EE     1293
##  8 ES     1844
##  9 FI     1563
## 10 FR     1771
## # ℹ 20 more rows

Visualizations

Visualizating distributions: Age distrubution per country

Ess_Full |>
  mutate(agea = (if_else(agea == 999, NA_real_, agea))) |>
  ggplot() +
  theme_minimal() +
  geom_boxplot(mapping = aes(x = cntry, y = agea)) +
  labs(title="Respondents' age distribution, per country", x = "Country", y= "Respondent's age") # labels!
## Warning: Removed 393 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

The age distributions demonstrate little variance between countries. While some countries skew younger or older, overall the distribution is uniform–with the exception of Israel (IL), which skews significantly younger.

###Exploring correlations: Internet Use x Political News consumed

Ess_Full |>
  # Remove (by mutating to NA) administrative coding for missing values. 6666 = Not applicable, 7777 = Refusal, 8888 = Don't know, 9999 = No answer.
  mutate(nwspol = (if_else(nwspol %in% c(7777, 8888, 9999), NA_real_, nwspol))) |>
  mutate(netustm = (if_else(netustm %in% c(6666, 7777, 8888, 9999), NA_real_, netustm))) |>
  # Plot visualization
  ggplot() +
  theme_classic() +
  geom_point(mapping = aes(x = nwspol, y = netustm),
             color = '#37410f') +
  labs(title = "Time using the internet vs Time consuming political news",
       x = "Time consuming political views (in minutes)", y = "Time using the internet (in minutes)");
## Warning: Removed 11627 rows containing missing values or values outside the scale range
## (`geom_point()`).

The visualization above does not suggest a clear relational pattern between time spent in internet use. My original question was whether time spent using the internet is typically used for consuming political news. In order to draw firmer insights, I’d need to explore other relations, filters, and control for other variables.

Quick note:

This data dive is sligtly direction-less. As discussed with the professor ealier, I will change my dataset soon to pair it with a project in a different course. Therefore, I am using the current data solely to work through and learn the content of the first few weeks, but have not conceptualized a long-term project for this specific dataset.

End of Week 2 Data Dive