Julia Souza Due: Jan 26, 2026
Relevant packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
Ess_Full <- read_csv("C:/Users/Julia/Desktop/Stats Class Files/ESS Dataset 11.csv")
## Rows: 50116 Columns: 570
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (24): name, proddate, cntry, cntbrthd, lnghom1, lnghom2, fbrncntc, mbrn...
## dbl (546): essround, edition, idno, dweight, pspwght, pweight, anweight, nws...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Disclaimer: This section was coded with the aid of ChatGPT, mostly to avoid retyping all the variable names, which would be very time consuming. Empty columns were removed manually, before upload.
# List colums to be removed into a values list
cols_to_remove <- c(
# Administrative / interview timing
"inwds", "ainws", "ainwe", "binwe", "cinwe", "dinwe",
"einwe", "finwe", "hinwe", "iinwe", "kinwe", "rinwe",
"inwde", "jinws", "jinwe", "inwtm", "mode",
# Recontact + sampling design
"recon", "domain", "prob", "stratum", "psu",
# Country-specific party vote variables
"prtvtdat", "prtvtebe", "prtvtfbg", "prtvtchr", "prtvtccy",
"prtvtiee", "prtvtffi", "prtvtffr", "prtvgde1", "prtvgde2",
"prtvtegr", "prtvthhu", "prtvteis", "prtvteie", "prtvteil",
"prtvteit", "prtvtblv", "prtvclt1", "prtvclt2", "prtvclt3",
"prtvtbme", "prtvtinl", "prtvtcno", "prtvtfpl", "prtvtept",
"prtvtbrs", "prtvtesk", "prtvtgsi", "prtvtges", "prtvtese",
"prtvthch", "prtvtdua", "prtvtdgb",
# Country-specific party closeness variables
"prtcleat", "prtclebe", "prtclfbg", "prtclbhr", "prtclccy",
"prtcliee", "prtclgfi", "prtclgfr", "prtclgde", "prtclegr",
"prtclihu", "prtcleis", "prtclfie", "prtclfil", "prtclfit",
"prtclblv", "prtclclt", "prtclbme", "prtclhnl", "prtclcno",
"prtcljpl", "prtclgpt", "prtclbrs", "prtclesk", "prtclgsi",
"prtclhes", "prtclese", "prtclhch", "prtcleua", "prtcldgb",
# Party closeness strength
"prtdgcl"
)
# Remove columns from dataset (overriding)
Ess_Full <- Ess_Full %>%
select(-any_of(cols_to_remove))
# Save clean data for future assignments
write.csv(Ess_Full,"Ess_Stats_class.csv")
I will remove more columns once I have a better idea of my project.
Question 1: Investigating country differences How different are the sets for each country? Is there significant variance in sample size (summary = count), age of participants (quantiles, median), etc? Are the differences representative of the national population, or simply due to the sample selection?
Question 2: Are there connections between different variables? Is there a correlation between variables that can be analyzed more in depth?
Question 3: For respondents in the sampled countries, is there a correlation between time spent using the internet and time consuming political news? *This question was thought with the goal of generating the Scatterplot visualization. I am unsure if this will be the continued direction of the project.
# How many countries? (Numeric Summary #1)
length(unique(Ess_Full$cntry))
## [1] 30
# Sample per country (Numeric Summary #2)
Ess_Full |>
group_by(cntry) |>
count()
## # A tibble: 30 × 2
## # Groups: cntry [30]
## cntry n
## <chr> <int>
## 1 AT 2354
## 2 BE 1594
## 3 BG 2239
## 4 CH 1384
## 5 CY 685
## 6 DE 2420
## 7 EE 1293
## 8 ES 1844
## 9 FI 1563
## 10 FR 1771
## # ℹ 20 more rows
Ess_Full |>
mutate(agea = (if_else(agea == 999, NA_real_, agea))) |>
ggplot() +
theme_minimal() +
geom_boxplot(mapping = aes(x = cntry, y = agea)) +
labs(title="Respondents' age distribution, per country", x = "Country", y= "Respondent's age") # labels!
## Warning: Removed 393 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
The age distributions demonstrate little variance between countries. While some countries skew younger or older, overall the distribution is uniform–with the exception of Israel (IL), which skews significantly younger.
###Exploring correlations: Internet Use x Political News consumed
Ess_Full |>
# Remove (by mutating to NA) administrative coding for missing values. 6666 = Not applicable, 7777 = Refusal, 8888 = Don't know, 9999 = No answer.
mutate(nwspol = (if_else(nwspol %in% c(7777, 8888, 9999), NA_real_, nwspol))) |>
mutate(netustm = (if_else(netustm %in% c(6666, 7777, 8888, 9999), NA_real_, netustm))) |>
# Plot visualization
ggplot() +
theme_classic() +
geom_point(mapping = aes(x = nwspol, y = netustm),
color = '#37410f') +
labs(title = "Time using the internet vs Time consuming political news",
x = "Time consuming political views (in minutes)", y = "Time using the internet (in minutes)");
## Warning: Removed 11627 rows containing missing values or values outside the scale range
## (`geom_point()`).
The visualization above does not suggest a clear relational pattern between time spent in internet use. My original question was whether time spent using the internet is typically used for consuming political news. In order to draw firmer insights, I’d need to explore other relations, filters, and control for other variables.
This data dive is sligtly direction-less. As discussed with the professor ealier, I will change my dataset soon to pair it with a project in a different course. Therefore, I am using the current data solely to work through and learn the content of the first few weeks, but have not conceptualized a long-term project for this specific dataset.
End of Week 2 Data Dive