Datenvisualisierung der Daten ‘Young Germany during COVID-19’ der TUI Foundation

1.) VORBEREITUNG In dieser Analyse widmen wir uns dem Datensatz “Young Germany during Covid 19” der von der TUI Foundation veröffentlicht wurde (https://data.gesis.org/sharing/#!Detail/10.7802/2125). Da der Datensatz erst nach Zustimmung zu Nutzungsbedingungen zugänglich ist, laden wir diesen herunter und importieren diesen in R von der Festplatte. Der Datensatz liegt in unterschiedlichen Formaten vor. Da es extra einen Data-File im R-Format gibt, nutzen wir diesen. Da der Datensatz recht umfangreich ist, empfiehlt es sich, direkt auch das Codebook mit herunterzuladen.

ygcv19 <- readRDS("~/Downloads/youngeurope-2020-2-v100.rds")

2.) AUSWAHL DER VARIABLEN In einem zweiten Schritt wollen wir uns zunächst einige Variablen etwas näher anschauen. Dazu wählen wir die folgenden Variablen (siehe Codebook):

Hintergrundvariablen: a.) Age group (1. Young Adults (age 16 to 26)/ 2. Adults (age 27 and older)), b.) Gender, c.) Education Group, d.) Highest school certificate (DE1), e.) Highest eduaction level (DE2)

Messvariablen: f.) lebenswelt_4: Future Outlook, g.) democracy_31_2_grid: Threat or Opportunity: Digitalisation, h.) democracy_12_3_grid: Societal conflicts: People with and without university education, i.) corona_2: Compliance with corona measurements, j.) corona_3_1_grid: Compliance: To protect your own health, k.) corona_3_2_grid: Compliance: To protect the health of others, l.) corona_4: Corona measurement. Zunächst lassen wir uns eine Zusammenfassung der Variablen ausgeben. Dazu benötigen wir die Pakete dplyr und haven. Wir erstellen eine neue Variable c19youth und filtern dort anschließend die fehlenden Values (977/999) raus.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.0     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr)
require(haven)

## Loading required package: haven

c19youth <- ygcv19 %>% 
    select(id, age_group, gender, edu_group, edu_de_1, edu_de_2,lebenswelt_4, democracy_31_2_grid, democracy_12_3_grid, corona_2, corona_3_1_grid, corona_3_2_grid, corona_4 ) %>% 
    drop_na()
  
c19youth <- filter(c19youth, edu_de_1 != 977 | edu_de_2 != 977 | lebenswelt_4 != 977 | democracy_31_2_grid != 977 | democracy_12_3_grid != 977 | corona_2 != 977 | corona_3_1_grid != 977 | corona_3_2_grid != 977 | corona_4 != 977)

c19youth <- filter(c19youth, edu_de_1 != 999 | edu_de_2 != 999 | lebenswelt_4 != 999 | democracy_31_2_grid != 999 | democracy_12_3_grid != 999 | corona_2 != 999 | corona_3_1_grid != 999 | corona_3_2_grid != 999 | corona_4 != 999)

3.) Erste Visualisierung mit ggplot2: Barplot In einem ersten Schritt beginnen wir mit einem einfachen Plot einer der Messvariablen Ausprägung corona_4). Dazu benötigen wir das Paket ggplot2 und tidyverse. Dazu nutzen wir die sog. Pipe-Schreibweise und erweitern unsere Plots systematisch.

require(ggplot2)
library(tidyverse)
c19youth2 <- as.data.frame(c19youth)

ggplot(data = c19youth2, aes(x = corona_4)) +
    geom_bar()

## Don't know how to automatically pick scale for object of type haven_labelled/vctrs_vctr/double. Defaulting to continuous.

4.) Diese Visualisierung ist von der Aussagekraft nicht optimal. Die verschiedenen Antwortoptionen sind nicht erkennbar. Daher ersetzen wir die in den Daten vorhandenen Kategorien durch Labels.

ggplot(data = c19youth2, aes(x = corona_4)) +
    geom_bar() +
  scale_x_discrete(limits = c("1", "2", "3",
                              "4", "5"),
                   labels = c("Not sufficient", "Rather not sufficient", 
                              "Appropriate", "Rather exaggerated", "Exaggerated"))

5.) Wir sind aber immer noch nicht zufrieden mit unserer Visualisierung. Zum einen möchten wir gerne die Farbe ändern, zum anderen möchten wir gerne die Achsenbeschriftung anpassen und einen Titel sowie eine Quelle erweitern. Dazu erweitern wir den Code um weitere Informationen.

ggplot(data = c19youth2, aes(x = corona_4)) +
    geom_bar(fill = "steelblue") +
  scale_x_discrete(limits = c("1", "2", "3",
                              "4", "5"),
                   labels = c("Not sufficient", "Rather not sufficient", 
                              "Appropriate", "Rather exaggerated", "Exaggerated")) +
    labs(title = "Estimation of Corona Measurement (n =780)",
       subtitle = "Data from 2020",
       caption = "Source: Young Germany during COVID-19 (https://doi.org/10.7802/2125)",
       x = "Estimation level",
       y = "Count")