My coding goals for week 9 were to get started on my exploratory analyses.
I wanted to determine whether someones country of residence influenced life satisfaction scores at time 1 (pre-covid) and how that changed at time 2 (during-covid).
library(tidyverse) # for dplyr and ggplot
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.6 v dplyr 1.0.4
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr) # for data frame manipulation functions
library(ggplot2) # for graphs
To do this, I used the data from study 2 and looked at the life satisfaction at time 1 and time 2, by country. I chose to analyse countries with above >14 participants because country sample sizes below 15 were too small to be meaningful (the paper also did exploratory analyses of countries with above 14 participants).
I called our data frame LS_Country to indicate that we were measuring Life Satisfaction (LS) by country. The life satisfaction difference variable was included to compare figures later on.
LS_Country <- read.csv("Study 2.csv")
LS_Country <- LS_Country %>%
group_by(Country) %>% # Group scores by country
select(Country, T1SWLS, T2SWLS, SWLS_Diff) %>% # Select only the country, T1 and T2 life satisfaction scores.
summarise(sample_size = n(), # create a new column that displays number of participants
"T1LS" = mean(T1SWLS), # create a column that displays mean life satisfaction at time 1
"T2LS" = mean(T2SWLS), # create a column that displays mean life satisfaction at time 2
"LS_Diff" = mean(SWLS_Diff) # create a column that displays mean life satisfaction differences across T1 and T2
) %>%
filter(sample_size > 14) %>% # filter out all countries with less than 15 participants
arrange(desc(sample_size)) # arrange with the highest no. of participants at the top
LS_Country
## # A tibble: 5 x 5
## Country sample_size T1LS T2LS LS_Diff
## <chr> <int> <dbl> <dbl> <dbl>
## 1 USA 104 3.98 4.03 0.0519
## 2 UK 90 4.14 4.02 -0.116
## 3 Poland 29 3.81 4 0.186
## 4 Portugal 23 3.90 3.65 -0.252
## 5 Canada 15 3.83 4.12 0.293
We are left with 5 out of 28 countries with 15 or more participants: USA, UK, Poland, Portugal and Canada. Now we will compare these mean scores in a bar graph.
To be able to create a bar graph that has a side-by-side comparison for T1 and T2, by country, I first have to convert my dataframe from wide to long. Here I use the pivot_longer function, which takes the T1LS and T2LS columns from my data frame and puts the names of the columns into the data, and aligns the means with each ‘time’ in a separate column labelled ‘Means’. So now my data frame looks something like this:
LS_Country_Long <- pivot_longer(LS_Country,
cols = c(T1LS, T2LS),
names_to = "Time",
values_to = "Means")
LS_Country_Long
## # A tibble: 10 x 5
## Country sample_size LS_Diff Time Means
## <chr> <int> <dbl> <chr> <dbl>
## 1 USA 104 0.0519 T1LS 3.98
## 2 USA 104 0.0519 T2LS 4.03
## 3 UK 90 -0.116 T1LS 4.14
## 4 UK 90 -0.116 T2LS 4.02
## 5 Poland 29 0.186 T1LS 3.81
## 6 Poland 29 0.186 T2LS 4
## 7 Portugal 23 -0.252 T1LS 3.90
## 8 Portugal 23 -0.252 T2LS 3.65
## 9 Canada 15 0.293 T1LS 3.83
## 10 Canada 15 0.293 T2LS 4.12
I then create the graph using ggplot and geom_bar. Assigning country to the x axis and means to the y axis. By using ‘Time’ as my fill argument, it differentiates Time 1 and Time 2 with red and blue. I had to use stat = ‘identity’ which tells ggplot that I will provide the values for y, rather than letting it aggregate the number of rows for each x axis (which is default). The position = “dodge” argument lets me put the bars for each country next to each other, something I discovered after a bit of Google searching. Width = 0.5 lets me set the width of the bars. scale_y_continuous(limits = c(0, 5)) lets me set the limits of the y axis to be 0 to 5, which is the scale of the life satisfaction measure. expand = c(0, 0) makes it so there’s not gap between the x axis and the graph. labs is used to let me label the y axis. scale_fill_discrete(labels = ) enables me to label the fill variable (Time) and therefore change the key labels.
LS_Country_Graph <- ggplot(data = LS_Country_Long) +
geom_bar(mapping = aes(x = Country, y = Means, fill = Time),
stat = 'identity',
position = "dodge",
width = 0.5) +
scale_y_continuous(limits = c(0, 5), expand = c(0, 0)) +
labs(y = "Mean life satisfaction") +
scale_fill_discrete(labels = c("Pre-Covid Life Satisfaction", "During-Covid Life Satisfaction")) # labels for the key
LS_Country_Graph
Overall pretty successful. I did run into some issues around position and stat arguments in aesthetics, but other than that, it went pretty smoothly.
Finishing off my exploratory analyses and my recommendations for my verification report.