Goals for this week

My coding goals for week 9 were to get started on my exploratory analyses.

My first exploratory analysis

I wanted to determine whether someones country of residence influenced life satisfaction scores at time 1 (pre-covid) and how that changed at time 2 (during-covid).

Loading libraries

library(tidyverse) # for dplyr and ggplot

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.4
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr) # for data frame manipulation functions
library(ggplot2) # for graphs

Reading and creating data frame

To do this, I used the data from study 2 and looked at the life satisfaction at time 1 and time 2, by country. I chose to analyse countries with above >14 participants because country sample sizes below 15 were too small to be meaningful (the paper also did exploratory analyses of countries with above 14 participants).

I called our data frame LS_Country to indicate that we were measuring Life Satisfaction (LS) by country. The life satisfaction difference variable was included to compare figures later on.

LS_Country <- read.csv("Study 2.csv")

LS_Country <- LS_Country %>% 
  group_by(Country) %>% # Group scores by country
  select(Country, T1SWLS, T2SWLS, SWLS_Diff) %>% # Select only the country, T1 and T2 life satisfaction scores.
  summarise(sample_size = n(), # create a new column that displays number of participants
            "T1LS" = mean(T1SWLS), # create a column that displays mean life satisfaction at time 1
            "T2LS" = mean(T2SWLS), # create a column that displays mean life satisfaction at time 2
            "LS_Diff" = mean(SWLS_Diff) # create a column that displays mean life satisfaction differences across T1 and T2
            ) %>% 
  filter(sample_size > 14) %>% # filter out all countries with less than 15 participants
  arrange(desc(sample_size)) # arrange with the highest no. of participants at the top

LS_Country

## # A tibble: 5 x 5
##   Country  sample_size  T1LS  T2LS LS_Diff
##   <chr>          <int> <dbl> <dbl>   <dbl>
## 1 USA              104  3.98  4.03  0.0519
## 2 UK                90  4.14  4.02 -0.116 
## 3 Poland            29  3.81  4     0.186 
## 4 Portugal          23  3.90  3.65 -0.252 
## 5 Canada            15  3.83  4.12  0.293

We are left with 5 out of 28 countries with 15 or more participants: USA, UK, Poland, Portugal and Canada. Now we will compare these mean scores in a bar graph.

To be able to create a bar graph that has a side-by-side comparison for T1 and T2, by country, I first have to convert my dataframe from wide to long. Here I use the pivot_longer function, which takes the T1LS and T2LS columns from my data frame and puts the names of the columns into the data, and aligns the means with each ‘time’ in a separate column labelled ‘Means’. So now my data frame looks something like this:

LS_Country_Long <- pivot_longer(LS_Country, 
                                cols = c(T1LS, T2LS), 
                                names_to = "Time", 
                                values_to = "Means")
LS_Country_Long

## # A tibble: 10 x 5
##    Country  sample_size LS_Diff Time  Means
##    <chr>          <int>   <dbl> <chr> <dbl>
##  1 USA              104  0.0519 T1LS   3.98
##  2 USA              104  0.0519 T2LS   4.03
##  3 UK                90 -0.116  T1LS   4.14
##  4 UK                90 -0.116  T2LS   4.02
##  5 Poland            29  0.186  T1LS   3.81
##  6 Poland            29  0.186  T2LS   4   
##  7 Portugal          23 -0.252  T1LS   3.90
##  8 Portugal          23 -0.252  T2LS   3.65
##  9 Canada            15  0.293  T1LS   3.83
## 10 Canada            15  0.293  T2LS   4.12

Graphing

I then create the graph using ggplot and geom_bar. Assigning country to the x axis and means to the y axis. By using ‘Time’ as my fill argument, it differentiates Time 1 and Time 2 with red and blue. I had to use stat = ‘identity’ which tells ggplot that I will provide the values for y, rather than letting it aggregate the number of rows for each x axis (which is default). The position = “dodge” argument lets me put the bars for each country next to each other, something I discovered after a bit of Google searching. Width = 0.5 lets me set the width of the bars. scale_y_continuous(limits = c(0, 5)) lets me set the limits of the y axis to be 0 to 5, which is the scale of the life satisfaction measure. expand = c(0, 0) makes it so there’s not gap between the x axis and the graph. labs is used to let me label the y axis. scale_fill_discrete(labels = ) enables me to label the fill variable (Time) and therefore change the key labels.

LS_Country_Graph <- ggplot(data = LS_Country_Long) +
  geom_bar(mapping = aes(x = Country, y = Means, fill = Time), 
           stat = 'identity', 
           position = "dodge",
           width = 0.5) +
  scale_y_continuous(limits = c(0, 5), expand = c(0, 0)) +
  labs(y = "Mean life satisfaction") +
  scale_fill_discrete(labels = c("Pre-Covid Life Satisfaction", "During-Covid Life Satisfaction"))  # labels for the key

LS_Country_Graph

Overall pretty successful. I did run into some issues around position and stat arguments in aesthetics, but other than that, it went pretty smoothly.

Goals for week 10

Finishing off my exploratory analyses and my recommendations for my verification report.

Week 9 learning log