PSYC3361 Verification Report

Did Social Connection Decline During the First Wave of COVID-19?: The Role of Extraversion

Authors: Saraa Al-Saddik Group: Tuesday 1-3pm Group 5 Date: 7th August 2022

R code: - Study 1 R code - Study 2 R code

CSV files: - Study 1 csv - Study 2 csv

Part 1 Summary and Reaction:

Summary (150 words)

Reaction (150 words)

Part 2:

Verification goals

The goal of the verification section of this report was to reproduce the demographic descriptives, the means/ SD reported in the text and all figures from the article. This was done to help get a better idea behind the challenges encountered when reproducing other studies where these challenges will be highlighted throughout the report.

Demographic statistics reported in the study:

Study 1:

“Our final sample included 467 participants (age: M = 20.89, SD = 3.03; 77% women)who completed both out Time1 and Time2 surveys and met our inclusion criteria”.

Study 2:

  • Time 1:
    • “Participants (N=386; age:M=31.66, SD = 11.86; 55% Male; 80% White; 46% single/ never married; 31% U.S, 27% U.K) were recruited from Prolific Academic)”
  • Time 2:
    • “Our final sample comprised 336 participants (age: M=32.03, SD= 11.94; 55% Male; 80% White; 45% single/never married; 32% U.S; 27% U.K) wjho completed both our Time 1 and Time 2 surveys and met our pre-registered inclusion criteria.”

Means/SD reported in the text:

Study 1:

In-text results: - Physical/Social distancing: - “(98.5%) reproted practicing physical/social distancing” - “No one outside their household came within 6 feet of them the day before (Mode= 0, M = 0.77, SD = 1.39)” - Changes in social connectedness due to COVID: - “…Participants reported lower levels of social connection during the COVID-19 pandemic (Time 2: M= 3.97, SD= 0.85) than before (Time 1: M = 4.11, SD = 0.88)”. - Social connectedness x extraversion levels: - Exploratory Analysis: “…found that extraversion was weakly associated with declines in social connectedness. To unpack this finding, we split our sample into those with an extraversion score at or below in the 25th percentile and those at or above the 75th percentile. Our most intraverted participants exhibited a small drop in social connectedness between Time 1 (M = 3.45, SD= 0.70) and Time 2 (M = 3.35, SD = 0.66)”…; whereas the most extraverted participants exhibited a larger drop… in social connectedness between Time 1 (M = 4.70, SD = 0.72) and Time 2 (M = 4.45, SD = 0.84)“. - Lethargy levels (wellbeing proxy): -”To test whether momentary lethargy levels changed from before to during the pandemic….lethargy increased from Time 1 (M= 2.60, SD = 1.16) to Time 2(M= 3.16, SD = 1.27)“.

Table results: - Mean and SD reported in table 1:

Study 2:

In text results: - Physical/Social distancing: - “92.9% of participants reported practicing physical/ social distancing”.
- “No one got within 6 feet of them the day before (Mode = 0, M = 1.11, SD= 0.75)” - Change in social connectedness due to COVID: - “…we found no difference between participants’ reports of relatedness prior to the pandemic (Time 1: M= 4.90, SD = 1.11) versus dueing the pandemic (Time 2: M = 4.91, SD= 1.15).” - Change in loneliness levels due to COVID: - “Participants reported feeling a little less lonely during the pandemic (Time 2: M = 2.07, SD = 0.63) than before (Time 1: M= 2.14, SD= 0.67)”. - Loneliness levels x ectraverion levels: - “…higher levels of extraversion were linked to smaller decreases in loneliness…To further investigate this finding we again subdivided our sample to focus on participants in the top and bottom quartiles for extraversion. Whereas our most extraverted participants showed no change in loneliness from Time 1 (M = 1.64, SD= 0.51) to Time 2 (M= 1.67, SD= 0.49)”…, our most introverted participants decreased in loneliness from Time 1 (M = 2.56, SD= 0.63) to Time 2(M= 2.31, SD= 0.63)“. - Changes in life satisfaction levels due to COVID: -”…life satisfaction did not change from Time 1(M= 3.96, SD= 1.56) to Time 2(M= 3.98, SD= 1.46).”

Table results:

Figures reported in the text:

Study 1:

Figure 1:

Study 2:

Figure 2:

Figure 3:

Steps to replicating data:

Locate the open data

The first step was to locate the open data for both studies which was easily located by following the OSF links under the Methods section for each study.

The OSF Repositories are: - Study 1 - Study 2

Code book:

We then went to find the datacodes for each study so we could see what each variable was labelled: - Study 1: - Study 2:

Replication process:

As mentioned earlier, there were two studies in this article. Therefore, there were two separate csv file for each study. Therefore replication for each study occurred independantly starting with Study 1.

Demographic statistics:

Study 1:

Packages and reading data

# Read the Data STUDY 1-----------------------------------------------------------

# install the packages 
install.packages("tidyverse")  # for most of the functions needed 
install.packages("dplyr")      # for efficient data manipulation
install.packages("lsr")        # used to compute mode 
install.packages("psych")      # used to create table1 

# load the packages 
library(tidyverse)
library(lsr)
library(psych)

Note that we knew to load some of these packages by looking at the R code provided by the authors in the OSF file where we used the relevant packages that we deemed we needed to reproduce the data.

Reading Study 1 data

Raw study 1 data from the OSF file was saved as “study1_data.csv” in my work directory whereby the .csv format allowed data to be read using the read_csv() function to get the data into the Rmd file.The read_csv allows data to be imported into R as a tibble (a package used to manipulate and print data frames).

The data is then named as “study1_raw” by assigning it (<-) to the read_csv function.

# read the data 
study1_raw <- read_csv(file = "study1_data.csv")

Exclusion criteria:

Lucky for us there was no issue reading the data where the exclusion criteria had already been applied to the data for both studies (where there were 467 rows of data- i.e. one row for each of the participants who had met their inclusion criteria- as mentioned earlier). This was made our job easier as, when we were reading the exclusion criteria, we had a mini panic attack when we saw the following criteria:

  • “Immediately after rating their personality, participants were asked to indicate whether they answered the scale with regard to their “typical personality prior to the COVID-19 pandemic” or their “personality during the COVID-19 pandemic.” As pre-registered, we excluded 142 participants who answered the latter or failed to answer.”

  • “Excluding inattentive participants, participants who provided the same answer 12 times in a row, those who missed more than approx. 20% of items (Cut offs: i.e. 2 out of 10 from our measure of lethargy, 3 out of 12 on extraversion, 3 out of 19 on connectedness). If participants were missing data below these cut offs, we used the mean of each individual’s remaining responses to impute their missing responses.Finally, if participants completed the Time2 survey twice (which we did not foresee), we used only their first survey response”.

Although the authors didn’t make our replication journey easier, after seeing the struggles other groups encountered with applying the exclusion criteria, I felt super greatful that we were able to skip over this obstacle. Therefore we were able to dive straight into replicating the descriptive statistics (YAY!).

Starting the actual process:

We started off by checking the number of participants to double check the number of participants (though it looked correct from the number of rows but I am an overthinker so bear with me).

#Number of participants
count(study1_raw)
# A tibble: 1 × 1
      n
  <int>
1   467

So we had the correct number of participants (shocker :) ). Now to determine the gender percentage.

Gender identity of participants

Note: Study reproted that 77% of participants were women.

Again another blessing is that the authors wrote “man”, “woman”, “other” or “decline to answer” when reporting the gender data, so we did not have have to rename variables, so we did not encounter many issues when replicating the gender percentage. Or so I thought. So as wre may have established I have very poor (even that cannot describe it) coding skills. Terrible. So Google had immediately become my best friedn from the first step. So afterf a quick google search of ‘how to find percentage in R’, it was recommended that the easiest way was to use the dplyr package using this example code:

library( dplyr )
students %>% 
    group_by( gender ) %>% 
    summarise( percent = 100 * n() / nrow( students ) )

Which is sadly what I had attempted to do based on our raw data in the code below:

# Determine gender percentage 
study1_raw %>% 
  group_by(Gender) %>% 
  summarise(percent = 100 * n()/nrow(study1raw)) %>% 
  filter(Gender == "woman")  #Fail

This only produced an error stating that

Error in group_by(., Gender) : object 'study1_raw' not found

That is when I had realised that adding the filter function was me overthinking as I believed I needed the function to only get a percentage for the gender variable only. So once I had removed that line:

study1raw %>% 
  group_by(Gender) %>% 
  summarise(percent = 100 * n()/nrow(study1raw)) #Success
# A tibble: 4 × 2
  Gender              percent
  <chr>                 <dbl>
1 [Decline to Answer]   0.428
2 Man                  21.6  
3 Other                 0.857
4 Woman                77.1  

It worked! (Just goes to just follow instructions and not doing your own thing). We can see that, similar to the data- I had also gotten 77% for women. Okay so now to move onto our main dilemma as a group.

Age descriptives (or as we call it, the ‘Age Dilemma’):

Note: Study reproted that mean age = 20.89, SD = 3.03.

This section was our main challenge in the demographics section of the replications.

So initially we decided to filter out the “decline to answer” responses (where people did not want to give researchers their ages) using the ‘age != “[Decline to Answer])”’ and use the summarise() function to find the mean age and SD of participants:

#average age attempt 1
study1_raw %>% 
  filter(age != "[Decline to Answer]") %>% 
  summarise(mean(age), sd(age))    
               #FAIL (would work if age was numeric)

To which I kept receiving this error message:

! object 'age' not found

After searching on google and double checking why the age variable cannot be found in the raw study 1 data, I decided to relook over the code until I realised I had said age and not Age (face palm):

study1_raw %>% 
  filter(Age !="[Decline to Answer]") %>% 
  summarise(mean(Age), sd(Age))

However, the joy of not getting the same error message was short-lived as the next error message was:

# A tibble: 1 × 2
  `mean(Age)` `sd(Age)`
        <dbl>     <dbl>
1          NA      3.03
Warning message:
In mean.default(Age) : argument is not numeric or logical: returning NA

BUTTTT…. we got the SD (YAY!). So we know the summarise() function is working. Now to make the age numeral numeric rather than a character variable () In our second attempt we tried to use the .subset (rather than filter) to filter out the age variable. Again did not work. In both cases, the error message was that Age was not a numeric variable. Therefore, our next steps was to convert it into a numerical value. 

So in my first attempt, the suggested process to convert to numerical was to check the class type:

# check variable type 
class(study1_raw)  # supposed to be getting [1] "character"????
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 

and then to use the recode() in the dplyr package then ifelse() to convert from categorical to numerical then match() function to rename the character variable into a numeric one. However, after seeing the output of the first line…:

dplyr::recode (Age_variable, "Decline to Answer" = 0)
Error in UseMethod("recode") : 
  no applicable method for 'recode' applied to an object of class "c('spec_tbl_df', 'tbl_df', 'tbl', 'data.frame')"

….(and being heavily confused as to what I was supposed to be typing in the brackets):

data <- c("a","b","c","b","a")

dplyr::recode(data, a = 1, b = 2, c = 3)
## [1] 1 2 3 2 1

ifelse(data=="a", 1, ifelse(data=="b", 2, 3))
## [1] 1 2 3 2 1

oldvals <- c("a", "b", "c")
newvals <- c(1, 2, 3)
newvals[match(data, oldvals)]
## [1] 1 2 3 2 1       

(#example code used in [website] (https://universeofdatascience.com/how-to-recode-character-variables-in-r/)), I abandoned that method pretty quickly and ran to find an easier one. However, what did pique my interest is how recoded variables can be converted into each other using as.numeric and the as.character functions.

So then began the process of us combining different methods of filtering the “Decline to Answer” responses and trying to convert the age variable into a numeric one:

#convert age to numeric variable attempt 2
age.subset <- age.subset %>% 
  as.numeric(as.character(age))    #FAIL

#attempt 3
demographics %>% mutate(new_age = age != "[Decline to Answer]") #FAIL

#attempt 4
age.subset %>%  mutate(age_test = numeric(age.subset))  #FAIL

Here, we used the mutate function using the dplyr package to create a new variable known as new_age, then again tried to filter out the needed responses by using the above methods. Again, fail. So it was just a repetitive process of trial and error until eventually we figured out the amazing $. This allowed us to extract the specific Age part of the study1_raw data. Then we allowed Age to become a numeric variable by using the “as.numeric” function.We were now able to calculate the age means and SD and by using na.rm we were able to skip over any “Decline to Answer” values:

#THE FIX
study1_raw <- read_csv("study1_data.csv")


study1_raw$Age <- as.numeric(study1_raw$Age) 
                        #Converts "chr" to numeric form.

study1_age <- study1_raw %>% 
  summarise(mean_age = mean(Age, na.rm = TRUE), 
            sd_age = sd(Age, na.rm = TRUE), 
            min_age = min(Age, na.rm = TRUE), 
            max_age = max(Age, na.rm = TRUE)) #SUCCESS

Output:

# A tibble: 1 × 4
  mean_age sd_age min_age max_age
     <dbl>  <dbl>   <dbl>   <dbl>
1     20.9   3.03      17      44

Although the minimum and maximum age was not reported in the document, we decided to flex our newfound abilities ;) So YAY! And thus we commence to the next stage …. STUDY 2 DEMOGRAPHICS!!!

Study 2:

So we started off by loading the relavent packages and reading the data.

# load packages 
library(tidyverse) #for most of the functions needed
library(dplyr) # tools for efficient data manipulation

# read the data
study2_raw <- read_csv(file= "Study 2.csv")

Then we began the replication process starting with time 1 (pre-pandemic).

Time 1:

“Participants (N=386; age:M=31.66, SD = 11.86; 55% Male; 80% White; 46% single/ never married; 31% U.S, 27% U.K) were recruited from Prolific Academic)”

As mentioned in Study 1, the exclusion criteria had already been applied, thus we only have data for the N= 336 participants who had “completed both the Time 1 and Time 2 surveys and met (their) pre-registered inclusion criteria”. Thus, as we were missing 50 rows of data, we were unable to reproduce the demographic statistics at Time 1, only Time 2.

Time 2:

“Our final sample comprised 336 participants (age: M=32.03, SD= 11.94; 55% Male; 80% White; 45% single/never married; 32% U.S; 27% U.K) wjho completed both our Time 1 and Time 2 surveys and met our pre-registered inclusion criteria.”

For Time 2, I started off by double checking the number of participants who met the criteria were all reported in the data:

count(study2_raw)
# A tibble: 1 × 1
      n
  <int>
1   336      #YAY

As we had already reproduced the age and gender demographics in study 1, we figured it would be easier to start with those (and we were right- no problems). For gender we decided to assign a variable (which we called ‘study2_gender’) to the study2_raw environment. Using the dplyr package, we were able to use the group_by() function to group data framed, according to the specified gender variable. We then used the summarise() function to create a new data frame and creating three combinations of grouping variables- as seen in the output below (‘male’, ‘female’, ‘non-binary’). We then calculated the percentage for each grouping combination and put them in descending order in order to determine the most common gender within the participant data:

Gender:

study2_gender <- study2_raw %>%          #Ensures a new table is saved to the environment
  group_by(Gender) %>%                   #Groups the data based on the specified variable.
  summarise(n = n(),                     #Counts the number of unique instances for the variable.
  Percent = 100 * n()/nrow(study2_raw)) %>%        #Calculates the percentage - representative percentage for the instance (100 x number of instances/total rows)
  arrange(desc(n)) #arranges from largest to smallest

0utput:

# A tibble: 3 × 3
  Gender         n Percent
  <chr>      <int>   <dbl>
1 Male         184  54.8     # close enough if rounded, i guess?
2 Female       151  44.9  
3 Non-Binary     1   0.298

For the age variable, we realised that there were no “Decline to Answer” responses (thank god!), so the age variable was numeric and logical. Therefore we di not encounter the issues faced in Study 1 and were able to just assign the ‘study2_age’ variable to to study2_raw and calculate the means as per usual using the summarise() function from the dplyr package. #### Age:

study2_age <- study2_raw %>% 
  summarise(mean_age = mean(Age), 
            sd_age = sd(Age), 
            min_age = min(Age), 
            max_age = max(Age))
# A tibble: 1 × 4
  mean_age sd_age min_age max_age
     <dbl>  <dbl>   <dbl>   <dbl>
1     32.0   11.9      18      72     # SUCCESS (or close enough when rounded)

Next we needed to reproduce the percentages of ethnicity (80% White) and country’s (32% US, 27% UK) that participants belonged to. Here, since it was percentages we decided to adapt the gender method of finding percentages from study 1. Here we assigned the variables (study2_ethnicity or study2_country) to study2_raw to create a new table in the environement of the relevant data percentages for each assigned variable. The data was again arranged in descending order-using the arrange(desc(n)) via dplyr package- to ensure that the percentages are in order from most to least popular participant ethnicity/ country: #### Ethnicity:

study2_ethnicity <- study2_raw %>%  #Identical code used here as was used for gender.
  group_by(Ethnicity) %>% 
  summarise(n = n(), Percent = 100 * n()/nrow(study2_raw)) %>%
  arrange(desc(n))
# A tibble: 8 × 3
  Ethnicity                          n Percent
  <chr>                          <int>   <dbl>
1 White                            268  79.8      # SUCCESS
2 Asian                             30   8.93 
3 Hispanic                          14   4.17 
4 More than one                     10   2.98 
5 Black                              9   2.68 
6 Middle-Eastern                     3   0.893
7 American Indian/Alaskan Native     1   0.298
8 Other                              1   0.298

Country:

study2_country <- study2_raw %>% #Identical code used here as was used for gender.
  group_by(Country) %>%
  summarise(n = n(), Percent = 100 * n()/nrow(study2_raw)) %>%
  arrange(desc(n))
# A tibble: 29 × 3
   Country      n Percent
   <chr>    <int>   <dbl>
 1 USA        104   31.0     # not 32%???? not sure why (*cue OCD*)
 2 UK          90   26.8     # close enough :) 
 3 Poland      29    8.63
 4 Portugal    23    6.85
 5 Canada      15    4.46
 6 Greece      11    3.27
 7 Italy       10    2.98
 8 Germany      6    1.79
 9 Mexico       6    1.79
10 Slovenia     5    1.49
# … with 19 more rows

Although we did not get exactly 32% for the nuUploading file…_qv19tn0f1 mber of participants from the USA, it was close enough and we will just move on and blame the authors and reproducibility crisis.

Marriage stats:

The main issue we encountered when reproducing Study 2 demographics is that they did not provide their marriage data, so we were unable to reproduce the 45% single/never married data (kind of random since they provided everything else). Here, Samuel decided to email the authors asking them for their marriage data. Initially Dunigan Folk was emailed to which he responded (was very unexpected). However, he emailed us to email his coauthor Dr Okabe-Miyamoto as she had the data with her.

However, we never got a response and therefore could not replicate the data. So on to stage 2: MEAN/SD!

Mean/SD:

Fortunately for us, we were able to get through this stage with minor challenges.
## Study 1: After having identified the Mean/SD that need to be identified, we decided to start off with physical distancing:

Physical distancing:

**Physical/Social distancing: - “(98.5%) reproted practicing physical/social distancing” - “No one outside their household came within 6 feet of them the day before (Mode= 0, M = 0.77, SD = 1.39)”

As before, the physical distancing percentage was also produced by adapting the original gender percentage and therefore we were able to easily reproduce this. Again, we assigned the ‘study1_socialdistancing’ variable to study1_raw where we grouped the data based on the ‘SocialDistancing variable’ and then the summarise() function from the dplyr package to determine the percentage of people who social distanced:

# physical/social distancing 
study1_socialdistancing <- study1_raw %>%          # creates a new variable in the environment 
  group_by(SocialDistancing) %>%                   # groups data by social distancing 
  summarise(n = n(),                               # counts the number of instances 
            percent = 100 * n()/nrow(study1_raw))  # find the percentage 
A tibble: 2 × 3
  SocialDistancing     n percent
             <dbl> <int>   <dbl>
1                1   460   98.5   # YAY
2                3     7    1.50

Off to a great start! WOOO! Next we reproduced the 6 feet distancing stat. Here, I had to resort to google to determine how mode can be calculated. Here the it was suggested to install the ‘library(modeest)’ package. After assigning variables, mlv was used to used as a generic function to compute a generic estimate of the mode of the study1_sixfeet data. Here, the “mfv” method was used to return the mode of the data:

# attempt 1
library(modeest)
study1_sixfeet <- study1_raw %>% 
mlv(study1_sixfeet, method = "mfv") # FAIL

However, the following error message was produced:

Error in mlv.default(., study1_sixfeet, method = "mfv") : 
  is.numeric(x) is not TRUE

After inserting the error message into Google it was suggested to use the shapiro.test to determine whether the data is normally distributed:

#edit:
library(modeest)
study1_sixfeet <- study1_raw 
shapiro.test(study1_sixfeet) 
mlv(study1_sixfeet, method = "mfv")
Error in shapiro.test(study1_sixfeet) : is.numeric(x) is not TRUE
> mlv(study1_sixfeet, method = "mfv")
Error in mlv.default(study1_sixfeet, method = "mfv") : 
  is.numeric(x) is not TRUE

However, I got another error message. Although i could have continued looking into it I felt like this was way too complicated just to figure out a mode. So I decided to switch tactics. As R does not have a built in function to determine mode, another website suggested creating a user function to calculate the mode of the data. Here the vector study1_sixfeet is used as input and the mode is produced as an output.

#attempt 2:
# Create the function
getmode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Calculate mode using user function 
six_feet <- getmode(study1_sixfeet)
print(six_feet)

# Create vector and summarise() 
study1_sixfeet <- study1_raw %>% 
  summarise(mode_six_feet = getmode(SixFeet)) # SUCCESS

Output:

# A tibble: 1 × 1
  mode_six_feet
          <dbl>
1             0   # SUCCESS

NOTE: - GEORGIA’S EASIER METHOD - EXPLAIN ABOVE CODE

However, at a group meeting I realised that although Samuel and I had calculated the mode in a similar way, Georgia was able to create a much more efficient code by installing the lsr package to compute mode then using the modeOf() function to get the mode. She was also able to combine her method of calculating the mode whilst simulatenougly calculating the mean and SD:

install.packages("lsr")        # used to compute mode 
six_feet <- study1_raw %>%                          # creates a variable from raw data 
  summarise(modeOf(SixFeet), mean(SixFeet), sd(SixFeet))  # calculates mode, mean and sd

However, when I tried to run the code on my R console, I would receive an error message:

! could not find function "modeOf"

When I tried replacing “modeOf” with just mode(), the output was:

six_feet <- study1_raw %>%                          # creates a variable from raw data 
  summarise(mode(SixFeet), mean(SixFeet), sd(SixFeet))  # calculates mode, mean and sd
# A tibble: 1 × 3
  `mode(SixFeet)` `mean(SixFeet)` `sd(SixFeet)`
  <chr>                     <dbl>         <dbl>
1 numeric                   0.767          1.39

The mean and SD would be calculated but not the mode? —> ask Georgia about this.

Overall, this was a lesson on the beauty of collaborating rather than dividing and conquering where although it may be a success that you were able to reproduce the data, the ultimate goal is to be able to reproduce it in a way that is efficient. It also shows that there is more than one correct way to approach the same value and that by working with your team members, you are able to see that range in diverse and new approaches.

For the follwoing reproduced values, we aimed to reproduce them using both the in-text and table 1 values.

Changes in social connectedness:

  • Changes in social connectedness:
    • “…Participants reported lower levels of social connection during the COVID-19 pandemic (Time 2: M= 3.97, SD= 0.85) than before (Time 1: M = 4.11, SD = 0.88)”.

Next we tried to replicate the change in levels of social connection. As seen in the code book, the authors had labelled ‘average levels of social connection’ as ‘SCAVERAGE.T1’ for pre-pandemic and ‘SCAVERAGE.T2’ for post-pandemic. To replicate the means/SD, we looked at the R code provided to us by the authors in the OSF file provided in the study 1 where they calculated the mean (mean()) and SD (sd()) SCAVERAGE at Time 1 and Time 2 by subseting the relevant SCAVERAGE variables using the $ sign. And after copy-pasting that and running it in R, we happily found that it did reproduce the in-text and relevant table values for social connectedness (YAY!- thank you authors!).

#changes in social connectedness:
mean(study1_raw$SCAVERAGE.T1)
mean(study1_raw$SCAVERAGE.T2)
sd(study1_raw$SCAVERAGE.T1)
sd(study1_raw$SCAVERAGE.T2)
> mean(study1_raw$SCAVERAGE.T1)
[1] 4.113212                    # mean social connectedness at Time 1
> mean(study1_raw$SCAVERAGE.T2)
[1] 3.974868                    # mean social connectedness at Time 2
> sd(study1_raw$SCAVERAGE.T1)
[1] 0.8785128                   # SD social connectedness at Time 1
> sd(study1_raw$SCAVERAGE.T2)
[1] 0.8496547                   # SD social connectedness at Time 2

Table 1 and Lethargy:

Next we decided to reproduce table 1:

Lethargy levels (wellbeing proxy): - “To test whether momentary lethargy levels changed from before to during the pandemic….lethargy increased from Time 1 (M= 2.60, SD = 1.16) to Time 2(M= 3.16, SD = 1.27)” –> note: same values as table

(Note: APA format for table1?????)

Here, we started off by creating a data frame where we assigned study1_table1 to the study1_raw data where the study1_table1 variable had the unnecessary columns filtered out where we used the select() function in dplyr to only leave the variables that were reported in Table 1 (as shown below):

study1_table1 <- study1_raw %>%       # creating a new data frame 
  select(LETHAVERAGE.T1,              # this data frame only includes the variables that are in table 1
         LETHAVERAGE.T2, 
         LethDiff, 
         SCAVERAGE.T1, 
         SCAVERAGE.T2, 
         SCdiff, 
         EXTRAVERSION)

We then added another ‘table 1’ variable to the environment where we used the describe() function- using the psych package- to get the selected descriptive statistics of the given study1_table1 dataset:

library(psych) 
table1 <- describe(study1_table1)

However, we realised that it gives you all of the varying types of descriptives:

> table1
               vars   n  mean   sd median trimmed  mad   min  max range  skew kurtosis   se
LETHAVERAGE.T1    1 467  2.60 1.16   2.30    2.50 1.19  1.00 6.00  5.00  0.70    -0.30 0.05
LETHAVERAGE.T2    2 467  3.16 1.27   3.00    3.12 1.33  1.00 6.00  5.00  0.27    -0.90 0.06
LethDiff          3 467  0.56 1.33   0.40    0.54 1.33 -3.30 4.90  8.20  0.21     0.13 0.06
SCAVERAGE.T1      4 467  4.11 0.88   4.16    4.14 0.94  1.16 5.95  4.79 -0.29    -0.34 0.04
SCAVERAGE.T2      5 467  3.97 0.85   4.00    3.99 0.94  1.63 5.95  4.32 -0.14    -0.62 0.04
SCdiff            6 467 -0.14 0.71  -0.11   -0.12 0.62 -3.05 2.16  5.21 -0.42     1.60 0.03
EXTRAVERSION      7 467  4.17 1.01   4.17    4.15 1.11  1.50 6.75  5.25  0.09    -0.53 0.05    #Woah....but YAY!

Since we only wanted means and SD, we again decided to use the select function to filter out and leave only the mean and SD columns. Here all the data replicated with the values given in Table 1.

library(psych) 
table1 <- describe(study1_table1) %>%   # create a table that summarises table 1 variables 
  select(mean, sd)                      # only include means and sd's 
                mean   sd
LETHAVERAGE.T1  2.60 1.16
LETHAVERAGE.T2  3.16 1.27
LethDiff        0.56 1.33
SCAVERAGE.T1    4.11 0.88
SCAVERAGE.T2    3.97 0.85
SCdiff         -0.14 0.71
EXTRAVERSION    4.17 1.01        #SUCCESS

Social connectedness x extraversion levels:

  • Exploratory Analysis: “…found that extraversion was weakly associated with declines in social connectedness. To unpack this finding, we split our sample into those with an extraversion score at or below in the 25th percentile and those at or above the 75th percentile. Our most intraverted participants exhibited a small drop in social connectedness between Time 1 (M = 3.45, SD= 0.70) and Time 2 (M = 3.35, SD = 0.66)”…; whereas the most extraverted participants exhibited a larger drop… in social connectedness between Time 1 (M = 4.70, SD = 0.72) and Time 2 (M = 4.45, SD = 0.84)“.

As soon as I saw quantiles… I knew I was finished. So I decided to look at their R code to see how they had done it.

Most intraverted x social connectedness:
#Have most introverted participants experienced changes in connectedness
quantile(study1_table1$EXTRAVERSION) #finding quartiles 
BottomQuarter<- subset(study1_table1, EXTRAVERSION <= 3.41667) #Making subset of data with only participants in bottom quartile of extraversion
BottomQuarter
BottomQuarter<-as_tibble(BottomQuarter)
BottomQuarter

mean(BottomQuarter$SCAVERAGE.T2)
sd(BottomQuarter$SCAVERAGE.T2)
mean(BottomQuarter$SCAVERAGE.T1)
sd(BottomQuarter$SCAVERAGE.T1)

Output:

> mean(BottomQuarter$SCAVERAGE.T2)
[1] 3.351172                        # mean for most introverted at Time 2
> sd(BottomQuarter$SCAVERAGE.T2)
[1] 0.655226                        # SD for most introverted at Time 2
> mean(BottomQuarter$SCAVERAGE.T1)
[1] 3.446572                        # mean for most introverted at Time 1
> sd(BottomQuarter$SCAVERAGE.T1)
[1] 0.7024337                       # SD for most introverted at Time 1

# SUCCESS... they all match the in-text reported data (making life so much easier buttt...huh?)

The code was defintely far from easy to read and understand. Especially for someone as helpless at coding as me. So I decided to research what they had done line by line. –> used bottomquarter to detect people with the lowest extraversion score

Most extraverted x social connectedness:

Okay, so they provided the code for the ‘most introverted’ participants? What about the most extraverted?:

#Have most extraverted participants experienced changes in connectedness
quantile(Data$EXTRAVERSION) #finding quartiles
UpperQuarter<- subset(Data, EXTRAVERSION >= 4.8333) #Making subset of data with only participants in top quartile of extraversion
UpperQuarter<-as_tibble(UpperQuarter)
UpperQuarter

mean(UpperQuarter$SCAVERAGE.T1)
sd(UpperQuarter$SCAVERAGE.T1)
mean(UpperQuarter$SCAVERAGE.T2)
sd(UpperQuarter$SCAVERAGE.T2)
mean(UpperQuarter$SCAVERAGE.T1)
[1] 4.704611                     # mean 
> sd(UpperQuarter$SCAVERAGE.T1)
[1] 0.7228096
> mean(UpperQuarter$SCAVERAGE.T2)
[1] 4.445344
> sd(UpperQuarter$SCAVERAGE.T2)
[1] 0.835402

#SUCCESS (THANK YOU AUTHORS!!!)

However, while I had taken the lazy (or smart?) way of reproducing it- thats the point of open data after all, right?-, Georgia went the extra mile and did it from scratch:

# Introverts vs Extraverts  -----------------------------------------------

# split the sample 

# find the quantiles 
study1_quantile <- quantile(study1_raw$EXTRAVERSION) 

# Firstly i tried to split so that there were equal participants in each quantile 
# doesnt produce the right statistics 
q = c(0.25, 0.75)

study1_raw %>% 
  group_by(EXTRAVERSION) %>% 
  summarise(quant25 = quantile(EXTRAVERSION, probs = q[1]),
            quant75 = quantile(EXTRAVERSION, probs = q[2]))


introverts <- study1_raw %>% mutate(quartile = ntile(EXTRAVERSION, 4)) %>% 
  select(EXTRAVERSION, quartile, SCAVERAGE.T1, SCAVERAGE.T2) %>% 
  filter(quartile == 1)

extraverts <- study1_raw %>% mutate(quartile = ntile(EXTRAVERSION, 4)) %>% 
  select(EXTRAVERSION, quartile, SCAVERAGE.T1, SCAVERAGE.T2) %>% 
  filter(quartile == 4)

introvert_meanSC <- describe(introverts) %>% 
  select(mean, sd)

extravert_meanSC <- describe(extraverts) %>% 
  select(mean, sd)


## THE FIX
# quantile split by score (quantile cutoff)
study1_quantile <- quantile(study1_raw$EXTRAVERSION) # find quantiles 

# create data subsets 
study1_introverts <- study1_raw %>%                        # create new variable using raw data 
  select(EXTRAVERSION, SCAVERAGE.T1, SCAVERAGE.T2) %>%     # only including three variables 
  filter(EXTRAVERSION <= 3.41667)                          # filtering out data so that only the most introverted (first quantile) are included in this subset 
# this results in 119 people in this group ^

study1_extraverts <- study1_raw %>%                        # create a new variable using raw data 
  select(EXTRAVERSION, SCAVERAGE.T1, SCAVERAGE.T2) %>%     # only including three variables of interest 
  filter (EXTRAVERSION >= 4.83333)                         # only including the most extraverted (4th quatile cutoff)
# This results in 130 people in this group ^


# Finding the means and Sd's 
introvert_meanSC <- describe(study1_introverts) %>%        # new variable using most introverted data 
  select(mean, sd)                                         # calulates only means and sd's 
 
extravert_meanSC <- describe(study1_extraverts) %>%        # new variable using most extraverted data
  select(mean, sd)                                         # calulates means and sds 
  • WALK THEM THROUGH ABOVE CODE

Study 2:

Physical distancing:

Physical/Social distancing: - “92.9% of participants reported practicing physical/ social distancing”.
- “No one got within 6 feet of them the day before (Mode = 0, M = 1.11, SD= 0.75)”

Again, we started off with the replicating the social distancing descriptives. Here, we used the same method as Study 1:

study2_socialdistancing <- study2_raw %>%          # creates a new variable in the environment 
  group_by(SocialDistancing) %>%                   # groups data by social distancing 
  summarise(n = n(),                               # counts the number of instances 
            percent = 100 * n()/nrow(study2_raw))  # find the percentage 

However, when I ran the code, I got a different value to what was reported in the text (92.9%):

# A tibble: 2 × 3
  SocialDistancing     n percent
             <int> <int>   <dbl>
1                0    23    6.85
2                1   313   93.2  # not 92.9%???

After running my other grouo member’s code, they had also received the same output, so it must have been another error on the author’s part.

Next we replicated the six feet distancing values again using the same code as in Study 2:

# Create the function
getmode <- function(v) {
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

#method 1:
  
  # Calculate mode using user function 
six_feet <- getmode(study2_sixfeet)
print(six_feet)  

Output:
  mode_six_feet
1             0
 #SUCCESS (but that just gets us the mode)
 
#method 2:

# Create vector and summarise() 
study2_sixfeet <- study2_raw %>% 
  summarise(mode_six_feet = getmode(SixFeet),mean(study2_raw$SixFeet), sd(study2_raw$SixFeet))

Output:
  mode_six_feet mean(study2_raw$SixFeet) sd(study2_raw$SixFeet)
1             0                 1.116071               1.753864      #SUCCESS (more efficient code that develops mean, sd and mode but.....)

Why is the SD= 1.75 rather than 0.75? Upon comparing my group’s code to mine they also received the same result (and their’s look a lot more efficient than mine):

study2_six_feet <- study2_raw %>%
  summarise(mode_six_feet = mode(SixFeet), #turns out, mode isn't a real function.
            mean_six_feet = mean(SixFeet), #This looks right.
            sd_six_feet = sd(SixFeet)) #Big discrepancy!

Output:
  mode_six_feet mean_six_feet sd_six_feet
1       numeric      1.116071    1.753864

We then decided to run the author’s R code:

# Social Distancing
install.packages("psych")
install.packages("plyr R")
count(study2_raw$SixFeet)
describeBy(study2_raw$SixFeet))

Although I tried to run the code multiple times and ensured I installed the right packages, typed the command correctly, etc, I could not seem to get rid of the following error code:

Error in UseMethod("count") : 
  no applicable method for 'count' applied to an object of class "c('integer', 'numeric')"
  
  or 
  
  Error in describeBy(study2_raw$SocialDistancing) : 
  could not find function "describeBy"

and having searched the internet for ways to fix the error (maybe describe.by? nope), it did still produce SD=1.75 for my group members when they ran it so I will take their word for it to reduce my anxiety levels. It really is a crisis :)

  • fix study 1 code as well and see if it can also work as well

Table 3:

Next we decided to replicate table 3. Here we started off by installing the relevant psych package to be able to use the describe() function to calculate statistical data where we used the select() function to specifically only have the relevant mean/sd columns in the output:

study2_summary <- describe(study2_raw[, c('T1SWLS', 
                                          'T2SWLS', 
                                          'SWLS_Diff', 
                                          'T1Lonely', 
                                          'T2Lonely',
                                          'Lonely_Diff',
                                          'T1BMPN',
                                          'T2BMPN',
                                          'BMPN_Diff',
                                          'T1Extraversion')]) %>% 
  select(mean,sd)  #FAIL

However, this produced the follwoing error message:

Error in describe(study2_raw[, c("T1SWLS", "T2SWLS", "SWLS_Diff", "T1Lonely",  : 
  could not find function "describe"

After googling the error message, a solution suggested was to load the ‘library(’testthat’)’ package. Since there was “no package called ‘testthat’” as R beutifully stated, I had to use ‘install.packages(“testthat”)’. It was suggested to load this package before loading the ‘psych’ package. I then used the same code as mentioned above…:

install.packages("testthat")
library(psych)
study2_summary <- describe(study2_raw[, c('T1SWLS', #T1 life satisfaction
                                          'T2SWLS',  # T2 life satisfaction 
                                          'SWLS_Diff',  # Life satisfaction change
                                          'T1BMPN',        # T1 Relatedness score 
                                          'T2BMPN',        #T2 Relatedness score 
                                          'BMPN_Diff',     # Relatedness difference score
                                          'T1Lonely',    # T1 Loneliness
                                          'T2Lonely',    # T2 Loneliness
                                          'Lonely_Diff',  # Loneliness change (T2-T1) 
                                          'T1Extraversion')]) %>%  # Extraversion level pre-pandemic
  select(mean,sd)

And it produced:

                mean   sd
T1SWLS          3.97 1.53                             
T2SWLS          3.99 1.45
SWLS_Diff       0.02 0.88
T1BMPN          4.92 1.09
T2BMPN          4.91 1.14
BMPN_Diff      -0.01 1.11
T1Lonely        2.12 0.65
T2Lonely        2.06 0.62
Lonely_Diff    -0.06 0.40
T1Extraversion  3.90 0.79   # All the same as table 3 values! Yay!

IT WORKED!

  • I then compared each of these values to the table values.
  • (samuel v natalie discepepncy)

Exploratory analysis: Extraversion x Loneliness:

# STUDY2: Split Quantile by scores
study2_quantile <- quantile(study2_raw$T1Extraversion)

# Create and filter lower quantile of extroverts (introverts)
library(dplyr)
study2_introverts <- study2_raw %>% 
  select(T1Extraversion, T1Lonely, T2Lonely) %>% 
  filter(T1Extraversion <= 3.33)
# There is 80 people as a result in this group (Quantile 1). 

# Create and filter upper quantile of extroverts (most extroverted)
study2_extroverts <- study2_raw %>%
  select(T1Extraversion, T1Lonely, T2Lonely) %>%
  filter(T1Extraversion >=4.416777)
# This results in 83 people as being categorised as the most extroverted

# Calculate mean and sd's of each of the above subsets
library(psych)
introvert_meanL <- describe(study2_introverts) %>%
  select(mean,sd)

Output for introversion:

introvert_meanL
               mean   sd
T1Extraversion 2.90 0.29
T1Lonely       2.53 0.63
T2Lonely       2.31 0.63    # T1Lonely is off by 0.02 for both mean and SD

Most Extraverted:

extrovert_meanL <- describe(study2_extroverts) %>%
  select(mean,sd)
> extrovert_meanL
               mean   sd
T1Extraversion 4.95 0.34
T1Lonely       1.63 0.49   # T1Lonely is off by 0.02 for both mean and SD 
T2Lonely       1.67 0.49
  • read comments + library package issue + walk them through it
  • look at comments on exploratory data
  • compare to wtf they did (hard to read) –> try to understand tho
  • DONT FORGET TO COMPARE EACH STAGE/ MAIN STAGES TO THE CODE THAT THE AUTHORS PROVIDED WATCH NEVARRO VIDZ TO BETTER UNDERSTAND SUMMARISE() ETC AND THEN READ OVER TO PROVIDE BETTER EXPLANATIONS

Figures:

Study 1:

Figure 1:

When replicating figure 1, I started off by identifying what type of graph it was (a histogram), and then simply search up ‘how to create a histogram in R’ and viola, I got an amazing youtube video that took us step-by-step on how to make a histogram, producing the following code using the tidyverse package:

figure1 <- study1_raw
hist(study1_raw$SCdiff,  # to make histogram
     breaks = 13,  # number of bars in histogram
     name = "Distribution of Social Connectedness Difference Scores"
     )

However, the video did not include how to label the axis or how to name the graph (which I kind of fluked by writing name = , but as you can guess that did not work so google it is!). A quick google search informed me to use the main, xlab and ylab parameters:

#attempt 2
figure1 <- study1_raw 
hist(study1_raw$SCdiff, # create a histogram fro SCdiff
     breaks = 13,   #number of bars in the histogram
     main = "Distribution of Social Connectedness Difference Scores", # name of graph
     xlab = "Social Connectedness Difference Score (T2-T1)", # x axis label
     ylab = "Frequency",   # y axis label
)   #SUCCESS

This produced the following graph:

Yes they look similar!: (Never have I been more grateful to the authors that they did not try to be fancy by adding different colours)

However, when liasing with my team members, I found that they were not so lucky and had tried to use ggplot and the geom_histogram functions to produce the graph:

# Failed Attempts 

figure1<- ggplot(study1_raw, aes(x = SCdiff)) + 
  geom_histogram(colour = "black" , fill = "grey", bins=12) +                                                 
  ggtitle("Distribution of Social Connectedness Difference Scores") +       # for the main title 
  xlab("Social Connectedness Difference Score (T2 - T1)") +                 # x axis label 
  ylab("Frequency")                                                         # y axis label 
# doesnt work ^^


#still doesnt work  
figure1 <- ggplot(study1_raw) + geom_histogram(
  mapping = aes(x = SCdiff), colour = "black", fill = "grey", bins = 13)

Until eventually they had reached the same code I had:

successful attempt:
figure1 <- hist(study1_raw$SCdiff,                                                  # create a histogram for SCdiff variable
                main = "Distruibution of Social Connectedness Difference Scores",   # for the main title 
                xlab = "Social Connectedness Difference Score (T2 - T1)")           # for the x axis label

Keep in mind that our approach was a divide and conquer - so yes it was not the best idea and we could have saved a lot of time if we had just done them together and communicated more consistently. It was also hard to see what other team members were doing in their R studio, so we would have to wait until they post it in the shared HackMD document to be able to compare codes (you live and you learn). Not very efficient but we got there in the end.

Study 2:

Figure 2:

We then began replicating figure 2 starting with the “Distribution of Relatedness Difference Scores” started off by making a bunch of mistakes for relatedness and then did the same code for loneliness (why not do what you did in study 1? dont ask)

Relatedness:

Initially we started off by installing the packages where we also installed ‘ggplot2’ package to create more complex plots using the ‘study2data’ in the ‘Relatedness_diff’ data frame where we used the select() dplyr function to extract and use only the ‘BMPN_Diff’ (relatedness difference scores) to plot the graph:

# Load packages 
library(tidyverse)
library(ggplot2)

# Read the data
study2data <- read_csv(file= "Study 2.csv") 

# Create data frame
Relatedness_diff <- study2data %>% 
  select(BMPN_Diff)

So we initially made the mistake of trying to use ggplot, specifically the geom_histogram function, to create our graph. Not sure why I thought it would work this time. We tried to add features, such as colour and the number of bars:

relatedness <- ggplot(Relatedness_diff, aes(x = BMPN_Diff)) + 
  geom_histogram(colour = "black" , fill = "grey", bins = 18) +                                                 
  ggtitle("Distribution of Relatedness Difference Scores") +       # for the main title 
  xlab("Relatedness Difference Score (T2 - T1)") +                 # x axis label 
  ylab("Frequency")   # y axis label 

And yes it did work!

Or so we thought. Upon closer inspection, we realised the graphs looked different:

Can you spot the difference? Graph in the study:

As we can see the frequencies of each social difference score in our graph looks different to what was produced in the study. This is particularly evident from the 1 to -1 relatedness difference scores where our columns show more variability. So we decided to use the Study 1 method.

Here, we used the hist() function to produce the distribution of relatedness difference scores histogram and labelled the x and y axis as in Study 1:

hist(Relatedness_diff$BMPN_Diff,           # to create the relatedness diff scores histogram 
  main = "Distribution of Relatedness Difference Scores",       # for the main title 
  xlab = "Relatedness Difference Score (T2 - T1)",                  # x axis label 
  ylab = "Frequency",                                             # y axis label
  breaks = 17)                                       # number of columns 

Which produced the following graph:

Yep definitely much better.

Loneliness:

We then adapted the same method we used for relatedness to the ‘Distribution of Loneliness Scores’ Graph. Again we created a data frame for only the loneliness scores:

Loneliness_diff <- study2_raw %>%
  select(Lonely_Diff)

And then adapted the relatedness method:

# Replicate Loneliness figure 2 graph:
hist(Loneliness_diff$Lonely_Diff,           # to create the relatedness diff scores histogram 
     main = "Distribution of Loneliness Scores",       # for the main title 
     xlab = "Loneliness Difference Score (T2 - T1)",                  # x axis label 
     ylab = "Frequency",                                             # y axis label
     breaks = 40)                                       # number of columns

And yes it produced a graph that looks like the loneliness distribution graph produced in teh study:

Relatedness and Loneliness distribution Graphs:

Now to see how we can combine the Relatedness and loneliness graphs to have them stood side-by-side.

I will admit that we initially thought we just had to reproduce the graphs and that was all. However, after watching the group presentations we figured… yeah we should probably put the graphs together like the study. So we decided to ask the other group as to how they were able to do so where they recommended using the ‘library(cowplot)’ package…:

# package to combine relatedness and loneliness scores:
install.packages("cowplot")
library(cowplot)

…in combination with the ‘plot_grid’ function to be able to arrange our graphs into a grid:

Figure2Study2 <- plot_grid(relatedness, loneliness)  # FAIL because it doesn't work with hist function, only ggplot. 

However, we got excited too soon where upon investigating the error message:

Warning messages:
1: In as_grob.default(plot) :
  Cannot convert object of class histogram into a grob.
2: In as_grob.default(plot) :
  Cannot convert object of class histogram into a grob.

we realised that this method only allows us to arrange ggplots into a grid, not graohs built using the hist() function.

Thus, we had to set out and find a different method. We decided to ask our tutor Yooki where she suggested we use the par() function to combine the two histograns and create multiple plots at once. Our use of this is seen here:

par(mfrow=c(1,2))
hist(study2_raw$Lonely_Diff, main = "Distribution of Loneliness Scores",
     xlab = "Loneliness Difference Score (T2-T1)", xlim=c(-2,2), ylim=c(0,50), breaks = 40)
hist(study2_raw$BMPN_Diff, main = "Distribution of Relatedness Difference Scores",
     xlab = "Relatedness Difference Score (T2-T1)", xlim=c(-6,4), ylim=c(0,80), breaks = 20)

This produced:

  • walk them through this code So…yay!

Or so we thought until we got to our nightmare….. figure 3.

Figure 3:

Exploratory:

Relationship between age and levels of social connectedness:

  • Your question:
    • The first question I want to explore in my data is… whether university students in their first few years had lower levels of social connection, relatedness, loneliness.
      1. I am interested in this question because… my first year of university was when COVID 19 hit. I would envy older university students as they were already able to make those social connections and form friends. With everything being online and no face to face, and having left most of my high school friends, I found it difficult to form these social connections and I want to see if the data is reflective of this.
  • jealous of older kids
  • so as we can see, age bracket of university students tested range from 17-44
  • now ofcourse what year the students are in is unknown from the data so we will have to create some assumptions of the dataset and create and define some parameters
  • these are all undergraduate students so they should all be roughly within their first 4 years of university
  • gap years?
  • this could go in two ways
  • so based on my experience I would assume those older would better be able to have a higher level of social connectedness and lower levels of loneliness during COVID as they had time to form these quality relationships- have more life experience- that would stick during COVID whereas those younger went from school straight to uni so they did not have time to create these relationships
  • however, it could go the other way where older students may feel out of place? and the younger students are better able to adapt to the changes caused by COVID so less changes in connectedness pre to post COVID
  • lets look at data set and see how it goes
  • Questions to ask:
    • how should i operationalise first to last years:
      • double check whther your parameters are ok
  • can i also see if there is a postitive correlation between age and connectedness levels
    Output:
# A tibble: 1 × 4
  mean_age sd_age min_age max_age
     <dbl>  <dbl>   <dbl>   <dbl>
1     20.9   3.03      17      44

Effect of gender on connectedness

Effect of culture on connectedness