Title: Verification-report-final.knit

Part 1 Summary and Reaction:

Summary

COVID is a recent crisis and the psychological impact of the social behaviour changes (in the form of social distancing) that arose in response to it is a recent area of research that has yet to be investigated.

During COVID-19, physical/social distancing of six feet from anyone outside of our household became a known preventative measure. The question Folk and colleagues (2020) aimed to address is whether this change in social behaviour altered people’s sense of social connection and wellbeing before versus during the pandemic. However, this question was boiled down to an individual’s personal characteristics. That is, the change in an individual’s level of social connectedness and wellbeing was dependant on their level of extraversion.

Thus, the study aimed to determine changes in levels of social connectedness and wellbeing in introverts versus extroverts. Here, it was an open question as to whether an introverts or extraverts’ overall level of wellbeing/ connectedness is differentially affected by COVID changes in social behaviour. Extraverts are known to have stronger social relationships and therefore a stronger support network that can assist them through a crisis. However, an extrovert’s wellbeing is dependent on their level of social connectedness. Meanwhile, introverts may be less adversely impacted by changes in social behaviour as they already had fewer social interactions before the pandemic. Thus, they may not exhibit much of a difference in social connectedness and wellbeing before versus during the pandemic.

In Study 1, 467 undergraduate students from a Canadian university underwent a survey at two timepoints- before the pandemic (T1) and during the pandemic (T2)- where the same participants at T1 were recruited again at T2. Here, the survey contained demographic items, measures of social connectedness, lethargy- that acted as a proxy for measuring subjective wellbeing- and extraversion measures.

Study 1 found that participants reported lower levels of social connectedness during the pandemic versus before, however it was small in magnitude. Moreover, extraverted participants reported larger drops in social connectedness from T1 to T2 than introverted. However, when connectedness levels before the pandemic were controlled, extraverts coped better than introverts. As an indicator of wellbeing it was found that lethargy levels (proxy for wellbeing) increased from T1 to T2. It was also found that changes in lethargy was significantly correlated with changes in social connectedness. Although the data provides us with insight into the psychological aspects of coping with the pandemic- and coping strategies vary according to individual differences- Study 1 had a non-representative sample. That is, the results obtained from undergraduate students cannot be generalised to the general population. Thus, Study 2 was conducted.

In Study 2, 336 community adults (primarily from the US and UK) were surveyed at before and during the pandemic as well. Social connectedness was determined using two scales; relatedness (using the Balanced Measure of Psychological Needs- BMPN) and loneliness. Moreover, life satisfaction (using the Satisfaction With Life Scale- SWLS) and levels of extraversion (before the pandemic) of each participant were also measured. In both study 1 and 2, physical distancing levels of each participant was also measured.

Study 2 found that there were no differences in relatedness levels before (T1) versus during (T2) the pandemic. Moreover, participants reported feeling slightly, but significantly, less lonely. In relation to extraversion, it was found that the most introverted participants showed a significant decrease in loneliness, whereas the most extraverted showed no improvement. Finally, it was found that levels of life satisfaction did not change before versus during the pandemic

Overall, both studies suggest that, on average, people had relatively low levels of changes in social connectedness despite the social behaviour changes caused by COVID. This suggests that despite the crisis limiting our social interactions, we as a society are able to adapt our social behaviours in order to obtain our need for social connection and belonging- where most of our social interactions became online-based.

In relation to the overall relationship between extraversion and connectedness, Study 1 indicated that extraverts were worse off than introverts. However, study 2 indicated that introverts experienced less loneliness. However, we see that when initial social connectedness levels were controlled (pre-pandemic) the effect of extraversion level on connectedness reversed or disappeared. Thus, the conclusion is that extraverts appear to have fared worse as they had more social connections to lose than introverts. Thus, exhibiting lower levels of social connectedness and therefore wellbeing. This again provides insight as to how individual differences can cause individuals to cope with the crisis differently and how individuals vary in their coping abilities.

Reaction:

The paper reminded me of…. My own experience during the pandemic. It is true that individual differences can affect the psychological impact of changes in social behaviour. I classify myself as somewhere between an introvert and an extrovert, so I found myself able to relate to both. I would constantly alternate in my levels of loneliness and connectedness where sometimes I would be fine not socialising with anyone for weeks, and then at other times I would find myself constantly going to the park just to see people outside of my household.

I wonder whether… the authors should have investigated other measures of individual differences, such as household composition. Although extraversion can be an important factor, it would also be important to investigate individuals who live on their own in comparison to those who live in a family (I live with a family of six). In this case, all variables that were measured, particularly loneliness, should substantially, and significantly, differ based on an individual’s household composition. Individuals living alone and are not in a relationship would at a higher risk of loneliness, reduced wellbeing, and increased levels of psychological distress, such as depression (Flood 2005; Lauder et al. 2004; Relationships Australia 2011). It is important to investigate this as we can then identify higher risk individuals and provide them with mental health support that they need where many people may still be experiencing the aftereffects.

I was surprised that… Changes in levels relatedness did not vary before versus during the pandemic. As relatedness was measured by aspects such as “I felt close and connected with people who are important to me”, I would have thought that being quarantined with your household members whilst undergoing a crisis would have made individuals feel more close and connected. Again, this could have also tied into the household composition aspect where not everyone lives with their family. Due to rules such as not being allowed to visit other households, individuals may have only been able to communicate with their family members online and therefore felt a sense of disconnection. However, there was no change whatsoever where the mean change in relatedness score was M = - 0.01. So although relatedness reduced, it definitely was not significant.

Part 2:

Verification goals

The goal of the verification section of this report was to reproduce the demographic descriptives, the means/ SD reported in the text and all figures from the article. This was done to help get a better idea behind the challenges encountered when reproducing other studies where these challenges will be highlighted throughout the report.

Demographic statistics reported in the study:

Study 1:

#### Study 2:

Time 1:
Time 2:

Means/SD reported in the text:

Study 1:

In-text results: - Physical/Social distancing:

Changes in social connectedness due to COVID:
Social connectedness x extraversion levels:
Lethargy levels (wellbeing proxy):

Table results: - Mean and SD reported in table 1:

Study 2:

In text results: - Physical/Social distancing:

Extraversion levels:
Change in relatedness due to COVID:
Change in loneliness levels due to COVID:
Loneliness levels x extraverion levels:
Changes in life satisfaction levels due to COVID:

Table results:

Figures reported in the text:

Study 1:

Figure 1:

Study 2:

Figure 2:

Figure 3:

Steps to replicating data:

Locate the open data

The first step was to locate the open data for both studies which was easily located by following the OSF links under the Methods section for each study.

The OSF Repositories are: - Study 1 - Study 2

Code book:

We then went to find the datacodes for each study so we could see what each variable was labelled: - Study 1: - Study 2:

Replication process:

As mentioned earlier, there were two studies in this article. Therefore, there were two separate csv files for each study. Therefore replication for each study occurred independantly starting with Study 1.

Packages and reading data

# Read the Data STUDY 1-----------------------------------------------------------

# install the packages 
install.packages("tidyverse")  # for most of the functions needed 
install.packages("dplyr")      # for efficient data manipulation
install.packages("lsr")        # used to compute mode 
install.packages("psych")      # used to create table1 
install.packages("cowplot")    # to combine ggplots into a grid
install. packages ("ggplot2")  # data visualisations

# load the packages 
library(tidyverse)
library(lsr)
library(psych)
library(dplyr)
library(cowplot)
library(ggplot2)

# read the data 
study1_raw <- read.csv(file = "Study 1 data.csv")
study2_raw <- read.csv(file = "Study 2.csv")

Note that we knew to load some of these packages by looking at the R code provided by the authors in the OSF file where we used the relevant packages that we deemed we needed to reproduce the data.

Demographic statistics:

Study 1:

Reading Study 1 data

Raw study 1 data from the OSF file was saved as “study1_data.csv” in my work directory whereby the .csv format allowed data to be read using the read_csv() function to get the data into the Rmd file.The read_csv allows data to be imported into R as a tibble (a package used to manipulate and print data frames).

The data is then named as “study1_raw” by assigning it (<-) to the read_csv function.

# read the data 
study1_raw <- read.csv(file = "Study 1 data.csv")

Exclusion criteria:

Lucky for us there was no issue reading the data where the exclusion criteria had already been applied to the data for both studies (where there were 467 rows of data- i.e. one row for each of the participants who had met their inclusion criteria- as mentioned earlier). This was made our job easier as, when we were reading the exclusion criteria, we had a mini panic attack when we saw the following criteria:

Although the authors didn’t make our replication journey easier, after seeing the struggles other groups encountered with applying the exclusion criteria, were lucky to have skipped this obstacle and reproduce demographics.

Starting the actual process:

Gender identity of participants

Again another blessing is that the authors wrote “man”, “woman”, “other” or “decline to answer” when reporting the gender data, so we did not have have to rename variables. Thus we easilt reproduced the gender percentage. As I am a beginner to coding, Google had immediately become my best friend from the first step. So after a quick google search of ‘how to find percentage in R’, it was recommended that the easiest way was to use the dplyr package using this (adapted) example code:

library( dplyr )
study1raw %>% 
  group_by(Gender) %>% 
  summarise(percent = 100 * n()/nrow(study1raw)) #Success

Age descriptives (or as we call it… the ‘Age Dilemma’):

This section was our main challenge in the demographics section of the replications.

So initially we decided to filter out the “decline to answer” responses (where people did not want to give researchers their ages) and use the summarise() function to find the mean age and SD of participants:

#average age attempt 1
study1_raw %>% 
  filter(age != "[Decline to Answer]") %>%  # filter out the “decline to answer” responses 
  summarise(mean(age), sd(age))    
               #FAIL (would work if age was numeric)

So our next step was to make age numeric rather than a character variable (). So in my first attempt, the suggested process to convert to numerical was to check the class type:

# check variable type 
class(study1_raw)  # supposed to be getting [1] "character"????

# Output:
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

and then to use the recode() in the dplyr package then ifelse() to convert from categorical to numerical then match() function to rename the character variable into a numeric one. However, after seeing the output of the first line(and being heavily confused as to what I was supposed to be typing in the brackets based on the example code used in website. I abandoned that method pretty quickly and ran to find an easier one. However, what did pique my interest is how recoded variables can be converted into each other using as.numeric and the as.character functions.

So then began the process of combining different methods of filtering the “Decline to Answer” responses and trying to convert the age variable into a numeric one:

#convert age to numeric variable attempt 2
age.subset <- age.subset %>% 
  as.numeric(as.character(age))    #FAIL

#attempt 3
demographics %>% mutate(new_age = age != "[Decline to Answer]") # use dplyr to create a new variable ('new_age')  #FAIL

#attempt 4
age.subset %>%  mutate(age_test = numeric(age.subset))  #FAIL

It was just a repetitive process of trial and error until eventually we figured out the amazing $. This allowed us to extract the specific Age part of the study1_raw data. Then we allowed Age to become a numeric variable by using the “as.numeric” function. We were now able to calculate the age means and SD and by using na.rm we were able to remove any "Decline to Answer” (NA) values: ``` #THE FIX study1_raw$Age <- as.numeric(study1_raw$Age) #Converts “chr” to numeric form.

study1_age <- study1_raw %>% summarise(mean_age = mean(Age, na.rm = TRUE), sd_age = sd(Age, na.rm = TRUE), min_age = min(Age, na.rm = TRUE), max_age = max(Age, na.rm = TRUE)) #SUCCESS, Study reported that mean age = 20.89, SD = 3.03.

Although the minimum and maximum age was not reported in the document, we decided to flex our newfound abilities ;) 

## Study 2:

We again started off by loading the relevant packages and reading the data.

load packages

library(tidyverse) #for most of the functions needed library(dplyr) # tools for efficient data manipulation

read the data

study2_raw <- read_csv(file= “Study 2.csv”)

Then we began the replication process starting with Time 1. 

### Time 1:

![](https://i.imgur.com/CjQ1AJv.jpg)

As mentioned in Study 1, the exclusion criteria had already been applied, thus we only have data for the N= 336 participants who had "completed both the Time 1 and Time 2 surveys". Thus, as we were missing 50 rows of data, we were unable to reproduce the demographic statistics at Time 1, only Time 2. 

### Time 2:
![](https://i.imgur.com/64gOT8B.jpg)

As we had already reproduced the age and gender demographics in study 1, we figured it would be easier to start with those (and we were right- no problems). For gender we decided to assign a variable (which we called 'study2_gender') to the study2_raw environment. Using the dplyr package, we were able to use the group_by() function to group data frames according to the specified gender variable. We then used the summarise() function to create a new data frame and creating three combinations of grouping variables- as seen in the output below ('male', 'female', 'non-binary'). We then calculated the percentage for each grouping combination and put them in descending order in order to determine the most common gender within the participant data:

#### Gender:

study2_gender <- study2_raw %>% #Ensures a new table is saved to the environment group_by(Gender) %>% #Groups the data based on the specified variable. summarise(n = n(), #Counts the number of unique instances for the variable. Percent = 100 * n()/nrow(study2_raw)) %>% #Calculates the percentage - representative percentage for the instance (100 x number of instances/total rows) arrange(desc(n)) #arranges from largest to smallest

Although males did not equal exactly 55%, it would be close enough if rounded. When replicated it is not very likely that you would get the exact same responses, so this will have to do.  

#### Age:
For the age variable, we realised that there were no "Decline to Answer" responses (thank god!), so the age variable was numeric. Therefore we did not encounter the issues faced in Study 1 and were able to just assign the 'study2_age' variable to to study2_raw and calculate the means as per usual using the summarise() function from dplyr.

study2_age <- study2_raw %>% summarise(mean_age = mean(Age), sd_age = sd(Age), min_age = min(Age), max_age = max(Age))

Once again, the output was not exactly the same, but its close enough when rounded. 

Next we needed to reproduce the percentages of ethnicity's and country's that participants belonged to. Here, since it was percentages we decided to adapt the gender method of finding percentages from study 1. 
Here we assigned the variables ('study2_ethnicity' or 'study2_country') to 'study2_raw' to create a new table in the environement of the relevant data percentages for each assigned variable. The data was again arranged in descending order-using the arrange(desc(n)) via dplyr package- to ensure that the percentages are in order from most to least popular participant ethnicity/ country:
#### Ethnicity:

study2_ethnicity <- study2_raw %>% #Identical code used here as was used for gender. group_by(Ethnicity) %>% summarise(n = n(), Percent = 100 * n()/nrow(study2_raw)) %>% arrange(desc(n))

#### Country:

study2_country <- study2_raw %>% #Identical code used here as was used for gender. group_by(Country) %>% summarise(n = n(), Percent = 100 * n()/nrow(study2_raw)) %>% arrange(desc(n))

Although we did not get exactly 32% for participants from the USA, it was close enough (may have been rounding errors?- we will say that a lot as we tend to see that replicated values can get very close but not exactly the same). 

#### Marriage stats:
The main issue we encountered when reproducing Study 2 demographics is that they did not provide their marriage data, so we were unable to reproduce the '45% single/never married' data (kind of random since they provided everything else). Here, Samuel decided to email the authors asking them for their marriage data. Initially Dunigan Folk was emailed to which he responded (was very unexpected). However, he emailed us to email his coauthor Dr Okabe-Miyamoto as she had the data with her. 

![](https://i.imgur.com/uN1jaiN.png)

However, we never got a response and therefore could not replicate the data. So on to Stage 2: MEAN/SD!

# Mean/SD:  
## Study 1:
After having identified the Mean/SD that need to be identified, we decided to start off with physical distancing:

#### Physical distancing:
![](https://i.imgur.com/yGar4RH.jpg)

As before, the physical distancing percentage was also produced by adapting the original gender percentage and therefore we were able to easily reproduce this. Again, we assigned the 'study1_socialdistancing' variable to study1_raw where we grouped the data based on the 'SocialDistancing variable' and then the summarise() function from the dplyr package to determine the percentage of people who socially distanced:

physical/social distancing

study1_socialdistancing <- study1_raw %>% # creates a new variable in the environment group_by(SocialDistancing) %>% # groups data by social distancing summarise(n = n(), # counts the number of instances percent = 100 * n()/nrow(study1_raw)) # find the percentage

SocialDistancing n percent 1 1 460 98.5 # % who socially distanced #YAY 2 3 7 1.50 # did not socially distance

Woohoo! Next we reproduced the six feet distancing statistics. Here, I had to resort to google to determine how mode can be calculated. It was suggested to install the 'library(modeest)' package. After assigning variables, mlv was used to used as a generic function to compute a generic estimate of the mode of the study1_sixfeet data. Here, the "mfv" method was used to return the mode of the data:

attempt 1

library(modeest) study1_sixfeet <- study1_raw %>% mlv(study1_sixfeet, method = “mfv”) # FAIL

However, the following error message was produced:

Error in mlv.default(., study1_sixfeet, method = “mfv”) : is.numeric(x) is not TRUE

After inserting the error message into Google it was suggested to use the shapiro.test to determine whether the data is normally distributed:

#edit: library(modeest) study1_sixfeet <- study1_raw shapiro.test(study1_sixfeet) mlv(study1_sixfeet, method = “mfv”)

Error in shapiro.test(study1_sixfeet) : is.numeric(x) is not TRUE > mlv(study1_sixfeet, method = “mfv”) Error in mlv.default(study1_sixfeet, method = “mfv”) : is.numeric(x) is not TRUE

I felt this method was too complicated just to calculate mode, so I changed tactics. As R does not have a built in function to determine mode, another website suggested creating a user function to calculate the mode of the data. Here the vector study1_sixfeet is used as input and the mode is produced as an output.

#attempt 2: # Create the function getmode <- function(v) { uniqv <- unique(v) uniqv[which.max(tabulate(match(v, uniqv)))] }

Calculate mode using user function

six_feet <- getmode(study1_sixfeet) print(six_feet)

Create vector and summarise()

study1_sixfeet <- study1_raw %>% summarise(mode_six_feet = getmode(SixFeet)) # SUCCESS

Yes, it produced the mode! However, this code does not produce the six feet mean and SD. 

At a group meeting I realised that although Samuel and I had calculated the mode in a similar way, Georgia was able to create a much more efficient code by installing the lsr package to compute mode then using the modeOf() function to get the mode. She was also able to combine her method of calculating the mode whilst simulatenougly calculating the mean and SD:

install.packages(“lsr”) # used to compute mode six_feet <- study1_raw %>% # creates a variable from raw data summarise(modeOf(SixFeet), mean(SixFeet), sd(SixFeet)) # calculates mode, mean and sd

However, when I tried to run the code on my R console, I would receive an error message:

! could not find function “modeOf”

Luckily for me, upon doing study 2 mode, I came upon a much more efficient code to calculate mode, mean and SD in the one code:

study1_sixfeet <- study1_raw %>% summarise(mode_six_feet = getmode(SixFeet),mean(study1_raw$SixFeet), sd(study1_raw$SixFeet))

Output:

A tibble: 1 × 3

mode_six_feet mean(study1_raw$SixFeet) sd(study1_raw$SixFeet) 1 0 0.767 1.39 # YAY! And it produces mean and sd (two birds with one stone)

Overall, this was a lesson on the beauty of collaborating rather than dividing and conquering where although it may be a success that we were able to reproduce the data, the ultimate goal is to be able to reproduce it in a way that is efficient. It also shows that there is more than one correct way to reproduce the same value and that by working with your team members, you are able to have diverse and new approaches. 

For the following reproduced values, we aimed to reproduce them using both the in-text and table 1 values. 

#### Table 1 and Lethargy:
Next we decided to reproduce table 1 as it included most of the intext values as well:
![](https://i.imgur.com/dNwZsio.png)

Here, we started off by creating a data frame where we assigned study1_table1 to the study1_raw data where the study1_table1 variable had the unnecessary columns filtered out where we used the select() function in dplyr to only leave the variables that were reported in Table 1 (as shown below):

study1_table1 <- study1_raw %>% # creating a new data frame select(LETHAVERAGE.T1, # this data frame only includes the variables that are in table 1 LETHAVERAGE.T2, LethDiff, SCAVERAGE.T1, SCAVERAGE.T2, SCdiff, EXTRAVERSION)

We then added another 'table 1' variable to the environment where we used the describe() function- using the psych package- to get the selected descriptive statistics of the given 'study1_table1' dataset.

library(psych) table1 <- describe(study1_table1)

However, we realised that it gives you *all* of the varying types of descriptives.Since we only wanted means and SD, we again decided to use the select function to filter out and leave only the mean and SD columns. Here all the data replicated with the values given in Table 1.

library(psych) table1 <- describe(study1_table1) %>% # create a table that summarises table 1 variables select(mean, sd) # only include means and sd’s

Objective met! 

#### Social connectedness x extraversion levels:
![](https://i.imgur.com/Y7RPInf.jpg)

As soon as I saw quantiles... I knew I was finished. So I decided to look at the author's R code (in OSF file) to see how they had done it. Here is their method:

##### Most intraverted x social connectedness:

#Have most introverted participants experienced changes in connectedness quantile(study1_table1$EXTRAVERSION) #finding quartiles BottomQuarter<- subset(study1_table1, EXTRAVERSION <= 3.41667) #Making subset of data with only participants in bottom quartile of extraversion BottomQuarter BottomQuarter<-as_tibble(BottomQuarter) BottomQuarter

mean(BottomQuarter$SCAVERAGE.T2) sd(BottomQuarter$SCAVERAGE.T2) mean(BottomQuarter$SCAVERAGE.T1) sd(BottomQuarter$SCAVERAGE.T1)

Output:

mean(BottomQuarter$SCAVERAGE.T2) [1] 3.351172 # mean for most introverted at Time 2 sd(BottomQuarter$SCAVERAGE.T2) [1] 0.655226 # SD for most introverted at Time 2 mean(BottomQuarter$SCAVERAGE.T1) [1] 3.446572 # mean for most introverted at Time 1 sd(BottomQuarter$SCAVERAGE.T1) [1] 0.7024337 # SD for most introverted at Time 1

SUCCESS… they all match the in-text reported data (making life so much easier buttt…huh?)

Although the code worked for both most introverted and most extraverted participants, the code was definitely far from easy to read and understand. Especially for someone as helpless at coding as me. So we decided to develop our own method.

#### Our way:

The first step was to determine the quantiles for extraversion using quantile() and filtering out the extraversion data using $:

Introverts vs Extraverts ———————————————–

quantile split by score (quantile cutoff)

study1_quantile <- quantile(study1_raw$EXTRAVERSION) # find quantiles

After determining the quantiles, we created data subsets for the most introverted and most extraverted participants. Here, we used select() to only have our desired dependant variables where the authors only used extraversion and social connectedness levels at T1 and T2. We then filtered out the data to only include the most introverted or most extroverted in their respective data subsets. 

**Introverts**:
The most introverted participants were deemed as those who fell into the 0% and 25% quantile (first quantile or less):

study1_introverts <- study1_raw %>% # create new variable using raw data select(EXTRAVERSION, SCAVERAGE.T1, SCAVERAGE.T2) %>% # only including three variables filter(EXTRAVERSION <= 3.41667) %>% # filtering out data so that only the most introverted (first quantile) are included in this subset mutate(Type = “Most Introverted”) # create new variable from data subset called “most introverted” # this results in 119 people in this group ^ (as we can see in the 119 rows)

**Extroverts**:
The most extraverted participants were those who fell into the third quantile (75%) or greater:

study1_extraverts <- study1_raw %>% # create a new variable using raw data select(EXTRAVERSION, SCAVERAGE.T1, SCAVERAGE.T2) %>% # only including three variables of interest filter (EXTRAVERSION >= 4.83333) %>% # only including the most extraverted (4th quatile cutoff) mutate(Type = “Most Extraverted”) # create new variable from data subset called “most extraverted” # This results in 130 people in this group ^

We then moved on to determining the mean and SD for each relevant most introverted/ extraverted group using describe() and selecting the mean and sd columns:

**Introverts**:

Finding the means and Sd’s

introvert_meanSC <- describe(study1_introverts) %>% # new variable using most introverted data select(mean, sd) # calulates only means and sd’s # SUCCESS

**Extraverts**:

extravert_meanSC <- describe(study1_extraverts) %>% # new variable using most extraverted data select(mean, sd) # calulates means and sds #SUCCESS

Although this was the most challenging part of the Mean/SD process, it showed me the importance of taking things step by step and dissecting the error code that R provides. 

## Study 2:
### Physical distancing:
![](https://i.imgur.com/NuTT81E.jpg)

Again, we started off with replicating the social distancing descriptives. Here, we used the same method as Study 1:

study2_socialdistancing <- study2_raw %>% # creates a new variable in the environment group_by(SocialDistancing) %>% # groups data by social distancing summarise(n = n(), # counts the number of instances percent = 100 * n()/nrow(study2_raw)) # find the percentage

However, my value was different to what was reported in the text:

A tibble: 2 × 3

SocialDistancing n percent 1 0 23 6.85 2 1 313 93.2 # not 92.9%???

After running my other group member's code, they had also received the same output, so it must have been another error on the author's part. 

Next we replicated the six feet distancing values again using the same code as in Study 1:

Create vector and summarise()

study2_sixfeet <- study2_raw %>% summarise(mode_six_feet = getmode(SixFeet),mean(study2_raw$SixFeet), sd(study2_raw$SixFeet))

Why is the SD= 1.75 rather than 0.75? Upon comparing my group's code to mine they also received the same result (and their's look a lot more efficient than mine). We then decided to run the author's R code:

Social Distancing

install.packages(“psych”) install.packages(“plyr R”) count(study2_raw$SixFeet) describeBy(study2_raw$SixFeet)

To which we see that their SD is also equal to 1.75? This indicates the first of many discrepencies (refer to table 3) that we saw throughout the study which somewhat brought into question its reliability. It really is a crisis :) 

### Table 3:
![](https://i.imgur.com/krVtBrR.png)

Next we decided to replicate table 3 as it also contained the in-text values (or so we thought). Here we started off by using the describe() function to calculate statistical data where we used the select() function to specifically produce mean/sd output. We also specified the relevant variables that were listed in table 3:

llibrary(psych) study2_summary <- describe(study2_raw[, c(‘T1SWLS’, #T1 life satisfaction ‘T2SWLS’, # T2 life satisfaction ‘SWLS_Diff’, # Life satisfaction change ‘T1BMPN’, # T1 Relatedness score ‘T2BMPN’, #T2 Relatedness score ‘BMPN_Diff’, # Relatedness difference score ‘T1Lonely’, # T1 Loneliness ‘T2Lonely’, # T2 Loneliness ‘Lonely_Diff’, # Loneliness change (T2-T1) ‘T1Extraversion’)]) %>% # Extraversion level pre-pandemic select(mean,sd)

Output:

            mean   sd

T1SWLS 3.97 1.53
T2SWLS 3.99 1.45 SWLS_Diff 0.02 0.88 T1BMPN 4.92 1.09 T2BMPN 4.91 1.14 BMPN_Diff -0.01 1.11 T1Lonely 2.12 0.65 T2Lonely 2.06 0.62 Lonely_Diff -0.06 0.40 T1Extraversion 3.90 0.79 # All the same as table 3 values! Yay!

Now whilst I had compared my output to the table values, an issue we found is the discrepency between the table and intext values:
- Change in relatedness due to COVID:
 ![](https://i.imgur.com/H2unbGe.png)

For example, we can see the relatedness score reported in the text is for "relatedness prior to the pandemic (Time 1: M= 4.90, SD = 1.11)". In the table, the reported mean is M=4.92 and the reported SD is 1.09 (not 1.11). This was a consistent error which could be seen for relatedness T2, loneliness scores, etc. It caused us a lot of unneccessary frustration as whilst Samuel was using the in-text values, Natasha and I were using the table values and were confused as to how he was getting them wrong. As we said in our presentation, this brought into question the reliability of the peer review process. 

### Exploratory analysis: Extraversion x Loneliness:
For this exploratory, I tried to use the author's code to see whether it would work as well, however, I came across the same issue where I found it too difficult to understand their method. So I decided to use our method again. Once again, I figured out the quantiles for study 2 extraversion.

STUDY2: Split Quantile by scores

study2_quantile <- quantile(study2_raw$T1Extraversion)

Output:

0% 25% 50% 75% 100% 2.083333 3.333333 3.833333 4.416667 6.000000

I then created the subsets for 'most introverted' versus 'most extraverted' where I picked the vaiables of interest using the select () function to isolate the T1Extraversion, T1Lonely, T2Lonely variables specified by the authors. I then filtered the rows based on the extraversion data subsets. For most introverted they needed to have fall in the 25% quantile or less (<=3.33). For most extraverted, they had to fall into the 75% quantile or higher (>=4.416777).

Create and filter lower quantile of extroverts (introverts)

library(dplyr) study2_introverts <- study2_raw %>% select(T1Extraversion, T1Lonely, T2Lonely) %>% filter(T1Extraversion <= 3.33) # There is 80 people as a result in this group (Quantile 1).

Create and filter upper quantile of extroverts (most extroverted)

study2_extroverts <- study2_raw %>% select(T1Extraversion, T1Lonely, T2Lonely) %>% filter(T1Extraversion >=4.416777) # This results in 83 people as being categorised as the most extroverted

Calculate mean and sd’s of each of the above subsets

library(psych) introvert_mean <- describe(study2_introverts) %>% select(mean,sd)

Ouput

introvert_mean mean sd T1Extraversion 2.90 0.29 T1Lonely 2.53 0.63 T2Lonely 2.31 0.63 # T1Lonely is off by 0.02 for both mean and SD

Most Extraverted:

extrovert_meanL <- describe(study2_extroverts) %>% select(mean,sd)

extrovert_meanL mean sd T1Extraversion 4.95 0.34 T1Lonely 1.63 0.49 # T1Lonely is off by 0.02 for both mean and SD T2Lonely 1.67 0.49

However, for study 2 we were unable to reproduce the exact same values (off by .02), and as much as I try I am not sure why that is the case.Maybe another discrepency? Once again...What a crisis :) 

# Figures:
## Study 1:
### Figure 1:
![](https://i.imgur.com/2Z72rxG.png)
When replicating figure 1, I started off by identifying what type of graph it was (a histogram), and then simply search up 'how to create a histogram in R' and voila, I got an amazing [youtube video](https://www.youtube.com/watch?v=tp_BG5wDeVU&t=112s) that took me step-by-step on how to make a histogram, producing the following code using the tidyverse package:

figure1 <- study1_raw hist(study1_raw$SCdiff, # to make histogram breaks = 13, # number of bars in histogram name = "Distribution of Social Connectedness Difference Scores") ``` However, the video did not include how to label the axis or how to name the graph (which I kind of fluked by writing name = , but as you can guess that did not work so google it is!). A quick google search informed me to use the main, xlab and ylab parameters: ``` #attempt 2 and final solution figure1 <- study1_raw hist(study1_raw$SCdiff, # create a histogram fro SCdiff breaks = 13, #number of bars in the histogram main = “Distribution of Social Connectedness Difference Scores”, # name of graph xlab = “Social Connectedness Difference Score (T2-T1)”, # x axis label ylab = “Frequency”, # y axis label ) #SUCCESS

Yes they look similar! (Never have I been more grateful to the authors that they did not try to be fancy by adding different colours)

However, when liasing with my team members, I found that they were not so lucky and had tried to use ggplot and the geom_histogram functions to produce the graph:

Failed Attempts

figure1<- ggplot(study1_raw, aes(x = SCdiff)) + geom_histogram(colour = “black” , fill = “grey”, bins=12) +
ggtitle(“Distribution of Social Connectedness Difference Scores”) + # for the main title xlab(“Social Connectedness Difference Score (T2 - T1)”) + # x axis label ylab(“Frequency”) # y axis label # doesnt work ^^

#still doesnt work
figure1 <- ggplot(study1_raw) + geom_histogram( mapping = aes(x = SCdiff), colour = “black”, fill = “grey”, bins = 13)

Until eventually they had reached the same code I had.Keep in mind that our approach was a divide and conquer - so yes it was not the best idea and we could have saved a lot of time if we had just done them together and communicated more consistently. It was also hard to see what other team members were doing in their R studio, so we would have to wait until they post it in the shared HackMD document to be able to compare codes (you live and you learn). Not very efficient but we got there in the end.  

However, upon proofreading our code, we came across a more effective method using ggplot:

figure1<- ggplot(study1_raw, aes(x = SCdiff)) + geom_histogram(colour = “black” , fill = “grey”, breaks = seq(-4, 4, by = 0.5)) + scale_x_continuous(breaks = seq(-3, 3, by = 1)) + scale_y_continuous(expand = expansion(mult = 0, add = 0)) + theme( axis.line = element_line(colour = “black”),
panel.background = element_blank())+ ggtitle(“Distribution of Social Connectedness Difference Scores”) + xlab(“Social Connectedness Difference Score (T2 - T1)”) + ylab(“Frequency”) # MUCH BETTER

## Study 2:
### Figure 2:
We then began replicating figure 2 starting with the "Distribution of Relatedness Difference Scores" started off by making a bunch of mistakes for relatedness and then did the same code for loneliness (why not do what you did in study 1? dont ask)
![](https://i.imgur.com/KEkoViM.jpg)
### Relatedness:
Initially we started off by installing the packages where we also installed 'ggplot2' package to create more complex plots using the 'study2data' in the 'Relatedness_diff' data frame where we used the select() dplyr function to extract and use only the 'BMPN_Diff' (relatedness difference scores) to plot the graph:

Load packages

library(tidyverse) library(ggplot2)

Read the data

study2data <- read_csv(file= “Study 2.csv”)

Create data frame

Relatedness_diff <- study2data %>% select(BMPN_Diff)

So we initially made the mistake of trying to use ggplot, specifically the geom_histogram function, to create our graph. Not sure why I thought it would work this time. We tried to add features, such as colour and the number of bars:

relatedness <- ggplot(Relatedness_diff, aes(x = BMPN_Diff)) + geom_histogram(colour = “black” , fill = “grey”, bins = 18) +
ggtitle(“Distribution of Relatedness Difference Scores”) + # for the main title xlab(“Relatedness Difference Score (T2 - T1)”) + # x axis label ylab(“Frequency”) # y axis label

And yes it did work! Or so we thought. 

Upon closer inspection, we saw that the frequencies of each social difference score in our graph looks different to what was produced in the study. This is particularly evident from the 1 to -1 relatedness difference scores where our columns show more variability. So we decided to use the Study 1 method. 

Here, we used the hist() function to produce the distribution of relatedness difference scores histogram and labelled the x and y axis as in Study 1. This process was then repeated to reproduce the loneliness aspect of figure 2. We then investigated methods to combine the two graphs into a grid. We originally tried to use 'library(cowplot)' package, however we realised that this was only for ggplots. Thus, we set out to find a different method. We decided to ask our tutor Yooki where she suggested we use the par() function to combine the two histograms:

par(mfrow=c(1,2)) hist(study2_raw$Lonely_Diff, main = "Distribution of Loneliness Scores", xlab = "Loneliness Difference Score (T2-T1)", xlim=c(-2,2), ylim=c(0,50), breaks = 40) hist(study2_raw$BMPN_Diff, main = “Distribution of Relatedness Difference Scores”, xlab = “Relatedness Difference Score (T2-T1)”, xlim=c(-6,4), ylim=c(0,80), breaks = 20) # SUCCESS

However, once again upon editing our work, we decided we wanted a graph that allows for more customisation (via ggplot)- so we readapted the figure 1 strategy. This was mainly for smaller details such as having the x and y axis joined and just making the graph more similar to that in the original study. I have provided a breakdown of or steps in the code:

Figure 2

library(ggplot2) # Figure 2- Relatedness: figure2_relatedness<- ggplot(study2_raw, aes(x = BMPN_Diff)) + # create a new graph based on relatedness scores geom_histogram(colour = “black” , fill = “grey”, breaks = seq(-6, 4, by = 0.5)) + # customise graph, seq() allows for x axis length to be customised, by = allows us to specify the number of columns we want per one unit on the x axis scale_x_continuous(breaks = seq(-5, 4, by = 1)) + # specify x axis (from -5 to 4) with each x value being plotted a unit away (ie specify we want it as -5, -4, -3, etc) scale_y_continuous(expand = expansion(mult = 0, add = 0)) + # customise space between data and axis using expansion (by making it = 0 we are indicating we want no gaps) theme( axis.line = element_line(colour = “black”), # customise axes panel.background = element_blank())+ # remove graph background ggtitle(“Distribution of Relatedness Difference Scores”) + # name of graph xlab(“Relatedness Difference Score (T2 - T1)”) + # name axes ylab(“Frequency”)

Figure 2:- Loneliness

figure2_loneliness<- ggplot(study2_raw, aes(x = Lonely_Diff)) + # create a new graph based on loneliness scores geom_histogram(colour = “black” , fill = “grey”, breaks = seq(-2, 2, by = 0.1)) + # customise x axis to range from -2 to 2 as in the study and to have a column every .1 unit on the x axis scale_x_continuous(breaks = seq(-2, 2, by = 1)) + # customise x axis scale_y_continuous(expand = expansion(mult = 0, add = 0)) + # no gaps theme( axis.line = element_line(colour = “black”),
panel.background = element_blank())+ ggtitle(“Distribution of Loneliness Difference Scores”) + xlab(“Loneliness Difference Score (T2 - T1)”) + ylab(“Frequency”)

Joint graph:

library(cowplot) figure2<- plot_grid(figure2_relatedness, figure2_loneliness) # combine two ggplots in a grid

PERFECT!

This process showed me again that although it is great to have been able to reproduce the original study's statistics, it is also important to go over your code and ensure that it is the *best* way to reproduce it. It is good to always come back to your original code and see if there are better (and perhaps more efficient) ways of reproducing the data. 

## Figure 3:
Before I get into our final solution, I would like to explore some of the errors we came across (not all as ther's way too many). We initially underestimated what a nightmare this graph would be so here is a couple of our failed attempts:
1.

figure3 <- ggplot(figure3_study1)+ geom_line(mapping = aes( x = Time, y = SCAVERAGE, lty = Type, group = Type)) # FAIL

2.

# add in stat summary figure3 <- ggplot(figure3_study1)+ geom_line(mapping = aes( x = Time, y = SCAVERAGE, lty = Type, group = Type)) + stat_summary(aes(x = Time, y = SCAVERAGE, group = Type)) # FAIL

3.

# doesnt work figure3 <- ggplot(figure3_study1)+ stat_summary(aes(x = Time, y = SCAVERAGE, group = Type)) + geom_line(aes(x = Time, y = mean(SCAVERAGE), lty = Type, group = Type))

At least the points were plotted? But overall... that is definitely not what we wanted to reproduce. 

Thus, this resulted in heavy collaboration and multiple errors and lines upon lines of code to reproduce one graph (*face palm*). 

Anyway, so we decided to go back and read the article method. As soon as we saw that the sample was "subdivided" into "the top and bottom quartiles for extraversion", we knew to use the same method as our previous exploratory methods. Now this figure involved two sections:- social connectedness and loneliness. We started off with social connectedness. 

### Social connectedness:
Here we needed to use the study 1 data and determine the quantiles for extraversion levels. This was done using the previous methods:

quantile split by score (quantile cutoff)

study1_quantile <- quantile(study1_raw$EXTRAVERSION)

We then created the data subsets for most introverted and most introverted again using the same method as in study 1. We also used the mutate() function to add the new variable 'Type' as when we combine the data later, it will be hard to distingush who are the most introverted and who are the most extroverted (easier to read and distingush):

#STEP ONE - create data subsets that filter out the most introverted and the most extraverted adding a new variable that classifies each as either extravert/introvert

study1_extraverts <- study1_raw %>% # create a new variable using raw data select(EXTRAVERSION, SCAVERAGE.T1, SCAVERAGE.T2) %>% # only including three variables of interest filter(EXTRAVERSION >= 4.83333) %>% # only including the most extraverted (4th quatile cuoff) mutate(Type = “Most Extraverted”) # adding a variable so its easier to graph

Upon working on my exploratory for awhile, I knew that to visualise these separate data, we needed to find ways to combine these 2 datasets (most intraverted + most extraverted) as we cannot visualise two different datasets in the one graph. Here, we decided to use the bind_rows() function from the dplyr package to combine the two datasets:

study1_extreme <- bind_rows(study1_introverts, study1_extraverts)

Now we needed to separate this 'study1_extreme' joint data into pre and during COVID timeframes. Since this graph centred on social connectedness levels, we selected SCAVERAGE scores and created a new varible within the dataset called 'Before Pandemic' using mutate(). This was done as when we combine these timeframes again, it will be hard to tell what rows are for 'before pandemic' and which are 'during pandemic', so this makes it easier to read the data. We also decided to rename SCAVERAGE.T1 to just SCAVERAGE:

STEP THREE - create new variables that have separated the different time variables but includes both introverts and extraverts

study1_extreme_before <- study1_extreme %>% #creating a new variable using the extreme data (most introverted and extraverted) select(SCAVERAGE.T1, Type) %>% #only including variables of interest –> added type using the mutate function mutate(Time = “Before Pandemic”) %>% # create a new variable that classifies the time period rename(SCAVERAGE = SCAVERAGE.T1) # renaming so that this can be plotted on the y axix

The same method was repeated for 'during the pandemic' time frame:

study1_extreme_during <- study1_extreme %>% # same but for the second time measurement select(SCAVERAGE.T2, Type) %>% mutate(Time = “During Pandemic”) %>% rename(SCAVERAGE = SCAVERAGE.T2)

We then binded the data together again (cannot visualise data on the one graph with multiple datasets). This creates a dataset that includes a categorisation of 'most introverted' vs 'most extraverted' *and* 'before" or 'during' with the corresponding social connectedness (SCAVERAGE) score :

figure3_study1 <- bind_rows(study1_extreme_before, study1_extreme_during) # combining the two variables created earlier into one data set stacked on top of each other

Now that all of the data is combined and categorised, we now needed to determine the descriptives, particularly the means and sd, for each category:

STEP FIVE create a summary table with all the SC means of the four groups

figure3_study1_summary <- figure3_study1 %>% # new variable group_by(Type, Time) %>% summarise(SCAVERAGE = mean(SCAVERAGE))

Now we needed to isolate the variables again to be able to form our graph that differentiates between before vs during pandemic and most extraverted vs most introverted. That is, we create variables that differ based on the specific level of extraversion and time point. For example, we created assigned a new variable to the dataframe called 'study1_extreme_before_introverts' from the 'study1_extreme_before' dataset (that specifies only prepandemic values) where this variable only takes into account the 'most introverted' participants. This was done by using filter():

study1_extreme_before_introverts <- study1_extreme_before %>% filter(Type == “Most Introverted”) # only want most introverted (and before the pandemic)

Thus, we see that we have created a variable that is specific to 'before pandemic' and only for the 'most introverted' participants. We then coverted the variable into values where we only need the SCAVERAGE values. This was done to no longer make them no longer function or data objects. This was done using $ and using SCAVERAGE scores (our only variable of interest):

Before pandemic- introverts

study1_extreme_before_introverts <- study1_extreme_before %>% filter(Type == “Most Introverted”) # only want most introverted (and before the pandemic)

study1_extreme_before_introverts <- study1_extreme_before_introverts$SCAVERAGE

We can see that it is now just a string of values (not categorised- became an atomic vector).

This was repeated for each required variable:
- During pandemic + Most introverted:

during pandemic - introverts

study1_extreme_during_introverts <- study1_extreme_during %>% filter(Type == “Most Introverted”)

study1_extreme_during_introverts <- study1_extreme_during_introverts$SCAVERAGE

- Before pandemic + Most extraverted:

before pandemic - extraverts

study1_extreme_before_extraverts <- study1_extreme_before %>% filter(Type == “Most Extraverted”)

study1_extreme_before_extraverts <- study1_extreme_before_extraverts$SCAVERAGE

- During pandemic + Most extraverted:

during pandemic - extraverts

study1_extreme_during_extraverts <- study1_extreme_during %>% filter(Type == “Most Extraverted”)

study1_extreme_during_extraverts <- study1_extreme_during_extraverts\(SCAVERAGE ``` Now for the error bars! Now in the caption, it was stated that "95% CI error bars" were used. So we could not just used SE error bars. For this we had to experiment with several trial and error methods to determine CI. So eventually we resorted to looking at the author's code and determining how they conducted their CI's and used a slightly altered version based on our desired dataset: ``` t.test(study1_extreme_during_introverts, alternative = "two.sided", paired = TRUE) ``` To which I got an error: ``` Error in t.test.default(study1_extreme_during_introverts, alternative = "two.sided", : 'y' is missing for paired test ``` So I decided to remove the 'paired = TRUE' and see what that produced: ``` t.test(study1_extreme_during_introverts, alternative = "two.sided") # SUCCESS ``` So that was the we used to determine the upper and lower limits of our CI for each variable using a one-sample, rather than paired sampled, t-test: - Before pandemic + Most intraverted: ``` t.test(study1_extreme_before_introverts, alternative = "two.sided") ``` - During pandemic + Most introverted: ``` t.test(study1_extreme_during_introverts, alternative = "two.sided") ``` - Before pandemic + Most extraverted: ``` t.test(study1_extreme_before_extraverts, alternative = "two.sided") ``` - During pandemic + Most extraverted: ``` t.test(study1_extreme_during_extraverts, alternative = "two.sided") ``` We then created vectors that explicitly defined upper and lower CI values using c(): ``` # STEP EIGHT - create data for upper and lower limits using the results from the t test above figure3_study1_summary\)lower <- c(4.579, 4.3, 3.319, 3.232) # creating a variable that contains the lower CI limits of the four groups

figure3_study1_summary$upper <- c(4.83, 4.59, 3.574, 3.47) #creating a variable that contains the upper CI limits of the four groups

Now that were done with all the pre-work, we can finally graph it! Here I have provided comments to help follow along the graph:

READY TO PLOT THE GRAPH!!!

figure3_SC <- ggplot(figure3_study1_summary) + # used data frame used to find SCAVERAGE mean and SD for all extraversion levels and time points geom_line(aes(x = Time, y = SCAVERAGE, lty = Type, group = Type)))+ # draws the lines, lty is used to specify changes in line type (changes in line are based on extraversion type) geom_point(aes(x = Time, y = SCAVERAGE, group = Type))+ # add the points of the SC means for each extraversion type ylim(3, 5.0)+ # rescales the y axis so it starts from the value of 3 and ends at the value of 5 theme(legend.title = element_blank(), # removes a title from the legend panel.grid.major = element_blank(), # removes grid lines panel.grid.minor = element_blank(), # removes grid lines axis.line = element_line(colour = “black”), # adds lines on the axis and makes them black panel.background = element_blank())+ #sets the background to white geom_errorbar(aes( #defining the asethetics of the CI error bars x = Time, # defining the x variable (before or during pandemic) y = SCAVERAGE, # defining the y variable group = Type, # defining how to group (most extraverted or most introverted) ymin = lower, # setting the lower error bar to the corresponding lower CI value as defined earlier by c() ymax= upper, # setting the upper error bar to the corresponding upper CI value as defined earlier by c() width = .1,))+ # sets the width of the CI error bars ggtitle(“Social Connectedness Changes Based on Extraversion”) + # for the main title (put enter as the title is too long when we placed the SC graph beside loneliness- see later) ylab(“Mean Social Connectedness”) # title for the Y axis

print(figure3_SC) # SUCCESSSS

#### Loneliness:
Now the same process was repeated for the loneliness graph, so  a brief summary of the method will be provided:

Here, study 2 data was used and quantiles for extraversion levels were determined:

quantile split by score (quantile cutoff)

study2_quantile <- quantile(study2data$T1Extraversion)

Then the data subsets were created based on extraversion level:

study2_introverts <- study2data %>% # create new variable using raw data select(T1Extraversion, T1Lonely, T2Lonely) %>% # only including three variables filter(T1Extraversion <= 3.833333) %>% # filtering out data so that only the most introverted (first quantile or less) are included in this subset mutate(Type = “Most Introverted”)

study2_extraverts <- study2data %>% # create a new variable using raw data select(T1Extraversion, T1Lonely, T2Lonely) %>% # only including three variables of interest filter(T1Extraversion >= 4.416667) %>% # only including the most extraverted (3rd quantile or larger) mutate(Type = “Most Extraverted”) # adding a variable so its easier to graph

We then combined repeated the same process again:

STEP ONE - combining the introvert and extravert subset data into one - stacking on top of each other

study2_figure3_Bind <- bind_rows(study2_introverts, study2_extraverts)

STEP TWO - create new variables that have separated the different time variables but includes both introverts and extraverts - - stacking on top of each other

study2_figure3_before <- study2_figure3_Bind %>% #creating a new variable using the extreme data (most introverted and extraverted) select(T1Lonely, Type) %>% #only including variables of interest mutate(Time = “Before Pandemic”) %>% # create a new variable that classifies the time period rename(Loneliness = T1Lonely) # renaming so that this can be plotted on the y axix

study2_figure3_during <- study2_figure3_Bind %>% # same but for the second time measurement (during pandemic) select(T2Lonely, Type) %>% mutate(Time = “During Pandemic”) %>% rename(Loneliness = T2Lonely)

STEP THREE - bind the two new variables on top of each other (now categorised based on extraversion level and time point)

#- this makes a data set that includes a categorisation of introvert vs extravert and before or during with the corresponding Loneliness score

figure3_study2_data <- bind_rows(study2_figure3_before, study2_figure3_during) # combining the two variables created earlier into one data set stacked on top of each other

STEP FOUR create a summary table with all the SC means of the four groups

figure3_study2_summary <- figure3_study2_data %>% # new variable group_by(Type, Time) %>% summarise(Loneliness = mean(Loneliness))

Next we created separate variables that categorise specific to extraversion level and time points and turn them into values in the environment:

#Before pandemic - introverts study2_figure3_introverts_before <- study2_figure3_before %>% filter(Type == “Most Introverted”) # only want most introverted

study2_figure3_introverts_before <- study2_figure3_introverts_before$Loneliness # only want Loneliness variable and need to convert to a value using $

Before pandemic - extraverts

study2_figure3_extroverts_before <- study2_figure3_before %>% filter(Type == “Most Extraverted”)

study2_figure3_extroverts_before <- study2_figure3_extroverts_before$Loneliness

During pandemic - introverts

study2_figure3_introverts_during <- study2_figure3_during %>% filter(Type == “Most Introverted”)

study2_figure3_introverts_during <- study2_figure3_introverts_during$Loneliness

During pandemic - extraverts

study2_figure3_extroverts_during <- study2_figure3_during %>% filter(Type == “Most Extraverted”)

study2_figure3_extroverts_during <- study2_figure3_extroverts_during$Loneliness

STEP SIX - Find the CI for each group via one samples t test

#Before pandemic - introverts t.test(study2_figure3_introverts_before, alternative = “two.sided”)

During pandemic - introverts

t.test(study2_figure3_introverts_during, alternative = “two.sided”)

Before pandemic - extraverts

t.test(study2_figure3_extroverts_before, alternative = “two.sided”)

During pandemic - extraverts

t.test(study2_figure3_extroverts_during, alternative = “two.sided”)

We then create vectors to specify the lower and upper values of the CI's:

figure3_study2_summary$lower <- c(1.519254, 1.566918, 2.300463, 2.191586)

figure3_study2_summary$upper <- c(1.735026, 1.779943, 2.495222, 2.381804)

Now we can finally graph it (same as social connectedness)!

TIME TO GRAPH!

figure3_lonely <- ggplot(figure3_study2_summary) + geom_line(aes(x = Time, y = Loneliness, lty = Type, group = Type))+
geom_point(aes(x = Time, y = Loneliness, group = Type))+
ylim(1, 3.0)+
theme(legend.title = element_blank(), legend.key = element_blank(), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = “black”),
panel.background = element_blank())+
geom_errorbar(aes(
x = Time,
y = Loneliness,
group = Type,
ymin = lower,
ymax= upper,
width = .05,))+
ggtitle(“Loneliness Changes Based on Extraversion”) + ylab(“Mean Loneliness”)

plot(figure3_lonely) # SUCCESS

#### Finalised figure 3:
Now that we have produced the two graphs, to have it look exactly like that in the study's....:
![](https://i.imgur.com/2zYcx4I.png)

....I used the following code to combine the two ggplots:

library(cowplot) # allows ggplots to be arranged as a grid using plot_grid() function final_figure3 <- plot_grid(figure3_SC+ theme(legend.position = “none”), figure3_lonely) # legend.position -> used to remove the legend from the connectedness graph

Yup...definitely a triumph. 

# Part 3: Exploratory:

## Exploratory 1: Relationship between age and levels of social connectedness:
I had begun my first year of university when COVID had first hit, so I did not really have an oppurtunity to develop strong social relationships outside of my high school friends and family. During this time, I would envy students in their third or fourth years of university as they would have had time to develop these strong social relationships and support network at university. This caused me to feel as though I had reduced levels of social connectedness in comparison to older university students. So, I decided to explore whether my hypothesis that students in their first years of university (first_few) experienced lower levels of social connectedness than those in their third or fourth years (last_few). 

Now in the study 1 data, we only have the ages of the university 467 university participants. So I will have to make a couple of assumptions in order to conduct this exploratory analysis. Here, I will divide the age groups into quantiles. Here, those in the second quantile (25%) and below will be deemed as being in their first few years of university. Those in the third quantile or above (>= 75% quantile) are deemed as being in their later years of university. Now ofcourse I am aware that there can be older students in their first few years of university due to reasosns such as gap years and so on, however we will just follow the aforementioned assumptions. 

Now, regardless of whether older participants are in their first few years of university, the argument here is that, for example, those who were 17-18 years old most likely transitioned from school and straight into university. They did not have much of an oppurtunity to immerse themselves in society and be able to form strong social relationships outside of school and family. However, older students, no matter what year of university, do have experience and have engaged in social interactions prior to the pandemic and even starting university. Therefore, regardless, overall the hypothesis is that, on average, the older the undergraduate student, the stronger their support networks, and therefore their levels of social connectedness will not change as much as younger students who feel more isolated from their school friends and university class mates.

As seen in our earlier Study 1 Demographics, the maximum age was 44 whilst the minimum age was 17:

A tibble: 1 × 4

mean_age sd_age min_age max_age 1 20.9 3.03 17 44

Thus, there are a range of ages to work with. 

(Note that the following method is a re-adaptation of the figure 3 method). 

### Descriptives:
Initially I started off by determining the quantiles and classifying undergraduate students as first few years of university (25% and below) or later years of university (75% or above):

####Has social connectedness changed more for younger or older undergraduate students? # quantile split by score (quantile cutoff) study1_quantile <- quantile(study1_raw$Age) # find quantiles ``` To which I got an error message: ``` Error in (1 - h) * qs[i] : non-numeric argument to binary operator ``` Oh yes, the age dilemma. So I decided to apply our previous solution to the dilemma by selecting only the Age data from the study 1 data, turning it into a numeric variable and then filtering out the 'Decline to Answer' responses: ``` #attempt 2 # Filter out 'Decline to answer responses and make age numeric study1_raw$Age <- as.numeric(study1_raw$Age) # filter out Age data and turn into numeric variable

Output:

Warning message: NAs introduced by coercion

quantile split by score (quantile cutoff)

study1_quantile <- quantile(study1_raw$Age, na.rm = TRUE) # find quantiles and remove “decline to answer”/NA responses

Next, as per usual, I created data subsets based on whether participants were classified as first few years or later years of university:

create data subsets

first_few <- study1_raw %>% # create new variable using raw data select(Age, SCAVERAGE.T1, SCAVERAGE.T2, SCdiff, EXTRAVERSION) %>% # only including four variables filter(Age <= 19) %>% # filtering out data so that only the younger participants (first quantile) are included in this subset mutate(Type = “First few years”) # creating new ‘Type’ variable to graph # this results in 143 people in this group ^

final_few <- study1_raw %>% # create a new variable using raw data select(Age, SCAVERAGE.T1, SCAVERAGE.T2, SCdiff, EXTRAVERSION) %>% # only including four variables of interest filter (Age >= 22) %>% # only including participants above 75% quantile mutate(Type = “Final years”) # creating new ‘Type’ variable to graph # This results in 118 people in this group ^

I then combined the two data subsets into one to be able to visualise the data later on:

combining first few years and final years data into one

study1_age_extremes <- bind_rows(first_few, final_few)


I then created new variables based on both time point (before or during pandemic) and age (first years of final years):

New variable based on time point and age

age_extreme_before <- study1_age_extremes %>% #creating a new variable using the extreme data (first years and last years) select(SCAVERAGE.T1, Type) %>% #only including variables of interest –> added type using the mutate function mutate(Time = “Before Pandemic”) %>% # create a new variable that classifies the time period rename(SCAVERAGE = SCAVERAGE.T1) # renaming so that this can be plotted on the y axis

age_extreme_during <- study1_age_extremes %>% # same but for the second time measurement select(SCAVERAGE.T2, Type) %>% mutate(Time = “During Pandemic”) %>% rename(SCAVERAGE = SCAVERAGE.T2)

And then combined the data frames once again, now categorised based on time and age:

Combining the two variables (age and time point)

age_time <- bind_rows(age_extreme_before, age_extreme_during) # combining the two variables created earlier into one data set stacked on top of each other

Now that all variables are categorised into the one data frame, the mean for each group can be determined:

Finding the means:

age_time_summary <- age_time %>% # new variable group_by(Type, Time) %>% summarise(SCAVERAGE = mean(SCAVERAGE))

Same as before, we then create separate variables that are categorised based on age (proxy for first or last years of university) and time of measurement:

Create separate variables for each group specific to age and time of measurement

#Before pandemic - first years study1_extreme_before_first <- age_extreme_before %>% filter(Type == “First Years”) # only want first years (and before the pandemic)

study1_extreme_before_first <- study1_extreme_before_first$SCAVERAGE # only want SCAVERAGE variable and need to convert to a values in the environment bar using $

before pandemic - last years

study1_extreme_before_last <- age_extreme_before %>% filter(Type == “Last Years”)

study1_extreme_before_last <- study1_extreme_before_last$SCAVERAGE

during pandemic - first years

study1_extreme_during_first <- age_extreme_during %>% filter(Type == “First Years”)

study1_extreme_during_first <- study1_extreme_during_first$SCAVERAGE

during pandemic - last years

study1_extreme_during_last <- age_extreme_during %>% filter(Type == “Last Years”)

study1_extreme_during_last <- study1_extreme_during_last$SCAVERAGE

So now we are ready to plot our graph!

### Visualisation:

#Plot graph age_SC <- ggplot(age_time_summary) + # used data frame for mean and SD geom_line(aes(x = Time, y = SCAVERAGE, lty = Type, group = Type))+ # draws the lines geom_point(aes(x = Time, y = SCAVERAGE, group = Type))+ # add the points on the means ylim(3, 5.0)+ # rescales the y axis theme(legend.title = element_blank(), # removes a title from the legend panel.grid.major = element_blank(), # removes grid lines panel.grid.minor = element_blank(), # removes grid lines axis.line = element_line(colour = “black”), # adds lines on the axis and makes them black panel.background = element_blank()) + # title for the Y axis #sets the background to white ggtitle(“Social Connectedness Changes Based on Age”) + # for the main title ylab(“Mean Social Connectedness”) # label the y axis

### Results:
The graph tells us that initially first and last year university students had the same level of social connectedness, but during the pandemic their levels of social connectedness began to differ. We see that although both groups had a decrease in social connectedness levels, first years did in fact experience lower levels of social connectedness than students in their final years of their undergraduate degree. This brought into question whether this difference in social connectedness during the pandemic between the two groups is significant.

#### Test of significance:
To determine CI, I tried using the same method as figure 3, however, it did not work. Here's an example:

t.test(study1_extreme_before_first, alternative = “two.sided”)

Error code:

Error in t.test.default(study1_extreme_before_first, alternative = “two.sided”) : not enough ‘x’ observations

I then tried to use other methods to test significance such as:

wilcox.test(study1_extreme_during_first,study1_extreme_during_last, alternative = “g”) # FAIL

wilcox.test(study1_extreme_before_first, conf.int = TRUE) # FAIL

It was suggested that a reason for this error was due to NA values, and due to the age dilemma I was worried about that, so I tried methods to remove na values:

na <- na.omit(study1_extreme_during_first,study1_extreme_during_last) # to remove NA values, did not work- would give 0 observations in environment

After multiple rounds of trial and error, I decided to try another data frame as I only needed to test the significance of the 'during the pandemic' timeframe anyway:

try different data frame?

SCdiff <- age_time_summary %>% # create new variable using raw data select(SCAVERAGE, Type, Time) %>% # only including four variables filter(Time == “During Pandemic”)

Test whether difference in social connectedness during pandemic is statistically significant between first v last uni years:

t.test(SCdiff$SCAVERAGE, alternative = “two.sided”) # SUCCESS

Thus, this supports the hypothesis that students in their first few years of university experienced statistically significant changes in social connectedness levels than older students who had more time to develop these strong social connections prior to the pandemic. This is seen in the p < .05 value and how CI does not contain zero therefore rejecting the null hypothesis that there is no significant difference in social connectedness between those in their first few years versus their last few years of university.  

This study is beneficial in allowing university support system to conduct workshops and activities that give first year students an oppurtunity to work on enhancing these social connectedness levels and developing a strong social network.

## Exploratory 2: Relationship between number of people participants encountered during social distancing and loneliness levels:

Although the researchers did compare levels of loneliness for individuals who reported undergoing social distancing, I decided to go a bit deeper where I wanted to explore whether the *amount* of people that individuals encountered within six feet of them (outside of their household) affected their loneliness levels. 

It was hypothesised that the more people participants encountered within six feet, the lower the change in their loneliness levels as their need for belonging and social interaction is being addressed (to some extent) more than those with minimal to no contact. 

To test this, I again readapted the figure 3 method:

quantile split by score (quantile cutoff)

study2_quantile <- quantile(study2_raw$SixFeet) # find quantiles for six feet physical distancing

create data subsets

study2_no <- study2_raw %>% # create new variable using raw data select(SixFeet, T1Lonely, T2Lonely) %>% # only including ten desired variables filter(study2_raw$SixFeet <= 0) %>% # filtering out data so that only the people who reported seeing no people (first quantile) are included in this subset mutate(Type = “No Contact”) # this results in 204 people in this group ^

study2_min<- study2_raw %>% # create a new variable using raw data select(SixFeet, T1Lonely, T2Lonely) %>%
filter (study2_raw$SixFeet > 0, study2_raw$SixFeet <= 2 ) %>% # filtering out data so that only the people who reported seeing an acceptable amount of people are included in this subset mutate(Type = “Minimal Contact”) # This results in 64 people in this group ^

study2_max <- study2_raw %>% # create a new variable using raw data select(SixFeet, T1Lonely, T2Lonely) %>% # only including four variables of interest filter (study2_raw$SixFeet > 2) %>% # only including participants greater than the 75% quantile mutate(Type = “Max Contact”) # This results in 68 people in this group ^

Create new variables that have separated the different time variables but includes all forms of distancing - - stacking on top of each other

study2_sixft_before <- study2_sixft_Bind %>% #creating a new variable using the bind data select(T1Lonely, Type) %>% #only including variables of interest mutate(Time = “Before Pandemic”) %>% # create a new variable that classifies the time period rename(Loneliness = T1Lonely) # renaming so that this can be plotted on the y axis

study2_sixft_during <- study2_sixft_Bind %>% # same but for the second time measurement select(T2Lonely, Type) %>% mutate(Time = “During Pandemic”) %>% rename(Loneliness = T2Lonely)

Bind the two new variables on top of each other

this makes a data set that includes a categorisation of introvert vs extravert and before or during with the corresponding Loneliness score

sixft_study2_data <- bind_rows(study2_sixft_before, study2_sixft_during) # combining the two variables created earlier into one data set stacked on top of each other

Create a summary table with all the SC means of the six groups

sixft_study2_summary <- sixft_study2_data %>% # new variable group_by(Type, Time) %>% summarise(Loneliness = mean(Loneliness))

Before pandemic x minimal contact

study2_min_before <- study2_sixft_before %>% filter(Type == “Minimal contact”)

study2_min_before <- study2_min_before$Loneliness

Before pandemic x max contact

study2_max_before <- study2_sixft_before %>% filter(Type == “Max Contact”)

study2_max_before <- study2_max_before$Loneliness

During pandemic - no contact

study2_no_during <- study2_sixft_during %>% filter(Type == “No Contact”)

study2_no_during <- study2_no_during$Loneliness

During pandemic - min contact

study2_min_during <- study2_sixft_during %>% filter(Type == “Minimal Contact”)

study2_min_during <- study2_min_during$Loneliness

During pandemic - max contact

study2_max_during <- study2_sixft_during %>% filter(Type == “Max Contact”)

study2_max_during <- study2_max_during$Loneliness

Plot

sixft_lonely <- ggplot(sixft_study2_summary) + geom_line(aes(x = Time, y = Loneliness, colour = Type, group = Type))+
geom_point(aes(x = Time, y = Loneliness, group = Type))+
theme(legend.title = element_blank(), legend.key = element_blank(), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = “black”),
panel.background = element_blank())+
ggtitle(“Loneliness Changes Based on six feet social distancing levels”) + ylab(“Mean Loneliness”)


#### Results:
The pattern of results indicates that, across all levels of social distancing, participants experienced lower levels of loneliness during the pandemic- as reported by the study. However, if we break these changes in loneliness down, we see that participants who had no contact, within six feet of individuals outside of their household, reported the highest levels of loneliness. What I found most surprising is that those who had maximum contact with individuals outside of their household (more than 2 people) actually experienced higher levels of loneliness than particpants who had minimal contact (1-2 people) with people outside their household. In fact, those with minimal contact experienced the greatest drop in loneliness levels during the pandemic. By looking into this result, it can allow us to fill in the literature gaps and better understand the nature of social relationships, particularly during a crisis such as the pandemic, where surprisingly encountering less people had reduced loneliness. 

## Why individuals with minimum contact have the biggest drop in loneliness?
The purpose of the second part of the exploratory question is to determine the contributing factors behind the large drop in loneliness levels for those with minimal contact. My hypothesis was that individuals who have minimal contact tend to have larger drops in loneliness levels due to personality factors/ individual differences. These difference may arise from these 64 participants having higher levels of relatedness and/or have higher life satisfaction. As in the study relatedness is scored by scoring measures such as "I felt close and connected with people who are important to me". Life satisfaction was defined as scoring items such as "I am satisfied with my life". By being high in relatedness and life satisfaction, individuals with minimal conteact do not define their social interactions by the number of people they encounter, rather they define it by the quality of those interactions (less is more). 

##### Relatedness:
To test this theory the same data subsets were used as in Exploratory question 2 (study2_no, study2_min, study2_max):

create data subsets

study2_no <- study2_raw %>% # create new variable using raw data select(SixFeet, BMPN_Diff) %>% # only including ten desired variables filter(study2_raw$SixFeet <= 0) %>% # filtering out data so that only the people who reported seeing no people (first quantile) are included in this subset mutate(Type = “No Contact”) # this results in 204 people in this group ^

study2_min<- study2_raw %>% # create a new variable using raw data select(SixFeet, BMPN_Diff) %>%
filter (study2_raw$SixFeet > 0, study2_raw$SixFeet <= 2 ) %>% # filtering out data so that only the people who reported seeing an acceptable amount of people are included in this subset mutate(Type = “Minimal Contact”) # This results in 64 people in this group ^

study2_max <- study2_raw %>% # create a new variable using raw data select(SixFeet, BMPN_Diff) %>% # only including four variables of interest filter (study2_raw$SixFeet > 2) %>% # only including participants greater than the 75% quantile mutate(Type = “Max Contact”) # This results in 68 people in this group ^

Combining the no, min and max six ft social distancing subset data into one - stacking on top of each other

study2_sixft_BMPN <- bind_rows(study2_no,study2_min,study2_max)

Then a summary table was created to determine the mean relatedness change in each category in relation to amount of people encountered outside participant's household whilst social distancing at six feet:

Create a summary table with all the SC means of the six groups

sixft_BMPN_summary <- study2_sixft_BMPN %>% # new variable group_by(Type) %>% summarise(Relatedness = mean(BMPN_Diff))

As we can already see in the output, most values are close to zero. That was the first indication that there may not be a significant result. I then decided to test the significance of these means to determine whether they are significantly different from zero via a One Way Samples t-test:

CI limits

t.test(sixft_BMPN_summary$Relatedness, alternative = “two.sided”)

Specify CI limits to visualise in graph

sixft_BMPN_summary$lower <- c(-0.1003813)

sixft_BMPN_summary\(upper <- c(0.1200232) ``` As we can see in the output, the change in relatedness in relation to level of social distancing was not significant for any group ( let alone minimal contact). It failed to reject the null hypothesis that there is a significant difference in relatedness change levels across levels of social distancing. It even went against the hypothesis that there would be an even more significant change in relatedness for minimal contact individuals. I then decided to visualise the data to be abel to visually compare changes in relatedness score across types of social distancing whilst inserting 95% CI error bars. Note that teh graoh was a readaptation of the method used in Figure 3: ``` # Plot library(ggplot2) sixft_BMPN <- ggplot(sixft_BMPN_summary, aes(x= Type, y= Relatedness, colour = Type)) + geom_col(width = .5 )+ ylim(-.5, .5)+ theme(legend.title = element_blank(), legend.key = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"), panel.background = element_blank())+ geom_errorbar(aes( x = Type, y = Relatedness, group = Type, ymin = lower, ymax= upper, width = .05,))+ ggtitle("Relatedness levels based on six feet social distancing") + ylab("Mean Relatedness Change") ``` ##### Life satisfaction: I then repeated the exact same method as 'Relatedness' (but used the life satisfaction - BMPN_Diff data instead): ``` #Exploratory whether those who had minimum six feet contact vs those who had a lot # create data subsets study2_no <- study2_raw %>% # create new variable using raw data select(SixFeet, SWLS_Diff) %>% # only including ten desired variables filter(study2_raw\)SixFeet <= 0) %>% # filtering out data so that only the people who reported seeing no people (first quantile) are included in this subset mutate(Type = “No”) # this results in 204 people in this group ^

study2_min<- study2_raw %>% # create a new variable using raw data select(SixFeet, SWLS_Diff) %>%
filter (study2_raw$SixFeet > 0, study2_raw$SixFeet <= 2 ) %>% # filtering out data so that only the people who reported seeing an acceptable amount of people are included in this subset mutate(Type = “Min”) # This results in 64 people in this group ^

study2_max <- study2_raw %>% # create a new variable using raw data select(SixFeet, SWLS_Diff) %>% # only including four variables of interest filter (study2_raw$SixFeet > 2) %>% # only including participants greater than the 75% quantile mutate(Type = “Max”) # This results in 68 people in this group ^

Create a summary table with all the SC means of the six groups

sixft_SWLS_summary <- study2_sixft_SWLS %>% # new variable group_by(Type) %>% summarise(Satisfaction = mean(SWLS_Diff)) # Again all values are close to zero across all levels of social distancing

#Plot # CI limits t.test(sixft_SWLS_summary$Satisfaction, alternative = “two.sided”)

Output:-

p-value = 0.4288 # not significant 95 percent confidence interval: -0.2160440 0.3441894
# zero is contained within CI limits, therefore failing to reject the null hypothesis that there is a significant difference in life satisfaction changes across levels of social distancing.

Specify CI limits to visualise in graph

sixft_SWLS_summary$lower <- c(-0.2160440 )

sixft_SWLS_summary$upper <- c(0.3441894)

Plot

sixft_SWLS <- ggplot(sixft_SWLS_summary, aes(x= Type, y= Satisfaction, colour = Type)) + geom_col(width = .5 )+ ylim(-.5, .5)+
theme(legend.title = element_blank(), legend.key = element_blank(), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_line(colour = “black”),
panel.background = element_blank())+ geom_errorbar(aes(
x = Type,
y = Satisfaction,
group = Type,
ymin = lower,
ymax= upper,
width = .05,))+ ggtitle(“Life Satisfaction levels based on six feet social distancing”) + ylab(“Mean Life Satisfaction Change”)

Joining the two graphs for comparison:

library(cowplot) joint_figure <- plot_grid(sixft_BMPN+ theme(legend.position = “none”), sixft_SWLS)


#### Results:
So as we can see, changes in relatedness and life satisfaction difference scores (from before to during the pandemic) are close to zero. Moreover, these slight differences are not enough to be deemed statistically significant and go against the alternate hypothesis that individual differences in levels of relatedness and life satisfaction can impact the amount of social interactions a participant is satisfied in encountering. Moreover, there was no significantly different effect of these individual differences in individuals with minimal contact. Although minimal contact did have the highest mean (M = 0.184) in life satisfaction levels (not so much relatedness- (M= 0.0286)), it was not significant enough to support the hypothesis.
Thus, this indicates that relatedness and life satisfaction have nothing to do with the amount of people encountered outside of their household during social distancing. Moreover, these changes are close to zero and insignificant across all types of social distancing, let alone having a significant outcome in minimal contact. Thus, it is inconclusive as to the cause of the drastic drop in loneliness for minimal contact and further research needs to be conducted.

### Exploratory 3: Whether younger participants are more resilient to changes in social behaviour (caused by the pandemic) than older participants 

The aim of this exploratory analysis was to determine whether younger population was more resilient in showing higher levels of social connectedness than the older population when undergoing changes in social behaviours via social distancing. This was left as an open ended question as the younger population may have been better able to adapt due to the obtaining their needs for social connection and belonging through online platforms such as social media. However, older individuals may be more adaptable and have higher levels of social connectedness as they have stronger social relatioships and support networks that they can rely on during a crisis (such as a pandemic). Note that the study had operationalised social connectedness as a combination of relatedness and loneliness scores. 

To test this hypothesis, the figure 3 method was once again re-adapted with some slight changes. I initially began by refering back to the study 2 age descriptives where I found that there was data on age groups ranging from 18-72.

#Age descriptives: print(study2_age) # Age ranges from 18-72

I then used the quantile() to be able to categorise participants into their respective age groups. Here, those in the 0% and 25% quantile were deemed as being the younger age group. Those in the 50% and 75% quantile were deemed as being the middle aged group and those above the 75% quantile were deemed as being the older age group. I started off by creating data sets for relatedness levels (which was then repeated for loneliness levels). These age subsets then were binded and then again formed another subset based on the time of relatedness (before or during pandemic) and were once again bound. This formed categories based on the participant's age group and the time of pandemic.

Develop quantile

study2_quantile <- quantile(study2_raw$Age) # find quantiles

Relatedness

create data subsets

younger <- study2_raw %>% # create new variable using raw data select(Age,T1BMPN, T2BMPN) %>% # only including three variables of interest filter(Age <= 23) %>% # filtering out data so that only the younger participants (first quantile) are included in this subset mutate(Type = “Younger”) # creating new ‘Type’ variable to graph # this results in 94 people in this group ^

mid <- study2_raw %>% # create a new variable using raw data select(Age,T1BMPN, T2BMPN) %>% # only including three variables of interest filter (Age > 23, Age <= 38) %>% # only including participants within 50% and 75% quantile mutate(Type = “Middle”) # creating new ‘Type’ variable to graph # This results in 159 people in this group ^

older <- study2_raw %>% # create a new variable using raw data select(Age,T1BMPN, T2BMPN) %>% # only including three variables of interest filter (Age > 38) %>% # only including participants above 75% quantile mutate(Type = “Older”) # creating new ‘Type’ variable to graph # This results in 83 people in this group ^

combining all age subset data into one

study2_all_ages <- bind_rows(younger, mid, older)

New variable based on time point and age

age_all_before <- study2_all_ages %>% #creating a new variable using the age groups data select(T1BMPN, Type) %>% #only including variables of interest –> added type using the mutate function mutate(Time = “Before Pandemic”) %>% # create a new variable that classifies the time period rename(Relatedness = T1BMPN) # renaming so that this can be plotted on the y axis

age_all_during <- study2_all_ages %>% # same but for the second time measurement select(T2BMPN, Type) %>% mutate(Time = “During Pandemic”) %>% rename(Relatedness = T2BMPN)

Combining the two variables (age and time point)

age_groups_time <- bind_rows(age_all_before, age_all_during) # combining the two variables created earlier into one data set stacked on top of each other

Finding the means:

age_relatedness_study2_summary <- age_groups_time %>% # new variable group_by(Type, Time) %>% summarise(Relatedness = mean(Relatedness))

Create separate variables for each group specific to age and time of measurement

#Before pandemic - Younger study2_age_before_younger<- age_all_before %>% filter(Type == “Younger”) # only want first years (and before the pandemic)

study2_age_before_younger <- study2_age_before_younger$Relatedness # only want SCAVERAGE variable and need to convert to a values in the environment bar using $

Before pandemic - Mid-Age

study2_age_before_mid <- age_all_before %>% filter(Type == “Middle”)

study2_age_before_mid <- study2_age_before_mid$Relatedness

Before pandemic - Older

study2_age_before_older <- age_all_before %>% filter(Type == “Older”)

study2_age_before_older <- study2_age_before_older$Relatedness

during pandemic - Younger

study2_age_during_younger <- age_all_during %>% filter(Type == “Younger”)

study2_age_during_younger <- study2_age_during_younger$Relatedness

during pandemic - Mid-Age

study2_age_during_mid <- age_all_during %>% filter(Type == “Middle”)

study2_age_during_mid <- study2_age_during_mid$Relatedness

during pandemic - Older

study2_age_during_older <- age_all_during %>% filter(Type == “Older”)

study2_age_during_older <- study2_age_during_older$Relatedness

I then created CI limits for the 95% CI error bars. As the CI limits all do not contain zero, it is safe to reject the null hypothesis that participants did not experience relatedness (or loneliness later on). Participants did experience and relatedness and loneliness before and during the pandemic.

CI limits:

Find the CI for each group via One sample t tests

t.test(study2_age_before_younger, alternative = “two.sided”)

t.test(study2_age_before_mid, alternative = “two.sided”)

t.test(study2_age_before_older, alternative = “two.sided”)

t.test(study2_age_during_younger, alternative = “two.sided”)

t.test(study2_age_during_mid, alternative = “two.sided”)

t.test(study2_age_during_older, alternative = “two.sided”)

Create data for upper and lower limits using the results from the t test above

age_relatedness_study2_summary$lower <- c(4.545670, 4.678196, 5.016795, 4.497347, 4.636523, 5.071762) # creating a variable that contains the lower CI limits of the six groups

age_relatedness_study2_summary$upper <- c(4.968515,5.024110, 5.477181, 4.967192, 5.000793, 5.514584) #creating a variable that contains the upper CI limits of the six groups

I then plotted the relatedness graph:

READY TO PLOT THE GRAPH!!!

Relatedness_ex3 <- ggplot(age_relatedness_study2_summary) + # used data frame for mean and SD geom_line(aes(x = Time, y = Relatedness, colour = Type, group = Type))+ # draws the lines geom_point(aes(x = Time, y = Relatedness, group = Type))+ # add the points on the means ylim(2, 6.0)+ # rescales the y axis theme(legend.title = element_blank(), # removes a title from the legend panel.grid.major = element_blank(), # removes grid lines panel.grid.minor = element_blank(), # removes grid lines axis.line = element_line(colour = “black”), # adds lines on the axis and makes them black panel.background = element_blank())+ #sets the background to white geom_errorbar(aes( #defining the asethetics of the error bars x = Time, # defining the x variable y = Relatedness, # defining the y variable group = Type, # defining how to group ymin = lower, # setting the lower error bar to the corresponding lower CI value as defined earlier ymax= upper, # setting the upper error bar to the corresponding upper CI value as defined earlier width = .1,))+ # sets the width of the error bars ggtitle(“Relatedness Changes Based on Age and Time of Pandemic”) + # for the main title ylab(“Mean Relatedness”) # title for the Y axis

The same process for relatedness was repeated for loneliness where only the selected variable (T1Lonely and T2Lonely) changed:

Loneliness

create data subsets

younger <- study2_raw %>% # create new variable using raw data select(Age,T1Lonely, T2Lonely) %>% # only including three variables filter(Age <= 23) %>% # filtering out data so that only the younger participants (first quantile) are included in this subset mutate(Type = “Younger”) # creating new ‘Type’ variable to graph # this results in 94 people in this group ^

mid <- study2_raw %>% # create a new variable using raw data select(Age,T1Lonely, T2Lonely) %>% # only including three variables of interest filter (Age > 23, Age <= 38) %>% # only including participants within 50% and 75% quantile mutate(Type = “Middle”) # creating new ‘Type’ variable to graph # This results in 159 people in this group ^

older <- study2_raw %>% # create a new variable using raw data select(Age,T1Lonely, T2Lonely) %>% # only including three variables of interest filter (Age > 38) %>% # only including participants above 75% quantile mutate(Type = “Older”) # creating new ‘Type’ variable to graph # This results in 83 people in this group ^

combining all age subset data into one

study2_all_ages <- bind_rows(younger, mid, older)

New variable based on time point and age

age_all_before <- study2_all_ages %>% #creating a new variable using the age groups data select(T1Lonely, Type) %>% #only including variables of interest –> added type using the mutate function mutate(Time = “Before Pandemic”) %>% # create a new variable that classifies the time period rename(Loneliness = T1Lonely) # renaming so that this can be plotted on the y axis

age_all_during <- study2_all_ages %>% # same but for the second time measurement select(T2Lonely, Type) %>% mutate(Time = “During Pandemic”) %>% rename(Loneliness = T2Lonely)

Combining the two variables (age and time point)

age_groups_time <- bind_rows(age_all_before, age_all_during) # combining the two variables created earlier into one data set stacked on top of each other

Finding the means:

age_time_study2_summary <- age_groups_time %>% # new variable group_by(Type, Time) %>% summarise(Loneliness = mean(Loneliness))

Create separate variables for each group specific to age and time of measurement

#Before pandemic - Younger study2_age_before_younger<- age_all_before %>% filter(Type == “Younger”) # only want first years (and before the pandemic)

study2_age_before_younger <- study2_age_before_younger$Loneliness # only want SCAVERAGE variable and need to convert to a values in the environment bar using $

Before pandemic - Mid-Age

study2_age_before_mid <- age_all_before %>% filter(Type == “Middle”)

study2_age_before_mid <- study2_age_before_mid$Loneliness

Before pandemic - Older

study2_age_before_older <- age_all_before %>% filter(Type == “Older”)

study2_age_before_older <- study2_age_before_older$Loneliness

during pandemic - Younger

study2_age_during_younger <- age_all_during %>% filter(Type == “Younger”)

study2_age_during_younger <- study2_age_during_younger$Loneliness

during pandemic - Mid-Age

study2_age_during_mid <- age_all_during %>% filter(Type == “Middle”)

study2_age_during_mid <- study2_age_during_mid$Loneliness

during pandemic - Older

study2_age_during_older <- age_all_during %>% filter(Type == “Older”)

study2_age_during_older <- study2_age_during_older$Loneliness

CI limits:

Find the CI for each group via One sample t tests

t.test(study2_age_before_younger, alternative = “two.sided”)

t.test(study2_age_before_mid, alternative = “two.sided”)

t.test(study2_age_before_older, alternative = “two.sided”)

t.test(study2_age_during_younger, alternative = “two.sided”)

t.test(study2_age_during_mid, alternative = “two.sided”)

t.test(study2_age_during_older, alternative = “two.sided”)

Create data for upper and lower limits using the results from the t test above

age_time_study2_summary$lower <- c(2.085014,2.048396, 1.821376, 1.998295, 2.021845,1.761657) # creating a variable that contains the lower CI limits of the six groups

age_time_study2_summary$upper <- c(2.339398,2.258125,2.096189,2.244706,2.221121,2.017671) #creating a variable that contains the upper CI limits of the six groups

READY TO PLOT THE GRAPH!!!

Loneliness_ex3 <- ggplot(age_time_study2_summary) + # used data frame for mean and SD geom_line(aes(x = Time, y = Loneliness, colour = Type, group = Type))+ # draws the lines geom_point(aes(x = Time, y = Loneliness, group = Type))+ # add the points on the means ylim(0,3)+ # rescales the y axis theme(legend.title = element_blank(), # removes a title from the legend panel.grid.major = element_blank(), # removes grid lines panel.grid.minor = element_blank(), # removes grid lines axis.line = element_line(colour = “black”), # adds lines on the axis and makes them black panel.background = element_blank())+ #sets the background to white geom_errorbar(aes( #defining the asethetics of the error bars x = Time, # defining the x variable y = Loneliness, # defining the y variable group = Type, # defining how to group ymin = lower, # setting the lower error bar to the corresponding lower CI value as defined earlier ymax= upper, # setting the upper error bar to the corresponding upper CI value as defined earlier width = .1,))+ # sets the width of the error bars ggtitle(“Loneliness Changes Based on Age and Time of Pandemic”) + # for the main title ylab(“Mean Loneliness”) # title for the Y axis

Combine graphs:

library(cowplot) final_ex3 <- plot_grid(Relatedness_ex3+ theme(legend.position = “none”), Loneliness_ex3)

I then tested the significance of the difference between relatedness and loneliness in younger vs older participants during the pandemic:

Test significance of difference during pandemic Relatedness_diff <- age_relatedness_study2_summary %>%
select(Relatedness, Type, Time) %>%
filter(Time == “During Pandemic”)

Loneliness_diff <- age_time_study2_summary %>%
select(Loneliness, Type, Time) %>%
filter(Time == “During Pandemic”)

Part 4: Recommendations

Make access to aspects of OSF file easier:- The pre-registration pdf, measures, raw data csv files for each study, the provided R code and the code book were each provided in separate links that were spread out through the study. However, there was a link toward the end of the study (under the ‘Data Accessibility Statement’) for each of the study’s respective OSF files. This was nicely laid out where there were subheadings as to which link was for the R code, csv file, etc. However, each time you clicked to open one of the files, a new tab would open. This caused me to have multiple tabs open and constantly switching between these tabs trying to find whatever file it is I need. Sometimes I had eight tabs open at once- and that was just to see the data- does not even include the tabs open when I used multiple websites to research how to replicate a certain code. One way to resolve this is to have subheadings in the raw data that make sense. For example, I would constantly get confused as to whether SWLS stood for life satisfaction or relatedness. Something as simple as renaming the variables to ‘Satisfaction’ or ‘Relatedness’ would have saved me time switching between the raw data, R code and code book. Moreover, they could have had a side-tab that allows you to switch between aspects of the OSF file without opening a new tab. For example, if I am on Study 1 data.csv and decide I now need the study 2 csv, there is a side tab where I can click that heading and access the data rather than having to exit the OSF file as a whole, find the study OSF link, and open study 2. csv raw data on a new tab. Something as small as this can save time, allow reproducibility to become more time efficient and less frustrating whilst also not using a significant amount of energy on our devices and causing everything (particularly R studio) to run slower.

Explain code to individuals who are not from a coding background: As I mentioned throughout my report, although we are grateful the authors left us with some R code, when for example I was using that code to replicate the ‘most introverted’ vs ‘most extraverted’ quantile data in Mean/SD stage, I had no idea what the code was doing and was unable to read and understand the provided code. I would read through 12 lines of code where the only comment was ’‘#Making subset of data with only participants in top quartile of extraversion’ and my only response was “huh?”. Although I would try to break down what the chunks of code meant, it only left me feeling more overwhelemed. In fact it made no sense to the point where I had decided to develop the code from scratch rather than reusing the code provided to us in the R doc. When trying to undertake open data, you want individuals from any background to be able to open the R doc and be able to follow along the steps taken to produce values, even if it is just them reading through the ‘#’ comments next to your code to be able to understand the process of how the study’s values were reproduced. Moreover, do not be hesitant to beak down steps into headings using # to explain each step that was taken. The recommendations can be found on this website.

Moroever, researchers should explain WHAT each line of code is doing and WHY that line of code is needed (how does it affect output). We do not only want to see that the value was produced, we also want to see the R coding journey the authors went through to write this code. Among the key suggestions found, authors should make use of RStudio. They should write their coding journey and explain each line of code. Suggestions on the layout of this documentation and tips to make reprodubility easier. Not only does writing out this code step-by-step in detail make reproducibility easier and more time efficient, it also allows for transparency where issues- such as an illogical exclusion criteria- to produce significant results can become more transparent. Moreover, it allows authors to proof read their code and ensure that it does in fact produce the reported data, their code is logical (as their code will be scutinised by other scientists) whilst also allowing them to proof read their code to avoid issues such as the discrepencies between the intext and table 3 values whereby these issues bring into question the integrity of their code.

Writing code in the order it occured: This again ties into the previous point of providing a detailed report of the coding journey and using # to break down steps. The main apsect I would like to focus on is to START FROM THE BEGINNING. For example, upon opening the R code for Study 1, the first line of code is titled: ‘Has social connection changed as a result of the COVID-19 Pandemic’. Okay… but what about the reported participant demographics? How did you produce the physical distancing mean/sd and inferencial statistics? They skipped multiple steps ahead. Moreover. their Study 1 code is approximately 80 lines long (with installing packages and reading csv file) and only mainly focuses on how they produced inferential statistics such as t-tests, cohen’s d and regression. Moreover, there is absolutely no mention of how they produced their tables and figures. These issues are also seen in the Study 2 R code. The point is that the authors have provided very bried code that in no way made replicating their descriptive statistics easier. If third year university students are able to write up a RStudio file detailing the steps they undertook to reproduce the values, so can the authors!

In fact, this requirement of providing a detailed explanation of the author’s coding journey should become a requirement in the peer review process to allow papers to be published. Although it may be time consuming for authors, it is more time consuming for the scientific community to go about reproducing these values. By implementinf this practice, it has scientific benefits including increasing reproducibility and encouraging multiple perspectives and results in publications. It allows various input on how to better analyse the original data and avoids issues such as the falsification of data. This practice can be encouraged by incentivising data sharing where authors who undertake this practice can receive benefits such as increased recognition of the intellectual value of their data and more citations. However, this open data practice should be mandated rather than recommended by the peer review board (Levenstein & Lyle, 2018).

Title: Verification-report-final.knit

Did Social Connection Decline During the First Wave of COVID-19?: The Role of Extraversion

Part 1 Summary and Reaction:

Summary

Reaction:

Part 2:

Verification goals

Demographic statistics reported in the study:

Study 1:

Means/SD reported in the text:

Study 1:

Study 2:

Figures reported in the text:

Study 1:

Study 2:

Steps to replicating data:

Locate the open data

Code book:

Replication process:

Packages and reading data

Demographic statistics:

Study 1:

Reading Study 1 data

Exclusion criteria:

Starting the actual process:

Gender identity of participants

Age descriptives (or as we call it… the ‘Age Dilemma’):

load packages

read the data

physical/social distancing

attempt 1

Calculate mode using user function

Create vector and summarise()

A tibble: 1 × 3

SUCCESS… they all match the in-text reported data (making life so much easier buttt…huh?)

Introverts vs Extraverts ———————————————–

quantile split by score (quantile cutoff)

Finding the means and Sd’s

A tibble: 2 × 3

Create vector and summarise()

Social Distancing

STUDY2: Split Quantile by scores

Create and filter lower quantile of extroverts (introverts)

Create and filter upper quantile of extroverts (most extroverted)

Calculate mean and sd’s of each of the above subsets

Ouput

Failed Attempts

Load packages

Read the data

Create data frame

Figure 2

Figure 2:- Loneliness

Joint graph:

PERFECT!

quantile split by score (quantile cutoff)

STEP THREE - create new variables that have separated the different time variables but includes both introverts and extraverts

STEP FIVE create a summary table with all the SC means of the four groups

Before pandemic- introverts

during pandemic - introverts

before pandemic - extraverts

during pandemic - extraverts

READY TO PLOT THE GRAPH!!!

quantile split by score (quantile cutoff)

STEP ONE - combining the introvert and extravert subset data into one - stacking on top of each other

STEP TWO - create new variables that have separated the different time variables but includes both introverts and extraverts - - stacking on top of each other

STEP THREE - bind the two new variables on top of each other (now categorised based on extraversion level and time point)

STEP FOUR create a summary table with all the SC means of the four groups

Before pandemic - extraverts

During pandemic - introverts

During pandemic - extraverts

STEP SIX - Find the CI for each group via one samples t test

During pandemic - introverts

Before pandemic - extraverts

During pandemic - extraverts

TIME TO GRAPH!

A tibble: 1 × 4

Output:

quantile split by score (quantile cutoff)

create data subsets

combining first few years and final years data into one