The goal of the verification section of this report was to reproduce the demographic descriptives, the means/ SD reported in the text and all figures from the article. This was done to help get a better idea behind the challenges encountered when reproducing other studies where these challenges will be highlighted throughout the report.
“Our final sample included 467 participants (age: M = 20.89, SD = 3.03; 77% women)who completed both out Time1 and Time2 surveys and met our inclusion criteria”.
In-text results: - Physical/Social distancing: - “(98.5%) reproted practicing physical/social distancing” - “No one outside their household came within 6 feet of them the day before (Mode= 0, M = 0.77, SD = 1.39)” - Changes in social connectedness due to COVID: - “…Participants reported lower levels of social connection during the COVID-19 pandemic (Time 2: M= 3.97, SD= 0.85) than before (Time 1: M = 4.11, SD = 0.88)”. - Social connectedness x extraversion levels: - Exploratory Analysis: “…found that extraversion was weakly associated with declines in social connectedness. To unpack this finding, we split our sample into those with an extraversion score at or below in the 25th percentile and those at or above the 75th percentile. Our most intraverted participants exhibited a small drop in social connectedness between Time 1 (M = 3.45, SD= 0.70) and Time 2 (M = 3.35, SD = 0.66)”…; whereas the most extraverted participants exhibited a larger drop… in social connectedness between Time 1 (M = 4.70, SD = 0.72) and Time 2 (M = 4.45, SD = 0.84)“. - Lethargy levels (wellbeing proxy): -”To test whether momentary lethargy levels changed from before to during the pandemic….lethargy increased from Time 1 (M= 2.60, SD = 1.16) to Time 2(M= 3.16, SD = 1.27)“.
Table results: - Mean and SD reported in table 1:
In text results: - Physical/Social distancing: - “92.9% of
participants reported practicing physical/ social distancing”.
- “No one got within 6 feet of them the day before (Mode = 0, M = 1.11,
SD= 0.75)” - Change in social connectedness due to COVID: - “…we found
no difference between participants’ reports of relatedness prior to the
pandemic (Time 1: M= 4.90, SD = 1.11) versus dueing the pandemic (Time
2: M = 4.91, SD= 1.15).” - Change in loneliness levels due to COVID: -
“Participants reported feeling a little less lonely during the pandemic
(Time 2: M = 2.07, SD = 0.63) than before (Time 1: M= 2.14, SD= 0.67)”.
- Loneliness levels x ectraverion levels: - “…higher levels of
extraversion were linked to smaller decreases in loneliness…To further
investigate this finding we again subdivided our sample to focus on
participants in the top and bottom quartiles for extraversion. Whereas
our most extraverted participants showed no change in loneliness from
Time 1 (M = 1.64, SD= 0.51) to Time 2 (M= 1.67, SD= 0.49)”…, our most
introverted participants decreased in loneliness from Time 1 (M = 2.56,
SD= 0.63) to Time 2(M= 2.31, SD= 0.63)“. - Changes in life satisfaction
levels due to COVID: -”…life satisfaction did not change from Time 1(M=
3.96, SD= 1.56) to Time 2(M= 3.98, SD= 1.46).”
Table results:
Figure 1:
Figure 2:
Figure 3:
The first step was to locate the open data for both studies which was easily located by following the OSF links under the Methods section for each study.
We then went to find the datacodes for each study so we could see
what each variable was labelled: - Study 1: - Study 2:
As mentioned earlier, there were two studies in this article. Therefore, there were two separate csv file for each study. Therefore replication for each study occurred independantly starting with Study 1.
# Read the Data STUDY 1-----------------------------------------------------------
# install the packages
install.packages("tidyverse") # for most of the functions needed
install.packages("dplyr") # for efficient data manipulation
install.packages("lsr") # used to compute mode
install.packages("psych") # used to create table1
# load the packages
library(tidyverse)
library(lsr)
library(psych)
Note that we knew to load some of these packages by looking at the R code provided by the authors in the OSF file where we used the relevant packages that we deemed we needed to reproduce the data.
Raw study 1 data from the OSF file was saved as “study1_data.csv” in my work directory whereby the .csv format allowed data to be read using the read_csv() function to get the data into the Rmd file.The read_csv allows data to be imported into R as a tibble (a package used to manipulate and print data frames).
The data is then named as “study1_raw” by assigning it (<-) to the read_csv function.
# read the data
study1_raw <- read_csv(file = "study1_data.csv")
Lucky for us there was no issue reading the data where the exclusion criteria had already been applied to the data for both studies (where there were 467 rows of data- i.e. one row for each of the participants who had met their inclusion criteria- as mentioned earlier). This was made our job easier as, when we were reading the exclusion criteria, we had a mini panic attack when we saw the following criteria:
“Immediately after rating their personality, participants were asked to indicate whether they answered the scale with regard to their “typical personality prior to the COVID-19 pandemic” or their “personality during the COVID-19 pandemic.” As pre-registered, we excluded 142 participants who answered the latter or failed to answer.”
“Excluding inattentive participants, participants who provided the same answer 12 times in a row, those who missed more than approx. 20% of items (Cut offs: i.e. 2 out of 10 from our measure of lethargy, 3 out of 12 on extraversion, 3 out of 19 on connectedness). If participants were missing data below these cut offs, we used the mean of each individual’s remaining responses to impute their missing responses.Finally, if participants completed the Time2 survey twice (which we did not foresee), we used only their first survey response”.
Although the authors didn’t make our replication journey easier, after seeing the struggles other groups encountered with applying the exclusion criteria, I felt super greatful that we were able to skip over this obstacle. Therefore we were able to dive straight into replicating the descriptive statistics (YAY!).
We started off by checking the number of participants to double check the number of participants (though it looked correct from the number of rows but I am an overthinker so bear with me).
#Number of participants
count(study1_raw)
# A tibble: 1 × 1
n
<int>
1 467
So we had the correct number of participants (shocker :) ). Now to determine the gender percentage.
Note: Study reproted that 77% of participants were women.
Again another blessing is that the authors wrote “man”, “woman”, “other” or “decline to answer” when reporting the gender data, so we did not have have to rename variables, so we did not encounter many issues when replicating the gender percentage. Or so I thought. So as wre may have established I have very poor (even that cannot describe it) coding skills. Terrible. So Google had immediately become my best friedn from the first step. So afterf a quick google search of ‘how to find percentage in R’, it was recommended that the easiest way was to use the dplyr package using this example code:
library( dplyr )
students %>%
group_by( gender ) %>%
summarise( percent = 100 * n() / nrow( students ) )
Which is sadly what I had attempted to do based on our raw data in the code below:
# Determine gender percentage
study1_raw %>%
group_by(Gender) %>%
summarise(percent = 100 * n()/nrow(study1raw)) %>%
filter(Gender == "woman") #Fail
This only produced an error stating that
Error in group_by(., Gender) : object 'study1_raw' not found
That is when I had realised that adding the filter function was me overthinking as I believed I needed the function to only get a percentage for the gender variable only. So once I had removed that line:
study1raw %>%
group_by(Gender) %>%
summarise(percent = 100 * n()/nrow(study1raw)) #Success
# A tibble: 4 × 2
Gender percent
<chr> <dbl>
1 [Decline to Answer] 0.428
2 Man 21.6
3 Other 0.857
4 Woman 77.1
It worked! (Just goes to just follow instructions and not doing your own thing). We can see that, similar to the data- I had also gotten 77% for women. Okay so now to move onto our main dilemma as a group.
Note: Study reproted that mean age = 20.89, SD = 3.03.
This section was our main challenge in the demographics section of the replications.
So initially we decided to filter out the “decline to answer” responses (where people did not want to give researchers their ages) using the ‘age != “[Decline to Answer])”’ and use the summarise() function to find the mean age and SD of participants:
#average age attempt 1
study1_raw %>%
filter(age != "[Decline to Answer]") %>%
summarise(mean(age), sd(age))
#FAIL (would work if age was numeric)
To which I kept receiving this error message:
! object 'age' not found
After searching on google and double checking why the age variable cannot be found in the raw study 1 data, I decided to relook over the code until I realised I had said age and not Age (face palm):
study1_raw %>%
filter(Age !="[Decline to Answer]") %>%
summarise(mean(Age), sd(Age))
However, the joy of not getting the same error message was short-lived as the next error message was:
# A tibble: 1 × 2
`mean(Age)` `sd(Age)`
<dbl> <dbl>
1 NA 3.03
Warning message:
In mean.default(Age) : argument is not numeric or logical: returning NA
BUTTTT…. we got the SD (YAY!). So we know the summarise() function is working. Now to make the age numeral numeric rather than a character variable () In our second attempt we tried to use the .subset (rather than filter) to filter out the age variable. Again did not work. In both cases, the error message was that Age was not a numeric variable. Therefore, our next steps was to convert it into a numerical value.
So in my first attempt, the suggested process to convert to numerical was to check the class type:
# check variable type
class(study1_raw) # supposed to be getting [1] "character"????
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
and then to use the recode() in the dplyr package then ifelse() to convert from categorical to numerical then match() function to rename the character variable into a numeric one. However, after seeing the output of the first line…:
dplyr::recode (Age_variable, "Decline to Answer" = 0)
Error in UseMethod("recode") :
no applicable method for 'recode' applied to an object of class "c('spec_tbl_df', 'tbl_df', 'tbl', 'data.frame')"
….(and being heavily confused as to what I was supposed to be typing in the brackets):
data <- c("a","b","c","b","a")
dplyr::recode(data, a = 1, b = 2, c = 3)
## [1] 1 2 3 2 1
ifelse(data=="a", 1, ifelse(data=="b", 2, 3))
## [1] 1 2 3 2 1
oldvals <- c("a", "b", "c")
newvals <- c(1, 2, 3)
newvals[match(data, oldvals)]
## [1] 1 2 3 2 1
(#example code used in [website] (https://universeofdatascience.com/how-to-recode-character-variables-in-r/)), I abandoned that method pretty quickly and ran to find an easier one. However, what did pique my interest is how recoded variables can be converted into each other using as.numeric and the as.character functions.
So then began the process of us combining different methods of filtering the “Decline to Answer” responses and trying to convert the age variable into a numeric one:
#convert age to numeric variable attempt 2
age.subset <- age.subset %>%
as.numeric(as.character(age)) #FAIL
#attempt 3
demographics %>% mutate(new_age = age != "[Decline to Answer]") #FAIL
#attempt 4
age.subset %>% mutate(age_test = numeric(age.subset)) #FAIL
Here, we used the mutate function using the dplyr package to create a new variable known as new_age, then again tried to filter out the needed responses by using the above methods. Again, fail. So it was just a repetitive process of trial and error until eventually we figured out the amazing $. This allowed us to extract the specific Age part of the study1_raw data. Then we allowed Age to become a numeric variable by using the “as.numeric” function.We were now able to calculate the age means and SD and by using na.rm we were able to skip over any “Decline to Answer” values:
#THE FIX
study1_raw <- read_csv("study1_data.csv")
study1_raw$Age <- as.numeric(study1_raw$Age)
#Converts "chr" to numeric form.
study1_age <- study1_raw %>%
summarise(mean_age = mean(Age, na.rm = TRUE),
sd_age = sd(Age, na.rm = TRUE),
min_age = min(Age, na.rm = TRUE),
max_age = max(Age, na.rm = TRUE)) #SUCCESS
Output:
# A tibble: 1 × 4
mean_age sd_age min_age max_age
<dbl> <dbl> <dbl> <dbl>
1 20.9 3.03 17 44
Although the minimum and maximum age was not reported in the document, we decided to flex our newfound abilities ;) So YAY! And thus we commence to the next stage …. STUDY 2 DEMOGRAPHICS!!!
So we started off by loading the relavent packages and reading the data.
# load packages
library(tidyverse) #for most of the functions needed
library(dplyr) # tools for efficient data manipulation
# read the data
study2_raw <- read_csv(file= "Study 2.csv")
Then we began the replication process starting with time 1 (pre-pandemic).
“Participants (N=386; age:M=31.66, SD = 11.86; 55% Male; 80% White; 46% single/ never married; 31% U.S, 27% U.K) were recruited from Prolific Academic)”
As mentioned in Study 1, the exclusion criteria had already been applied, thus we only have data for the N= 336 participants who had “completed both the Time 1 and Time 2 surveys and met (their) pre-registered inclusion criteria”. Thus, as we were missing 50 rows of data, we were unable to reproduce the demographic statistics at Time 1, only Time 2.
“Our final sample comprised 336 participants (age: M=32.03, SD= 11.94; 55% Male; 80% White; 45% single/never married; 32% U.S; 27% U.K) wjho completed both our Time 1 and Time 2 surveys and met our pre-registered inclusion criteria.”
For Time 2, I started off by double checking the number of participants who met the criteria were all reported in the data:
count(study2_raw)
# A tibble: 1 × 1
n
<int>
1 336 #YAY
As we had already reproduced the age and gender demographics in study 1, we figured it would be easier to start with those (and we were right- no problems). For gender we decided to assign a variable (which we called ‘study2_gender’) to the study2_raw environment. Using the dplyr package, we were able to use the group_by() function to group data framed, according to the specified gender variable. We then used the summarise() function to create a new data frame and creating three combinations of grouping variables- as seen in the output below (‘male’, ‘female’, ‘non-binary’). We then calculated the percentage for each grouping combination and put them in descending order in order to determine the most common gender within the participant data:
study2_gender <- study2_raw %>% #Ensures a new table is saved to the environment
group_by(Gender) %>% #Groups the data based on the specified variable.
summarise(n = n(), #Counts the number of unique instances for the variable.
Percent = 100 * n()/nrow(study2_raw)) %>% #Calculates the percentage - representative percentage for the instance (100 x number of instances/total rows)
arrange(desc(n)) #arranges from largest to smallest
0utput:
# A tibble: 3 × 3
Gender n Percent
<chr> <int> <dbl>
1 Male 184 54.8 # close enough if rounded, i guess?
2 Female 151 44.9
3 Non-Binary 1 0.298
For the age variable, we realised that there were no “Decline to Answer” responses (thank god!), so the age variable was numeric and logical. Therefore we di not encounter the issues faced in Study 1 and were able to just assign the ‘study2_age’ variable to to study2_raw and calculate the means as per usual using the summarise() function from the dplyr package. #### Age:
study2_age <- study2_raw %>%
summarise(mean_age = mean(Age),
sd_age = sd(Age),
min_age = min(Age),
max_age = max(Age))
# A tibble: 1 × 4
mean_age sd_age min_age max_age
<dbl> <dbl> <dbl> <dbl>
1 32.0 11.9 18 72 # SUCCESS (or close enough when rounded)
Next we needed to reproduce the percentages of ethnicity (80% White) and country’s (32% US, 27% UK) that participants belonged to. Here, since it was percentages we decided to adapt the gender method of finding percentages from study 1. Here we assigned the variables (study2_ethnicity or study2_country) to study2_raw to create a new table in the environement of the relevant data percentages for each assigned variable. The data was again arranged in descending order-using the arrange(desc(n)) via dplyr package- to ensure that the percentages are in order from most to least popular participant ethnicity/ country: #### Ethnicity:
study2_ethnicity <- study2_raw %>% #Identical code used here as was used for gender.
group_by(Ethnicity) %>%
summarise(n = n(), Percent = 100 * n()/nrow(study2_raw)) %>%
arrange(desc(n))
# A tibble: 8 × 3
Ethnicity n Percent
<chr> <int> <dbl>
1 White 268 79.8 # SUCCESS
2 Asian 30 8.93
3 Hispanic 14 4.17
4 More than one 10 2.98
5 Black 9 2.68
6 Middle-Eastern 3 0.893
7 American Indian/Alaskan Native 1 0.298
8 Other 1 0.298
study2_country <- study2_raw %>% #Identical code used here as was used for gender.
group_by(Country) %>%
summarise(n = n(), Percent = 100 * n()/nrow(study2_raw)) %>%
arrange(desc(n))
# A tibble: 29 × 3
Country n Percent
<chr> <int> <dbl>
1 USA 104 31.0 # not 32%???? not sure why (*cue OCD*)
2 UK 90 26.8 # close enough :)
3 Poland 29 8.63
4 Portugal 23 6.85
5 Canada 15 4.46
6 Greece 11 3.27
7 Italy 10 2.98
8 Germany 6 1.79
9 Mexico 6 1.79
10 Slovenia 5 1.49
# … with 19 more rows
Although we did not get exactly 32% for the nu mber of participants from the USA, it
was close enough and we will just move on and blame the authors and
reproducibility crisis.
The main issue we encountered when reproducing Study 2 demographics is that they did not provide their marriage data, so we were unable to reproduce the 45% single/never married data (kind of random since they provided everything else). Here, Samuel decided to email the authors asking them for their marriage data. Initially Dunigan Folk was emailed to which he responded (was very unexpected). However, he emailed us to email his coauthor Dr Okabe-Miyamoto as she had the data with her.
However, we never got a response and therefore could not replicate the data. So on to stage 2: MEAN/SD!
Fortunately for us, we were able to get through this stage with minor
challenges.
## Study 1: After having identified the Mean/SD that need to be
identified, we decided to start off with physical distancing:
**Physical/Social distancing: - “(98.5%) reproted practicing physical/social distancing” - “No one outside their household came within 6 feet of them the day before (Mode= 0, M = 0.77, SD = 1.39)”
As before, the physical distancing percentage was also produced by adapting the original gender percentage and therefore we were able to easily reproduce this. Again, we assigned the ‘study1_socialdistancing’ variable to study1_raw where we grouped the data based on the ‘SocialDistancing variable’ and then the summarise() function from the dplyr package to determine the percentage of people who social distanced:
# physical/social distancing
study1_socialdistancing <- study1_raw %>% # creates a new variable in the environment
group_by(SocialDistancing) %>% # groups data by social distancing
summarise(n = n(), # counts the number of instances
percent = 100 * n()/nrow(study1_raw)) # find the percentage
A tibble: 2 × 3
SocialDistancing n percent
<dbl> <int> <dbl>
1 1 460 98.5 # YAY
2 3 7 1.50
Off to a great start! WOOO! Next we reproduced the 6 feet distancing stat. Here, I had to resort to google to determine how mode can be calculated. Here the it was suggested to install the ‘library(modeest)’ package. After assigning variables, mlv was used to used as a generic function to compute a generic estimate of the mode of the study1_sixfeet data. Here, the “mfv” method was used to return the mode of the data:
# attempt 1
library(modeest)
study1_sixfeet <- study1_raw %>%
mlv(study1_sixfeet, method = "mfv") # FAIL
However, the following error message was produced:
Error in mlv.default(., study1_sixfeet, method = "mfv") :
is.numeric(x) is not TRUE
After inserting the error message into Google it was suggested to use the shapiro.test to determine whether the data is normally distributed:
#edit:
library(modeest)
study1_sixfeet <- study1_raw
shapiro.test(study1_sixfeet)
mlv(study1_sixfeet, method = "mfv")
Error in shapiro.test(study1_sixfeet) : is.numeric(x) is not TRUE
> mlv(study1_sixfeet, method = "mfv")
Error in mlv.default(study1_sixfeet, method = "mfv") :
is.numeric(x) is not TRUE
However, I got another error message. Although i could have continued looking into it I felt like this was way too complicated just to figure out a mode. So I decided to switch tactics. As R does not have a built in function to determine mode, another website suggested creating a user function to calculate the mode of the data. Here the vector study1_sixfeet is used as input and the mode is produced as an output.
#attempt 2:
# Create the function
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
# Calculate mode using user function
six_feet <- getmode(study1_sixfeet)
print(six_feet)
# Create vector and summarise()
study1_sixfeet <- study1_raw %>%
summarise(mode_six_feet = getmode(SixFeet)) # SUCCESS
Output:
# A tibble: 1 × 1
mode_six_feet
<dbl>
1 0 # SUCCESS
NOTE: - GEORGIA’S EASIER METHOD - EXPLAIN ABOVE CODE
However, at a group meeting I realised that although Samuel and I had calculated the mode in a similar way, Georgia was able to create a much more efficient code by installing the lsr package to compute mode then using the modeOf() function to get the mode. She was also able to combine her method of calculating the mode whilst simulatenougly calculating the mean and SD:
install.packages("lsr") # used to compute mode
six_feet <- study1_raw %>% # creates a variable from raw data
summarise(modeOf(SixFeet), mean(SixFeet), sd(SixFeet)) # calculates mode, mean and sd
However, when I tried to run the code on my R console, I would receive an error message:
! could not find function "modeOf"
When I tried replacing “modeOf” with just mode(), the output was:
six_feet <- study1_raw %>% # creates a variable from raw data
summarise(mode(SixFeet), mean(SixFeet), sd(SixFeet)) # calculates mode, mean and sd
# A tibble: 1 × 3
`mode(SixFeet)` `mean(SixFeet)` `sd(SixFeet)`
<chr> <dbl> <dbl>
1 numeric 0.767 1.39
The mean and SD would be calculated but not the mode? —> ask Georgia about this.
Overall, this was a lesson on the beauty of collaborating rather than dividing and conquering where although it may be a success that you were able to reproduce the data, the ultimate goal is to be able to reproduce it in a way that is efficient. It also shows that there is more than one correct way to approach the same value and that by working with your team members, you are able to see that range in diverse and new approaches.
For the follwoing reproduced values, we aimed to reproduce them using both the in-text and table 1 values.
Next we decided to reproduce table 1:
Lethargy levels (wellbeing proxy): - “To test whether momentary lethargy levels changed from before to during the pandemic….lethargy increased from Time 1 (M= 2.60, SD = 1.16) to Time 2(M= 3.16, SD = 1.27)” –> note: same values as table
(Note: APA format for table1?????)
Here, we started off by creating a data frame where we assigned study1_table1 to the study1_raw data where the study1_table1 variable had the unnecessary columns filtered out where we used the select() function in dplyr to only leave the variables that were reported in Table 1 (as shown below):
study1_table1 <- study1_raw %>% # creating a new data frame
select(LETHAVERAGE.T1, # this data frame only includes the variables that are in table 1
LETHAVERAGE.T2,
LethDiff,
SCAVERAGE.T1,
SCAVERAGE.T2,
SCdiff,
EXTRAVERSION)
We then added another ‘table 1’ variable to the environment where we used the describe() function- using the psych package- to get the selected descriptive statistics of the given study1_table1 dataset:
library(psych)
table1 <- describe(study1_table1)
However, we realised that it gives you all of the varying types of descriptives:
> table1
vars n mean sd median trimmed mad min max range skew kurtosis se
LETHAVERAGE.T1 1 467 2.60 1.16 2.30 2.50 1.19 1.00 6.00 5.00 0.70 -0.30 0.05
LETHAVERAGE.T2 2 467 3.16 1.27 3.00 3.12 1.33 1.00 6.00 5.00 0.27 -0.90 0.06
LethDiff 3 467 0.56 1.33 0.40 0.54 1.33 -3.30 4.90 8.20 0.21 0.13 0.06
SCAVERAGE.T1 4 467 4.11 0.88 4.16 4.14 0.94 1.16 5.95 4.79 -0.29 -0.34 0.04
SCAVERAGE.T2 5 467 3.97 0.85 4.00 3.99 0.94 1.63 5.95 4.32 -0.14 -0.62 0.04
SCdiff 6 467 -0.14 0.71 -0.11 -0.12 0.62 -3.05 2.16 5.21 -0.42 1.60 0.03
EXTRAVERSION 7 467 4.17 1.01 4.17 4.15 1.11 1.50 6.75 5.25 0.09 -0.53 0.05 #Woah....but YAY!
Since we only wanted means and SD, we again decided to use the select function to filter out and leave only the mean and SD columns. Here all the data replicated with the values given in Table 1.
library(psych)
table1 <- describe(study1_table1) %>% # create a table that summarises table 1 variables
select(mean, sd) # only include means and sd's
mean sd
LETHAVERAGE.T1 2.60 1.16
LETHAVERAGE.T2 3.16 1.27
LethDiff 0.56 1.33
SCAVERAGE.T1 4.11 0.88
SCAVERAGE.T2 3.97 0.85
SCdiff -0.14 0.71
EXTRAVERSION 4.17 1.01 #SUCCESS
Physical/Social distancing: - “92.9% of participants reported
practicing physical/ social distancing”.
- “No one got within 6 feet of them the day before (Mode = 0, M = 1.11,
SD= 0.75)”
Again, we started off with the replicating the social distancing descriptives. Here, we used the same method as Study 1:
study2_socialdistancing <- study2_raw %>% # creates a new variable in the environment
group_by(SocialDistancing) %>% # groups data by social distancing
summarise(n = n(), # counts the number of instances
percent = 100 * n()/nrow(study2_raw)) # find the percentage
However, when I ran the code, I got a different value to what was reported in the text (92.9%):
# A tibble: 2 × 3
SocialDistancing n percent
<int> <int> <dbl>
1 0 23 6.85
2 1 313 93.2 # not 92.9%???
After running my other grouo member’s code, they had also received the same output, so it must have been another error on the author’s part.
Next we replicated the six feet distancing values again using the same code as in Study 2:
# Create the function
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
#method 1:
# Calculate mode using user function
six_feet <- getmode(study2_sixfeet)
print(six_feet)
Output:
mode_six_feet
1 0
#SUCCESS (but that just gets us the mode)
#method 2:
# Create vector and summarise()
study2_sixfeet <- study2_raw %>%
summarise(mode_six_feet = getmode(SixFeet),mean(study2_raw$SixFeet), sd(study2_raw$SixFeet))
Output:
mode_six_feet mean(study2_raw$SixFeet) sd(study2_raw$SixFeet)
1 0 1.116071 1.753864 #SUCCESS (more efficient code that develops mean, sd and mode but.....)
Why is the SD= 1.75 rather than 0.75? Upon comparing my group’s code to mine they also received the same result (and their’s look a lot more efficient than mine):
study2_six_feet <- study2_raw %>%
summarise(mode_six_feet = mode(SixFeet), #turns out, mode isn't a real function.
mean_six_feet = mean(SixFeet), #This looks right.
sd_six_feet = sd(SixFeet)) #Big discrepancy!
Output:
mode_six_feet mean_six_feet sd_six_feet
1 numeric 1.116071 1.753864
We then decided to run the author’s R code:
# Social Distancing
install.packages("psych")
install.packages("plyr R")
count(study2_raw$SixFeet)
describeBy(study2_raw$SixFeet))
Although I tried to run the code multiple times and ensured I installed the right packages, typed the command correctly, etc, I could not seem to get rid of the following error code:
Error in UseMethod("count") :
no applicable method for 'count' applied to an object of class "c('integer', 'numeric')"
or
Error in describeBy(study2_raw$SocialDistancing) :
could not find function "describeBy"
and having searched the internet for ways to fix the error (maybe describe.by? nope), it did still produce SD=1.75 for my group members when they ran it so I will take their word for it to reduce my anxiety levels. It really is a crisis :)
Next we decided to replicate table 3. Here we started off by installing the relevant psych package to be able to use the describe() function to calculate statistical data where we used the select() function to specifically only have the relevant mean/sd columns in the output:
study2_summary <- describe(study2_raw[, c('T1SWLS',
'T2SWLS',
'SWLS_Diff',
'T1Lonely',
'T2Lonely',
'Lonely_Diff',
'T1BMPN',
'T2BMPN',
'BMPN_Diff',
'T1Extraversion')]) %>%
select(mean,sd) #FAIL
However, this produced the follwoing error message:
Error in describe(study2_raw[, c("T1SWLS", "T2SWLS", "SWLS_Diff", "T1Lonely", :
could not find function "describe"
After googling the error message, a solution suggested was to load the ‘library(’testthat’)’ package. Since there was “no package called ‘testthat’” as R beutifully stated, I had to use ‘install.packages(“testthat”)’. It was suggested to load this package before loading the ‘psych’ package. I then used the same code as mentioned above…:
install.packages("testthat")
library(psych)
study2_summary <- describe(study2_raw[, c('T1SWLS', #T1 life satisfaction
'T2SWLS', # T2 life satisfaction
'SWLS_Diff', # Life satisfaction change
'T1BMPN', # T1 Relatedness score
'T2BMPN', #T2 Relatedness score
'BMPN_Diff', # Relatedness difference score
'T1Lonely', # T1 Loneliness
'T2Lonely', # T2 Loneliness
'Lonely_Diff', # Loneliness change (T2-T1)
'T1Extraversion')]) %>% # Extraversion level pre-pandemic
select(mean,sd)
And it produced:
mean sd
T1SWLS 3.97 1.53
T2SWLS 3.99 1.45
SWLS_Diff 0.02 0.88
T1BMPN 4.92 1.09
T2BMPN 4.91 1.14
BMPN_Diff -0.01 1.11
T1Lonely 2.12 0.65
T2Lonely 2.06 0.62
Lonely_Diff -0.06 0.40
T1Extraversion 3.90 0.79 # All the same as table 3 values! Yay!
IT WORKED!
# STUDY2: Split Quantile by scores
study2_quantile <- quantile(study2_raw$T1Extraversion)
# Create and filter lower quantile of extroverts (introverts)
library(dplyr)
study2_introverts <- study2_raw %>%
select(T1Extraversion, T1Lonely, T2Lonely) %>%
filter(T1Extraversion <= 3.33)
# There is 80 people as a result in this group (Quantile 1).
# Create and filter upper quantile of extroverts (most extroverted)
study2_extroverts <- study2_raw %>%
select(T1Extraversion, T1Lonely, T2Lonely) %>%
filter(T1Extraversion >=4.416777)
# This results in 83 people as being categorised as the most extroverted
# Calculate mean and sd's of each of the above subsets
library(psych)
introvert_meanL <- describe(study2_introverts) %>%
select(mean,sd)
Output for introversion:
introvert_meanL
mean sd
T1Extraversion 2.90 0.29
T1Lonely 2.53 0.63
T2Lonely 2.31 0.63 # T1Lonely is off by 0.02 for both mean and SD
Most Extraverted:
extrovert_meanL <- describe(study2_extroverts) %>%
select(mean,sd)
> extrovert_meanL
mean sd
T1Extraversion 4.95 0.34
T1Lonely 1.63 0.49 # T1Lonely is off by 0.02 for both mean and SD
T2Lonely 1.67 0.49
When replicating figure 1, I started off by identifying what type of graph it was (a histogram), and then simply search up ‘how to create a histogram in R’ and viola, I got an amazing youtube video that took us step-by-step on how to make a histogram, producing the following code using the tidyverse package:
figure1 <- study1_raw
hist(study1_raw$SCdiff, # to make histogram
breaks = 13, # number of bars in histogram
name = "Distribution of Social Connectedness Difference Scores"
)
However, the video did not include how to label the axis or how to name the graph (which I kind of fluked by writing name = , but as you can guess that did not work so google it is!). A quick google search informed me to use the main, xlab and ylab parameters:
#attempt 2
figure1 <- study1_raw
hist(study1_raw$SCdiff, # create a histogram fro SCdiff
breaks = 13, #number of bars in the histogram
main = "Distribution of Social Connectedness Difference Scores", # name of graph
xlab = "Social Connectedness Difference Score (T2-T1)", # x axis label
ylab = "Frequency", # y axis label
) #SUCCESS
This produced the following graph:
Yes they look similar!: (Never have I been more
grateful to the authors that they did not try to be fancy by adding
different colours)
However, when liasing with my team members, I found that they were not so lucky and had tried to use ggplot and the geom_histogram functions to produce the graph:
# Failed Attempts
figure1<- ggplot(study1_raw, aes(x = SCdiff)) +
geom_histogram(colour = "black" , fill = "grey", bins=12) +
ggtitle("Distribution of Social Connectedness Difference Scores") + # for the main title
xlab("Social Connectedness Difference Score (T2 - T1)") + # x axis label
ylab("Frequency") # y axis label
# doesnt work ^^
#still doesnt work
figure1 <- ggplot(study1_raw) + geom_histogram(
mapping = aes(x = SCdiff), colour = "black", fill = "grey", bins = 13)
Until eventually they had reached the same code I had:
successful attempt:
figure1 <- hist(study1_raw$SCdiff, # create a histogram for SCdiff variable
main = "Distruibution of Social Connectedness Difference Scores", # for the main title
xlab = "Social Connectedness Difference Score (T2 - T1)") # for the x axis label
Keep in mind that our approach was a divide and conquer - so yes it was not the best idea and we could have saved a lot of time if we had just done them together and communicated more consistently. It was also hard to see what other team members were doing in their R studio, so we would have to wait until they post it in the shared HackMD document to be able to compare codes (you live and you learn). Not very efficient but we got there in the end.
We then began replicating figure 2 starting with the “Distribution of Relatedness Difference Scores” started off by making a bunch of mistakes for relatedness and then did the same code for loneliness (why not do what you did in study 1? dont ask)
We then adapted the same method we used for relatedness to the ‘Distribution of Loneliness Scores’ Graph. Again we created a data frame for only the loneliness scores:
Loneliness_diff <- study2_raw %>%
select(Lonely_Diff)
And then adapted the relatedness method:
# Replicate Loneliness figure 2 graph:
hist(Loneliness_diff$Lonely_Diff, # to create the relatedness diff scores histogram
main = "Distribution of Loneliness Scores", # for the main title
xlab = "Loneliness Difference Score (T2 - T1)", # x axis label
ylab = "Frequency", # y axis label
breaks = 40) # number of columns
And yes it produced a graph that looks like the loneliness
distribution graph produced in teh study:
Social connectedness x extraversion levels:
As soon as I saw quantiles… I knew I was finished. So I decided to look at their R code to see how they had done it.
Most intraverted x social connectedness:
Output:
The code was defintely far from easy to read and understand. Especially for someone as helpless at coding as me. So I decided to research what they had done line by line. –> used bottomquarter to detect people with the lowest extraversion score
Most extraverted x social connectedness:
Okay, so they provided the code for the ‘most introverted’ participants? What about the most extraverted?:
However, while I had taken the lazy (or smart?) way of reproducing it- thats the point of open data after all, right?-, Georgia went the extra mile and did it from scratch: