After finishing all the coding tutorials, I decided this week I should make a start on the COVID paper dataset and try to reproduce the demographic descriptives and, if I was feeling overly ambitious, I could try my hand at helping a fellow group member reproduce the scatterplot, seeing as it looked extremely difficult (to which I can now say I regret trying)
For reference, the demographic statistics and scatterplot are below:
All of these (with one notable exception) were easy enough. However, there probably is a simpler and better way of calculating percentages but this is a start for now.
# load packages
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.6 v dplyr 1.0.4
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# extract data
covid <- "Covid_Data.csv" %>%
read.csv()
# create relevant data set
pcp <- covid %>%
select(age, gender_bin_f, race_bin, educ, income, emp_bin, livealone)
# mean and standard deviation of participant age
pcp %>%
summarise(
mean_age = mean(age),
sd_age = sd(age)
)
## mean_age sd_age
## 1 45.15238 16.78756
# age range of participants
pcp$age %>% range()
## [1] 18 76
# Percentage of females
one <- pcp %>%
select(gender_bin_f) %>%
filter(gender_bin_f == "Female")
100 * count(one) / count(pcp)
## n
## 1 49.20635
# percent white
two <- pcp %>%
select(race_bin) %>%
filter(race_bin == "White")
100 * count(two) / count(pcp)
## n
## 1 75.44974
# working for pay
pay <- pcp %>%
select(emp_bin) %>%
filter(emp_bin == "Working")
100 * count(pay) / count(pcp)
## n
## 1 56.19048
# educated
smort <- pcp %>%
select(educ) %>%
filter(educ != "Graduated high school (or GED)", educ != "Less than high school")
100 * count(smort) / count(pcp)
## n
## 1 87.72487
# lonely
alone <- pcp %>%
select(livealone) %>%
filter(livealone == "Yes")
100 * count(alone) / count(pcp)
## n
## 1 22.96296
Now, this one gets its own section as it was one of two major sources of frustration for this week (the other being the scatterplot). It was particularly difficult to get the median for three main reasons:
Initially, I tried the simplest approach (bear in mind the correct answer is 50,000 to 60,000)
# median household income
wrong <- pcp %>%
select(income) %>%
filter(income != "Decline to answer") %>%
summarise(med = median(income))
print(wrong)
## med
## 1 $30,000 to $40,000
So that didn’t work
I figured the error came from the fact that the data set was not ordered properly so I sought to fix that in the most inefficient, time-consuming way possible because I have nothing better to do apparently.
# properly arranged
ohboy <- pcp %>%
select(income) %>%
arrange(match(income, c(
"Less than $10,000",
"$10,000 to $20,000",
"$20,000 to $30,000",
"$30,000 to $40,000",
"$40,000 to $50,000",
"$50,000 to $60,000",
"$60,000 to $80,000",
"$80,000 to $100,000",
"$100,000 to $120,000",
"$120,000 to $140,000",
"$140,000 to $160,000",
"$160,000 to $180,000",
"$180,000 to $200,000",
"$200,000 to $220,000",
"$220,000 to $250,000",
"Greater than $250,000"
))) %>%
filter(income != "Decline to answer") %>%
summarise(median = median(income))
print(ohboy)
## median
## 1 $30,000 to $40,000
Imagine wasting all that time just for the same wrong answer
Thankfully, Jenny was able to crack the code and supplied me with a solution which I would not have arrived at even after countless hours of fiddling around.
Initially, I had no idea what was going on, but as I ran the code line by line and googled some of the functions, I was able to understand how they operate to produce the intended result.
# actual median household income
dem_income <- pcp %>%
select(income) %>%
filter(income != "Decline to answer") %>%
separate(income, c("low", "to", "high"), sep = " ") %>%
select(low, high) %>%
mutate(low2 = ifelse(low == "Less", 0,
ifelse(low == "Greater", 275000, low))) %>%
select(-low)
dem_income$high <- gsub('[$]','',dem_income$high)
dem_income$low2 <- gsub('[$]','',dem_income$low2)
dem_income$high <- as.numeric(gsub(',','',dem_income$high))
dem_income$low2 <- as.numeric(gsub(',','',dem_income$low2))
dem_income %>%
mutate(midpoint = (high+low2)/2) %>%
summarise(med_income = median(midpoint))
## med_income
## 1 55000
And there it is
Reproducing this would be no easy feat and I gathered as much within the first couple of hours of trying. Some things I’ve learnt during that time, other than new ways of managing stress, was the very simple fact that one should read the entirety of the relevant material before diving in since there is a high likelihood some of the data may already be there.
I came to this stunning revelation after squandering much time experimenting with new functions such as rowMeans() to try and acquire the means of numerous variables and then bind them all together into one data set when I should have just examined the supplied codebook to realise that those averages were already calculated and inputted into separate columns.
So instead of this:
# frequency of positive emotions
it <- covid %>%
select(f_aff, f_ener, f_acc, f_int, f_calm, f_app,
f_cont, f_qui, f_rela, f_pea, f_reli, f_hap,
f_pro, f_amu, f_joy, f_exc)
# frequency of negative emotions
keeps <- covid %>%
select(f_ang, f_dis, f_sad, f_emb, f_bor, f_anx,
f_fear, f_sha, f_gui, f_irr, f_fru,
f_conc, f_lon)
# intensity of positive emotions
on <- covid %>%
select(i_aff, i_ener, i_acc, i_int, i_calm, i_app,
i_cont, i_qui, i_rela, i_pea, i_reli, i_hap,
i_pro, i_amu, i_joy, i_exc)
# intensity of negative emotions
going <- covid %>%
select(i_ang, i_dis, i_sad, i_emb, i_bor, i_anx,
i_fear, i_sha, i_gui, i_irr, i_fru,
i_conc, i_lon)
# combined average frequency and intensity of positive and negative emotions alongside age
ugh <- covid %>%
select(age) %>%
mutate(
pos_f = rowMeans(it, na.rm = TRUE),
neg_f = rowMeans(keeps, na.rm = TRUE),
pos_i = rowMeans(on, na.rm = TRUE),
neg_i = rowMeans(going, na.rm = TRUE)) %>%
na.omit(ugh)
You simply have this:
# the exact same thing without the redundancies
hmph <- "Covid_Data.csv" %>%
read_csv() %>%
select(age, avg_i_pos, avg_i_neg, avg_f_pos, avg_f_neg) %>%
na.omit(hmph)
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## ResponseId = col_character(),
## gender_all = col_character(),
## race_all = col_character(),
## educ = col_character(),
## income = col_character(),
## emp = col_character(),
## livealone = col_character(),
## gender_bin_f = col_character(),
## race_bin = col_character(),
## emp_bin = col_character()
## )
## i Use `spec()` for the full column specifications.
print(hmph)
## # A tibble: 942 x 5
## age avg_i_pos avg_i_neg avg_f_pos avg_f_neg
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 18 2.25 2.33 1.81 1.33
## 2 18 1.15 2.36 1.45 2.31
## 3 18 1.81 1.62 2.44 1
## 4 18 2.13 2.17 2.19 1.67
## 5 18 2.25 1.27 2.31 1
## 6 18 1.06 1.85 1.46 2.18
## 7 18 2.38 1.62 2.44 0.923
## 8 18 0.75 2.69 0.833 2.54
## 9 18 2.07 2.23 1.62 2.23
## 10 18 1.81 1.25 2.21 0.727
## # ... with 932 more rows
From here, I assume in order to produce the plot, I need to organise this data in such a way where all the average emotion values are in one column and there is another column specifying intensity or frequency and one more column specifying valence.
To demonstrate what I am talking about here is a professional artistic depiction:
I have an idea about how I am going to convert the current table into that table above but the problem is getting it to work.
I’m thinking that I should use the pivot_longer function to get the avg_i_pos, avg_i_neg, avg_f_pos, avg_f_neg columns into one and have the values fall into the mean column. From there, I can use one of the functions that I learnt from Jenny’s code (separate) to separate the frequency/intensity from the positive/negative and with a few modifications, it should produce the table above.
However, as per usual, I encountered an error within the first step when using the pivot_longer function.
And after spending the better part of a day on google trying to troubleshoot, I decided to leave it for another time.
I successfully managed to reproduce most of the demographic descriptives on my own and, with a lot of help from Jenny, managed to finish it off with the median household income.
Some of the challenges were the aforementioned median income as well as the scatterplot, where I thought I had a good plan down but its execution was poorly done as it turns out. I spent much time trying to find solutions, doing things like:
Nothing worked
Since both google and my own brain has failed me, I have elected to submit these problems for the QnA so that some light can be shed on where I went wrong and what the possible solutions may be.