The Goal

After finishing all the coding tutorials, I decided this week I should make a start on the COVID paper dataset and try to reproduce the demographic descriptives and, if I was feeling overly ambitious, I could try my hand at helping a fellow group member reproduce the scatterplot, seeing as it looked extremely difficult (to which I can now say I regret trying)

For reference, the demographic statistics and scatterplot are below:

Progress

Demographic Descriptives

All of these (with one notable exception) were easy enough. However, there probably is a simpler and better way of calculating percentages but this is a start for now.

# load packages

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.4
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# extract data 

covid <- "Covid_Data.csv" %>% 
  read.csv()

# create relevant data set

pcp <- covid %>% 
  select(age, gender_bin_f, race_bin, educ, income, emp_bin, livealone)

# mean and standard deviation of participant age

pcp %>% 
  summarise(
    mean_age = mean(age),
    sd_age = sd(age)
  ) 
##   mean_age   sd_age
## 1 45.15238 16.78756
# age range of participants

pcp$age %>% range()
## [1] 18 76
# Percentage of females 

one <- pcp %>% 
  select(gender_bin_f) %>% 
  filter(gender_bin_f == "Female")

100 * count(one) / count(pcp)
##          n
## 1 49.20635
# percent white

two <- pcp %>%
  select(race_bin) %>% 
  filter(race_bin == "White")

100 * count(two) / count(pcp)
##          n
## 1 75.44974
# working for pay

pay <- pcp %>%
  select(emp_bin) %>% 
  filter(emp_bin == "Working")

100 * count(pay) / count(pcp)
##          n
## 1 56.19048
# educated

smort <- pcp %>%
  select(educ) %>% 
  filter(educ != "Graduated high school (or GED)", educ != "Less than high school")

100 * count(smort) / count(pcp)
##          n
## 1 87.72487
# lonely

alone <- pcp %>% 
  select(livealone) %>% 
  filter(livealone == "Yes")

100 * count(alone) / count(pcp)
##          n
## 1 22.96296

Median Household Income

Now, this one gets its own section as it was one of two major sources of frustration for this week (the other being the scatterplot). It was particularly difficult to get the median for three main reasons:

  • the data was NOT in order
  • it was extremely tedious to get it in order
  • I am not exactly the sharpest tool in the shed

Initially, I tried the simplest approach (bear in mind the correct answer is 50,000 to 60,000)

# median household income 

wrong <- pcp %>% 
  select(income) %>% 
  filter(income != "Decline to answer") %>% 
  summarise(med = median(income))

print(wrong)
##                  med
## 1 $30,000 to $40,000

So that didn’t work

I figured the error came from the fact that the data set was not ordered properly so I sought to fix that in the most inefficient, time-consuming way possible because I have nothing better to do apparently.

# properly arranged 

ohboy <- pcp %>% 
  select(income) %>% 
  arrange(match(income, c(
    "Less than $10,000", 
    "$10,000 to $20,000",
    "$20,000 to $30,000",
    "$30,000 to $40,000",
    "$40,000 to $50,000",
    "$50,000 to $60,000",
    "$60,000 to $80,000",
    "$80,000 to $100,000",
    "$100,000 to $120,000",
    "$120,000 to $140,000",
    "$140,000 to $160,000",
    "$160,000 to $180,000",
    "$180,000 to $200,000",
    "$200,000 to $220,000",
    "$220,000 to $250,000",
    "Greater than $250,000"
  ))) %>% 
  filter(income != "Decline to answer") %>% 
  summarise(median = median(income))

print(ohboy)
##               median
## 1 $30,000 to $40,000

Imagine wasting all that time just for the same wrong answer

Thankfully, Jenny was able to crack the code and supplied me with a solution which I would not have arrived at even after countless hours of fiddling around.

Initially, I had no idea what was going on, but as I ran the code line by line and googled some of the functions, I was able to understand how they operate to produce the intended result.

# actual median household income

dem_income <- pcp %>%
  select(income) %>% 
  filter(income != "Decline to answer") %>% 
  separate(income, c("low", "to", "high"), sep = " ") %>%
  select(low, high) %>%
  mutate(low2 = ifelse(low == "Less", 0,
                       ifelse(low == "Greater", 275000, low))) %>% 
  select(-low)

dem_income$high <- gsub('[$]','',dem_income$high) 

dem_income$low2 <- gsub('[$]','',dem_income$low2)

dem_income$high <- as.numeric(gsub(',','',dem_income$high))

dem_income$low2 <- as.numeric(gsub(',','',dem_income$low2))

dem_income %>%
  mutate(midpoint = (high+low2)/2) %>%
  summarise(med_income = median(midpoint))
##   med_income
## 1      55000

And there it is

The Scatterplot

Reproducing this would be no easy feat and I gathered as much within the first couple of hours of trying. Some things I’ve learnt during that time, other than new ways of managing stress, was the very simple fact that one should read the entirety of the relevant material before diving in since there is a high likelihood some of the data may already be there.

I came to this stunning revelation after squandering much time experimenting with new functions such as rowMeans() to try and acquire the means of numerous variables and then bind them all together into one data set when I should have just examined the supplied codebook to realise that those averages were already calculated and inputted into separate columns.

So instead of this:

# frequency of positive emotions 

it <- covid %>% 
  select(f_aff, f_ener, f_acc, f_int, f_calm, f_app,
         f_cont, f_qui, f_rela, f_pea, f_reli, f_hap,
         f_pro, f_amu, f_joy, f_exc)

# frequency of negative emotions

keeps <- covid %>% 
  select(f_ang, f_dis, f_sad, f_emb, f_bor, f_anx,
         f_fear, f_sha, f_gui, f_irr, f_fru,
         f_conc, f_lon)

# intensity of positive emotions

on <- covid %>% 
  select(i_aff, i_ener, i_acc, i_int, i_calm, i_app,
         i_cont, i_qui, i_rela, i_pea, i_reli, i_hap,
         i_pro, i_amu, i_joy, i_exc)

# intensity of negative emotions 

going <- covid %>% 
  select(i_ang, i_dis, i_sad, i_emb, i_bor, i_anx,
         i_fear, i_sha, i_gui, i_irr, i_fru,
         i_conc, i_lon)

# combined average frequency and intensity of positive and negative emotions alongside age 

ugh <- covid %>% 
  select(age) %>% 
  mutate(
    pos_f = rowMeans(it, na.rm = TRUE),
    neg_f = rowMeans(keeps, na.rm = TRUE),
    pos_i = rowMeans(on, na.rm = TRUE),
    neg_i = rowMeans(going, na.rm = TRUE)) %>% 
  na.omit(ugh)

You simply have this:

# the exact same thing without the redundancies 

hmph <- "Covid_Data.csv" %>% 
  read_csv() %>% 
  select(age, avg_i_pos, avg_i_neg, avg_f_pos, avg_f_neg) %>% 
  na.omit(hmph) 
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   ResponseId = col_character(),
##   gender_all = col_character(),
##   race_all = col_character(),
##   educ = col_character(),
##   income = col_character(),
##   emp = col_character(),
##   livealone = col_character(),
##   gender_bin_f = col_character(),
##   race_bin = col_character(),
##   emp_bin = col_character()
## )
## i Use `spec()` for the full column specifications.
print(hmph)
## # A tibble: 942 x 5
##      age avg_i_pos avg_i_neg avg_f_pos avg_f_neg
##    <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
##  1    18      2.25      2.33     1.81      1.33 
##  2    18      1.15      2.36     1.45      2.31 
##  3    18      1.81      1.62     2.44      1    
##  4    18      2.13      2.17     2.19      1.67 
##  5    18      2.25      1.27     2.31      1    
##  6    18      1.06      1.85     1.46      2.18 
##  7    18      2.38      1.62     2.44      0.923
##  8    18      0.75      2.69     0.833     2.54 
##  9    18      2.07      2.23     1.62      2.23 
## 10    18      1.81      1.25     2.21      0.727
## # ... with 932 more rows

From here, I assume in order to produce the plot, I need to organise this data in such a way where all the average emotion values are in one column and there is another column specifying intensity or frequency and one more column specifying valence.

To demonstrate what I am talking about here is a professional artistic depiction:

I have an idea about how I am going to convert the current table into that table above but the problem is getting it to work.

I’m thinking that I should use the pivot_longer function to get the avg_i_pos, avg_i_neg, avg_f_pos, avg_f_neg columns into one and have the values fall into the mean column. From there, I can use one of the functions that I learnt from Jenny’s code (separate) to separate the frequency/intensity from the positive/negative and with a few modifications, it should produce the table above.

However, as per usual, I encountered an error within the first step when using the pivot_longer function.

And after spending the better part of a day on google trying to troubleshoot, I decided to leave it for another time.

Challenges/Successes

I successfully managed to reproduce most of the demographic descriptives on my own and, with a lot of help from Jenny, managed to finish it off with the median household income.

Some of the challenges were the aforementioned median income as well as the scatterplot, where I thought I had a good plan down but its execution was poorly done as it turns out. I spent much time trying to find solutions, doing things like:

Nothing worked

Next steps

Since both google and my own brain has failed me, I have elected to submit these problems for the QnA so that some light can be shed on where I went wrong and what the possible solutions may be.