PSY460: Advanced Quantitative Methods

Week #5: Reshaping Data

Today, we’ll practice pivoting data from wide format to long format and back again, and we’ll discuss when and why this is important to do. We’ll then take time to review fundamentals of using R and clear up any remaining questions, and we’ll prepare for writing the first portion of your paper, which is due next week.

Quiz

Imagine a dataset with 10 participants (one per row) and four columns: ID (a number assigned to each participant), and variables indicating what was eaten during Breakfast, Lunch, and Dinner. Now, imagine that you pivot longer, assigning names from the last three columns to “Meal” and values from these columns to “Food” (keeping ID as an identifier column).

  • Write out what the first six rows might look like.
  • How many rows would you have in total?
  • How many columns would you have in total?
  • What is an advantage of restructuring your data like this?

Tidy Data

  • Tidying data is a standardized way of cleaning data that typically makes them easier to analyze.
  • When data are tidied, they have the following attributes:
    • Every variable has a unique column.
    • Every observation has a unique row.

Another Look at the Penguins Data

# As always, we need to load our packages and data.
library(tidyverse) 
library(palmerpenguins)
myownpenguins <- penguins 

# This dataset doesn't label each penguin with a unique ID number, 
# so we will add a variable in ourselves.

ID <- c(1:nrow(myownpenguins))

penguins_withIDs <- cbind(ID, myownpenguins)

Tidying the Penguins Data

#
tidypenguins <- penguins_withIDs %>% 
  filter(rowSums(is.na(penguins_withIDs)) == 0) %>% 
  select(ID:sex, -island) %>%
  pivot_longer(bill_length_mm:flipper_length_mm, 
               names_to = "measurement_type", 
               values_to = "measurement") %>% 
  separate(col = measurement_type, 
           into = c("part", "dimension", "metric"), 
           sep = "_") 

Tidied Penguins Data

# A tibble: 999 × 8
      ID species body_mass_g sex    part    dimension metric measurement
   <int> <fct>         <int> <fct>  <chr>   <chr>     <chr>        <dbl>
 1     1 Adelie         3750 male   bill    length    mm            39.1
 2     1 Adelie         3750 male   bill    depth     mm            18.7
 3     1 Adelie         3750 male   flipper length    mm           181  
 4     2 Adelie         3800 female bill    length    mm            39.5
 5     2 Adelie         3800 female bill    depth     mm            17.4
 6     2 Adelie         3800 female flipper length    mm           186  
 7     3 Adelie         3250 female bill    length    mm            40.3
 8     3 Adelie         3250 female bill    depth     mm            18  
 9     3 Adelie         3250 female flipper length    mm           195  
10     5 Adelie         3450 female bill    length    mm            36.7
# ℹ 989 more rows

Pivoting Wider

#
widepenguins <- tidypenguins %>% 
  select(-"metric") %>% 
  pivot_wider(names_from = c("part", "dimension"),
              values_from = "measurement")

Back to the Beginning

# A tibble: 333 × 7
      ID species body_mass_g sex    bill_length bill_depth flipper_length
   <int> <fct>         <int> <fct>        <dbl>      <dbl>          <dbl>
 1     1 Adelie         3750 male          39.1       18.7            181
 2     2 Adelie         3800 female        39.5       17.4            186
 3     3 Adelie         3250 female        40.3       18              195
 4     5 Adelie         3450 female        36.7       19.3            193
 5     6 Adelie         3650 male          39.3       20.6            190
 6     7 Adelie         3625 female        38.9       17.8            181
 7     8 Adelie         4675 male          39.2       19.6            195
 8    13 Adelie         3200 female        41.1       17.6            182
 9    14 Adelie         3800 male          38.6       21.2            191
10    15 Adelie         4400 male          34.6       21.1            198
# ℹ 323 more rows

An important caveat about tidying data

  • In many cases, tidy data will not have independent observations across rows.
    • If more than one observation is collected from participants, these observations will be spread across rows, such that the same person accounts for multiple datapoints. Additionally, multiple rows may be clustered in terms of other shared variables (e.g., neighborhoods or families or schools).

How to analyze tidy data

  • If there is non-independence in a dataset, this must be accounted for statistically, or else it will tend to greatly inflate false positives (Type 1 errors).
    • Mixed effects models (also called linear mixed models, multilevel models, or hierarchical linear models) are ideal in these cases.

What questions do you still have about how to use R?

Writing an Introduction section

  • Your aim in this section is to provide ample support for your hypotheses.
    • You should demonstrate how your hypotheses are clearly supported by the relevant literature, while still filling a gap in the literature.
  • The introduction should have the shape of a funnel; it should begin with a broad research question and narrow down to the specific hypotheses.

Writing a Method section

  • Your aim in this section is to spell out all of the relevant details of your sample and study design, such that any other researcher could follow the “recipe” you describe to replicate your procedure.
  • Even though you are not collecting your own data, you should describe the participant experience as fully as possible, and indicate how your variables of interest fit into the larger study that was being conducted.

Groupwork time!