PSY460: Advanced Quantitative Methods

Week #4: Data Wrangling

Today, we’ll expand upon some of the R techniques we began to cover last week, in the context of testing a research question. Then, we’ll talk about why it’s important to plan data analysis ahead of time. At the end of class, we’ll have a brainstorming session for each team to get feedback from the rest of the class. If time allows, you’ll also work in teams to begin formulating a plan for analyzing your data.

More Opportunities for Bonus Tokens

  • February 25 @ 11:30am: “Moral snowballing”
  • March 17 @ 4:45pm: “Leveraging students’ and instructors’ beliefs about students’ abilities to improve biology undergraduates’ outcomes”

Quiz

  • Why would you want to manipulate data? List two possible reasons.
  • What is the difference between filter and select in dplyr?
  • What is the difference between summarize (for sum scores) and rowSums in dplyr?

First Steps to Testing a Research Question

  • Most of the work in analyzing data comes before running inferential statistics.
    • First, the data must be “cleaned” or “manipulated” in a way that allows for the right analyses to be conducted.
    • Second, it is important to thoroughly describe your data before attempting to fit any statistical models.

Testing a Very, Very Important Research Question

  • Does the Ratio of Beaks to Flippers Differ Across Penguin Species and Sexes?

Loading Packages and Data

#
library(tidyverse) # This gives you access to dplyr and more.
library(palmerpenguins) # We'll reuse the "penguins" dataset.
library(magrittr) # This allows you to use a bidirectional pipe.

myownpenguins <- penguins 
# This makes the Penguins data available in your environment.

Reducing the Dataset for Analysis

  • It is very common for datasets to contain missing data.
    • Therefore, researchers must decide how to deal with NAs, ideally before data analysis begins. One option is to remove cases with NAs.
#
myownpenguins$NA_count <- rowSums(is.na(myownpenguins))

penguins_noNA <- filter(myownpenguins, NA_count == 0)

penguins_noNA %<>% select(-NA_count)
glimpse(penguins_noNA)
Rows: 333
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 41.1, 38.6…
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 17.6, 21.2…
$ flipper_length_mm <int> 181, 186, 195, 193, 190, 181, 195, 182, 191, 198, 18…
$ body_mass_g       <int> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3200, 3800…
$ sex               <fct> male, female, female, female, male, female, male, fe…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Creating a New Variable

  • Testing our hypothesis requires a new variable to be created, to be used as the outcome of interest.
#
penguins_noNA %<>% mutate(bill_to_flipper_ratio = 
                            bill_length_mm/flipper_length_mm) 

Describing the Data

  • Before fitting a model, it is critical to get a sense for the shape of the data. Before running analyses, you should make a decision about how to handle outliers.
hist(penguins_noNA$bill_length_mm)
hist(penguins_noNA$flipper_length_mm)
hist(penguins_noNA$bill_to_flipper_ratio)

Inspecting Outliers

#
penguins_noNA %>% 
  select(bill_length_mm, flipper_length_mm, 
         bill_to_flipper_ratio) %>% 
  arrange(desc(bill_to_flipper_ratio)) %>% head()
# A tibble: 6 × 3
  bill_length_mm flipper_length_mm bill_to_flipper_ratio
           <dbl>             <int>                 <dbl>
1           58                 181                 0.320
2           51.5               187                 0.275
3           54.2               201                 0.270
4           55.8               207                 0.270
5           52.7               197                 0.268
6           51.7               194                 0.266

Summarizing the Data

You can generally obtain a rough assessment of your hypothesis through summary statistics.

#
penguins_noNA %>% 
  group_by(species, sex) %>% 
  summarize(mean_ratio = mean(bill_to_flipper_ratio*100), 
            sd_ratio = sd(bill_to_flipper_ratio*100), .groups = "keep")
# A tibble: 6 × 4
# Groups:   species, sex [6]
  species   sex    mean_ratio sd_ratio
  <fct>     <fct>       <dbl>    <dbl>
1 Adelie    female       19.9    1.27 
2 Adelie    male         21.0    1.17 
3 Chinstrap female       24.3    1.75 
4 Chinstrap male         25.6    0.981
5 Gentoo    female       21.4    0.966
6 Gentoo    male         22.3    1.04 

Fitting a Model

Compared to the work that goes into data wrangling, the culminating analysis is (typically) relatively trivial to run.

#
summary(lm(bill_to_flipper_ratio ~ sex + species, data = penguins_noNA))

Call:
lm(formula = bill_to_flipper_ratio ~ sex + species, data = penguins_noNA)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.034988 -0.007340  0.000028  0.006569  0.076452 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.198885   0.001182 168.210  < 2e-16 ***
sexmale          0.010850   0.001306   8.310 2.54e-15 ***
speciesChinstrap 0.045105   0.001749  25.792  < 2e-16 ***
speciesGentoo    0.014440   0.001471   9.815  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01191 on 329 degrees of freedom
Multiple R-squared:  0.6906,    Adjusted R-squared:  0.6878 
F-statistic: 244.8 on 3 and 329 DF,  p-value: < 2.2e-16

Preregistration

  • Since psychologists have become more sensitive to problematic practices like p-hacking, many have been committing themselves to particular analyses prior to looking at any data. This is called a preregistration.
    • Preregistration increases the credibility, transparency, and positive regard of research.
    • Preregistrations typically involve decisions beyond deciding on a statistical model (e.g., exclusion criteria, creation of new variables).
    • See Canvas for a couple of examples.

Take ten minutes to regroup with your teammates, and then we can share ideas as a full class!