PSY460: Advanced Quantitative Methods

Week #4: Data Wrangling

Today, we’ll expand upon some of the R techniques we began to cover last week, in the context of testing a research question. Then, we’ll talk about why it’s important to plan data analysis ahead of time. At the end of class, we’ll have a brainstorming session for each team to get feedback from the rest of the class. If time allows, you’ll also work in teams to begin formulating a plan for analyzing your data.

More Opportunities for Bonus Tokens

February 25 @ 11:30am: “Moral snowballing”
March 17 @ 4:45pm: “Leveraging students’ and instructors’ beliefs about students’ abilities to improve biology undergraduates’ outcomes”

Quiz

Why would you want to manipulate data? List two possible reasons.
What is the difference between filter and select in dplyr?
What is the difference between summarize (for sum scores) and rowSums in dplyr?

First Steps to Testing a Research Question

Most of the work in analyzing data comes before running inferential statistics.
- First, the data must be “cleaned” or “manipulated” in a way that allows for the right analyses to be conducted.
- Second, it is important to thoroughly describe your data before attempting to fit any statistical models.

Testing a Very, Very Important Research Question

Does the Ratio of Beaks to Flippers Differ Across Penguin Species and Sexes?

Loading Packages and Data

#
library(tidyverse) # This gives you access to dplyr and more.
library(palmerpenguins) # We'll reuse the "penguins" dataset.
library(magrittr) # This allows you to use a bidirectional pipe.

myownpenguins <- penguins 
# This makes the Penguins data available in your environment.

Reducing the Dataset for Analysis

It is very common for datasets to contain missing data.
- Therefore, researchers must decide how to deal with NAs, ideally before data analysis begins. One option is to remove cases with NAs.

#
myownpenguins$NA_count <- rowSums(is.na(myownpenguins))

penguins_noNA <- filter(myownpenguins, NA_count == 0)

penguins_noNA %<>% select(-NA_count)

glimpse(penguins_noNA)

Rows: 333
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 41.1, 38.6…
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 17.6, 21.2…
$ flipper_length_mm <int> 181, 186, 195, 193, 190, 181, 195, 182, 191, 198, 18…
$ body_mass_g       <int> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3200, 3800…
$ sex               <fct> male, female, female, female, male, female, male, fe…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Creating a New Variable

Testing our hypothesis requires a new variable to be created, to be used as the outcome of interest.

#
penguins_noNA %<>% mutate(bill_to_flipper_ratio = 
                            bill_length_mm/flipper_length_mm)

Describing the Data

Before fitting a model, it is critical to get a sense for the shape of the data. Before running analyses, you should make a decision about how to handle outliers.

hist(penguins_noNA$bill_length_mm)

hist(penguins_noNA$flipper_length_mm)

hist(penguins_noNA$bill_to_flipper_ratio)

Inspecting Outliers

#
penguins_noNA %>% 
  select(bill_length_mm, flipper_length_mm, 
         bill_to_flipper_ratio) %>% 
  arrange(desc(bill_to_flipper_ratio)) %>% head()

# A tibble: 6 × 3
  bill_length_mm flipper_length_mm bill_to_flipper_ratio
           <dbl>             <int>                 <dbl>
1           58                 181                 0.320
2           51.5               187                 0.275
3           54.2               201                 0.270
4           55.8               207                 0.270
5           52.7               197                 0.268
6           51.7               194                 0.266

Summarizing the Data

You can generally obtain a rough assessment of your hypothesis through summary statistics.

#
penguins_noNA %>% 
  group_by(species, sex) %>% 
  summarize(mean_ratio = mean(bill_to_flipper_ratio*100), 
            sd_ratio = sd(bill_to_flipper_ratio*100), .groups = "keep")

# A tibble: 6 × 4
# Groups:   species, sex [6]
  species   sex    mean_ratio sd_ratio
  <fct>     <fct>       <dbl>    <dbl>
1 Adelie    female       19.9    1.27 
2 Adelie    male         21.0    1.17 
3 Chinstrap female       24.3    1.75 
4 Chinstrap male         25.6    0.981
5 Gentoo    female       21.4    0.966
6 Gentoo    male         22.3    1.04

Fitting a Model

Compared to the work that goes into data wrangling, the culminating analysis is (typically) relatively trivial to run.

#
summary(lm(bill_to_flipper_ratio ~ sex + species, data = penguins_noNA))


Call:
lm(formula = bill_to_flipper_ratio ~ sex + species, data = penguins_noNA)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.034988 -0.007340  0.000028  0.006569  0.076452 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.198885   0.001182 168.210  < 2e-16 ***
sexmale          0.010850   0.001306   8.310 2.54e-15 ***
speciesChinstrap 0.045105   0.001749  25.792  < 2e-16 ***
speciesGentoo    0.014440   0.001471   9.815  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01191 on 329 degrees of freedom
Multiple R-squared:  0.6906,    Adjusted R-squared:  0.6878 
F-statistic: 244.8 on 3 and 329 DF,  p-value: < 2.2e-16

Preregistration

Since psychologists have become more sensitive to problematic practices like p-hacking, many have been committing themselves to particular analyses prior to looking at any data. This is called a preregistration.
- Preregistration increases the credibility, transparency, and positive regard of research.
- Preregistrations typically involve decisions beyond deciding on a statistical model (e.g., exclusion criteria, creation of new variables).
- See Canvas for a couple of examples.