Objectives
By the end of this assignment, you should:
- understand the concept of “cumulative science”
- be able to identify the type of a variable
- understand how to use the pipe operator (
%>%
)
- understand how to isolate data (
select
, filter
, arrange
)
- understand how to count rows in a data frame (
nrow
)
This assignment is due Thursday, January 30 at noon. Please turn your .html AND .Rmd files into Canvas. Your .Rmd file should knit without an error before turning in the assignment.
The first few excercises we’ll do in lab. They concern a dataset called babynames
. The dataset is included in the “babynames” package.
- Alter the code to select just the
n
column:
select(babynames, name, prop)
- Use the logical operators to manipulate the code below to show: [a] All of the names where prop is greater than or equal to 0.08, [b] All of the children named “Sea” [c] All of the names that have a missing value for n (Hint: this should return an empty data set).
filter(babynames, name == "Garrett")
- Arrange babynames by n. Add prop as a second (tie breaking) variable to arrange on. What is the smallest value of n?
- Use
%>%
to write a sequence of functions that: 1. Filters babynames to just the girls that were born in 2015. 2. Selects the name and n columns. 3. Arranges the results so that the most popular names are near the top.
The next few exercises will focus on data from the Lewis & Frank (2018) replication of the Xu and Tenenbaum 2007 experiment (that we talked about in lecture). We’ll be working with data from the first experiment only. For reference, the journal paper write up of this study can be found here, and you can see the actual experiment that participants saw here.
The data are in a file called lewis_2018_exp1.csv
. We can start by loading the data with the read_csv()
function and saving it to a variable called lf_data
:
lf_data <- read_csv("data/lewis_2018_exp1.csv")
This data frame is tidy, meaning each column is a variable and each row is an observation. In this case, each observation is a unique participant and trial combination. There are six variables in the data and each variable is described below. The first six rows of the data frame are also displayed below.
- exp - Experiment number. Lewis & Frank (2016) had 12 experiments in it; the present dataset only includes the data from the first experiment.
- subids - Subject ID. This is an anonymous id that uniquely identifies every participant in the study.
- trial_num - Each participant completed 12 “trials.” In this case, a trial is a single screen where the participant sees a novel word, one or more examples, and then is asked to click on other examples of the novel word.
- category - There were three different categories of objects: vehicles, vegetables, and animals. Each participant saw some trials from each category.
- condition - This is the variable that we manipulated. It refers to the number of examples of the novel word participants saw at the top of the page. Participants saw either 3 subordinate examples (“three_subordinate”; e.g., 3 dalmations), 3 basic level examples (“three_basic”; e.g. a dalmation, a poodle, and a bernese mountain dog), 3 superordinate examples (“three_superordinate”; e.g. a dalmation, a rabbit, and a horse), or just a single example (“one”; e.g. 1 dalmation).
- proportion_basic_level_responses - This is the variable that we measured. It refers to the proportion (out of 2 possible) of basic level examples that a participant selected.
1 |
1 |
9 |
vehicles |
three_subordinate |
0 |
1 |
2 |
9 |
animals |
three_basic |
1 |
1 |
3 |
9 |
animals |
three_superordinate |
1 |
1 |
4 |
9 |
vehicles |
three_superordinate |
1 |
1 |
5 |
9 |
animals |
three_superordinate |
1 |
1 |
6 |
9 |
vegetables |
three_subordinate |
0 |
- Select the columns
subids
, category
, proportion_basic_level_responses
from the data. Print the first six rows of this data frame.
- Print the first six rows of a data frame that does NOT include the
category
column.
Note: the template for the remaining exercises is blank, and so you will need to add R chunks where appropriate.
- Use logical tests and Boolean operators to return only the rows that contain trials (rows): [a] with category as vegetables, [b] with category as animals and a trial number less than 7, [c] with category as vegetables or animals, [d] with at least one basic level response in the “one” condition.
- The following code selects all trials (rows) where the condition was either “three_subordinate” or “one.” Rewrite this code in a way that uses the
%in%
operator.
filter(lf_data, condition == "three_subordinate" | condition == "one")
- How many trials are there where the category is either vegetables or animals? Use
nrow()
.
- The three following sets of commands are written without the pipe operator (
%>%
). Rewrite each one to include the pipe.
[a]
var1 <- mutate(lf_data, category)
[b]
var1 <- select(lf_data, category)
var2 <- nrow(var1)
[c]
var1 <- filter(lf_data, trial_num == 1)
var2 <- filter(var1, category == "animals")
var3 <- select(var2, trial_num, category)
- The two following sets of commands are written with the pipe operator. Rewrite each one to exclude the pipe.
[a]
lf_data %>%
filter(trial_num < 6) %>%
nrow()
[b]
lf_data %>%
select(subids, category, proportion_basic_level_responses) %>%
filter(subids == 1) %>%
arrange(category)
- Look at the code below. Describe in full sentences what this code does.
lf_data %>%
select(subids, category, condition) %>%
filter(category == "vehicles" & condition != "one") %>%
arrange(-subids)
- On the first day of class, we talked about the “Sally Anne Task” that measures children’s understanding of theory of mind (example videos). Describe four variables that you could measure in this task to assess children’s theory of mind performance. Specifically, describe (1) one qualitative variable, (2) one quantitative - binary variable, (3) one quantitative - numeric, and (4) one quantitative - real variable. For each variable, give a one sentence description of the variable, AND one example value of that variable with units.
- Consider the following claim: “The scientific process is a social endeavor.” To what extent is this statement true or not true? What are the implications of your response for research methods in psychological science? Please respond with a short paragraph.
---
title: "Assignment 1: Cumulative Science and Intro to dplyr"
subtitle: "Modern Research Methods"
output:
  html_document:
    code_download: true
    css: ../lab.css
    highlight: kate
    theme: cosmo
    toc: false
    toc_float: false
---

```{r global_options, include = F}
library(tidyverse)
library(knitr)
library(babynames)
```


<br>
<br>
<div id="boxedtext">

 <font size="4"> **Objectives** </font> 
 
By the end of this assignment, you should:

- understand the concept of "cumulative science"
- be able to identify the type of a variable
- understand how to use the pipe operator (`%>%`)
- understand how to isolate data ( `select`, `filter`, `arrange`)
- understand how to count rows in a data frame (`nrow`)
</div>

This assignment is due **Thursday, January 30 at noon**. Please turn your .html AND .Rmd files into Canvas. Your .Rmd file should knit without an error before turning in the assignment.

The first few excercises we'll do in lab. They concern a dataset called `babynames`. The dataset is included in the "babynames" package.

<br> 

(1) Alter the code to select just the `n` column: 

```{r, echo = T, eval = F} 
select(babynames, name, prop)
```

<br> 

(1) Use  the logical operators to manipulate the code below to show:  [a] All of the names where prop is greater than or equal to 0.08, [b] All of the children named “Sea”
[c] All of the names that have a missing value for n (Hint: this should return an empty data set).

```{r, echo = T, eval = F} 
filter(babynames, name == "Garrett")
```


<br> 

(1) Arrange babynames by n. Add prop as a second (tie breaking) variable to arrange on.
What is the smallest value of n?

```{r, echo = T, eval = F} 
``` 

<br> 

(1) Use `%>%` to write a sequence of functions that: 1. Filters babynames to just the girls that were born in 2015. 2. Selects the name and n columns. 3. Arranges the results so that the most popular names are near the top.

<br> 

***

The next few exercises will focus on data from the Lewis & Frank (2018) replication of the Xu and Tenenbaum 2007 experiment (that we talked about in lecture).   We'll be working with data from the first experiment only.  For reference, the journal paper write up of this study can be found [here](http://www.andrew.cmu.edu/user/mollylew/papers/LF_2018.pdf), and you can see the actual experiment that participants saw [here](https://langcog.stanford.edu/expts/MLL/XTMEM/exp1/exp1.html).

The data are in a file called `lewis_2018_exp1.csv`. We can start by loading the data with the `read_csv()` function and saving it to a variable called `lf_data`:
```{r, message = F}
lf_data <- read_csv("data/lewis_2018_exp1.csv")
```

This data frame is *tidy*, meaning each column is a variable and each row is an observation. In this case, each observation is a unique participant and trial combination. There are six variables in the data and each variable is described below. The first six rows of the data frame are also displayed below. 

* **exp** - Experiment number. Lewis & Frank (2016) had 12 experiments in it; the present dataset only includes the data from the first experiment. 
* **subids** - Subject ID. This is an anonymous id that uniquely identifies every participant in the study.
* **trial_num** - Each participant completed 12 "trials." In this case, a trial is a single screen where the participant sees a novel word, one or more examples, and then is asked to click on other examples of the novel word.
* **category** - There were three different categories of objects: vehicles, vegetables, and animals. Each participant saw some trials from each category. 
* **condition** - This is the variable that we manipulated. It refers to the number of examples of the novel word participants saw at the top of the page. Participants saw either 3 subordinate examples ("three_subordinate"; e.g., 3 dalmations),  3 basic level examples ("three_basic"; e.g. a dalmation, a poodle, and a bernese mountain dog), 3 superordinate examples ("three_superordinate"; e.g. a dalmation, a rabbit, and a horse), or just a single example ("one"; e.g. 1 dalmation).
* **proportion_basic_level_responses** - This is the variable that we measured. It refers to the proportion (out of 2 possible) of basic level examples that a participant selected. 

```{r, echo =F}
kable(head(lf_data))
```

<br> 

(1) Select the columns `subids`, `category`, `proportion_basic_level_responses` from the data. Print the first six rows of this data frame. 



<br> 

(1) Print the first six rows of a data frame that does NOT include the `category` column.

*Note: the template for the remaining exercises is blank, and so you will need to add R chunks where appropriate.*

<br> 

(1) Use logical tests and Boolean operators to return only the rows that contain trials (rows): [a] with category as vegetables, [b] with category as animals and a trial number less than 7, [c] with category as vegetables or animals,  [d] with at least one basic level response in the "one" condition.

<br> 

(1) The following code selects all trials (rows) where the condition was either "three_subordinate" or "one." Rewrite this code in a way that uses the `%in%` operator. 

```{r, eval = F}
filter(lf_data, condition == "three_subordinate" | condition == "one")
```

<br> 

(1) How many trials are there where the category is either vegetables or animals? Use `nrow()`.

<br> 

(1) The three following sets of commands are written without the pipe operator (`%>%`). Rewrite each one to include the pipe. 

[a]
```{r}
var1 <- mutate(lf_data, category)
```

[b]
```{r}
var1 <- select(lf_data, category)
var2 <- nrow(var1)
```

[c]
```{r}
var1 <- filter(lf_data, trial_num == 1)
var2 <- filter(var1, category == "animals")
var3 <- select(var2, trial_num, category)
```

<br> 

(1) The two following sets of commands are written with the pipe operator. Rewrite each one to exclude the pipe. 

[a]
```{r, eval = F}
lf_data %>%
  filter(trial_num < 6) %>%
  nrow()
```

[b]
```{r, eval = F}
lf_data %>%
  select(subids, category, proportion_basic_level_responses) %>%
  filter(subids == 1) %>%
  arrange(category)
```

<br> 

(1) Look at the code below. Describe in full sentences what this code does.

```{r, eval = F}
lf_data %>%
  select(subids, category, condition) %>%
  filter(category == "vehicles" & condition != "one") %>%
  arrange(-subids)
```

<br> 

(1) On the first day of class, we talked about the "Sally Anne Task" that measures children's understanding of theory of mind ([example videos](https://www.youtube.com/watch?v=oazK2fkRU1A])). Describe four variables that you could measure in this task to assess children's theory of mind performance. Specifically, describe (1) one qualitative variable, (2) one quantitative - binary variable, (3) one quantitative - numeric, and (4) one quantitative -  real variable. For each variable, give a one sentence description of the variable, AND one example value of that variable with units.

<br> 


(1) Consider the following claim: "The scientific process is a social endeavor." To what extent is this statement true or not true? What are the implications of your response for research methods in psychological science? Please respond with a short paragraph. 

