Packages
From Spreadsheet to Analysis to Graphs
I. Loading raw data into R
II. Cleaning and manipulating your data
- 4. Generating new columns of data
  - 4a. Continuous variables: calculating overall scores for Regret and Maximization scales
  - 4b. Categorical variables: recoding decision mode and school affiliation
- 5. Checking to make sure your manipulation code worked
III. Calculating descriptive statistics
- Describing continuous data
- 7. Calculating summary statistics for specific sub-groups
  - 7a. Using filter() to choose subsets of rows of data
  - 7b. using group_by() and summarize() to generate summary statistics for all groups at once
IV. Calculating inferential statistics
- 7. Correlations
- 8. t-tests
V. Graphing data
Submitting this assignment

Packages

While R can do a lot of amazing stuff on its own, some special types of analysis or graphing can be more easily (or stylishly) done using one of a (huge) number of “packages” that people have created specially for R. Think of R as your phone, and the packages as apps that each allow the phone to do something extra.

To use a package, you have to install it (once), and then load it (every time you re-open RStudio).

In this assignment, we’ll need the assistance of two packages in particular:

ggplot2 is the package we’ll use to make clean, attractive graphs
knitr is the package we’ll use to export this homework assignment to a format you can upload to Canvas

First, you can use the code below to install these two packages if you haven’t already.

# Only run this code if you HAVEN'T already installed these packages!
# They are almost certainly already installed - that's a benefit of RStudio Cloud!
install.packages(c("tidyverse",
                   "knitr"))

Next, use the following code to load, or “turn on”, these packages for the current RStudio window so you can actually access the functions inside. You need to do this every time you open a new RStudio tab! For this code chunk, all you’ll need to do is run the code below:

# We only need to "turn on" tidyverse
# knitr actually works behind the scenes already!
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

From Spreadsheet to Analysis to Graphs

In this assignment, we will be modeling a specific order of R tasks:

I. Load your raw data II. Prepare your data for analysis III. Generate descriptive statistics IV. Generate inferential statistics V. Generate graphs

This is important because R documents execute their code in order from beginning to end, or from top to bottom. All the code at the end of the document assumes that code at the beginning of the document has already been run. For example, if you want to run a t-test on a mean of a set of other scores that you must calculate from your raw data, the code to calculate that mean score needs to come before the code that runs the t-test.

I. Loading raw data into R

The first thing you do in any data analysis document is to load your fresh raw data into R so you can later work with and calculate statistics on your data.

1. Check working directory

Check your working directory using getwd(). If you are using RStudio Cloud, or if you opened RStudio via the R project file in the same directory that contains your datafile (classdata_2020.csv) and your R scripts, you should be in that folder now.

Then check to make sure the data is there using list.files().

getwd

## function () 
## .Internal(getwd())
## <bytecode: 0x3443c38>
## <environment: namespace:base>

list.files('classdata_2020.csv')

## character(0)

2. Read in data

Read in your data file using read_csv("classdata_2020.csv"), and assign it to the dataframe name IntroSurvey using the <- left-arrow operator. If you’re not sure how to do this, check the code you used in R Assignment 1. (You can open a fresh R markdown file of Assignment 1 via the Files pane in the bottom righthand corner of the RStudio screen. It will open in another tab, so you can go back and forth to copy code from it and paste it here.)

Remember that for code to run, it must be inside a “code chunk”—e.g., the grey-background area that starts immediately below this paragraph. The functions in the paragraph above will not get run by R, because they’re in the plain-text part of this document.

IntroSurvey <- read.csv("classdata_2020.csv")

3. Visually inspect your data

Use the following exploration functions to check that the dataframe IntroSurvey read in correctly:

head()

head(IntroSurvey)

##   id maxi1 maxi2 maxi3 maxi4 maxi5 maxi6 regret1 regret2 regret3 regret4
## 1  1     2     6     3     4     4     2       6       5       5       6
## 2  2     4     4     6     6     5     5       6       6       6       3
## 3  3     4     2     6     4     7     3       7       3       6       7
## 4  4     6     4     3     7     5     6       7       7       5       3
## 5  5     5     5     7     7     6     7       2       2       2       1
## 6  6     5     7     3     6     7     6       7       7       7       7
##   regret5 courses_enrolled courses_shopped points_enrolled time_planning
## 1       1                4               6            15.0        7 to 9
## 2       5                5               6            16.0        3 to 5
## 3       7                2               3             8.0        1 to 3
## 4       6                7               7            17.0        1 to 3
## 5       1                6               6            18.5        7 to 9
## 6       6                5               6            16.0        1 to 3
##   courses_satisfaction
## 1                    5
## 2                    4
## 3                    4
## 4                    3
## 5                    3
## 6                    3
##                                                                                            dec_mode
## 1 calculation-based decision mode (e.g., weighing pros and cons of each course against one another)
## 2 calculation-based decision mode (e.g., weighing pros and cons of each course against one another)
## 3 calculation-based decision mode (e.g., weighing pros and cons of each course against one another)
## 4                     role-based decision mode (e.g., taking what a Psychology major ought to take)
## 5                     role-based decision mode (e.g., taking what a Psychology major ought to take)
## 6                     role-based decision mode (e.g., taking what a Psychology major ought to take)
##   process_regret outcome_regret regret_general maxi_general psych_courses age
## 1              1              2              4            6             5  21
## 2              2              2              1            5             7  20
## 3              1              1              4            6             6  23
## 4              1              2              2            5             6  20
## 5              1              1              1            6             4  21
## 6              2              2              5            5             4  20
##   birthyear    class               school gender handed      major
## 1      1999   Senior                   CC      M      R Psychology
## 2      2000   Junior                   CC      M      R Psychology
## 3      1996 Post-bac Continuing Ed. / SPS      F      R Psychology
## 4      2000   Junior                   CC      F      R Psychology
## 5      1999   Senior                   CC      M      R Psychology
## 6      1999   Senior                   CC      M      R Psychology
##   concentration reader programs CRT1_5 CRT2_5 CRT3_47 CRT_total
## 1          None      1        1      0      0       0         0
## 2          None      1        1      1      1       1         3
## 3    Psychology      1        1      0      0       0         0
## 4     Sociology      0        2      0      0       0         0
## 5       Pre-Med      1        2      1      0       1         2
## 6       Pre-Med      0        1      0      1       1         2

summary()

summary(IntroSurvey)

##        id         maxi1           maxi2       maxi3           maxi4      
##  Min.   : 1   Min.   :2.000   Min.   :1   Min.   :1.000   Min.   :1.000  
##  1st Qu.:18   1st Qu.:4.000   1st Qu.:2   1st Qu.:2.000   1st Qu.:5.000  
##  Median :35   Median :5.000   Median :4   Median :3.000   Median :6.000  
##  Mean   :35   Mean   :4.957   Mean   :4   Mean   :3.449   Mean   :5.333  
##  3rd Qu.:52   3rd Qu.:6.000   3rd Qu.:5   3rd Qu.:5.000   3rd Qu.:7.000  
##  Max.   :69   Max.   :7.000   Max.   :7   Max.   :7.000   Max.   :7.000  
##                                                                          
##      maxi5           maxi6          regret1         regret2     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:4.000   1st Qu.:3.000   1st Qu.:5.000   1st Qu.:4.000  
##  Median :5.000   Median :5.000   Median :6.000   Median :5.000  
##  Mean   :5.116   Mean   :4.565   Mean   :5.304   Mean   :4.725  
##  3rd Qu.:7.000   3rd Qu.:6.000   3rd Qu.:7.000   3rd Qu.:6.000  
##  Max.   :7.000   Max.   :7.000   Max.   :7.000   Max.   :7.000  
##                                                                 
##     regret3         regret4         regret5      courses_enrolled
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   : 1.000  
##  1st Qu.:3.000   1st Qu.:2.000   1st Qu.:3.000   1st Qu.: 4.000  
##  Median :4.500   Median :4.000   Median :5.000   Median : 5.000  
##  Mean   :4.279   Mean   :3.928   Mean   :4.377   Mean   : 4.406  
##  3rd Qu.:6.000   3rd Qu.:5.000   3rd Qu.:6.000   3rd Qu.: 5.000  
##  Max.   :7.000   Max.   :7.000   Max.   :7.000   Max.   :10.000  
##  NA's   :1                                                       
##  courses_shopped  points_enrolled time_planning      courses_satisfaction
##  Min.   : 0.000   Min.   : 4.00   Length:69          Min.   :1.000       
##  1st Qu.: 4.750   1st Qu.:13.00   Class :character   1st Qu.:3.000       
##  Median : 5.000   Median :16.00   Mode  :character   Median :4.000       
##  Mean   : 5.632   Mean   :14.88                      Mean   :3.797       
##  3rd Qu.: 6.000   3rd Qu.:17.00                      3rd Qu.:4.000       
##  Max.   :25.000   Max.   :21.50                      Max.   :5.000       
##  NA's   :1                                                               
##    dec_mode         process_regret  outcome_regret  regret_general 
##  Length:69          Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  Class :character   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:2.000  
##  Mode  :character   Median :2.000   Median :2.000   Median :3.000  
##                     Mean   :1.928   Mean   :2.145   Mean   :2.926  
##                     3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:4.000  
##                     Max.   :5.000   Max.   :5.000   Max.   :6.000  
##                                                     NA's   :1      
##   maxi_general   psych_courses        age          birthyear   
##  Min.   :2.000   Min.   :1.000   Min.   :19.00   Min.   :1987  
##  1st Qu.:4.000   1st Qu.:2.000   1st Qu.:20.00   1st Qu.:1995  
##  Median :5.000   Median :5.000   Median :21.00   Median :1998  
##  Mean   :4.754   Mean   :4.241   Mean   :22.99   Mean   :1997  
##  3rd Qu.:6.000   3rd Qu.:6.000   3rd Qu.:24.50   3rd Qu.:1999  
##  Max.   :6.000   Max.   :9.000   Max.   :33.00   Max.   :2001  
##                  NA's   :11      NA's   :2                     
##     class              school             gender             handed         
##  Length:69          Length:69          Length:69          Length:69         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     major           concentration          reader          programs    
##  Length:69          Length:69          Min.   :0.0000   Min.   :1.000  
##  Class :character   Class :character   1st Qu.:0.0000   1st Qu.:1.000  
##  Mode  :character   Mode  :character   Median :0.0000   Median :2.000  
##                                        Mean   :0.2174   Mean   :1.609  
##                                        3rd Qu.:0.0000   3rd Qu.:2.000  
##                                        Max.   :1.0000   Max.   :3.000  
##                                                                        
##      CRT1_5           CRT2_5          CRT3_47         CRT_total    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:1.000  
##  Median :1.0000   Median :1.0000   Median :1.0000   Median :2.000  
##  Mean   :0.5652   Mean   :0.5797   Mean   :0.7681   Mean   :1.913  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :3.000  
##

str()

str(IntroSurvey)

## 'data.frame':    69 obs. of  37 variables:
##  $ id                  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ maxi1               : int  2 4 4 6 5 5 2 6 7 5 ...
##  $ maxi2               : int  6 4 2 4 5 7 4 1 2 7 ...
##  $ maxi3               : int  3 6 6 3 7 3 3 2 1 5 ...
##  $ maxi4               : int  4 6 4 7 7 6 3 7 6 6 ...
##  $ maxi5               : int  4 5 7 5 6 7 4 5 7 7 ...
##  $ maxi6               : int  2 5 3 6 7 6 2 6 5 6 ...
##  $ regret1             : int  6 6 7 7 2 7 3 7 7 6 ...
##  $ regret2             : int  5 6 3 7 2 7 5 6 7 5 ...
##  $ regret3             : int  5 6 6 5 2 7 2 7 7 5 ...
##  $ regret4             : int  6 3 7 3 1 7 2 6 7 7 ...
##  $ regret5             : int  1 5 7 6 1 6 5 6 7 3 ...
##  $ courses_enrolled    : int  4 5 2 7 6 5 3 5 5 6 ...
##  $ courses_shopped     : int  6 6 3 7 6 6 4 6 6 5 ...
##  $ points_enrolled     : num  15 16 8 17 18.5 16 12 16 21.5 17 ...
##  $ time_planning       : chr  "7 to 9" "3 to 5" "1 to 3" "1 to 3" ...
##  $ courses_satisfaction: int  5 4 4 3 3 3 5 4 3 5 ...
##  $ dec_mode            : chr  "calculation-based decision mode (e.g., weighing pros and cons of each course against one another)" "calculation-based decision mode (e.g., weighing pros and cons of each course against one another)" "calculation-based decision mode (e.g., weighing pros and cons of each course against one another)" "role-based decision mode (e.g., taking what a Psychology major ought to take)" ...
##  $ process_regret      : int  1 2 1 1 1 2 3 2 5 2 ...
##  $ outcome_regret      : int  2 2 1 2 1 2 1 2 2 1 ...
##  $ regret_general      : int  4 1 4 2 1 5 2 5 6 3 ...
##  $ maxi_general        : int  6 5 6 5 6 5 5 6 6 5 ...
##  $ psych_courses       : int  5 7 6 6 4 4 6 6 2 5 ...
##  $ age                 : int  21 20 23 20 21 20 28 21 21 21 ...
##  $ birthyear           : int  1999 2000 1996 2000 1999 1999 1992 1999 1999 1998 ...
##  $ class               : chr  "Senior" "Junior" "Post-bac" "Junior" ...
##  $ school              : chr  "CC" "CC" "Continuing Ed. / SPS" "CC" ...
##  $ gender              : chr  "M" "M" "F" "F" ...
##  $ handed              : chr  "R" "R" "R" "R" ...
##  $ major               : chr  "Psychology" "Psychology" "Psychology" "Psychology" ...
##  $ concentration       : chr  "None" "None" "Psychology" "Sociology" ...
##  $ reader              : int  1 1 1 0 1 0 0 0 0 0 ...
##  $ programs            : int  1 1 1 2 2 1 2 3 2 2 ...
##  $ CRT1_5              : int  0 1 0 0 1 0 1 1 0 0 ...
##  $ CRT2_5              : int  0 1 0 0 0 1 1 1 0 0 ...
##  $ CRT3_47             : int  0 1 0 0 1 1 1 1 1 0 ...
##  $ CRT_total           : int  0 3 0 0 2 2 3 3 1 0 ...

II. Cleaning and manipulating your data

Before we calculate any statistics on our data, we may need to clean the raw data, or calculate additional values of interest in our data. It is important to do this before calculating any statistics, so that any values you may need for your statistics are already present in your data by the time you run your “analysis” code.

4. Generating new columns of data

4a. Continuous variables: calculating overall scores for Regret and Maximization scales

The code below will calculate two new columns based on existing raw data in IntroSurvey. Note that this way of calculating regret_total (the mean of the 5 individual regret questions) is different than how you did it in Assignment 1. There are often multiple ways to accomplish the same thing in R—note that this way is more transparent in that it shows exactly how regret_total is calculated, while using rowMeans() like we did last time was more elegant, since it used less code to accomplish the same objective.

regret_total, for overall Regret Score on the Regret Scale, by averaging all 5 individual regret scores, regret1 to regret5
maxi_total, for overall Maximizing score on the Maximizing Scale, by averaging all 6 individual regret scores, maxi1 to maxi6

IntroSurvey <- mutate(IntroSurvey,
                      regret_total = (regret1 + regret2 + regret3 + regret4 + regret5) / 5,
                      maxi_total = (maxi1 + maxi2 + maxi3 + maxi4 + maxi5 + maxi6) / 6)

You can use the function mutate() that you see above to add new columns to a dataframe, or overwrite existing columns in a dataframe. Anytime you want to calculate a new column that may be some combination of existing columns, like averaging the values across multiple columns, mutate() is the function you want.

You can create as many new columns as you want inside of one mutate() command, as long as you separate each row of column-creating code with a comma. To help your code look more readable, press EnterReturn after the comma at the end of each row of column-creating code, to put each row of code on its own row. (This is not necessary for R; it’s just easier formatting for us humans to parse with our eyeballs.)

Now, it’s your turn to calculate some new columns of data.

The Maximizing Scale has been shown to divide up into three subscales:

Alternative Search (maxi1 & maxi3)
Decision Difficulty (maxi2 & maxi5)
High Standards (maxi4 & maxi6)

Please use the approach seen above to assign three new variables in the IntroSurvey dataframe, each of which represents one subscale: maxi_as, maxi_dd, and maxi_hs. (Hint: you should be averaging the two maxi scores for each.)

IntroSurvey <- mutate(IntroSurvey,
                      maxi_as = (maxi1 + maxi3) / 2,
                      maxi_dd = (maxi2 + maxi5) / 2,
                      maxi_hs = (maxi4 + maxi6) / 2)

4b. Categorical variables: recoding decision mode and school affiliation

Above, we used mutate() to create new continuous variables by averaging the values of other continuous variables together. We can also create new categorical variables by recoding the values of existing categorical variables into levels that are more convenient for the statistics we want to calculate.

The question about decision mode allowed for 4 options (calculation, rule, role, and affect). To do a t-test, we need the variable re-coded as binary, with two levels: calculation vs. not-calculation. To do this, we’ll use the ifelse() function with mutate() to generate a new column in the data. The code below tells R to create a new variable (dec_mode_binary) whose value is "calc" if the original variable (dec_mode) started with the phrase “calculation-based”, and whose value is "not-calc" otherwise.

IntroSurvey <- mutate(IntroSurvey,
                      dec_mode_binary = ifelse(startsWith(dec_mode, "calculation-based"),
                                          "calc",
                                          "not-calc"))

There are various strategies you can use to recode the levels of categorical variables. Above, we used a logical test to create a binary variable based on whether the logical test returned true or false. Creating binary variables is very useful for setting up t-tests, because t-tests require you to split your data up into two groups for comparison.

Next, let’s create another new binary variable. This time, we will re-code school affiliation into a binary variable. Later, we can use this binary variable to test for differences in scores by school affiliation (Columbia College/Barnard vs GS/SPS, for example).

The theoretical reason why you might want to group students in this way is that CC and Barnard students tend to have come to college straight out of high school (or with a small gap), whereas GS and post-bac students are more likely to have had some amount of gap between high school and arriving at Columbia (e.g., coursework at another university; a dance or athletic career; military service; etc.). So it might be reasonable to look to see if there are any differences between those two groups of students.

To do this, we’ll use the ifelse() function with mutate() to generate a new column in our data, like we did earlier. The code below tells R to create a new variable (school_binary) whose value is "B_CC" if the original variable (school) was either Barnard or CC, and whose value is "GS_SPS" otherwise.

IntroSurvey <- mutate(IntroSurvey,
                      school_binary = ifelse(school %in% c("Barnard", "CC"),
                                             "B_CC",
                                             "GS_SPS"))

Now it’s your turn. Use mutate() and ifelse() as we demonstrated above to create another binary variable based on handedness. Right now, handedness may have up to 3 levels: “R”, “L”, or “A” (for ambidextrous). Create a new binary variable called handed_binary with two levels: “R” and “notR”, where anyone who is not right-handed has the value “notR”.

IntroSurvey <- mutate(IntroSurvey,
                      handed_binary = ifelse(handed %in% c("R", "L","A"),
                                             "R",
                                             "notR"))

5. Checking to make sure your manipulation code worked

Use head() to check that the following new variables are present in IntroSurvey:

regret_total
maxi_total
maxi_as, maxi_dd, & maxi_hs

head('regret_total')

## [1] "regret_total"

head('maxi_total')

## [1] "maxi_total"

head('maxi_as', 'maxi_dd','maxi_hs')

## [1] "maxi_as"

What other ways are there to check whether your new variables were created correctly? Ask your TA or a classmate to come up with at least one way, other than using head(), of checking your work.

Going through the code again and make sure the average is right.

III. Calculating descriptive statistics

Before we start doing hypothesis tests, we should take a look at what our key variables look like—how they are distributed, what their range of values are, where they’re centered, and how much variance they have.

Describing continuous data

Let’s start with some descriptive summary statistics on our new variables. Since they are means of ordinal data, these new variables are technically ordinal as well. But for our purposes in this class, we will fudge things and treat them as scale (interval/ratio).

6a. Calculate means/SDs using summarize()

To calculate multiple summary statistics on one dataframe at once, we will use summarize(). As an example, first let’s calculate both the mean and standard devation of the total regret score across all subjects.

# First, create a summary dataframe to hold the relevant info
Regret_Total_Summary <- summarize(IntroSurvey,
                                  mean_regret_total = mean(regret_total, na.rm = TRUE),
                                  sd_regret_total = sd(regret_total, na.rm = TRUE))

# Then print the dataframe to show the summary values inside
Regret_Total_Summary

##   mean_regret_total sd_regret_total
## 1          4.523529         1.41507

The syntax you use to calculate summary stats with summarize() is similar to the syntax you use to create new columns with mutate() like we did earlier. You can calculate multiple summary stats in the same summarize() command as long as you separate every new stat you want to calculate with a comma.

Use mean() and sd() inside of summarize() to find the mean and standard deviations for:

total maximizing score

# First, create a summary dataframe to hold the relevant info
Maxi_Total_Summary <- summarize(IntroSurvey,
                                  mean_maxi_total = mean(maxi_total, na.rm = TRUE),
                                  sd_maxi_total = sd(maxi_total, na.rm = TRUE))

each of the 3 maxi subscales (You can do this all inside of one summarize() call)

# Then print the dataframe to show the summary values inside
Maxi_Total_Summary

##   mean_maxi_total sd_maxi_total
## 1        4.570048      1.096416

6b. Show quick histograms using hist()

Next, we will use hist() to visualize the distribution of scores for continuous variables in our data. As a refresher, hist() works a little differently than most of the functions we will use in this assignment. In most of the functions we are using, you already specify to the function that it should look inside IntroSurvey, so you do not need to use the $ to tell R where to find the columns you’re referring to. (You hand the mail carrier a bundle of letters that’s marked as all going to Schermerhorn, so that on each letter you only need to put the room number and not also the building name.) However, hist() isn’t that fancy, so you will need to use the $ to tell it to generate a histogram of a specific column inside of a larger dataframe in your environment.

As an example, we will first generate a histogram of the distribution of total regret scores over all students in the data. If we’re going to be using it in t-tests and correlations, we’ll want to be able to assume it came from a population that is normally distributed. Does that look like a reasonable assumption?

hist(IntroSurvey$regret_total)

Now, on your own, use hist() like we did above to generate histograms for the columns that you just calculated the mean and standard deviation for in question 6a:

total maximizing score

hist(IntroSurvey$maxi_total)

each of the 3 maxi subscales:

** Alternative Search

hist(IntroSurvey$maxi_as)

** Decision Difficulty

hist(IntroSurvey$maxi_dd)

** High Standards

hist(IntroSurvey$maxi_hs)

6c. Count categorical values using count()

Histograms are a fantastic way to visualize a summary of continuous, numeric variables. However, not all data are numeric—sometimes variables of interest are categorical (nominal), where each observation is one of a few specified levels.

We can use the function count() to see how many observations occur at each level of a categorical variable. For example, let’s use count() to count how many values are in each level of the column class. This will tell us how many participants are in each class year in our IntroSurvey data.

count_class <- count(IntroSurvey,
                     class)

count_class

##         class  n
## 1      Junior 33
## 2    Post-bac  5
## 3 Second-year  4
## 4      Senior 27

How many students fall into each category of school?

count_school <- count(IntroSurvey,
                     school)

count_school

##                 school  n
## 1              Barnard  2
## 2                   CC 38
## 3 Continuing Ed. / SPS  6
## 4                   GS 23

What about handed?

count_handed <- count(IntroSurvey,
                     handed)

count_handed

##   handed  n
## 1      A  1
## 2      L  6
## 3      R 62

6d. Descriptive stats: exploration

Explore at least 3 other variables that you think might have a relationship with maximizing and/or regret. Add a line of text above each code chunk to introduce the variable you’re exploring, and describe what analyses you do in the code chunk.

If the variable is numeric or interval (i.e., if its responses are numbers), use summarize() with mean() and sd() to compute the mean and standard deviation, and use hist() to produce a histogram. R calls these kinds of data “numeric” or “double” or “interval”.

If the variable is categorical, use count() to count the values in each level of the variable. R calls these kinds of data “character” (chr) or “factor” (fac).

Hint: Use class() to check what type of variable you are working with.

#Counting categorial values for major

count_major <- count(IntroSurvey,
                     major)

count_major

##                                    major  n
## 1                              Chemistry  1
## 2                Neuroscience & Behavior  4
## 3                             Psychology 52
## 4                Psychology, Art History  1
## 5           Psychology, Computer Science  1
## 6           Psychology, creative writing  1
## 7           Psychology, Creative Writing  2
## 8               Psychology, Data Science  1
## 9                  Psychology, Economics  1
## 10         Psychology, Political Science  1
## 11                 Psychology, Sociology  1
## 12                             Undecided  2
## 13 Undecided, (pending) Computer Science  1

#Counting categorial values for gender

count_gender <- count(IntroSurvey,
                     gender)

count_gender

##                  gender  n
## 1                     F 45
## 2                     M 23
## 3 Prefer not to specify  1

#Producing histogram for CRT
                               
hist(IntroSurvey$CRT_total)

7. Calculating summary statistics for specific sub-groups

Many times, when investigating summary statistics to learn more about your data, you will want to compare values for different study groups in your participant population, in addition to reporting data across all subjects as we just learned how to do.

Here, we will learn about some functions that will help you break up your summaries by participant groups.

7a. Using filter() to choose subsets of rows of data

First, you can use the filter() function to choose a subset of rows out of your data, so you can produce summaries on one group at a time.

righties <- filter(IntroSurvey,
                   handed == "R")

Inside of filter(), you use logical statements based on columns of your data to choose which rows to subset. If you use multiple logical statements in the same filter() command, separated with commas, you will choose the intersection of all those logical statements, or all the rows where the first statement and the second statement and the third statement, etc., are true.

Below, use filter() to choose the subset of your data belonging to Columbia College-affiliated right-handed students. You can use the code above to help you, but this time, you will need two logical statements inside of filter().

righties <- filter(IntroSurvey,
                   handed == "R",school == "CC",gender == "F")

Sometimes, you may want to choose just a subset of your data to operate on, but other times, you may want to calculate all summary statistics for different groups in your data at once, so you can compare them. We will learn how to do this next.

7b. using group_by() and summarize() to generate summary statistics for all groups at once

Remember the summarize() function from before? Previously, we used it to calculate summary statistics over all observations in our data. We can also use summarize() to calculate statistics separately for different groups in our data. In order to do this, we must tell R what groups are in our data, using a new function, group_by(). Then, summarize() will automatically calculate summary statistics separately for every group it sees in the data.

Here’s an example: Let’s calculate the mean and SD of students’ total regret scores, like we did in question 6a. This time, however, we will calculate it separately for students of every class year.

# First, create a summary dataframe to hold the relevant info
regret_total_by_class_summary <- IntroSurvey %>%
  group_by(class) %>%
  summarize(mean_regret_total = mean(regret_total, na.rm = TRUE),
            sd_regret_total = sd(regret_total, na.rm = TRUE))

## `summarise()` ungrouping output (override with `.groups` argument)

# Then print the dataframe to show the summary values inside
regret_total_by_class_summary

## # A tibble: 4 x 3
##   class       mean_regret_total sd_regret_total
##   <chr>                   <dbl>           <dbl>
## 1 Junior                   4.42           1.35 
## 2 Post-bac                 5.28           0.522
## 3 Second-year              3.1            1.54 
## 4 Senior                   4.72           1.48

If you run the code above, you’ll see that you get columns for mean regret score and SD of regret score, but now instead of one summary value for all subjects, there’s a separate value for each level of class, corresponding to each unique class year reported in the data.

There is also a new command used in the code above: the pipe operator, %>%. The %>% operator does one simple, but key, thing: it takes the object on the left-hand side and feeds it into the first argument of the function on the right-hand side.

This means that you can put multiple functions together, one after the other, and R will understand and execute them in left to right, like natural English. In the code above, this means that we are telling R to do these things in this order:

Take the IntroSurvey data object
Use group_by() on IntroSurvey to tell R to group the rows of IntroSurvey by the class column
Use summarize() on the grouped IntroSurvey data, calculating a mean and SD of regret_total

It’s very important here that group_by() is called before summarize(), because otherwise R wouldn’t know how to group the data before summarizing it.

We will practice this a couple times.

Below, group IntroSurvey by school affiliation (CC, Barnard, GS, etc), and then calculate mean and SD of total regret score for each school group in the data using the pipe operator.

# First, create a summary dataframe to hold the relevant info
regret_total_by_school_summary <- IntroSurvey %>%
  group_by(school) %>%
  summarize(mean_regret_total = mean(regret_total, na.rm = TRUE),
            sd_regret_total = sd(regret_total, na.rm = TRUE))

## `summarise()` ungrouping output (override with `.groups` argument)

# Then print the dataframe to show the summary values inside
regret_total_by_school_summary

## # A tibble: 4 x 3
##   school               mean_regret_total sd_regret_total
##   <chr>                            <dbl>           <dbl>
## 1 Barnard                           4.8            3.11 
## 2 CC                                4.48           1.43 
## 3 Continuing Ed. / SPS              4.93           0.969
## 4 GS                                4.45           1.43

Below, group IntroSurvey by major affiliation, and then calculate mean and SD of total maximizing score for each major in the data.

# First, create a summary dataframe to hold the relevant info
regret_total_by_major_summary <- IntroSurvey %>%
  group_by(major) %>%
  summarize(mean_regret_total = mean(regret_total, na.rm = TRUE),
            sd_regret_total = sd(regret_total, na.rm = TRUE))

## `summarise()` ungrouping output (override with `.groups` argument)

# Then print the dataframe to show the summary values inside
regret_total_by_major_summary

## # A tibble: 13 x 3
##    major                                 mean_regret_total sd_regret_total
##    <chr>                                             <dbl>           <dbl>
##  1 Chemistry                                          7             NA    
##  2 Neuroscience & Behavior                            2.95           1.17 
##  3 Psychology                                         4.64           1.39 
##  4 Psychology, Art History                            5.2           NA    
##  5 Psychology, Computer Science                       2.4           NA    
##  6 Psychology, creative writing                       5.8           NA    
##  7 Psychology, Creative Writing                       4.3            2.12 
##  8 Psychology, Data Science                           3.4           NA    
##  9 Psychology, Economics                              4.2           NA    
## 10 Psychology, Political Science                      4.4           NA    
## 11 Psychology, Sociology                              3.4           NA    
## 12 Undecided                                          4.5            0.990
## 13 Undecided, (pending) Computer Science              6             NA

IV. Calculating inferential statistics

Now we can start doing some real data analysis! For the purposes of getting practice in a variety of inferential statistics in R, we’ll be running more exploratory tests than would normally be considered appropriate for a real study. We’ll talk more about why this is a problem later in the semester, but hopefully in your stats class you talked about the dangers of running too many tests—e.g., alpha-inflation, p-hacking, etc.

7. Correlations

Now that you have a sense for what the individual variables each show, let’s look at how some of them relate to each other. First, some correlations.

7a. Age vs. birth year

Let’s see if age correlates with birthyear (it should!). The function cor.test() allows us to put in two variables, see what their correlation is, and return a p-value for that correlation.

The R syntax for a correlation is cor.test(~ var1 + var2, data = DataFrame). Note that because we’re defining the dataframe (for today’s assignment, IntroSurvey) within the test, we don’t need to dollar-sign index it in the names of the variables. We can just use the variable name on its own (for example, inside this function, if age were one of our variables, we only need to call age, not IntroSurvey$age, because cor.test() already knows to look inside of IntroSurvey.)

cor.test() allows you to run a Pearson, Kendall, or Spearman correlation, depending on what you tell it to do using the argument method. This time, since we want to do a Pearson correlation, we’ll use method = "pearson".

First, we’ll save the contents of our correlation test into the variable cor_age_birthyear so we can access its contents later. Then, we’ll print cor_age_birthyear to the console, so we can visually inspect the results of the correlation test.

cor_age_birthyear <- cor.test(~ age + birthyear, method = "pearson", data = IntroSurvey)

cor_age_birthyear

## 
##  Pearson's product-moment correlation
## 
## data:  age and birthyear
## t = -67.981, df = 65, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9957309 -0.9886655
## sample estimates:
##        cor 
## -0.9930409

Above, you can see the r-value, p-value, and several other pieces of info about the correlation test. These statistics are actually saved as variables inside of the correlation test object. To extract these statistics and report them, we can use the dollar sign $ to index these statistics, similar to indexing columns in a dataframe.

To index the Pearson’s r for this test, we can look for the sub-variable called estimate:

cor_age_birthyear$estimate

##        cor 
## -0.9930409

And to index the p-value, we can call the sub-variable p.value:

cor_age_birthyear$p.value

## [1] 4.112443e-62

Now, we can use these variable names to refer to the Pearson’s r and p-value of this correlation test. You don’t have to look at the numbers and copy them over!

PS: In RStudio, if you want to know all the sub-variables inside of an object, if you type the name of the object with a $ after it, an auto-complete menu should pop up and you can use the up and down arrow keys to look through all the possible sub-variables.

b. Maximizing vs. Regret

Let’s see if maximizing correlates with regret. Use cor.test() on those two variables, and report the Pearson’s r and p-value. Remember to label what you are doing in the text above each code chunk.

cor_maxi_total_regret_total <- cor.test(~ maxi_total + regret_total, method = "pearson", data = IntroSurvey)

cor_maxi_total_regret_total

## 
##  Pearson's product-moment correlation
## 
## data:  maxi_total and regret_total
## t = 1.5852, df = 66, p-value = 0.1177
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.04915847  0.41116223
## sample estimates:
##       cor 
## 0.1915112

cor_maxi_total_regret_total$estimate

##       cor 
## 0.1915112

cor_maxi_total_regret_total$p.value

## [1] 0.1177043

Typically, maximizing correlates reasonably strongly with regret (r ~ 0.4). Does this appear to be the case in our dataset?

c. Maximizing vs. Course-selection behavior

Let’s see if maximizing correlates with various measures of course-selection behavior. Calculate the correlation and report the Pearson’s r between maxi_total and each of the following variables:

courses_enrolled

cor_maxi_total_courses_enrolled <- cor.test(~ maxi_total + courses_enrolled, method = "pearson", data = IntroSurvey)

cor_maxi_total_courses_enrolled

## 
##  Pearson's product-moment correlation
## 
## data:  maxi_total and courses_enrolled
## t = 3.6432, df = 67, p-value = 0.0005261
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1880423 0.5868297
## sample estimates:
##       cor 
## 0.4066257

courses_shopped

cor_maxi_total_courses_shopped <- cor.test(~ maxi_total + courses_shopped, method = "pearson", data = IntroSurvey)

cor_maxi_total_courses_shopped

## 
##  Pearson's product-moment correlation
## 
## data:  maxi_total and courses_shopped
## t = 0.96346, df = 66, p-value = 0.3388
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1241420  0.3464656
## sample estimates:
##       cor 
## 0.1177689

points_enrolled

cor_maxi_total_points_enrolled <- cor.test(~ maxi_total + points_enrolled, method = "pearson", data = IntroSurvey)

cor_maxi_total_points_enrolled

## 
##  Pearson's product-moment correlation
## 
## data:  maxi_total and points_enrolled
## t = 2.3237, df = 67, p-value = 0.02319
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.03892622 0.47882273
## sample estimates:
##       cor 
## 0.2730909

d. Regret Scale vs. Other ways of measuring regret

Is the regret score (regret_total), as calculated by the Regret Scale, correlated with either of our students’ self-reports of regret:

process_regret?

cor_regret_total_process_regret <- cor.test(~ regret_total + process_regret, method = "pearson", data = IntroSurvey)

cor_regret_total_process_regret

## 
##  Pearson's product-moment correlation
## 
## data:  regret_total and process_regret
## t = 2.1832, df = 66, p-value = 0.03258
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.02249018 0.46893262
## sample estimates:
##       cor 
## 0.2595236

outcome_regret?

cor_regret_total_outcome_regret <- cor.test(~ regret_total + outcome_regret, method = "pearson", data = IntroSurvey)

cor_regret_total_outcome_regret

## 
##  Pearson's product-moment correlation
## 
## data:  regret_total and outcome_regret
## t = 1.0319, df = 66, p-value = 0.3059
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1159037  0.3537996
## sample estimates:
##       cor 
## 0.1260034

What about regret_general?

cor_regret_total_regret_general <- cor.test(~ regret_total + regret_general, method = "pearson", data = IntroSurvey)

cor_regret_total_regret_general

## 
##  Pearson's product-moment correlation
## 
## data:  regret_total and regret_general
## t = 7.3213, df = 65, p-value = 4.682e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5152693 0.7856135
## sample estimates:
##       cor 
## 0.6722696

e. Maximizing Scale vs. Maximizing self-report

Is the score on the Maximizing Scale (maxi_total) related to the self-report of tendency to maximize (maxi_general)?

cor_maxi_total_maxi_general <- cor.test(~ maxi_total + maxi_general, method = "pearson", data = IntroSurvey)

cor_maxi_total_maxi_general

## 
##  Pearson's product-moment correlation
## 
## data:  maxi_total and maxi_general
## t = 2.8859, df = 67, p-value = 0.005247
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1040117 0.5276617
## sample estimates:
##       cor 
## 0.3325069

8. t-tests

We use t-tests to determine if two samples likely come from populations with different means. Of course, sample means will always differ somewhat, even if two samples are drawn from the same population. When the t-value is high enough, we conclude that it’s unlikely that the two samples actually share the same population mean.

The R syntax for a Student’s t-test is t.test(dv ~ iv, data = DataFrame). Note that because we’re defining the dataframe (for today’s assignment, IntroSurvey) within the test, we don’t need to dollar-sign index it in the names of the DV and IV variables. We can just use the variable name on its own (for example, inside this function, if age were our DV, we only need to call age, not IntroSurvey$age, because t.test() already knows to look inside of IntroSurvey.)

a. Handedness, based on age

Let’s start by seeing if our subjects show a difference in age based on handedness. (They shouldn’t!)

Note: t-tests can only be run on two groups. Earlier in question 4c, we created a new binary handedness variable called handed_binary that we can now use to separate two groups for our t-test.

# note that t.test syntax has your DV first, then your IV
# typically, the outcome variable comes first in R syntax, followed by whatever predicts it t_age_handed <- t.test(age ~ handed_binary, data = IntroSurvey) t_age_handed

Similarly to cor.test objects, if you save the output of t.test() into a variable, you can dollar-sign index into its sub-variables to access the estimated means, t-value, p-value, etc. from the test.

#t_age_handed$estimate

#t_age_handed$statistic

#t_age_handed$p.value

b. Regret, based on school affiliation

Next, you’ll create and inspect your own t-test objects. This time, let’s test for differences in scores by school affiliation (Columbia College/Barnard vs GS/SPS, for example). We will use the school_binary variable that we created earlier in question 4c. If you can’t remember how this variable was created, refer back to the code in that question.

Does regret score differ by school affiliation?

t_regret_total_school_binary <- t.test(regret_total ~ school_binary, data = IntroSurvey)

t_regret_total_school_binary

## 
##  Welch Two Sample t-test
## 
## data:  regret_total by school_binary
## t = -0.16547, df = 61.531, p-value = 0.8691
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.7475580  0.6332722
## sample estimates:
##   mean in group B_CC mean in group GS_SPS 
##             4.500000             4.557143

c. Maximizing, based on school affiliation

Does tendency to maximize differ by school affiliation?

t_maxi_total_school_binary <- t.test(maxi_total ~ school_binary, data = IntroSurvey)

t_maxi_total_school_binary

## 
##  Welch Two Sample t-test
## 
## data:  maxi_total by school_binary
## t = 2.229, df = 48.581, p-value = 0.03047
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.0596064 1.1536120
## sample estimates:
##   mean in group B_CC mean in group GS_SPS 
##             4.825000             4.218391

d.

It might be reasonable to expect that people who made their course decisions using a calculation-based mode are more likely to be maximizers than those who used an affect- or rule-based mode. So we should see if there’s any difference in maximizing between those who used calculation-based modes and those who didn’t.

(Note, though, that the causality implied by this hypothesis likely runs in the opposite direction: we think that someone with a tendency to maximize would be more likely to use a calculation-based mode. The proper test for cases where a continuous variable predicts a binary outcome is a logistic regression. To keep things statistically simple, we’ll lean into the sloppy stats we’re already using in this lab, and test instead whether use of a calculation-based decision mode predicts maximizing score.)

Remember, the original decision mode variable, dec_mode has 4 options (calculation, rule, role, and affect). To do a t-test, we need to use the binary-recoded version of this variable, dec_mode_binary. We created this variable earlier in question 4c, so if you can’t remember how this variable was created, refer back to the code in that question.

Now, please run a t-test just like above, with dec_mode_binary as the IV and total maximizing score as the DV.

t_maxi_total_dec_mode_binary <- t.test(maxi_total ~ dec_mode_binary, data = IntroSurvey)

t_maxi_total_dec_mode_binary

## 
##  Welch Two Sample t-test
## 
## data:  maxi_total by dec_mode_binary
## t = -0.83703, df = 35.95, p-value = 0.4081
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.8779561  0.3649932
## sample estimates:
##     mean in group calc mean in group not-calc 
##               4.402778               4.659259

V. Graphing data

9. Intro to ggplot

Time to graph some results! ggplot2 is a package that helps you make prettier figures in R, using more straightforward syntax. We’ll use ggplot2 to make a scatterplot and a bar graph, but first let’s get oriented with how it works.

At the beginning of this lab, you installed the package ggplot2. You only have to do that once ever (unless you uninstall it or update the package). When working in RStudio Cloud you basically never have to do it, becuase our assignments have already installed it before you made your own copy of the project file.

We also loaded the ggplot2 library earlier in this lab, which allows us to quickly access all of its functions, with the library(ggplot2) command. You need to do that once every time you open a new R session.

ggplot2 allows you to format your figures in many cool ways. When working on one complete project, whether it be one homework assignment or one set of research analyses, it’s easiest to just define a “theme” one time, and then apply it to every figure in that set of analyses.

We’ll create a theme called bwtheme (for black & white), which is nice and simple. We’ll use some ggplot2 functions to define this theme and save it into a variable. This way, you can create the theme variable once and then use it over and over on multiple graphs. Run the code below to save our theme into a variable:

bwtheme <- theme_bw() + 
  theme(strip.background = element_rect(fill = 'white')) +
  theme(plot.title = element_text(size = rel(1))) + 
  theme(axis.title = element_text(size = rel(1)))

When you plot data with ggplot2, you need to tell it 3 things:

what type of plot to make (e.g., scatter, bar chart)
what data to plot (e.g., which variables are your x and y values?)
what extra layers do you want to add (e.g., a regression line)

After you tell ggplot2 one thing about what you want your plot to look like, if you put a + at the end of a line it knows that you have more information to give it about the same plot. It will then look for more code to add elements to the figure. Once it finds a plotting command that isn’t finished with a +, it will stop and produce the figure.

It’s best practice to add a new line after each separate layer of your plotting code, so that each layer’s code is easily readable on its own. You can add as many layers to your plotting command as you’d like as long as you end each line with a + (except the last line).

For example, to plot a scatterplot of age and birthyear, you would use:

ggplot(IntroSurvey, aes(x = birthyear, y = age)) +
  geom_point(shape = 1, position=position_jitter(width=.05,height=.05)) +
  bwtheme

## Warning: Removed 2 rows containing missing values (geom_point).

# position_jitter randomly shifts each point vertically and/or horizontally
# this is useful if you have lots of overlapping data so you can see how
# many points are in that one position

The ggplot() command defines the dataframe (IntroSurvey) and the “aesthetics” (the x and y variables you want to use). The geom_point() command tells ggplot2 you want to graph points (aka a scatter plot). shape = 1 tells ggplot2 to use open circles for each point. ggplot2 supports several different point shapes in scatter plots, each labeled by a different number. You can experiment with different shapes by changing the number, if you’d like.

One useful thing to add to a scatterplot is a linear regression line. (A correlation between x and y is the same as a linear regression with only one predictor variable.) We’ll do this with geom_smooth().

ggplot(IntroSurvey, aes(x = birthyear, y = age)) +
  geom_point(shape = 1, position=position_jitter(width=.05,height=.05)) +
  geom_smooth(method = lm) +
  bwtheme

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 2 rows containing non-finite values (stat_smooth).

## Warning: Removed 2 rows containing missing values (geom_point).

By default, geom_smooth() will graph the 95% confidence interval (CI) for your regression line. If you want to omit that line, set the se argument (stands for “standard error”) of geom_smooth() to be FALSE to turn the estimated standard error off, like this: geom_smooth(method = lm, se = FALSE)

Earlier, you found the correlation between regret and maximizing. Let’s graph that now. This time, we’ll add a title by adding an additional layer in our ggplot call, and we’ll also edit the x and y axis labels so that they’re a little more informative than our shorthand variable names.

ggplot(IntroSurvey, aes(x = maxi_total, y = regret_total)) +
  geom_point(shape = 1) +
  geom_smooth(method = lm) +
  # labs() lets you modify various labels in a ggplot
  labs(title = "Regret as a Function of Maximizing Tendency",
       x = "tendency to maximize",
       y = "tendency toward regret") +
  bwtheme

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 1 rows containing non-finite values (stat_smooth).

## Warning: Removed 1 rows containing missing values (geom_point).

10. Scatterplots: your turn!

Create one more figure that shows the relationship between two continuous variables (or any variables that aren’t categorical—scatterplots for ordinal variables are okay, too!). Make sure your figure has a title, and change the labels on each axis so that someone not familiar with our variable names would understand what the figure represents. You can copy and paste code from the previous sections, above, but remember to change the variable names as needed.

ggplot(IntroSurvey, aes(x = major, y = school)) +
  geom_point(shape = 1, position=position_jitter(width=.05,height=.05)) +
  geom_smooth(method = lm) +
  bwtheme

## `geom_smooth()` using formula 'y ~ x'

11. Box Plots

Now let’s make a box plot, which is a clean way to illustrate the centers and spreads of multiple categories (e.g., the two groups you’re comparing with a t-test) . We’ll graph this in the same way as we made the scatter plot, but this time instead of using geom_point() to create a scatter plot we’ll use geom_boxplot() to generate a box plot.

ggplot(IntroSurvey, aes(x = handed, y = age)) +
  geom_boxplot() +
  bwtheme

## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

geom_boxplot() shows you the median (middle line of the box), the 25th and 75th percentiles (the bottom and top edges of the box), and +- 1.5 inter-quartile ranges (IQRs) away from the median (the whiskers). Any points farther than 1.5 IQRs from the median are shown as individual points.

If you want to add the points over the boxplots (a good habit, so that your reader can see where the individual values fall, in addition to the summary statistics), you can easily do this by adding another layer on top of geom_boxplot() using something like geom_point(size = 3, alpha = .5). The alpha value is opacity (transparency), where larger fractions (closer to 1) are more opaque and lower fractions (closer to 0) are more transparent. Using transparency is another way to show how many data points are stacked right on top of each other, in addition to jittering the points’ locations like we did in the scatterplot above.

ggplot(IntroSurvey, aes(x = handed, y = age)) +
  geom_boxplot() +
  geom_point(size = 3, alpha = .3) +
  bwtheme

## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

## Warning: Removed 2 rows containing missing values (geom_point).

If you want to change the order of the x-axis labels, or change their text, you can change these with various arguments within scale_x_discrete(). There are several functions you can use to modify the behavior of x-axes and y-axes. In this case, because the x variable is categorical, scale_x_discrete() is the appropriate function to use.

ggplot(IntroSurvey, aes(x=handed, y=age)) +
  geom_boxplot() +
  geom_point(size = 3, alpha = .3) +
  bwtheme   +
  scale_x_discrete(limits = c("L", "R", "A"), # limits re-orders the three bars
                   # labels changes text.
                   # Make sure these are specified in the correct order!
                   labels = c("Left", "Right", "Ambi"))

## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

## Warning: Removed 2 rows containing missing values (geom_point).

As mentioned above, if making the points transparent doesn’t solve the problem of telling overlapping points apart, you can “jitter” the location of each point horizontally by using geom_jitter() in place of geom_point(). This adds a little bit of random noise to each point’s location, which makes it so that points which were on top of each other are now slightly offset.

Below, the geom_point() line has been commented out so it won’t run. R will skip over it and run the geom_jitter() line below it and keep going as usual. If you uncomment geom_point() by deleting the #, and instead comment out the geom_jitter() line by adding a # in front it, you’ll see your partially transparent points again.

Since both of these commands alter the way points are plotted by ggplot2, you should only use one at a time.

ggplot(IntroSurvey, aes(x=handed, y=age)) +
  geom_boxplot() +
  # geom_point(size = 3, alpha = .3) +
  geom_jitter(position=position_jitter(width=.2, height=0)) + 
  # jittering width only preserves correct age
  bwtheme   +
  scale_x_discrete(limits=c("L","R","A"),
                   labels=c("Left", "Right", "Ambi"))

## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

## Warning: Removed 2 rows containing missing values (geom_point).

Now, please create one new box plot that shows any of our continuous DVs as a function of one of our categorical variables. Again, make sure your figure has a title, and change the labels on each axis so that someone not familiar with our variable names would understand what the figure represents.

ggplot(IntroSurvey, aes(x = school, y = major)) +
  geom_boxplot() +
  bwtheme

Submitting this assignment

When saving an R Markdown file, you can either save it as a text-based file so you can keep editing it and running the code inside later, or you can export it as a final document that anyone can read without opening RStudio. Exporting an R Markdown is called “knitting.” To knit your file, click the “Knit” button at the top of your Editor pane, and click “Knit to HTML” or “Knit to PDF.” Once the HTML/PDF document finishes exporting, it should save into the same directory as your original R Markdown file. Rename this HTML document with a new filename that includes your own name, e.g., R_Assignment_2_Jorge_Mallea.html.

To get credit for this assignment, upload your completed “knitted” HTML/PDF to the Assignments section of Canvas for UN1490.

Psych 1490, Assignment 2

Rose Kuan

09/21/2020