While R can do a lot of amazing stuff on its own, some special types of analysis or graphing can be more easily (or stylishly) done using one of a (huge) number of “packages” that people have created specially for R. Think of R as your phone, and the packages as apps that each allow the phone to do something extra.
To use a package, you have to install it (once), and then load it (every time you re-open RStudio).
In this assignment, we’ll need the assistance of two packages in particular:
ggplot2 is the package we’ll use to make clean, attractive graphsknitr is the package we’ll use to export this homework assignment to a format you can upload to CanvasFirst, you can use the code below to install these two packages if you haven’t already.
# Only run this code if you HAVEN'T already installed these packages!
# They are almost certainly already installed - that's a benefit of RStudio Cloud!
install.packages(c("tidyverse",
"knitr"))
Next, use the following code to load, or “turn on”, these packages for the current RStudio window so you can actually access the functions inside. You need to do this every time you open a new RStudio tab! For this code chunk, all you’ll need to do is run the code below:
# We only need to "turn on" tidyverse
# knitr actually works behind the scenes already!
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
In this assignment, we will be modeling a specific order of R tasks:
I. Load your raw data II. Prepare your data for analysis III. Generate descriptive statistics IV. Generate inferential statistics V. Generate graphs
This is important because R documents execute their code in order from beginning to end, or from top to bottom. All the code at the end of the document assumes that code at the beginning of the document has already been run. For example, if you want to run a t-test on a mean of a set of other scores that you must calculate from your raw data, the code to calculate that mean score needs to come before the code that runs the t-test.
The first thing you do in any data analysis document is to load your fresh raw data into R so you can later work with and calculate statistics on your data.
Check your working directory using getwd(). If you are using RStudio Cloud, or if you opened RStudio via the R project file in the same directory that contains your datafile (classdata_2020.csv) and your R scripts, you should be in that folder now.
Then check to make sure the data is there using list.files().
getwd
## function ()
## .Internal(getwd())
## <bytecode: 0x3443c38>
## <environment: namespace:base>
list.files('classdata_2020.csv')
## character(0)
Read in your data file using read_csv("classdata_2020.csv"), and assign it to the dataframe name IntroSurvey using the <- left-arrow operator. If you’re not sure how to do this, check the code you used in R Assignment 1. (You can open a fresh R markdown file of Assignment 1 via the Files pane in the bottom righthand corner of the RStudio screen. It will open in another tab, so you can go back and forth to copy code from it and paste it here.)
Remember that for code to run, it must be inside a “code chunk”—e.g., the grey-background area that starts immediately below this paragraph. The functions in the paragraph above will not get run by R, because they’re in the plain-text part of this document.
IntroSurvey <- read.csv("classdata_2020.csv")
Use the following exploration functions to check that the dataframe IntroSurvey read in correctly:
head()
head(IntroSurvey)
## id maxi1 maxi2 maxi3 maxi4 maxi5 maxi6 regret1 regret2 regret3 regret4
## 1 1 2 6 3 4 4 2 6 5 5 6
## 2 2 4 4 6 6 5 5 6 6 6 3
## 3 3 4 2 6 4 7 3 7 3 6 7
## 4 4 6 4 3 7 5 6 7 7 5 3
## 5 5 5 5 7 7 6 7 2 2 2 1
## 6 6 5 7 3 6 7 6 7 7 7 7
## regret5 courses_enrolled courses_shopped points_enrolled time_planning
## 1 1 4 6 15.0 7 to 9
## 2 5 5 6 16.0 3 to 5
## 3 7 2 3 8.0 1 to 3
## 4 6 7 7 17.0 1 to 3
## 5 1 6 6 18.5 7 to 9
## 6 6 5 6 16.0 1 to 3
## courses_satisfaction
## 1 5
## 2 4
## 3 4
## 4 3
## 5 3
## 6 3
## dec_mode
## 1 calculation-based decision mode (e.g., weighing pros and cons of each course against one another)
## 2 calculation-based decision mode (e.g., weighing pros and cons of each course against one another)
## 3 calculation-based decision mode (e.g., weighing pros and cons of each course against one another)
## 4 role-based decision mode (e.g., taking what a Psychology major ought to take)
## 5 role-based decision mode (e.g., taking what a Psychology major ought to take)
## 6 role-based decision mode (e.g., taking what a Psychology major ought to take)
## process_regret outcome_regret regret_general maxi_general psych_courses age
## 1 1 2 4 6 5 21
## 2 2 2 1 5 7 20
## 3 1 1 4 6 6 23
## 4 1 2 2 5 6 20
## 5 1 1 1 6 4 21
## 6 2 2 5 5 4 20
## birthyear class school gender handed major
## 1 1999 Senior CC M R Psychology
## 2 2000 Junior CC M R Psychology
## 3 1996 Post-bac Continuing Ed. / SPS F R Psychology
## 4 2000 Junior CC F R Psychology
## 5 1999 Senior CC M R Psychology
## 6 1999 Senior CC M R Psychology
## concentration reader programs CRT1_5 CRT2_5 CRT3_47 CRT_total
## 1 None 1 1 0 0 0 0
## 2 None 1 1 1 1 1 3
## 3 Psychology 1 1 0 0 0 0
## 4 Sociology 0 2 0 0 0 0
## 5 Pre-Med 1 2 1 0 1 2
## 6 Pre-Med 0 1 0 1 1 2
summary()
summary(IntroSurvey)
## id maxi1 maxi2 maxi3 maxi4
## Min. : 1 Min. :2.000 Min. :1 Min. :1.000 Min. :1.000
## 1st Qu.:18 1st Qu.:4.000 1st Qu.:2 1st Qu.:2.000 1st Qu.:5.000
## Median :35 Median :5.000 Median :4 Median :3.000 Median :6.000
## Mean :35 Mean :4.957 Mean :4 Mean :3.449 Mean :5.333
## 3rd Qu.:52 3rd Qu.:6.000 3rd Qu.:5 3rd Qu.:5.000 3rd Qu.:7.000
## Max. :69 Max. :7.000 Max. :7 Max. :7.000 Max. :7.000
##
## maxi5 maxi6 regret1 regret2
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:4.000 1st Qu.:3.000 1st Qu.:5.000 1st Qu.:4.000
## Median :5.000 Median :5.000 Median :6.000 Median :5.000
## Mean :5.116 Mean :4.565 Mean :5.304 Mean :4.725
## 3rd Qu.:7.000 3rd Qu.:6.000 3rd Qu.:7.000 3rd Qu.:6.000
## Max. :7.000 Max. :7.000 Max. :7.000 Max. :7.000
##
## regret3 regret4 regret5 courses_enrolled
## Min. :1.000 Min. :1.000 Min. :1.000 Min. : 1.000
## 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:3.000 1st Qu.: 4.000
## Median :4.500 Median :4.000 Median :5.000 Median : 5.000
## Mean :4.279 Mean :3.928 Mean :4.377 Mean : 4.406
## 3rd Qu.:6.000 3rd Qu.:5.000 3rd Qu.:6.000 3rd Qu.: 5.000
## Max. :7.000 Max. :7.000 Max. :7.000 Max. :10.000
## NA's :1
## courses_shopped points_enrolled time_planning courses_satisfaction
## Min. : 0.000 Min. : 4.00 Length:69 Min. :1.000
## 1st Qu.: 4.750 1st Qu.:13.00 Class :character 1st Qu.:3.000
## Median : 5.000 Median :16.00 Mode :character Median :4.000
## Mean : 5.632 Mean :14.88 Mean :3.797
## 3rd Qu.: 6.000 3rd Qu.:17.00 3rd Qu.:4.000
## Max. :25.000 Max. :21.50 Max. :5.000
## NA's :1
## dec_mode process_regret outcome_regret regret_general
## Length:69 Min. :1.000 Min. :1.000 Min. :1.000
## Class :character 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2.000
## Mode :character Median :2.000 Median :2.000 Median :3.000
## Mean :1.928 Mean :2.145 Mean :2.926
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :6.000
## NA's :1
## maxi_general psych_courses age birthyear
## Min. :2.000 Min. :1.000 Min. :19.00 Min. :1987
## 1st Qu.:4.000 1st Qu.:2.000 1st Qu.:20.00 1st Qu.:1995
## Median :5.000 Median :5.000 Median :21.00 Median :1998
## Mean :4.754 Mean :4.241 Mean :22.99 Mean :1997
## 3rd Qu.:6.000 3rd Qu.:6.000 3rd Qu.:24.50 3rd Qu.:1999
## Max. :6.000 Max. :9.000 Max. :33.00 Max. :2001
## NA's :11 NA's :2
## class school gender handed
## Length:69 Length:69 Length:69 Length:69
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## major concentration reader programs
## Length:69 Length:69 Min. :0.0000 Min. :1.000
## Class :character Class :character 1st Qu.:0.0000 1st Qu.:1.000
## Mode :character Mode :character Median :0.0000 Median :2.000
## Mean :0.2174 Mean :1.609
## 3rd Qu.:0.0000 3rd Qu.:2.000
## Max. :1.0000 Max. :3.000
##
## CRT1_5 CRT2_5 CRT3_47 CRT_total
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:1.000
## Median :1.0000 Median :1.0000 Median :1.0000 Median :2.000
## Mean :0.5652 Mean :0.5797 Mean :0.7681 Mean :1.913
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :3.000
##
str()
str(IntroSurvey)
## 'data.frame': 69 obs. of 37 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ maxi1 : int 2 4 4 6 5 5 2 6 7 5 ...
## $ maxi2 : int 6 4 2 4 5 7 4 1 2 7 ...
## $ maxi3 : int 3 6 6 3 7 3 3 2 1 5 ...
## $ maxi4 : int 4 6 4 7 7 6 3 7 6 6 ...
## $ maxi5 : int 4 5 7 5 6 7 4 5 7 7 ...
## $ maxi6 : int 2 5 3 6 7 6 2 6 5 6 ...
## $ regret1 : int 6 6 7 7 2 7 3 7 7 6 ...
## $ regret2 : int 5 6 3 7 2 7 5 6 7 5 ...
## $ regret3 : int 5 6 6 5 2 7 2 7 7 5 ...
## $ regret4 : int 6 3 7 3 1 7 2 6 7 7 ...
## $ regret5 : int 1 5 7 6 1 6 5 6 7 3 ...
## $ courses_enrolled : int 4 5 2 7 6 5 3 5 5 6 ...
## $ courses_shopped : int 6 6 3 7 6 6 4 6 6 5 ...
## $ points_enrolled : num 15 16 8 17 18.5 16 12 16 21.5 17 ...
## $ time_planning : chr "7 to 9" "3 to 5" "1 to 3" "1 to 3" ...
## $ courses_satisfaction: int 5 4 4 3 3 3 5 4 3 5 ...
## $ dec_mode : chr "calculation-based decision mode (e.g., weighing pros and cons of each course against one another)" "calculation-based decision mode (e.g., weighing pros and cons of each course against one another)" "calculation-based decision mode (e.g., weighing pros and cons of each course against one another)" "role-based decision mode (e.g., taking what a Psychology major ought to take)" ...
## $ process_regret : int 1 2 1 1 1 2 3 2 5 2 ...
## $ outcome_regret : int 2 2 1 2 1 2 1 2 2 1 ...
## $ regret_general : int 4 1 4 2 1 5 2 5 6 3 ...
## $ maxi_general : int 6 5 6 5 6 5 5 6 6 5 ...
## $ psych_courses : int 5 7 6 6 4 4 6 6 2 5 ...
## $ age : int 21 20 23 20 21 20 28 21 21 21 ...
## $ birthyear : int 1999 2000 1996 2000 1999 1999 1992 1999 1999 1998 ...
## $ class : chr "Senior" "Junior" "Post-bac" "Junior" ...
## $ school : chr "CC" "CC" "Continuing Ed. / SPS" "CC" ...
## $ gender : chr "M" "M" "F" "F" ...
## $ handed : chr "R" "R" "R" "R" ...
## $ major : chr "Psychology" "Psychology" "Psychology" "Psychology" ...
## $ concentration : chr "None" "None" "Psychology" "Sociology" ...
## $ reader : int 1 1 1 0 1 0 0 0 0 0 ...
## $ programs : int 1 1 1 2 2 1 2 3 2 2 ...
## $ CRT1_5 : int 0 1 0 0 1 0 1 1 0 0 ...
## $ CRT2_5 : int 0 1 0 0 0 1 1 1 0 0 ...
## $ CRT3_47 : int 0 1 0 0 1 1 1 1 1 0 ...
## $ CRT_total : int 0 3 0 0 2 2 3 3 1 0 ...
Before we calculate any statistics on our data, we may need to clean the raw data, or calculate additional values of interest in our data. It is important to do this before calculating any statistics, so that any values you may need for your statistics are already present in your data by the time you run your “analysis” code.
The code below will calculate two new columns based on existing raw data in IntroSurvey. Note that this way of calculating regret_total (the mean of the 5 individual regret questions) is different than how you did it in Assignment 1. There are often multiple ways to accomplish the same thing in R—note that this way is more transparent in that it shows exactly how regret_total is calculated, while using rowMeans() like we did last time was more elegant, since it used less code to accomplish the same objective.
regret_total, for overall Regret Score on the Regret Scale, by averaging all 5 individual regret scores, regret1 to regret5maxi_total, for overall Maximizing score on the Maximizing Scale, by averaging all 6 individual regret scores, maxi1 to maxi6IntroSurvey <- mutate(IntroSurvey,
regret_total = (regret1 + regret2 + regret3 + regret4 + regret5) / 5,
maxi_total = (maxi1 + maxi2 + maxi3 + maxi4 + maxi5 + maxi6) / 6)
You can use the function mutate() that you see above to add new columns to a dataframe, or overwrite existing columns in a dataframe. Anytime you want to calculate a new column that may be some combination of existing columns, like averaging the values across multiple columns, mutate() is the function you want.
You can create as many new columns as you want inside of one mutate() command, as long as you separate each row of column-creating code with a comma. To help your code look more readable, press EnterReturn after the comma at the end of each row of column-creating code, to put each row of code on its own row. (This is not necessary for R; it’s just easier formatting for us humans to parse with our eyeballs.)
Now, it’s your turn to calculate some new columns of data.
The Maximizing Scale has been shown to divide up into three subscales:
maxi1 & maxi3)maxi2 & maxi5)maxi4 & maxi6)Please use the approach seen above to assign three new variables in the IntroSurvey dataframe, each of which represents one subscale: maxi_as, maxi_dd, and maxi_hs. (Hint: you should be averaging the two maxi scores for each.)
IntroSurvey <- mutate(IntroSurvey,
maxi_as = (maxi1 + maxi3) / 2,
maxi_dd = (maxi2 + maxi5) / 2,
maxi_hs = (maxi4 + maxi6) / 2)
Above, we used mutate() to create new continuous variables by averaging the values of other continuous variables together. We can also create new categorical variables by recoding the values of existing categorical variables into levels that are more convenient for the statistics we want to calculate.
The question about decision mode allowed for 4 options (calculation, rule, role, and affect). To do a t-test, we need the variable re-coded as binary, with two levels: calculation vs. not-calculation. To do this, we’ll use the ifelse() function with mutate() to generate a new column in the data. The code below tells R to create a new variable (dec_mode_binary) whose value is "calc" if the original variable (dec_mode) started with the phrase “calculation-based”, and whose value is "not-calc" otherwise.
IntroSurvey <- mutate(IntroSurvey,
dec_mode_binary = ifelse(startsWith(dec_mode, "calculation-based"),
"calc",
"not-calc"))
There are various strategies you can use to recode the levels of categorical variables. Above, we used a logical test to create a binary variable based on whether the logical test returned true or false. Creating binary variables is very useful for setting up t-tests, because t-tests require you to split your data up into two groups for comparison.
Next, let’s create another new binary variable. This time, we will re-code school affiliation into a binary variable. Later, we can use this binary variable to test for differences in scores by school affiliation (Columbia College/Barnard vs GS/SPS, for example).
The theoretical reason why you might want to group students in this way is that CC and Barnard students tend to have come to college straight out of high school (or with a small gap), whereas GS and post-bac students are more likely to have had some amount of gap between high school and arriving at Columbia (e.g., coursework at another university; a dance or athletic career; military service; etc.). So it might be reasonable to look to see if there are any differences between those two groups of students.
To do this, we’ll use the ifelse() function with mutate() to generate a new column in our data, like we did earlier. The code below tells R to create a new variable (school_binary) whose value is "B_CC" if the original variable (school) was either Barnard or CC, and whose value is "GS_SPS" otherwise.
IntroSurvey <- mutate(IntroSurvey,
school_binary = ifelse(school %in% c("Barnard", "CC"),
"B_CC",
"GS_SPS"))
Now it’s your turn. Use mutate() and ifelse() as we demonstrated above to create another binary variable based on handedness. Right now, handedness may have up to 3 levels: “R”, “L”, or “A” (for ambidextrous). Create a new binary variable called handed_binary with two levels: “R” and “notR”, where anyone who is not right-handed has the value “notR”.
IntroSurvey <- mutate(IntroSurvey,
handed_binary = ifelse(handed %in% c("R", "L","A"),
"R",
"notR"))
Use head() to check that the following new variables are present in IntroSurvey:
regret_totalmaxi_totalmaxi_as, maxi_dd, & maxi_hshead('regret_total')
## [1] "regret_total"
head('maxi_total')
## [1] "maxi_total"
head('maxi_as', 'maxi_dd','maxi_hs')
## [1] "maxi_as"
What other ways are there to check whether your new variables were created correctly? Ask your TA or a classmate to come up with at least one way, other than using head(), of checking your work.
Going through the code again and make sure the average is right.
Before we start doing hypothesis tests, we should take a look at what our key variables look like—how they are distributed, what their range of values are, where they’re centered, and how much variance they have.
Let’s start with some descriptive summary statistics on our new variables. Since they are means of ordinal data, these new variables are technically ordinal as well. But for our purposes in this class, we will fudge things and treat them as scale (interval/ratio).
To calculate multiple summary statistics on one dataframe at once, we will use summarize(). As an example, first let’s calculate both the mean and standard devation of the total regret score across all subjects.
# First, create a summary dataframe to hold the relevant info
Regret_Total_Summary <- summarize(IntroSurvey,
mean_regret_total = mean(regret_total, na.rm = TRUE),
sd_regret_total = sd(regret_total, na.rm = TRUE))
# Then print the dataframe to show the summary values inside
Regret_Total_Summary
## mean_regret_total sd_regret_total
## 1 4.523529 1.41507
The syntax you use to calculate summary stats with summarize() is similar to the syntax you use to create new columns with mutate() like we did earlier. You can calculate multiple summary stats in the same summarize() command as long as you separate every new stat you want to calculate with a comma.
Use mean() and sd() inside of summarize() to find the mean and standard deviations for:
# First, create a summary dataframe to hold the relevant info
Maxi_Total_Summary <- summarize(IntroSurvey,
mean_maxi_total = mean(maxi_total, na.rm = TRUE),
sd_maxi_total = sd(maxi_total, na.rm = TRUE))
summarize() call)# Then print the dataframe to show the summary values inside
Maxi_Total_Summary
## mean_maxi_total sd_maxi_total
## 1 4.570048 1.096416
Next, we will use hist() to visualize the distribution of scores for continuous variables in our data. As a refresher, hist() works a little differently than most of the functions we will use in this assignment. In most of the functions we are using, you already specify to the function that it should look inside IntroSurvey, so you do not need to use the $ to tell R where to find the columns you’re referring to. (You hand the mail carrier a bundle of letters that’s marked as all going to Schermerhorn, so that on each letter you only need to put the room number and not also the building name.) However, hist() isn’t that fancy, so you will need to use the $ to tell it to generate a histogram of a specific column inside of a larger dataframe in your environment.
As an example, we will first generate a histogram of the distribution of total regret scores over all students in the data. If we’re going to be using it in t-tests and correlations, we’ll want to be able to assume it came from a population that is normally distributed. Does that look like a reasonable assumption?
hist(IntroSurvey$regret_total)
Now, on your own, use
hist() like we did above to generate histograms for the columns that you just calculated the mean and standard deviation for in question 6a:
hist(IntroSurvey$maxi_total)
** Alternative Search
hist(IntroSurvey$maxi_as)
** Decision Difficulty
hist(IntroSurvey$maxi_dd)
** High Standards
hist(IntroSurvey$maxi_hs)
Histograms are a fantastic way to visualize a summary of continuous, numeric variables. However, not all data are numeric—sometimes variables of interest are categorical (nominal), where each observation is one of a few specified levels.
We can use the function count() to see how many observations occur at each level of a categorical variable. For example, let’s use count() to count how many values are in each level of the column class. This will tell us how many participants are in each class year in our IntroSurvey data.
count_class <- count(IntroSurvey,
class)
count_class
## class n
## 1 Junior 33
## 2 Post-bac 5
## 3 Second-year 4
## 4 Senior 27
How many students fall into each category of school?
count_school <- count(IntroSurvey,
school)
count_school
## school n
## 1 Barnard 2
## 2 CC 38
## 3 Continuing Ed. / SPS 6
## 4 GS 23
What about handed?
count_handed <- count(IntroSurvey,
handed)
count_handed
## handed n
## 1 A 1
## 2 L 6
## 3 R 62
Explore at least 3 other variables that you think might have a relationship with maximizing and/or regret. Add a line of text above each code chunk to introduce the variable you’re exploring, and describe what analyses you do in the code chunk.
If the variable is numeric or interval (i.e., if its responses are numbers), use summarize() with mean() and sd() to compute the mean and standard deviation, and use hist() to produce a histogram. R calls these kinds of data “numeric” or “double” or “interval”.
If the variable is categorical, use count() to count the values in each level of the variable. R calls these kinds of data “character” (chr) or “factor” (fac).
Hint: Use class() to check what type of variable you are working with.
#Counting categorial values for major
count_major <- count(IntroSurvey,
major)
count_major
## major n
## 1 Chemistry 1
## 2 Neuroscience & Behavior 4
## 3 Psychology 52
## 4 Psychology, Art History 1
## 5 Psychology, Computer Science 1
## 6 Psychology, creative writing 1
## 7 Psychology, Creative Writing 2
## 8 Psychology, Data Science 1
## 9 Psychology, Economics 1
## 10 Psychology, Political Science 1
## 11 Psychology, Sociology 1
## 12 Undecided 2
## 13 Undecided, (pending) Computer Science 1
#Counting categorial values for gender
count_gender <- count(IntroSurvey,
gender)
count_gender
## gender n
## 1 F 45
## 2 M 23
## 3 Prefer not to specify 1
#Producing histogram for CRT
hist(IntroSurvey$CRT_total)
Many times, when investigating summary statistics to learn more about your data, you will want to compare values for different study groups in your participant population, in addition to reporting data across all subjects as we just learned how to do.
Here, we will learn about some functions that will help you break up your summaries by participant groups.
First, you can use the filter() function to choose a subset of rows out of your data, so you can produce summaries on one group at a time.
righties <- filter(IntroSurvey,
handed == "R")
Inside of filter(), you use logical statements based on columns of your data to choose which rows to subset. If you use multiple logical statements in the same filter() command, separated with commas, you will choose the intersection of all those logical statements, or all the rows where the first statement and the second statement and the third statement, etc., are true.
Below, use filter() to choose the subset of your data belonging to Columbia College-affiliated right-handed students. You can use the code above to help you, but this time, you will need two logical statements inside of filter().
righties <- filter(IntroSurvey,
handed == "R",school == "CC",gender == "F")
Sometimes, you may want to choose just a subset of your data to operate on, but other times, you may want to calculate all summary statistics for different groups in your data at once, so you can compare them. We will learn how to do this next.
Remember the summarize() function from before? Previously, we used it to calculate summary statistics over all observations in our data. We can also use summarize() to calculate statistics separately for different groups in our data. In order to do this, we must tell R what groups are in our data, using a new function, group_by(). Then, summarize() will automatically calculate summary statistics separately for every group it sees in the data.
Here’s an example: Let’s calculate the mean and SD of students’ total regret scores, like we did in question 6a. This time, however, we will calculate it separately for students of every class year.
# First, create a summary dataframe to hold the relevant info
regret_total_by_class_summary <- IntroSurvey %>%
group_by(class) %>%
summarize(mean_regret_total = mean(regret_total, na.rm = TRUE),
sd_regret_total = sd(regret_total, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
# Then print the dataframe to show the summary values inside
regret_total_by_class_summary
## # A tibble: 4 x 3
## class mean_regret_total sd_regret_total
## <chr> <dbl> <dbl>
## 1 Junior 4.42 1.35
## 2 Post-bac 5.28 0.522
## 3 Second-year 3.1 1.54
## 4 Senior 4.72 1.48
If you run the code above, you’ll see that you get columns for mean regret score and SD of regret score, but now instead of one summary value for all subjects, there’s a separate value for each level of class, corresponding to each unique class year reported in the data.
There is also a new command used in the code above: the pipe operator, %>%. The %>% operator does one simple, but key, thing: it takes the object on the left-hand side and feeds it into the first argument of the function on the right-hand side.
This means that you can put multiple functions together, one after the other, and R will understand and execute them in left to right, like natural English. In the code above, this means that we are telling R to do these things in this order:
IntroSurvey data objectgroup_by() on IntroSurvey to tell R to group the rows of IntroSurvey by the class columnsummarize() on the grouped IntroSurvey data, calculating a mean and SD of regret_totalIt’s very important here that group_by() is called before summarize(), because otherwise R wouldn’t know how to group the data before summarizing it.
We will practice this a couple times.
Below, group IntroSurvey by school affiliation (CC, Barnard, GS, etc), and then calculate mean and SD of total regret score for each school group in the data using the pipe operator.
# First, create a summary dataframe to hold the relevant info
regret_total_by_school_summary <- IntroSurvey %>%
group_by(school) %>%
summarize(mean_regret_total = mean(regret_total, na.rm = TRUE),
sd_regret_total = sd(regret_total, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
# Then print the dataframe to show the summary values inside
regret_total_by_school_summary
## # A tibble: 4 x 3
## school mean_regret_total sd_regret_total
## <chr> <dbl> <dbl>
## 1 Barnard 4.8 3.11
## 2 CC 4.48 1.43
## 3 Continuing Ed. / SPS 4.93 0.969
## 4 GS 4.45 1.43
Below, group IntroSurvey by major affiliation, and then calculate mean and SD of total maximizing score for each major in the data.
# First, create a summary dataframe to hold the relevant info
regret_total_by_major_summary <- IntroSurvey %>%
group_by(major) %>%
summarize(mean_regret_total = mean(regret_total, na.rm = TRUE),
sd_regret_total = sd(regret_total, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
# Then print the dataframe to show the summary values inside
regret_total_by_major_summary
## # A tibble: 13 x 3
## major mean_regret_total sd_regret_total
## <chr> <dbl> <dbl>
## 1 Chemistry 7 NA
## 2 Neuroscience & Behavior 2.95 1.17
## 3 Psychology 4.64 1.39
## 4 Psychology, Art History 5.2 NA
## 5 Psychology, Computer Science 2.4 NA
## 6 Psychology, creative writing 5.8 NA
## 7 Psychology, Creative Writing 4.3 2.12
## 8 Psychology, Data Science 3.4 NA
## 9 Psychology, Economics 4.2 NA
## 10 Psychology, Political Science 4.4 NA
## 11 Psychology, Sociology 3.4 NA
## 12 Undecided 4.5 0.990
## 13 Undecided, (pending) Computer Science 6 NA
Now we can start doing some real data analysis! For the purposes of getting practice in a variety of inferential statistics in R, we’ll be running more exploratory tests than would normally be considered appropriate for a real study. We’ll talk more about why this is a problem later in the semester, but hopefully in your stats class you talked about the dangers of running too many tests—e.g., alpha-inflation, p-hacking, etc.
Now that you have a sense for what the individual variables each show, let’s look at how some of them relate to each other. First, some correlations.
Let’s see if age correlates with birthyear (it should!). The function cor.test() allows us to put in two variables, see what their correlation is, and return a p-value for that correlation.
The R syntax for a correlation is cor.test(~ var1 + var2, data = DataFrame). Note that because we’re defining the dataframe (for today’s assignment, IntroSurvey) within the test, we don’t need to dollar-sign index it in the names of the variables. We can just use the variable name on its own (for example, inside this function, if age were one of our variables, we only need to call age, not IntroSurvey$age, because cor.test() already knows to look inside of IntroSurvey.)
cor.test() allows you to run a Pearson, Kendall, or Spearman correlation, depending on what you tell it to do using the argument method. This time, since we want to do a Pearson correlation, we’ll use method = "pearson".
First, we’ll save the contents of our correlation test into the variable cor_age_birthyear so we can access its contents later. Then, we’ll print cor_age_birthyear to the console, so we can visually inspect the results of the correlation test.
cor_age_birthyear <- cor.test(~ age + birthyear, method = "pearson", data = IntroSurvey)
cor_age_birthyear
##
## Pearson's product-moment correlation
##
## data: age and birthyear
## t = -67.981, df = 65, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9957309 -0.9886655
## sample estimates:
## cor
## -0.9930409
Above, you can see the r-value, p-value, and several other pieces of info about the correlation test. These statistics are actually saved as variables inside of the correlation test object. To extract these statistics and report them, we can use the dollar sign $ to index these statistics, similar to indexing columns in a dataframe.
To index the Pearson’s r for this test, we can look for the sub-variable called estimate:
cor_age_birthyear$estimate
## cor
## -0.9930409
And to index the p-value, we can call the sub-variable p.value:
cor_age_birthyear$p.value
## [1] 4.112443e-62
Now, we can use these variable names to refer to the Pearson’s r and p-value of this correlation test. You don’t have to look at the numbers and copy them over!
PS: In RStudio, if you want to know all the sub-variables inside of an object, if you type the name of the object with a $ after it, an auto-complete menu should pop up and you can use the up and down arrow keys to look through all the possible sub-variables.
Let’s see if maximizing correlates with regret. Use cor.test() on those two variables, and report the Pearson’s r and p-value. Remember to label what you are doing in the text above each code chunk.
cor_maxi_total_regret_total <- cor.test(~ maxi_total + regret_total, method = "pearson", data = IntroSurvey)
cor_maxi_total_regret_total
##
## Pearson's product-moment correlation
##
## data: maxi_total and regret_total
## t = 1.5852, df = 66, p-value = 0.1177
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.04915847 0.41116223
## sample estimates:
## cor
## 0.1915112
cor_maxi_total_regret_total$estimate
## cor
## 0.1915112
cor_maxi_total_regret_total$p.value
## [1] 0.1177043
Typically, maximizing correlates reasonably strongly with regret (r ~ 0.4). Does this appear to be the case in our dataset?
Let’s see if maximizing correlates with various measures of course-selection behavior. Calculate the correlation and report the Pearson’s r between maxi_total and each of the following variables:
courses_enrolled
cor_maxi_total_courses_enrolled <- cor.test(~ maxi_total + courses_enrolled, method = "pearson", data = IntroSurvey)
cor_maxi_total_courses_enrolled
##
## Pearson's product-moment correlation
##
## data: maxi_total and courses_enrolled
## t = 3.6432, df = 67, p-value = 0.0005261
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1880423 0.5868297
## sample estimates:
## cor
## 0.4066257
courses_shopped
cor_maxi_total_courses_shopped <- cor.test(~ maxi_total + courses_shopped, method = "pearson", data = IntroSurvey)
cor_maxi_total_courses_shopped
##
## Pearson's product-moment correlation
##
## data: maxi_total and courses_shopped
## t = 0.96346, df = 66, p-value = 0.3388
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1241420 0.3464656
## sample estimates:
## cor
## 0.1177689
points_enrolled
cor_maxi_total_points_enrolled <- cor.test(~ maxi_total + points_enrolled, method = "pearson", data = IntroSurvey)
cor_maxi_total_points_enrolled
##
## Pearson's product-moment correlation
##
## data: maxi_total and points_enrolled
## t = 2.3237, df = 67, p-value = 0.02319
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.03892622 0.47882273
## sample estimates:
## cor
## 0.2730909
Is the regret score (regret_total), as calculated by the Regret Scale, correlated with either of our students’ self-reports of regret:
process_regret?
cor_regret_total_process_regret <- cor.test(~ regret_total + process_regret, method = "pearson", data = IntroSurvey)
cor_regret_total_process_regret
##
## Pearson's product-moment correlation
##
## data: regret_total and process_regret
## t = 2.1832, df = 66, p-value = 0.03258
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.02249018 0.46893262
## sample estimates:
## cor
## 0.2595236
outcome_regret?
cor_regret_total_outcome_regret <- cor.test(~ regret_total + outcome_regret, method = "pearson", data = IntroSurvey)
cor_regret_total_outcome_regret
##
## Pearson's product-moment correlation
##
## data: regret_total and outcome_regret
## t = 1.0319, df = 66, p-value = 0.3059
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1159037 0.3537996
## sample estimates:
## cor
## 0.1260034
What about regret_general?
cor_regret_total_regret_general <- cor.test(~ regret_total + regret_general, method = "pearson", data = IntroSurvey)
cor_regret_total_regret_general
##
## Pearson's product-moment correlation
##
## data: regret_total and regret_general
## t = 7.3213, df = 65, p-value = 4.682e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5152693 0.7856135
## sample estimates:
## cor
## 0.6722696
Is the score on the Maximizing Scale (maxi_total) related to the self-report of tendency to maximize (maxi_general)?
cor_maxi_total_maxi_general <- cor.test(~ maxi_total + maxi_general, method = "pearson", data = IntroSurvey)
cor_maxi_total_maxi_general
##
## Pearson's product-moment correlation
##
## data: maxi_total and maxi_general
## t = 2.8859, df = 67, p-value = 0.005247
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1040117 0.5276617
## sample estimates:
## cor
## 0.3325069
We use t-tests to determine if two samples likely come from populations with different means. Of course, sample means will always differ somewhat, even if two samples are drawn from the same population. When the t-value is high enough, we conclude that it’s unlikely that the two samples actually share the same population mean.
The R syntax for a Student’s t-test is t.test(dv ~ iv, data = DataFrame). Note that because we’re defining the dataframe (for today’s assignment, IntroSurvey) within the test, we don’t need to dollar-sign index it in the names of the DV and IV variables. We can just use the variable name on its own (for example, inside this function, if age were our DV, we only need to call age, not IntroSurvey$age, because t.test() already knows to look inside of IntroSurvey.)
Let’s start by seeing if our subjects show a difference in age based on handedness. (They shouldn’t!)
Note: t-tests can only be run on two groups. Earlier in question 4c, we created a new binary handedness variable called handed_binary that we can now use to separate two groups for our t-test.
# note that t.test syntax has your DV first, then your IV
# typically, the outcome variable comes first in R syntax, followed by whatever predicts it t_age_handed <- t.test(age ~ handed_binary, data = IntroSurvey) t_age_handed
Similarly to cor.test objects, if you save the output of t.test() into a variable, you can dollar-sign index into its sub-variables to access the estimated means, t-value, p-value, etc. from the test.
#t_age_handed$estimate
#t_age_handed$statistic
#t_age_handed$p.value
Next, you’ll create and inspect your own t-test objects. This time, let’s test for differences in scores by school affiliation (Columbia College/Barnard vs GS/SPS, for example). We will use the school_binary variable that we created earlier in question 4c. If you can’t remember how this variable was created, refer back to the code in that question.
Does regret score differ by school affiliation?
t_regret_total_school_binary <- t.test(regret_total ~ school_binary, data = IntroSurvey)
t_regret_total_school_binary
##
## Welch Two Sample t-test
##
## data: regret_total by school_binary
## t = -0.16547, df = 61.531, p-value = 0.8691
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.7475580 0.6332722
## sample estimates:
## mean in group B_CC mean in group GS_SPS
## 4.500000 4.557143
Does tendency to maximize differ by school affiliation?
t_maxi_total_school_binary <- t.test(maxi_total ~ school_binary, data = IntroSurvey)
t_maxi_total_school_binary
##
## Welch Two Sample t-test
##
## data: maxi_total by school_binary
## t = 2.229, df = 48.581, p-value = 0.03047
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.0596064 1.1536120
## sample estimates:
## mean in group B_CC mean in group GS_SPS
## 4.825000 4.218391
It might be reasonable to expect that people who made their course decisions using a calculation-based mode are more likely to be maximizers than those who used an affect- or rule-based mode. So we should see if there’s any difference in maximizing between those who used calculation-based modes and those who didn’t.
(Note, though, that the causality implied by this hypothesis likely runs in the opposite direction: we think that someone with a tendency to maximize would be more likely to use a calculation-based mode. The proper test for cases where a continuous variable predicts a binary outcome is a logistic regression. To keep things statistically simple, we’ll lean into the sloppy stats we’re already using in this lab, and test instead whether use of a calculation-based decision mode predicts maximizing score.)
Remember, the original decision mode variable, dec_mode has 4 options (calculation, rule, role, and affect). To do a t-test, we need to use the binary-recoded version of this variable, dec_mode_binary. We created this variable earlier in question 4c, so if you can’t remember how this variable was created, refer back to the code in that question.
Now, please run a t-test just like above, with dec_mode_binary as the IV and total maximizing score as the DV.
t_maxi_total_dec_mode_binary <- t.test(maxi_total ~ dec_mode_binary, data = IntroSurvey)
t_maxi_total_dec_mode_binary
##
## Welch Two Sample t-test
##
## data: maxi_total by dec_mode_binary
## t = -0.83703, df = 35.95, p-value = 0.4081
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.8779561 0.3649932
## sample estimates:
## mean in group calc mean in group not-calc
## 4.402778 4.659259
Time to graph some results! ggplot2 is a package that helps you make prettier figures in R, using more straightforward syntax. We’ll use ggplot2 to make a scatterplot and a bar graph, but first let’s get oriented with how it works.
At the beginning of this lab, you installed the package ggplot2. You only have to do that once ever (unless you uninstall it or update the package). When working in RStudio Cloud you basically never have to do it, becuase our assignments have already installed it before you made your own copy of the project file.
We also loaded the ggplot2 library earlier in this lab, which allows us to quickly access all of its functions, with the library(ggplot2) command. You need to do that once every time you open a new R session.
ggplot2 allows you to format your figures in many cool ways. When working on one complete project, whether it be one homework assignment or one set of research analyses, it’s easiest to just define a “theme” one time, and then apply it to every figure in that set of analyses.
We’ll create a theme called bwtheme (for black & white), which is nice and simple. We’ll use some ggplot2 functions to define this theme and save it into a variable. This way, you can create the theme variable once and then use it over and over on multiple graphs. Run the code below to save our theme into a variable:
bwtheme <- theme_bw() +
theme(strip.background = element_rect(fill = 'white')) +
theme(plot.title = element_text(size = rel(1))) +
theme(axis.title = element_text(size = rel(1)))
When you plot data with ggplot2, you need to tell it 3 things:
After you tell ggplot2 one thing about what you want your plot to look like, if you put a + at the end of a line it knows that you have more information to give it about the same plot. It will then look for more code to add elements to the figure. Once it finds a plotting command that isn’t finished with a +, it will stop and produce the figure.
It’s best practice to add a new line after each separate layer of your plotting code, so that each layer’s code is easily readable on its own. You can add as many layers to your plotting command as you’d like as long as you end each line with a + (except the last line).
For example, to plot a scatterplot of age and birthyear, you would use:
ggplot(IntroSurvey, aes(x = birthyear, y = age)) +
geom_point(shape = 1, position=position_jitter(width=.05,height=.05)) +
bwtheme
## Warning: Removed 2 rows containing missing values (geom_point).
# position_jitter randomly shifts each point vertically and/or horizontally
# this is useful if you have lots of overlapping data so you can see how
# many points are in that one position
The ggplot() command defines the dataframe (IntroSurvey) and the “aesthetics” (the x and y variables you want to use). The geom_point() command tells ggplot2 you want to graph points (aka a scatter plot). shape = 1 tells ggplot2 to use open circles for each point. ggplot2 supports several different point shapes in scatter plots, each labeled by a different number. You can experiment with different shapes by changing the number, if you’d like.
One useful thing to add to a scatterplot is a linear regression line. (A correlation between x and y is the same as a linear regression with only one predictor variable.) We’ll do this with geom_smooth().
ggplot(IntroSurvey, aes(x = birthyear, y = age)) +
geom_point(shape = 1, position=position_jitter(width=.05,height=.05)) +
geom_smooth(method = lm) +
bwtheme
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
By default, geom_smooth() will graph the 95% confidence interval (CI) for your regression line. If you want to omit that line, set the se argument (stands for “standard error”) of geom_smooth() to be FALSE to turn the estimated standard error off, like this: geom_smooth(method = lm, se = FALSE)
Earlier, you found the correlation between regret and maximizing. Let’s graph that now. This time, we’ll add a title by adding an additional layer in our ggplot call, and we’ll also edit the x and y axis labels so that they’re a little more informative than our shorthand variable names.
ggplot(IntroSurvey, aes(x = maxi_total, y = regret_total)) +
geom_point(shape = 1) +
geom_smooth(method = lm) +
# labs() lets you modify various labels in a ggplot
labs(title = "Regret as a Function of Maximizing Tendency",
x = "tendency to maximize",
y = "tendency toward regret") +
bwtheme
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
Create one more figure that shows the relationship between two continuous variables (or any variables that aren’t categorical—scatterplots for ordinal variables are okay, too!). Make sure your figure has a title, and change the labels on each axis so that someone not familiar with our variable names would understand what the figure represents. You can copy and paste code from the previous sections, above, but remember to change the variable names as needed.
ggplot(IntroSurvey, aes(x = major, y = school)) +
geom_point(shape = 1, position=position_jitter(width=.05,height=.05)) +
geom_smooth(method = lm) +
bwtheme
## `geom_smooth()` using formula 'y ~ x'
Now let’s make a box plot, which is a clean way to illustrate the centers and spreads of multiple categories (e.g., the two groups you’re comparing with a t-test) . We’ll graph this in the same way as we made the scatter plot, but this time instead of using geom_point() to create a scatter plot we’ll use geom_boxplot() to generate a box plot.
ggplot(IntroSurvey, aes(x = handed, y = age)) +
geom_boxplot() +
bwtheme
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
geom_boxplot() shows you the median (middle line of the box), the 25th and 75th percentiles (the bottom and top edges of the box), and +- 1.5 inter-quartile ranges (IQRs) away from the median (the whiskers). Any points farther than 1.5 IQRs from the median are shown as individual points.
If you want to add the points over the boxplots (a good habit, so that your reader can see where the individual values fall, in addition to the summary statistics), you can easily do this by adding another layer on top of geom_boxplot() using something like geom_point(size = 3, alpha = .5). The alpha value is opacity (transparency), where larger fractions (closer to 1) are more opaque and lower fractions (closer to 0) are more transparent. Using transparency is another way to show how many data points are stacked right on top of each other, in addition to jittering the points’ locations like we did in the scatterplot above.
ggplot(IntroSurvey, aes(x = handed, y = age)) +
geom_boxplot() +
geom_point(size = 3, alpha = .3) +
bwtheme
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
## Warning: Removed 2 rows containing missing values (geom_point).
If you want to change the order of the x-axis labels, or change their text, you can change these with various arguments within
scale_x_discrete(). There are several functions you can use to modify the behavior of x-axes and y-axes. In this case, because the x variable is categorical, scale_x_discrete() is the appropriate function to use.
ggplot(IntroSurvey, aes(x=handed, y=age)) +
geom_boxplot() +
geom_point(size = 3, alpha = .3) +
bwtheme +
scale_x_discrete(limits = c("L", "R", "A"), # limits re-orders the three bars
# labels changes text.
# Make sure these are specified in the correct order!
labels = c("Left", "Right", "Ambi"))
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
## Warning: Removed 2 rows containing missing values (geom_point).
As mentioned above, if making the points transparent doesn’t solve the problem of telling overlapping points apart, you can “jitter” the location of each point horizontally by using
geom_jitter() in place of geom_point(). This adds a little bit of random noise to each point’s location, which makes it so that points which were on top of each other are now slightly offset.
Below, the geom_point() line has been commented out so it won’t run. R will skip over it and run the geom_jitter() line below it and keep going as usual. If you uncomment geom_point() by deleting the #, and instead comment out the geom_jitter() line by adding a # in front it, you’ll see your partially transparent points again.
Since both of these commands alter the way points are plotted by ggplot2, you should only use one at a time.
ggplot(IntroSurvey, aes(x=handed, y=age)) +
geom_boxplot() +
# geom_point(size = 3, alpha = .3) +
geom_jitter(position=position_jitter(width=.2, height=0)) +
# jittering width only preserves correct age
bwtheme +
scale_x_discrete(limits=c("L","R","A"),
labels=c("Left", "Right", "Ambi"))
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
## Warning: Removed 2 rows containing missing values (geom_point).
Now, please create one new box plot that shows any of our continuous DVs as a function of one of our categorical variables. Again, make sure your figure has a title, and change the labels on each axis so that someone not familiar with our variable names would understand what the figure represents.
ggplot(IntroSurvey, aes(x = school, y = major)) +
geom_boxplot() +
bwtheme
When saving an R Markdown file, you can either save it as a text-based file so you can keep editing it and running the code inside later, or you can export it as a final document that anyone can read without opening RStudio. Exporting an R Markdown is called “knitting.” To knit your file, click the “Knit” button at the top of your Editor pane, and click “Knit to HTML” or “Knit to PDF.” Once the HTML/PDF document finishes exporting, it should save into the same directory as your original R Markdown file. Rename this HTML document with a new filename that includes your own name, e.g., R_Assignment_2_Jorge_Mallea.html.
To get credit for this assignment, upload your completed “knitted” HTML/PDF to the Assignments section of Canvas for UN1490.