This is what we call an rMarkdown File. You can imagine it’s like a word document, but we can run code inside of it and it will update the code as we test it out. It can output as .docx or .pdf as well, which makes it an ideal file type for doing assignments.
It also looks relatively professional.
rMarkdown is made up of two words – r and Markdown. R is our programming language.
Markdown is also a very basic language, but mainly used as a way to style text on websites. You might be used to this if you comment on Reddit, which also uses Markdown for its text entry.
To see what it might look like, let’s try:
Notice that I am adding enters between these. What would happen if we don’t?
White space is nice. We have room! Let’s be sure to use it.
We can also do things like bold, italics, and bolded italics.
We can also quote things. We won’t necessarily use this a lot, but it’s good to know.
Here.
See?
This can get crazy when we do things like:
Dr. C said that
Dr. K said…
Finally, we have lists, which could look like:
Now, with this in hand, you are an expert at making large posts on reddit.
Before we really jump in, we’re going to need to install some packages. Packages are things people have voluntarily created for free. They do special things that R on its own can’t do – kind of like downloadable content for your statistics software. This is one of the benefits of things like R and Python – they’re constantly growing as people create ways to do things that they normally can’t do. For our class, we need to install a fair few, including:
install.packages("swirl")
install.packages("tidyverse")
install.packages("wesanderson")
install.packages("psych")
install.packages("RColorBrewer")
install.packages("ggdist")
install.packages("PupillometryR")
install.packages("harrypotter")
When we use these packages in class, we will call them to R by saying things like:
library(swirl)
We will also do our best to keep our code clear. To do this, we will try (with various success) to also tell ourselves which package special functions come from, by doing things like:
DescribeBy::psych.
We’ll talk more about the :: much later.
It is important that we try and remain slightly organized in R. Your code wants to know where everything is. It can’t take the time to search your whole computer for a file. For this class, we will operate off of a folder we will create on our Desktop. To do this, first we want to point R to say “Hey, be sure to look at our Desktop”. Now, we can do this manually - the box on the right side has a Files section, and from there, we can click through our computers until we find our Desktop. Then, it says “More” with a dropdown list, and we can find an option that says “Set as Working Directory.” We could click that, and R will output the code that you would need to type (and then enters it) to do it yourself.
However, we are in a coding class! Coding means we do it live. To do that, we will use our first function, setwd(). We should have clicked it to see what that code looks like for you, but generally, Macs and PCs will differ on this one line of code.
setwd("~/Desktop") #For Macs
setwd("C:/Users/admin/Downloads") #For Windows
Once we do this, we want to tell R to create a new folder that will house all of our class materials. Then we will update our Working Directory to this location, and will want to ensure we’re always pointing there from now on.
dir.create(
file.path(
getwd(),
"JayTerm22"),
showWarnings = FALSE
)
setwd("~/Desktop/JayTerm22")
Download our .csvs and put them in your folder as well.
id1 <- "1tJBy2zM8xJaAdCiPMpRXYN9-_Mu6n3Yv"
#JayTermGeomCol.csv is Becca' project, summarized in 2x2 form.
#4 obs of 15 variables.
#id2 <- "1A8vDIa2dE3FFZorg9r5gh6I_sOBJDwl_"
#JayTermGeomBox is Becca's project, already in long form.
#744 obs of 24 variables. The key variables here are TotHumanization,
#Condition, and prepost.
id3 <- "1UjQPW1BZJTCljE17MO3MKj7Am_FaGGxa"
#JayTermRaincloud is Delaney's data. Specifically, it's looking at when people are reading Democratic reasons for voting for Joe Biden. It's in the long form.
id4 <- "1vQ07DwitkWB6_lLY4x7acCXMf424fQiA"
#ReshapingData is Becca's data in wide form, prior to pivot_longer().
id5 <- "1HVHs0it2lwZO8_TiyGl2a9VDvMBRFo4e"
#WordToNum is Becca's data prior to most cleaning.
dt1 <- read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download", id1))
write.csv(dt1, "JayTermGeomCol.csv")
#dt2 <- read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download", id2))
write.csv(dt2, "JayTermGeomBox.csv")
dt3 <- read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download", id3))
write.csv(dt3, "JayTermRainCloud.csv")
dt4 <- read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download", id4))
write.csv(dt4, "ReshapingData.csv")
dt5 <- read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download", id5))
write.csv(dt5, "WordToNum.csv")
## Inputting R Into Our Markdown File
<a href="#top">Back to top</a>
As we learned in the Intro video, we can also quickly create the space to show our code using Control+Alt+I / Control+Option+I for our Mac users. You also could type it out, which should look like \```{r}``` this.
print("Hello World")
## [1] "Hello World"
The letter r says "What comes next is going to be r code – do not read this as words, but instead, as code.
We can name coding chunks. This is helpful when we’re trying to be transparent and clear about what we want each coding chunk to tell us, and can organize our minds while we work. We will want to name our coding chunks throughout this class. You can do that by hitting space after the r and writing something else, like ```{r Print}```.
print("Hello World")
## [1] "Hello World"
Now, in your own markdown file, nothing really should have changed. We should notice how at the bottom of our rmd that the header title (“Inputting R Into Our Markdown File”) shows up. And, if you were to click into your code above, you might notice how two things change - the orange hash becomes a green C (c for code), and the title now says Print.
There are other options we could include within our curly brackets, but for the rest of the class, we will not include. You might have seen it in the beginning of creating a markdown file – one was called echo=FALSE and another says include=FALSE. Let’s try our code above exactly again but include the option , echo=FALSE.
## [1] "Hello World"
Our first error! What does it say?
Error in parse_block(g[-1], g[1], params.src, markdown_mode) : Duplicate chunk label ‘Print’, which has been used for the chunk: print(“Hello World”) Calls:
… process_file -> split_file -> lapply -> FUN -> parse_block Execution halted
Can anyone guess what went wrong? Let’s fix it by calling it Print2.
## [1] "Hello World"
Okay, now we see how it would work. Now what did echo=FALSE do? Good. Now we had also noticed include=FALSE. What will that do? Let’s try it now, calling this block Print3.
What happened to this code this time? Great. Now, there is something else we’ve unintentionally learned by playing around with these options. In our first Print chunk, did we include echo= or include=? What happened when they were left out?
This shows us that R will make a lot of assumptions for us! That’s good - we don’t want to always be telling R what we want, sometimes, we want R to just assume we want things the normal way, and we’ll tell if it we want to change things up.
Now, let’s transition towards thinking about what R is. R is a programming language, after all. We can always use our Console as our playground.
We’ve already learned one way to use R with the print() function. When we talk about functions, we’ll talk about them using their name (print) and the parenthesis () to signify we mean a function, thus print().
Print() took one input (we’ll call these arguments), and it took it in the form of quotes. Quotes signify text - characters. We can learn more about print if we type ?print or ?print() in the console. This is called a help file.
Compare this output:
print(x, …)
S3 method for class ‘factor’
print(x, quote = FALSE, max.levels = NULL,
width = getOption(“width”), …)
to the helpfile of, say, ?rnorm, which is a function that will draw a specified amount of random numbers from a normal distribution.
Description
Density, distribution function, quantile function and random generation for the normal distribution with mean equal to mean and standard deviation equal to sd.
Usage
dnorm(x, mean = 0, sd = 1, log = FALSE)
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
rnorm(n, mean = 0, sd = 1)
How many arguments do you think rnorm() takes? How many arguments does print() take?
Good! We also can look at rnorm() a little bit closer. Awhile back, we talked about how r treats things if we don’t specify them. Do you think rnorm() has default values? If so, which arguments have default values, and can you identify what those default values are?
Can anyone remind us what a mean is? Can anyone take a guess at what sd is, and what its definition is?
Why don’t we test to see if these default values are real. Let’s draw ten random numbers from the normal distribution. How might we do this?
If we ran rnorm(10) in our console right now, what would happen?
rnorm(n=10)
## [1] -0.3686503 -1.4506429 -0.8842406 -1.1916368 0.2692267 0.3090134
## [7] -0.1599275 0.8463873 0.2780706 0.4095050
rnorm(n=10)
## [1] -0.7575291 1.4926411 1.3373966 1.4959092 -0.9186650 0.4675853
## [7] -0.9256845 0.1838519 1.3093962 -1.5532860
Great. Now, you might be freaking out – your numbers are different than mine! Why do we think that is?
Well, to calm ourselves, let’s all get on the same randomness.
set.seed(131)
Okay. So, we now all should see the same numbers if we try again.
rnorm(n=10)
## [1] -0.81884660 -0.54384762 1.02089049 0.04670388 -0.45010286 -0.21150073
## [7] 1.10377374 0.05798225 -1.76945020 0.75282646
rnorm(n=10)
## [1] -0.7921836 0.3197308 0.2088026 1.4037762 0.7996981 -0.7110605
## [7] -1.5005960 -0.3298929 0.2430087 -0.4949339
While we don’t want to stray too far into math, you will notice that the second rnorm() is different than the first, even though we set a random draw, but we all have the same twenty numbers. The easiest way to explain this is that set.seed(131) created a random draw from 1 to infinity of numbers. The first time we called it we called 1-10; the second, 11-20, etc. If you wanted the second call to be the same as the first, you would need to call the seed again (try this on your own spare time).
Okay, so we have this list of ten numbers. It might be tedious for us to remember all of these random numbers by heart, especially if they keep changing! Let’s store this what we call vector of numbers into something, like x.
To assign variables, we can use either a single equal sign = or an arrow to the left <-. Your book points out we can also use an arrow to the right ->. We will always keep best practices, which means we will be using assign with an arrow to the left <-.
x <- rnorm(n=10)
Nothing happened! Uh oh! But wait - something has happened. Where did x go? Can anyone see it? Yes, x is in our environment now. We can call it out just by typing x into the Console.
x
## [1] 0.3469540 -0.7895139 0.5254704 -1.9481709 1.2787557 0.2055606
## [7] -0.8156215 -0.5689162 -1.1266178 0.6075258
You know, 10 is a little unfair. Would you think 10 draws is really comparable to the world of infinity? Let’s bump those rookie numbers up. Let’s overwrite x, and instead, make x 1,000 observations of rnorm().
x <- rnorm(n=1000)
Now, back at our help file for rnorm, does anyone remember what it said the default values were? Pull up the help file with ?rnorm if you have any questions. Well, let’s see how it worked with 1000 draws. We can use the function mean() and sd() to see the mean and standard deviation of x.
mean(x)
## [1] 0.01132934
sd(x)
## [1] 0.9912969
So, even in 1,000 draws, our mean still isn’t actually zero, but our standard deviation is definitely close to 1. It seems to work pretty well! If we were to change those default values, and save it as y, where we want 1000 more draws, a mean of 5, and a standard deviation of 2, how would you code that? After you do this, immediately show the mean and sd of y.
y <- rnorm(n=1000, mean=5, sd=2)
mean(y)
## [1] 4.836143
sd(y)
## [1] 1.931626
It is important that we are explicit in what we mean by each argument, because r cares about that stuff! If we were to not do it, and just throw things in various orders, things might change!
y2 <- rnorm(5, 1000, 2)
mean(y2)
## [1] 998.7306
sd(y2)
## [1] 1.230666
y3 <- rnorm(2,5,1000)
mean(y3)
## [1] 1006.172
sd(y3)
## [1] 148.0429
Great. There’s a lot of good reasons to just always be as explicit as possible - first and foremost, you forget things! Things can get exceedingly complicated, and it’s important that we always name things so that others can udnerstand our code too. Plus, if we name things, R will consider names before order. Try this now:
y4 <- rnorm(sd=2, n=1000, mean=5)
mean(y4)
## [1] 5.021291
sd(y4)
## [1] 2.01272
Perfect. Let’s clean up our environment. First, we want to know what is in our environment. We can use the ls() function to tell us what’s in our environment. We can also remove things from our environment by using the rm() function. Try rm(y) now.
ls()
## [1] "x" "y" "y2" "y3" "y4"
rm(y)
But, we can get meta in coding. What does ls() give us? Does the output remind us of anything we’ve done so far?
Yes, ls() gives us a vector of names in our environment. rm() is just looking for things to remove. What happens if we put a function ls() inside of another function, rm()? Try it now.
rm(ls())
## Error in rm(ls()): ... must contain names or character strings
Oooh, so close! But we hit an error. Let’s google to try and find a solution. What would be the first thing you would google? I’d probably want us to start with error rm(ls()), right? https://stackoverflow.com/questions/52711527/in-r-why-rmlist-ls-does-not-work-when-is-also-assignment
rm(list=ls())
Great. So, let’s get back to vectors. Let’s just make a vector of three numbers, and we’ll call it my_vect, of 1, 3, and 5. To have a list of numbers, we want to combine, or concatenate them, together, so we use c().
my_vect <- c(1,3,5)
Great job! We can do anything we would like to this. What do you think would happen if we added 100 to my_vect? Try it now without saving it as anything.
my_vect+100
## [1] 101 103 105
Let’s make a second shorter vector of just two observations – 2 and 4. We’ll call that my_short_vect. After you’ve made that, first think to yourself what would happen if we add my_vect and my_short_vect together. Remember my_vect should still be 1,3, and 5.
my_short_vect <- c(2,4)
my_vect+my_short_vect
## Warning in my_vect + my_short_vect: longer object length is not a multiple of
## shorter object length
## [1] 3 7 7
It threw an error! But math wise, what happened? R still did it’s job, but it wanted us to know it feels like something went wrong.
What we see is that it recycled the 2 – it did (1+2) | (3+4) | (5+.. uh oh I guess 2 but I’ll let them know I’m confused). R will frequently do this.
Now, we know what the second number in my_vect is. But what if you weren’t sure? We can ask R to tell us by asking for the second number in the vector by saying my_vect[2]. Try this now.
my_vect[2]
## [1] 3
Great! You can see that R has identified the second item in the list. As your book points out, you can also say to remove things by using the negative operator. Try my_vect[-2], but before you hit enter, think about what you think it will show.
my_vect[-2]
## [1] 1 5
Finally, we could also call multiple things using c(). Instead of getting the first and third observation by removing the second, try to reason out how you would get the first and the third using 1, 3, and c().
my_vect[c(1,3)]
## [1] 1 5
Nice job. Let’s clean our environment again and now start to think about what data might look like. To do this, let’s create two new variables – the first, a list of names – Kevin, Robyn, Diane, and Pat, called firstnames. Remember, this is text, and what did we say about inputting text-based things into R? The second variable, we can call age, and we’ll give it four numbers, 29, 29, 68, and 64.
Now, we want to make this into a dataset. Data in r are called data.frames. You might see the term matrix - this is simply a data.frame without labels. We can make our small dataset by running data.frame(firstnames, age) and save it as data. Print out data by typing data at the end to see what you’ve created.
rm(list=ls())
firstnames <- c("Kevin", "Robyn", "Diane", "Pat")
age <- c(29, 29, 68, 64)
data <- data.frame(firstnames, age)
data
## firstnames age
## 1 Kevin 29
## 2 Robyn 29
## 3 Diane 68
## 4 Pat 64
We can also view this by using View() with a capital V. We can only do this one in the console though.
#View(data)
Okay! So, now the trick - What would happen if we did data[1]? What about data[1,]? data[,1]? What would data[2] give you? What would data[,2] give you?
Based on these, what is the first number in a [#,#] pair for data? And what is the second?
Try them out, and then figure the combination needed to get Diane’s name to appear.
data[1]
## firstnames
## 1 Kevin
## 2 Robyn
## 3 Diane
## 4 Pat
data[,1]
## [1] "Kevin" "Robyn" "Diane" "Pat"
data[1,]
## firstnames age
## 1 Kevin 29
data[2]
## age
## 1 29
## 2 29
## 3 68
## 4 64
data[2,]
## firstnames age
## 2 Robyn 29
data[,2]
## [1] 29 29 68 64
data[3,1]
## [1] "Diane"
Great. The best part about data.frames is this named portion of the dataframe. It can be really ugly and confusing to call things by where they show up in the data. But data.frames have names for a reason. Let’s look at what the names of of the variables in our dataset are by using names() and passing our dataset into it.
names(data)
## [1] "firstnames" "age"
Great, as expected. You can imagine this is more important when dealing with data that has a lot of variables. Now, I said the benefit of using data.frames was the named columns feature. We can explicitly ask R about these columns by using the $ operator. The $ operator tells R to look inside of this previous object for something called this thing. For example, try data$age right now. Once you’ve done that, use R’s mean() function to take the mean of data$age.
data$age
## [1] 29 29 68 64
mean(data$age)
## [1] 47.5
Exactly! Data$age outputs just like a vector – because that’s what the column is – a vector of things. And as such, we can apply what we’ve learned to take the mean of the thing.
Now, so far, we’ve been using <- to assign things, and we said we wanted to avoid using the equal sign. But we can evaluate things using math, like thinking about greater than or less than. Equal to is a double equal sign (==) which means equivalent to.
To show some basics, try 5>3, 4<2, and 131==131. We should also review AND or OR. AND means both things must be true to be evaluated as true. OR means either one or the other has to be true to be evaluated as true.
5>3
## [1] TRUE
4<2
## [1] FALSE
131==131
## [1] TRUE
((5>3) | (4<2))
## [1] TRUE
((5>3) & (4<2))
## [1] FALSE
You can see how R tells us if things are true or false in the statement. This is a critical part to thinking about data - computers are all just TRUE or FALSE machines. When [this thing] is true, do this. Else, do this. If it’s false, try this.
For example, what if we want to confirm if everyone in our data is over the age of 18? How might we type something that could take the data$age column and see if it is greater than 18?
data$age > 18
## [1] TRUE TRUE TRUE TRUE
Great! Well, my birthday is this month, and I’m pretty sure 30 is when my life is over. How many people have had their lives end (that is, they are older than 30?)
data$age > 30
## [1] FALSE FALSE TRUE TRUE
Great. What if we wanted to see if just the first row and the age column of data was greater than 30? Remember that age is our second column in the dataset.
data[1,2] > 30
## [1] FALSE
But again, we don’t like calling things by brackets. It can get messy. Things move.
So, let’s try to get this true evaluation again, but this time, let’s be explicit. I want to find out the age of the person whose firstnames variable is equal to “Kevin”. When we are doing equal to, remember we use ==.
data[data$firstnames=="Kevin",]$age
## [1] 29
Doing great. What if we want to know their identified gender? We don’t have that data in our dataset yet. We can make it, however.
In our data, let’s make a new column called Gender. For now, we will assign this new variable the value of "".
data$Gender <- ""
Now, above we evaluated a call to the row of Kevin and asked for his age. Imagine instead, we asked for his gender. Right now, it would come back with what? ("“). Right. So, let’s tell R – give us his Gender, but overwrite it with this new thing, we’ll call”Male" in quotes. Repeat this for all four observations, with Robyn and Diane being Female, and Pat and Kevin being Male.
data[data$firstnames=="Kevin",]$Gender <- "Male"
data[data$firstnames=="Robyn",]$Gender <- "Female"
data[data$firstnames=="Diane",]$Gender <- "Female"
data[data$firstnames=="Pat",]$Gender <- "Male"
You can also do things with AND or OR. We could make the following 2 lines of code by saying if their names are Kevin OR Pat, Male. To make sure we can see that something changed, let’s name this one all caps, so MALE. Repeat for FEMALE for Diane and Robyn.
data[(data$firstnames=="Kevin" | data$firstnames=="Pat"),]$Gender <- "MALE"
data[(data$firstnames=="Robyn" | data$firstnames=="Diane"),]$Gender <- "FEMALE"
There are other ways we could do the same thing. Way back at the beginning, we created our dataframe by providing data.frame two vectors of information. Now that we have a data.frame, we can just give it the full vector of information. You might realize why this can be a bit risky —, but if not, let’s try it out first. Let’s call this one Gender2 to keep our eyes on what we’re doing.
data$Gender2 <- c("Male", "Female", "Female", "Male")
This is risky because we have to remember the exact order our names appeared! Otherwise, I might misgender a participant unintentionally.
It’s also important to note we could call the age column by passing that through the [], but if we do this, the names of the columns must be indicated by quotation marks.
data[age]
#Will throw an error
#Error in `[.data.frame`(data, age) : undefined columns selected
data["age"]
## age
## 1 29
## 2 29
## 3 68
## 4 64
While this data is very small, it still might be the case that we only want certain variables. Say, for instance, we just want to have a data set of the columns Name and Age. We can do that using the subset() function from base R. Checking the documentation, they give various examples, including things like:
subset(airquality, Day == 1, select = -Temp)
And we notice how with that - sign, it means NOT this. We can do the same thing to explicitly tell R which columns to include, namely, Age and Name. Since there are multiple, we will want to concentate these things together.
subset(data, select= c(age, firstnames))
## age firstnames
## 1 29 Kevin
## 2 29 Robyn
## 3 68 Diane
## 4 64 Pat
Also notice here that we aren’t passing data$age. You always want to tell R what the dataset is called. Many times, you’ll find yourself calling it explicitly like data$age. However, here, the subset() function takes data as an argument. Since you’ve already told it what dataset to look at, we don’t need to tell it again, and therefore, we can skip the data$age statement.
What happens a lot of the time though, is that participants don’t fill out everything. Imagine that we weren’t sure of Kevin’s gender - he didn’t report anything. So, instead, our data[data/$firstnames==“Kevin”,]/$Gender <- “Male” should be set to NA. You’ll notice it’s correct when in our code block, NA becomes a blue color instead.
data[data$firstnames=="Kevin",]$Gender <- NA
NA is R’s version of missing. A bunch of things will break if this is true, such as…
NA>0
## [1] NA
NA<0
## [1] NA
NA==1
## [1] NA
mean(c(1,2,3,NA))
## [1] NA
So, let’s talk more about what NA does. We said it was missing. This seems to be a big problem when we’re trying to calculate something as simple as a mean. Someone pull up the help file for mean using ?mean and let’s see if there’s a way to address this problem. What is the option? What is the default setting for this option?
mean(c(1,2,3,NA), na.rm=TRUE)
## [1] 2
That’s better! Great. So, sometimes, if we face NA, we might want to first check if there’s an option to deal with it. For basic mathematical operations, we will find out that na.rm=TRUE is usually a good one to just throw in there to be safe.
But what if we’re looking at our data, and we want to know if there is a missing age in there? We can ask R if, in our list of c(1,2,3,NA) if na exists. Let’s practice googling. What might we want to google if we want to check if NA exists?
[how to check if na in r]
is.na(c(1,2,3,NA))
## [1] FALSE FALSE FALSE TRUE
Good! And what if we wanted to check if in our Age column in our dataset, if there was NA?
is.na(data$age)
## [1] FALSE FALSE FALSE FALSE
Great! Let’s add a row to our dataset.
First, let us create a new observation vector, called newobs. It will be a concatentation of four inputs, since we have 4 variables in our dataset. We’ll name them Remove, NA, Male, MALE.
We have a function called rbind(), or ROW bind. You could guess on your own what a COLUMN bind function might be called. rbind() takes two arguments - the two things we want to bind together. As they are binding on the rows, we can think we want to just slap on more data below.
We will update our variable data to bind on the rows. Our first argument wants to be data - that is, what we already have. The second argument will be that new row we created, newobs.
newobs <- c("Remove", NA, "Male", "MALE")
data <- rbind(data, newobs)
Let us check what has happened by running is.na(data$age) again.
is.na(data$age)
## [1] FALSE FALSE FALSE FALSE TRUE
Ah, something is different! R has noticed we added that missing value in and says that fifth observation is true. What if we wanted R to tell us “Tell us if it is true that these things are NOT NA?” We can do this using the exclamation mark. Re-run your is.na above, but this time, add an exclamation mark.
!is.na(data$age)
## [1] TRUE TRUE TRUE TRUE FALSE
Now what has happened? We’ve discovered that NOT NA is true for the first four, but false for the fifth. This is really great stuff.
Now, let us imagine we want to look at our data. Inside of our data, we want to look at data$age. And we want to only include rows for which NOT NA is true. How might we break down this problem?
data[!is.na(data$age),]
## firstnames age Gender Gender2
## 1 Kevin 29 <NA> Male
## 2 Robyn 29 FEMALE Female
## 3 Diane 68 FEMALE Female
## 4 Pat 64 MALE Male
Great! Don’t forget to save it to data again.
data <- data[!is.na(data$age),]
Let’s clean up our workspace again by rm(list=ls()).
rm(list=ls())
We are going to load some data into our environment. R has a lot of datasets available to us to play with. One of the ones you were exposed to in your homework was diamonds, a dataset that gives a lot of information about various diamonds and their carats, cuts, color, prices, and more. We will rarly use these moving forward, since it can be hard to wrap our minds around visualizing diamonds, compared to say, visualizing people. Still, there are clear generalizations we can make once we get the hang of one to translate into the other.
To do this, let us first load the library ggplot2. After that, create a new variable called data, and just write the word diamonds (no quotes!). R will understand. To ensure it does understand, we can use the function head() to give us the first five rows of the data set. We could also use tail() to give us the last five rows of the data set.
library(ggplot2)
data <- diamonds
head(data)
## # A tibble: 6 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
We said we aren’t going to do a lot of statistics, and I keep to that promise. However, it is important to think about the kinds of data we could import and play with in R. We already have talked about one type - nominal data. Does anyone know what nominal data means, or can give an example of nominal data?
Yes, nominal can be understood as names. It is a type of data that is not quantitative (that means, it’s not a number), and it has no inherent order or ranking in it. Male and Female have no inherent number or ranking to them. When we were playing with nominal data, we used "" to signify it was text.
Ordinal data is similar - it’s text based. But what does the name ordinal make you think of?
Hopefully you said order. Yes, ordinal data is text-based data that has some inherent order to it. You might have noticed when we called head() that it said some things under the columns, including
If we ask R, hey, what kind of variable is cut, we can do that by asking R what the class of variable is it.
class(data$cut)
## [1] "ordered" "factor"
It says it’s an ordered (ordinal) factor. Factor is another way to say groups. If you hear factor, you should think group. W&J and Allegheny are two groups. Seniors vs. Juniors. Males vs. Females. If you can break them into groups, they can be considered a factor. And in each of those examples, you can see how there are different groups one could belong into. We call these parts of a factor a level. Factors have different levels.
There are other ways we can learn more about this variable. Say, for instance, we want to know what kind of levels are in this variable. We can call levels() to tell us more, and then we can even call something like table() to tell us how many show up in each.
levels(data$cut)
## [1] "Fair" "Good" "Very Good" "Premium" "Ideal"
table(data$cut)
##
## Fair Good Very Good Premium Ideal
## 1610 4906 12082 13791 21551
Finally, we can also use factor(data$cut) to show us the list again. factor() even shows the ordering of the factors, something we didn’t get from levels() or table() (though note, table() did provide them in order). Ordering of factors is something R loves to misbehave with, and it’s something we want to learn about.
factor(data$cut)
Let’s say we actually want to order everything in the reverse order – that is, Ideal first, then Premium, then Very Good, then Good, then Fair. (Note: We actually always want to order things from least <——> greatest, as diamonds currently is. Always make sure everything you do goes from worst <—–> best!). Still, how could we do this?
We should pull up the help file of ?factor.
It looks like there’s an option called levels. It says:
an optional vector of the unique values (as character strings) that x might have taken. The default is the unique set of values taken by as.character(x), sorted into increasing order of x.
So, the help file even tells us that we want to pass through character strings! Good to know. We can scroll to the bottom to skim the examples and see that it really looks straight forward:
xf<- factor(x, levels = c(“Male”, “Man” , “Lady”, “Female”),
So, let’s try that out, and try and re-order the cut factors into the descending order. We won’t save it as data$cut, since we actually like it the way it is, but at least we can see how it would work. Do this in your Console instead of in Markdown.
factor(data$cut, levels=c("Ideal", "Premium", "Very Good", "Good", "Fair"))
The other ones we noticed during head() was
Yes! You might have noticed that
To show this, let’s create two new variables in our dataset. The first we will call halfprice. It should be assigned the value of the price column divided by two. The second we will call roundeddepth, which will be depth rounded to the 0th decimal. We will need to use round() for this one. Look up the help file if you are confused.
data$halfprice <- data$price/2
data$roundeddepth <- round(data$depth, 0)
Colors in R can be a lot of things. As a wanna-be movie buff, I really like the filmmaker Wes Anderson, even though I’ve only seen a few of his films. Still - the ones I have seen, I like a lot. Thankfully, there’s a package called wesanderson that provides color palettes for most of his movies. We can look at his color palette choices by checking the helpfile of the package’s function by ?wes_palette.
One of my favorite films is Moonrise Kingdom. Let’s try wes_palette(“Moonrise3”) and save it as palette, and then just call palette on its own.
library(wesanderson)
palette <- wes_palette("Moonrise3")
palette
Cool! There are five colors in Moonrise3. You could try others and realize there are less than 5 in some (Moonrise1, for instance). This is good information for us to know. Now, one question we might have is how R understands what just happened. We can play around with this to find out some more general information. We learned that adding [] to the end of things we play with will have R tell us more about what’s going on underneath. We also could glance at our Environment to learn a bit about it — it is something called a palette, it says something about characters, and then it says there is a vectors of 1 through 5.
Well, we learned about vectors! We have even learned how to call, say, the 2nd item in a vector. Do you remember how?
palette[2]
## [1] "#F4B5BD"
Awesome! Does anyone know what this is? This is one way computers communicate colors to us.
Now, there could come a time where we might want MORE than 5 colors. This is the magic of R - given a set of colors, R can, in time, create an infinite number of colors. We can do this through the function colorRampPalette(). This is a more advanced function, because it actually does nothing on its own. It takes a vector of colors, but it needs to be saved on its own. If we were to look at the help for colorRampPalette(), it says:
These functions return functions
Talk about meta. We’ve been used to functions returning specific things. But still, let’s try it out. We have our function, colorRampPalette(), and we will pass through it our wes_palette(“Moonrise3”). Let’s save this as mr3.
mr3 <- colorRampPalette(wes_palette("Moonrise3"))
Now, as the help file said, it returns a function. So now in our environment we see we have a new section - Functions, and mr3 is a function! It also says it simply takes as input a single thing - an N. Based on what our goal for this function was, what do you think will happen if we give it a number of 5? What about a number of 10? Compare the 5 with wes_palette(“Moonrise3”)
mr3(5)==wes_palette("Moonrise3")
## [1] TRUE TRUE TRUE TRUE TRUE
mr3(5)
## [1] "#85D4E3" "#F4B5BD" "#9C964A" "#CDC08C" "#FAD77B"
wes_palette("Moonrise3")
mr3(10)
## [1] "#85D4E3" "#B6C6D2" "#E7B8C1" "#D6AA96" "#AF9C63" "#A69F58" "#BCB276"
## [8] "#D2C28A" "#E5CC82" "#FAD77B"
There are a variety of color themes provided in R. You can learn more about them on your own, but a good starting place is the RColorBrewer package.
library(RColorBrewer)
display.brewer.all()
We’re going to move onto thinking about plotting. Plotting - how we visualize data - is an extremely important skill! R is very powerful in its plotting capabilities. Your homework reviews three main ways to plot. These are very useful to review, and can help you practice and think through visualization in many ways. However, we are going to stick to learning about ggplot2 for in-class work. This is due to two main reasons:
ggplot2 is just better.
It’s also both the trickiest and the simpliest to understand.
So, let us pull up the help file for ggplot. If we scroll a bit, we see it says:
##ggplot() Back to top
ggplot(df, aes(x, y, other aesthetics))
Let’s try that. df is data frame, for which we’ll make that data. The second is aes(). aes() stands for aesthetics. aes() looks like it wants an x and a y (and others). Let’s give it an x of carat and a y of price.
library(ggplot2)
ggplot2::ggplot(data=data, aes(x=carat, y=price))
Uh oh. Nothing showed up! Well, something did. Let’s keep scrolling down the help file. At one point it says:
The summary data frame ds is used to plot larger red points on top
of the raw data. Note that we don’t need to supplydataormapping
in each layer because the defaults from
ggplot() are used.
ggplot(df, aes(gp, y)) +
geom_point()
This gives us two important lessons in a small example. Can anyone notice anything? What is going on?
Yes, in this next step, we’re adding something. We want to think about plotting with ggplot as just continually adding layers. The second thing it adds is something called geom_point. Can we guess what geom_point() might add? Let’s try it out.
ggplot2::ggplot(data=data, aes(x=carat, y=price)) +
geom_point()
Good! There are a lot of options even within geom_point though! For ggplot, it’s always best to just scroll to the bottom of the help file and look at the examples. One of the examples mentions something about changing the shape, can anyone see that? Where is it?
Good. We know that cut is a factor, and it seems to want factors. Let’s pass cut as a factor just like the example has it. But since we already know cut is a factor, we don’t need to explicitly tell R that.
ggplot2::ggplot(data=data, aes(x=carat, y=price)) +
geom_point(aes(shape=cut))
## Warning: Using shapes for an ordinal variable is not advised
And interestingly enough - R even gives us a graphical warning! It’s angry with us! What does it say? Why does it say that? Since we’re just learning various things about ggplot, we’re just going to continue to ignore that warning, but it’s a good lesson for us.
The great thing about ggplot is that the functions to call things are relatively straight forward. For example, say we want to change the x axis label. Let’s write down some guesses as to what that function might be called.
Great job! Yes, exactly, xlab(). Now, can we also guess then what the y-axis label might be called?
Great. With this in hand, let’s rename the x axis “Carat of diamond” and the y axis “Price in USD”
Now, as we go through this, I would highly recommend you switch to either a Script, or keep using the up arrow. It’s less for you to write.
ggplot2::ggplot(data=data, aes(x=carat, y=price)) +
geom_point(aes(shape=cut))+
xlab("Carat of diamond")+
ylab("Price in USD")
## Warning: Using shapes for an ordinal variable is not advised
Now, the help file for ggplot said that it takes other aes() inputs besides x and y. We know our data, and we know that there were various levels of cuts of the diamonds. Right now, we can’t really visualize what that looks like. We might want to make the graph have a different color based on the cut of the diamond.
ggplot2::ggplot(data=data, aes(x=carat, y=price, color=cut)) +
geom_point(aes(shape=cut))+
xlab("Carat of diamond")+
ylab("Price in USD")
## Warning: Using shapes for an ordinal variable is not advised
Great job! But maybe we don’t like the colors that r has provided for us. We have some scale, which has various colors, and we would like to manually tell R what colors we want. What would you suggest is the function name of telling a scale what colors it should be manually?
Let’s look at it’s help file.
ggplot2::ggplot(data=data, aes(x=carat, y=price, color=cut)) +
geom_point(aes(shape=cut))+
xlab("Carat of diamond")+
ylab("Price in USD")+
scale_color_manual(values=c("red", "blue", "green", "orange", "yellow"))
## Warning: Using shapes for an ordinal variable is not advised
scale_color_manual can take a lot of things. Back in the Colors section, we learned about wes_palette(). We also learned that Moonrise3 from wes_palette() had 5 distinct colors. This plot also has five distinct colors. Instead of naming all five colors, what could you do to manually tell the scale to use the colors from Moonrise3? Don’t forget to load wesanderson!
ggplot2::ggplot(data=data, aes(x=carat, y=price, color=cut)) +
geom_point(aes(shape=cut))+
xlab("Carat of diamond")+
ylab("Price in USD")+
scale_color_manual(values=wes_palette("Moonrise3"))
## Warning: Using shapes for an ordinal variable is not advised
ggplot2 can also handle variation based on continuous data input as well. Let’s delete the shape option from geom_point since ggplot is so angry with us. Let us also change the color option to be price. If we keep in scale_color_manual(), we’ll get an error. What does the error say? Let’s check the documentation for scale_color_manual(). It indeed says this discrete word! Okay, time to google again! What do you want to google to figure out how we can ask ggplot to give colors of a continous variable?
ggplot2 colors continuous scale
https://ggplot2.tidyverse.org/reference/scale_gradient.html
Scrolling around, looks like scale_color_gradientn() will do the trick!
ggplot2::ggplot(data=data, aes(x=carat, y=price, color=price)) +
geom_point()+
xlab("Carat of diamond")+
ylab("Price in USD")+
scale_color_gradientn(colors=wes_palette("Moonrise3"))
This function gets a bit more mathy than we would like, but what we can do is draw lines of best fit for our data. If we scroll to the help file, we can see that one way to do it is to pass no arguments. We can try that here.
ggplot2::ggplot(data=data, aes(x=carat, y=price, color=cut)) +
geom_point(aes(shape=cut))+
xlab("Carat of diamond")+
ylab("Price in USD")+
scale_color_manual(values=wes_palette("Moonrise3"))+
geom_smooth()
## Warning: Using shapes for an ordinal variable is not advised
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Notice how it drew five different lines. Why did it do that? What would you suggest we change if we only wanted a single line? What would we have to move, and to where?
ggplot2::ggplot(data=data, aes(x=carat, y=price)) +
geom_point(aes(shape=cut, color=cut))+
xlab("Carat of diamond")+
ylab("Price in USD")+
scale_color_manual(values=wes_palette("Moonrise3"))+
geom_smooth()
## Warning: Using shapes for an ordinal variable is not advised
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
But it also has a lot of other options. We note one says method=lm and another that says se=FALSE. lm means linear model. Linear sounds like… lines, yes. These currently aren’t so much lines as they are swiggagly lines. Let’s try both options at once.
ggplot2::ggplot(data=data, aes(x=carat, y=price)) +
geom_point(aes(shape=cut, color=cut))+
xlab("Carat of diamond")+
ylab("Price in USD")+
scale_color_manual(values=wes_palette("Moonrise3"))+
geom_smooth(method=lm, se=FALSE)
## Warning: Using shapes for an ordinal variable is not advised
## `geom_smooth()` using formula 'y ~ x'
##geom_hline() and geom_vline() Back to top
We could also add horizontal and vertical lines to things if we wanted to emphasize a certain cut off using geom_hline() and geom_vline(). Does anyone want to take a guess which one would be the horizontal line? Good. If we wanted to draw a horizontal line, that is, a line that goes left to right, what would we need to tell R? What if we wanted to draw a line up and down?
ggplot2::ggplot(data=data, aes(x=carat, y=price)) +
geom_point(aes(shape=cut, color=cut))+
xlab("Carat of diamond")+
ylab("Price in USD")+
scale_color_manual(values=wes_palette("Moonrise3"))+
geom_smooth(method=lm, se=FALSE)+
geom_hline(yintercept = 3)+
geom_vline(xintercept = 3)
## Warning: Using shapes for an ordinal variable is not advised
## `geom_smooth()` using formula 'y ~ x'
The bar plot is our classic plot in the social sciences. In order to create it, we need to load in some data. On Sakai, there should be data saved that we should download and then set it to our working directory. This data will look a lot different than our last dataset that had 53,940 observations. Type out head(data) and see what is inside.
rm(list=ls())
data <- read.csv("JayTermGeomCol.csv")
head(data)
## X Condition prepost n_TotHuman mean_TotHuman sd_TotHuman se_TotHuman
## 1 1 Strongly Agree Pre 194 3.729811 0.6704774 0.04813747
## 2 2 Strongly Agree Post 194 3.887457 0.5945494 0.04268616
## 3 3 Strongly Disagree Pre 178 3.889045 0.6330831 0.04745159
## 4 4 Strongly Disagree Post 177 4.077213 0.5651957 0.04248271
## n_HumanUni mean_HumanUni sd_HumanUni se_HumanUni n_HumanNat mean_HumanNat
## 1 194 3.776632 0.8657384 0.06215639 194 3.743127
## 2 194 4.121993 0.9460141 0.06791985 194 4.072165
## 3 178 4.089888 0.8172866 0.06125822 178 4.083333
## 4 177 4.393597 0.8869902 0.06667027 177 4.372881
## sd_HumanNat se_HumanNat
## 1 0.9505755 0.06824735
## 2 1.0389388 0.07459146
## 3 0.9170304 0.06873433
## 4 1.0569361 0.07944418
Great! So, in this case, it looks like we have what we would call summary statistics for a larger dataset. This was a study a student did with me recently. In the study, we measured people’s humanization of the opposing political party before and after some treatment we tried to apply to them. Can you identify that variable?
Good. We also randomly assigned you to interact with a person of the opposing political party that either confirmed your beliefs about them (they strongly agreed) or they did not confirm your beliefs about them (they strongly disagreed!). Can you find that variable?
What would we call these variables? Good. We also see a bunch of means, standard deviations, and standard errors.
Let’s try and visualize what’s going on here. One variable is time-based – pre and post. One variable is group-based, Condition. And one variable – we’ll look at total humanization – is our dependent variable. Based on all of this, which variable would you put on the X axis, which would you put on the Y axis, and which would you potentially break up by color, shape, or some other differentiation? Set up our first line of ggplot.
ggplot(data, aes(x=prepost, y=mean_TotHuman, color=Condition))
Good! We might at this point notice something wrong – Post comes before Pre! Uh oh! Why do we think that is? Well, go back to thinking about how R handles factors – we said it likes to put things in order. Well, it recognized prepost is a factor, and it tried to give it an order. Can anyone figure out what order R thinks of when not told? ABC! Yes.
We can fix that now so it’s not a problem moving forward. Does anyone remember how to change the levels of a factor? We practiced by reversing the order of the cut of diamonds.
data$prepost <- factor(data$prepost, levels=c("Pre", "Post"))
ggplot(data, aes(x=prepost, y=mean_TotHuman, color=Condition))+
geom_col()
Okay, so a start… Clearly, this isn’t the prettiest thing we’ve seen. It seems like the bars look a little… can anyone describe them? Stacked! Yes. Stacked. Where do we see this option in the help file for ?geom_col ? position = “stack”. How do we google “geom_col stacked don’t want”?
https://stackoverflow.com/questions/52385261/ggplot-prevent-bar-chart-stacking
Let’s try their solution.
ggplot(data, aes(x=prepost, y=mean_TotHuman, color=Condition))+
geom_col(position="dodge")
Great! Now we have another problem – those colors! What has happened!? Where do we see the color? Okay, but what is going on on the inside? What would you call this? Indeed!
ggplot(data, aes(x=prepost, y=mean_TotHuman, fill=Condition))+
geom_col(position="dodge")
Great. Well, there’s other information in our dataset that we haven’t visualized yet. Does anyone know what typically also shows up in these kind of plots? Error bars! Yes, indeed. ?geom_errorbar is our function of choice here. Now on the help file we can see it wants things called ymin and ymax or xmin and xmax. Since this isn’t stats, we’ll just accept that the minimum for ymin needs to be mean minus the standard error, and the ymax needs to be mean plus the standard error. If you try too quickly, you’ll notice we have the same problem as before!
ggplot(data, aes(x=prepost, y=mean_TotHuman, fill=Condition))+
geom_col(position="dodge")+
geom_errorbar(aes(ymin=mean_TotHuman-se_TotHuman,
ymax=mean_TotHuman+se_TotHuman),
position="dodge")
The position of the bars looks wrong. Let’s again try googling geom_errorbar position dodge to figure out what might be better.
It also still looks relatively bizarre. What if we change the width of the error bars.
ggplot(data, aes(x=prepost, y=mean_TotHuman, fill=Condition))+
geom_col(position="dodge")+
geom_errorbar(aes(ymin=mean_TotHuman-se_TotHuman,
ymax=mean_TotHuman+se_TotHuman),
position=position_dodge(.9),
width=.3)
One of the largest options in ggplot2 surrounds the theme of the graph. Pull up ?theme to see the long list of options that exist in ggplot. Theme is a different kind of layer. In the past, we’ve been adding layers using the addition key. But here, theme() acts more like a function, which takes a lot of inputs that we can clarify with commas that separate each of them.
If we want something removed, the option that says this is element_blank(). We’ll add one just for example purposes, but notice how long that list is. It’s up to you to play with it and see how things can change.
ggplot(data, aes(x=prepost, y=mean_TotHuman, fill=Condition))+
geom_col(position="dodge")+
geom_errorbar(aes(ymin=mean_TotHuman-se_TotHuman,
ymax=mean_TotHuman+se_TotHuman),
position=position_dodge(.9),
width=.3)+
theme(panel.grid.major=element_blank())
Now, you’ll notice we used geom_col() when I said we were talking about bar plots. But why Dr. C, did we not use geom_bar()?
If we go back to the help file for geom_bar, we note it says:
geom_bar() makes the height of the bar proportional to the number of cases.
That sounds a lot like a histogram - where we’re simply counting. And indeed, that is the general use of geom_bar(). For example, if we did a simple geom_bar() using data=diamonds, you’ll notice a few errors pop up if you try and specify a y= and x= aes() in the ggplot call. The error says:
Error: stat_count() can only have an x or y aesthetic.
Why? Well, because geom_bar() wants to just count things. You can count 1 thing, but you can’t count two things at once!
ggplot2::ggplot(data=diamonds, aes(x=cut))+
geom_bar()
But, if you read the documentation carefully, it gives us a hint that we could use if we wanted geom_bar() to act like geom_col(). What does geom_bar use by default, and how does it change when considering geom_col()? What if we told geom_bar() to use the stat option that geom_col() uses?
ggplot(data, aes(x=prepost, y=mean_TotHuman, fill=Condition))+
geom_bar(stat="identity",
position="dodge")+
geom_errorbar(aes(ymin=mean_TotHuman-se_TotHuman,
ymax=mean_TotHuman+se_TotHuman),
position=position_dodge(.9),
width=.3)+
theme(panel.grid.major=element_blank())
In the end, you can get the same thing. I haven’t really found a true benefit to either, though I just have been recently transitioning towards geom_col(), since that is what it is /meant/ to be for.
We could display the above data also using lines instead of bars. For most kinds of visualizations you would see, if we’re using geom_line(), we probably want to feed it data in this more compressed form. You should try on your own, and really it kind of just comes down to trial and error on how we want them to look. But let’s add on geom_line() instead of geom_bar(), and let’s pass through two arguments, aes() we’ll say color is equal to condition, and size outside of aes is equal to 1.2.
ggplot(data, aes(x=prepost, y=mean_TotHuman, group=Condition))+
geom_line(size=1.2, aes(color=Condition))
Now, compare this graph to a similar one where we just switch which is the group= variable and which is the prepost variable. Don’t forget to switch the aes() target as well. How do they look different? What does a line graph suggest that might not actually be true in this second graphic?
ggplot(data, aes(x=Condition, y=mean_TotHuman, group=prepost))+
geom_line(size=1.2, aes(color=prepost))
One very valid critique of this kind of plot is that it really doesn’t tell us much information. If you thought about the information we gave R even, we just immediately gave it the means and standard errors. We reduced the observations it could play with to four. It gave us the mean, and the standard error, but we could find out a lot more about it. Lately, visualiation has been moving towards providing more and more details about the data to people as possible. One of these options is called the Box and Whisker plot. For r, we call it geom_boxplot().
To use this plot, we need the full data set. We have downloaded this off Sakai as well, and you can read it in with read.csv() as data2. We’ll then set everything the same as before – aes(), x=prepost, y=TotHumanization, fill=Condition.
data2 <- read.csv("JayTermGeomBox.csv")
ggplot(data2, aes(x=prepost, y=TotHumanization, fill=Condition))+
geom_boxplot()
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
Good! What is being shown? The dots are the outliers.
One option that has been available to us is throughout all of these graphs is using what is called facet_wrap(). facet_wrap() breaks our graph into two. Let’s do a basic test of how this might look by going back to the graph we just made. After geom_boxplot(), let’s add another layer onto our ggplot, and this time call facet_wrap. Let’s look at the help file of facet_wrap()…. We’ll use that third option, where it seems to be something like variable tidle period. Let’s facet_wrap() on Condition first.
ggplot(data2, aes(x=prepost, y=TotHumanization, fill=Condition))+
geom_boxplot()+
facet_wrap(Condition~.)
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
But we also could facet_wrap on variables we didn’t define up front. We have a variable in data2 called Female. Let’s facet_wrap() on female.
ggplot(data2, aes(x=prepost, y=TotHumanization, fill=Condition))+
geom_boxplot()+
facet_wrap(Female~.)
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
The best part of ggplot2 is how you can start to really add layers and layers and combine everything you’ve learned onto a single graph. We’ll load our third dataset in now as data3. Data3 was another undergraduate student project. In a previous study, we had people write down why they voted for either Biden or Trump. In this study, we selected five of those stories and asked both Democrats and Republicans to rate how persuasive that story was, in order to find the most persausive story. Of course, Dems will say reasons for Biden are more persuasive than Reps, and Reps will say reaosns for Trump are very persuasive compared to Dems. But we really care more about that cross - Dems rating Republican stories. Either way, we can graph one of these results here.
To start, let’s load in data3 by read.csv() on JayTermRaincloud.csv. This is a piece of the larger dataset. You might notice with head() that something like data3[6:10,] is missing for the dependent variable of choice. This means that this person read the Democrat stories. You will also notice that each person shows up 5 times (ResponseID is the participant’s ID). Why?
Good. We’ll talk about this more later, but this is what we call tidy data in long form. In most other statistics packages, it can handle data iw wide form – that is, a column for each story. But what clean data is is that each row is its own observation – and here, we have 5 observations per person.
Anyway, back to our data. Let’s start by setting up our ggplot. For our x axis, let’s make it storynum. For our Y axis, we only have 1 variable in there that could be our dependent variable. For our first layer, let’s add another way to visualize data, calling from the PupillometryR:: library geom_flat_violin(). What information is this giving us? The distribution! Right.
data3 <- read.csv("JayTermRaincloud.csv")
ggplot(data3,aes(x=StoryNum,y=TotPersuade))+
PupillometryR::geom_flat_violin()
## Warning: Removed 265 rows containing non-finite values (stat_ydensity).
Now let’s add geom_point() to it.
ggplot(data3,aes(x=StoryNum,y=TotPersuade))+
PupillometryR::geom_flat_violin()+
geom_point()
## Warning: Removed 265 rows containing non-finite values (stat_ydensity).
## Warning: Removed 265 rows containing missing values (geom_point).
And finally, let’s add geom_boxplot.
ggplot(data3,aes(x=StoryNum,y=TotPersuade))+
PupillometryR::geom_flat_violin()+
geom_point()+
geom_boxplot()
But wait! We said the data was looking at Democrats and Republicans. Now that we have facet_wrap() down, let’s add that into this mammoth too.
ggplot(data3,aes(x=StoryNum,y=TotPersuade))+
PupillometryR::geom_flat_violin()+
geom_point()+
geom_boxplot()+
facet_wrap(Republican~.)
Okay! So, clearly this is a lot of stuff on there, and at the moment, it’s pretty meaningless. Let’s start with our most recent layer – boxplot(). What can we add to it to make it better? Maybe some colors? What would be the difference between doing fill=StoryNum in ggplot() versus fill=StoryNum in geom_boxplot()? Test out both and report back. Let’s stick with it in ggplot().
ggplot(data3,aes(x=StoryNum,y=TotPersuade, fill=StoryNum))+
PupillometryR::geom_flat_violin()+
geom_point()+
geom_boxplot()+
facet_wrap(Republican~.)
If we go back to geom_boxplot(), we remember it gave us what for dots? Right, the outliers. Well, do we need that, if we have all of the points showing up with geom_point()? Probably not. We can check the documentation for ?geom_boxplot and see it says:
Hiding the outliers can be achieved by setting outlier.shape = NA. Importantly, this does not remove the outliers, it only hides them.
At the same time, the points are not very helpful right now because they’re all lined up. If we typed ?position into R, it would give us a bunch of things, including position_jitter(). We’ll do this one, and to just help us along, we’ll set the width to .15 and outside, also set the size of the points smaller.
ggplot(data3,aes(x=StoryNum,y=TotPersuade, fill=StoryNum))+
PupillometryR::geom_flat_violin()+
geom_point(position = position_jitter(width = .15), size = .25)+
geom_boxplot(outlier.shape=NA)+
facet_wrap(Republican~.)
Getting there. But our boxplots are quite large. We’re really just using them to tell us more information about the mean, median, and quartiles. So we could probably shrink their width a lot, and make it a bit more transparent. If we google “how to increase transparency ggplot”, the first link from tidyverse (https://ggplot2.tidyverse.org/reference/aes_colour_fill_alpha.html) also gives us by searching for “trans” on the page,
Alpha refers to the opacity of a geom. Values of alpha range from 0 to 1, with lower values corresponding to more transparent colors.
Let’s make the width of the boxplot = 0.1, and the alpha equal to 0.3.
ggplot(data3,aes(x=StoryNum,y=TotPersuade, fill=StoryNum))+
PupillometryR::geom_flat_violin()+
geom_point(position = position_jitter(width = .15), size = .25)+
geom_boxplot(outlier.shape=NA, alpha=0.3, width=0.1)+
facet_wrap(Republican~.)
Great! Now, for the geom_flat_violin, it’s kind of still on top of (well, below), the boxplot. Wouldn’t it be nice if we could just shift it over a little bit? Like, nudge it over? Can someone type in ?position into the console and see if there is a way we could nudge the position?
ggplot(data3,aes(x=StoryNum,y=TotPersuade, fill=StoryNum))+
PupillometryR::geom_flat_violin(position=position_nudge(x=.25))+
geom_point(position = position_jitter(width = .15), size = .25)+
geom_boxplot(outlier.shape=NA, alpha=0.3, width=0.1)+
facet_wrap(~Republican)
Great. Now, let’s get rid of that legend – it’s taking up room. What can someone find if we google “remove legend ggplot2”?
ggplot(data3,aes(x=StoryNum,y=TotPersuade, fill=StoryNum))+
PupillometryR::geom_flat_violin(position=position_nudge(x=.25))+
geom_point(position = position_jitter(width = .15), size = .25)+
geom_boxplot(outlier.shape=NA, alpha=0.3, width=0.1)+
facet_wrap(~Republican)+
theme(legend.position = "none")
Great! And let’s add some better colors to this. As a Harry Potter nerd, I’m a sucker for all things Harry Potter. Good thing we installed that Harry Potter package. If we check the help file for ?harrypotter, we’ll see there’s a single function, hp, and it creates a vector of colors length N. There are various Options, which we’ll set to the character string “Ravenclaw”.
ggplot(data3,aes(x=StoryNum,y=TotPersuade, fill=StoryNum))+
PupillometryR::geom_flat_violin(position=position_nudge(x=.25))+
geom_point(position = position_jitter(width = .15), size = .25)+
geom_boxplot(outlier.shape=NA, alpha=0.3, width=0.1)+
facet_wrap(~Republican)+
theme(legend.position = "none")+
scale_fill_manual(values=harrypotter::hp(5, option="Ravenclaw"))
To round out this practice, let’s make this graph a bit uglier to just remind us how things work. Let us also specify at the top of ggplot() that the color of the graph should also be StoryNum. Does anyone remember what color will define compared to fill? Yes, the outline. But right now in our code, we haven’t really specified color anywhere. So if we ran it, R would default to its own ugly color choices for outlines, one for each group. We can add scale_color_manual() to the bottom. While we would normally probably also want to choose Ravenclaw to make it look nice, in order to emphasize what’s happening, let’s name the second one Slytherin.
ggplot(data3,aes(x=StoryNum,y=TotPersuade, fill=StoryNum, color=StoryNum))+
PupillometryR::geom_flat_violin(position=position_nudge(x=.25))+
geom_point(position = position_jitter(width = .15), size = .25)+
geom_boxplot(outlier.shape=NA, alpha=0.3, width=0.1)+
facet_wrap(~Republican)+
theme(legend.position = "none")+
scale_fill_manual(values=harrypotter::hp(5, option="Ravenclaw"))+
scale_color_manual(values=harrypotter::hp(5, option="Slytherin"))
It also should be stated that throughout all of this, we have been operating ggplot through a single dataset – the one that gets read into in the original ggplot call. However, this does not always have to be the case! ggplot is happy to play with multiple datasets at once, because all we’re doing is just adding additional layers onto it. This comes into play especially when we have faceted plots, but the general idea is something important that we should center on.
To demonstrate this, let’s first annotate our graph with the word “hello!”.
ggplot(data3,aes(x=StoryNum,y=TotPersuade, fill=StoryNum, color=StoryNum))+
PupillometryR::geom_flat_violin(position=position_nudge(x=.25))+
geom_point(position = position_jitter(width = .15), size = .25)+
geom_boxplot(outlier.shape=NA, alpha=0.3, width=0.1)+
facet_wrap(~Republican)+
theme(legend.position = "none")+
scale_fill_manual(values=harrypotter::hp(5, option="Ravenclaw"))+
scale_color_manual(values=harrypotter::hp(5, option="Slytherin"))+
annotate(geom="text", x = 1, y = 2.51, label = "Hello!", size=5)
This is good — but ggplot also didn’t really know how to handle this new thing so it put it on both. You can see how this might be not great. We might only want to write something on one half of the plot. So, to address this, we’re going to have to create a new dataset. How would you google to solve this problem?
annotate on only one facet https://stackoverflow.com/questions/11889625/annotating-text-on-individual-facet-in-ggplot2
dat_text <- data.frame(
label = c("Hello!", NA),
Republican = c("Democrat", "Republican"),
StoryNum = c(1,2),
x = c(1,1),
y = c(2,2)
)
So far, we’ve done a lot. We’ve learned a lot of different kinds of ggplots and played with a bunch of data. But we potentially have already noticed a problem. For things like geom_point(), we want all of the data. But then when we wanted geom_bar(), we wanted summary data. I was a generous god and gave you that data nicely prepared. But what if it wasn’t nicely prepared?
Cleaning data is the cornerstone of coding in R. So, let’s get to cleaning! We have a dataset called CleanMe.csv. This is a semi-cleaned dataset of one of my dissertation studies. In the study, we had a 2x2 design. You read about an attack that happened on American soil, perpetuated by either a U.S. extremist group or a foreign extremist group (that variable is called Group). You were then presented with a policy that either was extremely restrictive to your human rights or was not very restrictive to your human rights (PolicyGroup). We then asked you how supportive of the policy you were, and controlled for a bunch of things about you.
Let’s check it out by typing both View(data) and names(data)
rm(list=ls())
data <- read.csv("CleanMe.csv")
View(data)
names(data)
There’s a lot of variables! 121 variables. A lot of these are useless and we’ll consider removing them, others need to be summarized. For instance, I wanted to control for what kind of mood you were in, and we do that through what’s called the Positive And Negative Affect Scale, or PANAS. There are 20 questions that make up PANAS, ten which are positive, and ten which are negative. In that list of variables, can anyone find any of the PANAS questions we asked?
Great. So, one thing we would want to do is create an average Positive PANAS score and an average Negative PANAS score. On just a theoretical level, how would we get your average Positive PANAS score?
Good. So there are two ways we could go about this. The first is to simply add the positive columns together and divide by 10, since there are ten. For positive that is PANAS_Inter, PANAS_Excited, PANAS_Strong, PANAS_Ethus, PANAS_Proud, PANAS_Alert, PANAS_Inspired, PANAS_Determined, PANAS_Attentive, and PANAS_Active. We could name this new variable PANAS_Pos1 as the first way to do it.
data$PANAS_Pos1 <- (data$PANAS_Inter + data$PANAS_Excited + data$PANAS_Strong +
data$PANAS_Ethus + data$PANAS_Proud + data$PANAS_Alert +
data$PANAS_Inspired + data$PANAS_Determined + data$PANAS_Attentive +
data$PANAS_Active)/10
A second, just as reasonable of a way to go about this would be to consider each Row (person) and take the mean of the columns that we want to take. rowMeans() documentation seems to just accept an X – not something we can really specify. That is – it would try to add every row in our dataset. But, what we could do is include a function that subsets our data only to the columns that are important. Does anyone remember what function we learned that subsets data?
Great. Check your notes on that one, and write out what you need to do to create PANAS_Pos2.
data$PANAS_Pos2 <-
rowMeans(
subset(data,
select = c(
PANAS_Inter, PANAS_Excited, PANAS_Strong,
PANAS_Ethus, PANAS_Proud, PANAS_Alert,
PANAS_Inspired, PANAS_Determined, PANAS_Attentive,
PANAS_Active
)
)
)
Let’s make a very quick dataset to display this problem. Create two variables. The first, person, assign it a vector of “Person1” and “Person2”. The second variable, happyatwj, assign it a vector of “Strongly Disagree” and “Strongly Agree”. Then, use data.frame() to create a new dataset called data3 with these two things.
person <- c("Person1", "Person2")
happyatwj <- c("Strongly Disagree", "Strongly Agree")
data3 <- data.frame(person, happyatwj)
Great. So, this happens a lot in data. People ask on a scale how happy they are, but then, we want to analyze the numbers (imagine it was 1=SD…. 7=SA). Well, that’s not going to work well when you have it as the words! We can use plyr’s revalue() function. Let’s pull up this documentation. The examples at the bottom seem to vary in their application, but it does at least look pretty straight forward. Does the old value go on the left or right of the equal sign here?
Great. Let’s reassign happyatwj using revalue(). While we know there’s only two values, for our own effort, we should type out all seven possible combinations. It will help with your notes and a later class. So Strongly Disagree =1, Disagree=2, Somewhat disagree = 3…. Call it after to see what it has become.
data3$happyatwj <- plyr::revalue(data3$happyatwj,
c(
"Strongly Disagree"="1",
"Disagree"="2",
"Somewhat disagree"="3",
"Neither agree nor disagree"="4",
"Somewhat agree"="5",
"Agree"="6",
"Strongly Agree"="7"
)
)
## The following `from` values were not present in `x`: Disagree, Somewhat disagree, Neither agree nor disagree, Somewhat agree, Agree
data3$happyatwj
## [1] "1" "7"
We get a warning that is nice, just letting us know what’s not present. If you were expecting there to be something and it says there wasn’t, maybe there’s a problem with some capitalization. PLUS, what if you had different anchors (maybe one of them was Very unexcited to Extremely Excited) but still 1-7!?
BUT! Does anyone notice something that is wrong here? Yes! The numbers are characters – they have quotes around them. We have a function as.numeric() that we can use. Edit your above code but this time, start with as.numeric() and as an argument, pass everything you have done before. Refer back to Meta - Functions Inside of Functions - to remind us about this.
Still, we have written out something that cleans a single variable and took up quite a lot of lines of code. It is possible to do this for all 121 variables, but you can imagine that will take up a lot of your time.
data3$happyatwj <- as.numeric(plyr::revalue(data3$happyatwj,
c(
"Strongly Disagree"="1",
"Disagree"="2",
"Somewhat disagree"="3",
"Neither agree nor disagree"="4",
"Somewhat agree"="5",
"Agree"="6",
"Strongly Agree"="7"
)
)
)
## The following `from` values were not present in `x`: Strongly Disagree, Disagree, Somewhat disagree, Neither agree nor disagree, Somewhat agree, Agree, Strongly Agree
Now, in this study, we, as we said, had you either read about one of 2 different policies. Sometimes, the way you collect the data, you might create multiple variables based on the condition that they were in. That is, I asked YOU “How secure is this policy?” and I asked someone else “How secure is this policy?” but they went into my excel sheet as different variables. We can see that happened where I have data$Secure1 and data$Secure2.
To see how that looks, try table(data\(Secure1, data\)PolicyGroup) and table(data\(Secure2, data\)PolicyGroup).
table(data$Secure1, data$PolicyGroup)
##
## High Restrictions Low Restrictions
## 1 19 0
## 2 15 0
## 3 21 0
## 4 10 0
## 5 16 0
## 6 4 0
table(data$Secure2, data$PolicyGroup)
##
## High Restrictions Low Restrictions
## 1 0 17
## 2 0 22
## 3 0 18
## 4 0 10
## 5 0 11
## 6 0 3
## 7 0 1
This is fine, but we can’t analyze a dependent variable like how secure you think something is if we have two of the same question! To understand what is going on, let’s go back to what we learned about looking within on rows. We want to look at Secure2. Because of the second table, we’re interested in the values of Secure2 when data$PolicyGroup==“High Restrictions”. How might we go about this? Confirm how this looks different if you did Secure1.
data[data$PolicyGroup=="High Restrictions",]$Secure2
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [76] NA NA NA NA NA NA NA NA NA NA
data[data$PolicyGroup=="High Restrictions",]$Secure1
## [1] 5 2 5 4 2 5 3 1 5 3 1 1 1 3 4 2 3 4 1 6 1 1 6 3 3 3 4 3 6 5 3 5 5 3 5 1 6 5
## [39] 2 4 3 3 4 4 4 2 5 1 2 3 3 4 2 3 5 1 5 1 1 4 1 3 2 3 1 1 3 2 2 5 3 3 2 5 1 5
## [77] 2 2 2 5 2 3 1 1 1
Yes! They’re all NA. So we have two columns – in each, you have either an NA value or a number value. How can we combine NA and VALUE to get VALUE?
data$Secure <- rowMeans(subset(data, select = c(Secure1, Secure2)), na.rm = TRUE)
Indeed, we can take an average, as long as we ensure that na.rm=TRUE!
Other times, maybe something goes wrong and our variables just aren’t called what we want them to be called. This happens a lot with government datasets, or datasets from other people. There are tons of ways to do this, but my chosen favorite way is using dplyr’s function (which was installed with tidyverse) called rename. Dplyr’s documentation example shows:
rename(iris, petal_length = Petal.Length)
Which, if we took the time to load the iris dataset, we would quickly learn the format seems to be: rename(data, NEWVARNAME = OLDVARNAME). Also note there are no quotes here! Well, we have 3 variables in this dataset I know need to be renamed. Q120 should be how much do you support the right to privacy (SupportPrivacy), Q121 is how much do you support human rights (SupportHR), and Q122 is how much do you support not torturing people (SupportNoTort).
data <- dplyr::rename(data, c(SupportPrivacy = Q120,
SupportHR = Q121,
SupportNoTort = Q122))
Great job. Sometimes, there are variables that are in our dataset that are generally kind of useless. These columns just take up space when we call things like names(). For Qualitrics studies, you get things like LocationLatitude and LocationLongitude. We are never going to analyze that. How can we get rid of those variables – or, how could we select to keep all columns BUT these two?
One way we’ve already learned - we could subset the data, and just minus out those two variables. Save this new dataset as data2.
data2 <- subset(data, select = -c(LocationLongitude, LocationLatitude))
And we can see on the right, we’ve removed those two columns. Great job.
In this dataset, we also have data for the participant’s gender. Right now it is hiding in Q72. Let’s rename that variable as Gender. Now, in our data, Dr. C knows that 1 means the participant is male and 2 means the participant is female. There were no participants who reported a third gender in this dataset. But as is, that’s not clear! Imagine a visualization where it said gender 1 or 2. What would that even mean?
After we’ve renamed the variable to Gender, let’s also create a new variable, called Male. In data\(Male, we want to look for all data\)Gender which is equivalent to 1. In those cases, we want to assign it as “Male”. Give this your best shot. For all observations where data$Gender does NOT equal one, we want to assign it as “Not Male”.
Finally, in order for R to be sure it understands what we want, be sure to tell R that data$Male is as.factor() using the as.factor() function.
data <- dplyr::rename(data, c(Gender = Q72))
data$Male[data$Gender==1] <- "Male"
data$Male[data$Gender!=1] <- "Not Male"
data$Male <- as.factor(data$Male)
We’re going to load in ReshapingData.csv now. This is another version of the same dataset we used in geom_boxplot() and facet_wrap(). This is the study that asked people both before and after reading a story how they felt about someone else.
You can check names(data) and see that instead of TotHumanization with a pre-post variable, this data is in what we call the wide form. That is, there is a column for their Pre_TotHumanization and a column for their Post_TotHumanization. This is the best way to have our columns set up in R - that is:
Variables are clearly identified by their within subjects variable at the beginning of the variable name (pre/post here).
There is a separator between the within subjects variable and the DV of choice (here, _).
The spelling and capitalization of the pre/post or within-treatment variables is the same (TotHumanization).
Whatever has been used for 1-3 is used for all variables that change across time and are measured Time number of times. For all variables that do not change (gender, age), they should NOT include 1-3.
To do this, we need to turn on our tidyverse library. We’re going to create a new dataset called data_long. This dataset from data_long will take from it data.
What we’re going to do next is one of the most bizarre but critical parts of R. We’re going to use what we call the pipe operator. %>%. The pipe operator says “Take this thing on the left and do Thing on the Right.”
The documentation for tidyr is a bit complicated, so it will take time to read through. But we can acknowledge it has this as arguments:
pivot_longer( data, cols, names_to = “name”, names_prefix = NULL, names_sep = NULL,
That’s a lot of good info, and all we need to work with. Remember, since we’re being explicit already about what our data is (using the pipe), we don’t need that first argument - it’s being passed in. The second argument asks for columns. We know we want to pivot all columns BESIDES the columns that are NOT Pre_ or Post_. We can use the -c() to remove all columns that we list out.
Next says names_to = . We’ll pass two arguments through c() here, “prepost” and “.value”.
Finally, it says names_sep. Can anyone tell us what our names sep is?
Yes, "_".
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
data <- read.csv("ReshapingData.csv")
data_long <- data %>%
pivot_longer(cols= -c(ThoughtsCasey.c, ThoughtsCasey, WillingCasey,
Female, Condition, VoiceCondition, VoiceOrText,
Political_Party, Dehum.c, Dehum, ShareYourVote,
ShareYourReason, ResponseId, RespondToCasey, AgeP),
names_to = c("prepost", ".value"),
names_sep="_")
How many observations are in data_long? How does that relate to how many observations are in data? Check names(data_long). Check data_long$prepost. Tell me what you see.
####plyr::ddply() and summarise() Back to top
Now, in geom_barplot(), we had to have data summarized in a form where it was one observation for each condition. We can do this through using plyr::ddply(). If we look at the documentation for ddply(), the description sounds exactly like what we want:
Split data frame, apply function, and return results in a data frame.
We could spend the time looking up why this function has .data, and what that means, but we can also just cross down to the examples and see their example:
#Their example
ddply(dfx, .(group, sex), summarize,
mean = round(mean(age), 2),
sd = round(sd(age), 2))
So, in our case, we have a dataset called data_long. Where would we fill in that in the example? We also have two variables we want to split on– Condition and prepost. Where would we fill that in? And what is our dependent variable we’ve been analyzing? Yes, TotHumanization.
library(tidyverse)
library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
plyr::ddply(data_long, .(Condition, prepost), summarise,
mean = mean(TotHumanization),
sd = sd(TotHumanization)
)
## Condition prepost mean sd
## 1 StronglyAgree Post 3.887457 0.5945494
## 2 StronglyAgree Pre 3.729811 0.6704774
## 3 StronglyDisagree Post NA NA
## 4 StronglyDisagree Pre 3.889045 0.6330831
Now, in their example, they don’t include something we want to be sure to include. When we learned mean() and most math functions, what did we say we want to try and include most of the time? Yep, they don’t have na.rm=TRUE!
plyr::ddply(data_long, .(Condition, prepost), summarise,
mean = mean(TotHumanization, na.rm=TRUE),
sd = sd(TotHumanization, na.rm=TRUE)
)
## Condition prepost mean sd
## 1 StronglyAgree Post 3.887457 0.5945494
## 2 StronglyAgree Pre 3.729811 0.6704774
## 3 StronglyDisagree Post 4.077213 0.5651957
## 4 StronglyDisagree Pre 3.889045 0.6330831
Finally, for most scientific papers, we don’t use the standard deviation when graphing. We actually use something called the standard error. Since this is not a stats course, I’ll let you know that the standard error is the standard deviation divided by the square root of the number of observations for that group. So, we need to find out the number of observations for each full observation.
This one can be a bit tricky, and there are probably many ways to calculate N. For us, we know that !is.na() would give us a list of true or falses for whether or not the dependent variable is present. It turns out, TRUE is coded in R as 1 and FALSE is coded in R as 0. So if we sum up all of the TRUEs, we would get a total number of observations.
After that, we can take sd and divide it by sqrt() of n.
Finally, save this new dataset as twowaydata.
plyr::ddply(data_long, .(Condition, prepost), summarise,
n = sum(!is.na(TotHumanization)),
mean = mean(TotHumanization, na.rm=TRUE),
sd = sd(TotHumanization, na.rm=TRUE),
se = sd/sqrt(n)
)
## Condition prepost n mean sd se
## 1 StronglyAgree Post 194 3.887457 0.5945494 0.04268616
## 2 StronglyAgree Pre 194 3.729811 0.6704774 0.04813747
## 3 StronglyDisagree Post 177 4.077213 0.5651957 0.04248271
## 4 StronglyDisagree Pre 178 3.889045 0.6330831 0.04745159
Functions are probably what you first think about when you think about coding. We’ve gone quite a way in this course without even really thinking too much about them. It turns out, you just kind of plug in things and then magic comes out! And really, you can get pretty far with coding by just brute forcing your code.
But I’d like us to think about just a few times where we can save some coding space. Above, we learned a way to replace someone’s scale values of Strongly Disagree to a 1 and Strongly Agree to a 7. We noted that this worked for a single column, but it really would take up a lot of your time and script lines to do it for a whole dataset.
That is the goal of functions – to take what you’re doing a lot of, and do it less.
Take, for example, something silly. Imagine we want to amend data2’s “Person1” and “Person2” to “Person1 is a cool cat” and “Person2 is a cool cat”. Basically, given any input, we want to add onto it “is a cool cat”. We can write a function that does exactly this. To do that, we need a function.
paste0() is a function that takes multiple strings and slaps them together. So, we have whatever our input is, and we want to add in “is a cool cat”. For example, let’s pass through paste0() “Dr. C” and " is a cool cat".
paste0("Dr. C", " is a cool cat")
## [1] "Dr. C is a cool cat"
Great! But as stated, we want to do this for not just one time, but all of the times. That’s what’s great about functions – they can do exactly this. That is - we can create our own function instead of just relying on R and packages. Let’s name our function coolcat. Coolcat will be assigned function().
function() needs to be told what and how many things to expect. We think it’ll only be one thing (one column), so we’ll call it x.
Now that we’re writing our own functions, functions then are described in curly brackets {}.
Our function is very simple – now instead of paste0(“Dr. C”, “is a cool cat”), we want to be able to give any input – x – and then also paste " is a cool cat".
coolcat <- function(x){
paste0(x, " is a cool cat")
}
You will see it show up on the right in a new part called functions. To test to see if it worked, let’s run coolcat() on data2$person on its own. Then, let’s create a new row, called data2$cat, and assign it the output of coolcat().
coolcat(data3$person)
## [1] "Person1 is a cool cat" "Person2 is a cool cat"
data3$cat <- coolcat(data3$person)
Note: R has special function-like commands called lapply, sapply, and vapply. You might stumble across these in time. You would want to explore these in more detail next.
Slight note before starting: I created this function after reading: https://stackoverflow.com/questions/38724850/converting-likert-data-to-numeric-across-a-data-frame (specifically, the last answer).
dplyr::case_when() is a really cool function. We will use it to write our own function. Let’s call this function word_to_num. word_to_num(){} will take a single argument again, x.
Inside of our function, we’ll be strictly using case_when().
case_when()’s documentation says it “allows you to vectorise multiple if else statements…. If no cases match, NA is returned.” Let’s copy and paste that first example:
x <- 1:50
case_when(
x %% 35 == 0 ~ "fizz buzz",
x %% 5 == 0 ~ "fizz",
x %% 7 == 0 ~ "buzz",
TRUE ~ as.character(x)
)
## [1] "1" "2" "3" "4" "fizz" "6"
## [7] "buzz" "8" "9" "fizz" "11" "12"
## [13] "13" "buzz" "fizz" "16" "17" "18"
## [19] "19" "fizz" "buzz" "22" "23" "24"
## [25] "fizz" "26" "27" "buzz" "29" "fizz"
## [31] "31" "32" "33" "34" "fizz buzz" "36"
## [37] "37" "38" "39" "fizz" "41" "buzz"
## [43] "43" "44" "fizz" "46" "47" "48"
## [49] "buzz" "fizz"
This doesn’t help too much if we don’t know what %% means. We could google it. But, for shits and giggles, let’s just add one more line and try to guess the pattern:
x <- 1:50
case_when(
x %% 35 == 0 ~ "fizz buzz",
x %% 5 == 0 ~ "fizz",
x %% 7 == 0 ~ "buzz",
x %% 2 == 0 ~ "even",
TRUE ~ as.character(x)
)
## [1] "1" "even" "3" "even" "fizz" "even"
## [7] "buzz" "even" "9" "fizz" "11" "even"
## [13] "13" "buzz" "fizz" "even" "17" "even"
## [19] "19" "fizz" "buzz" "even" "23" "even"
## [25] "fizz" "even" "27" "buzz" "29" "fizz"
## [31] "31" "even" "33" "even" "fizz buzz" "even"
## [37] "37" "even" "39" "fizz" "41" "buzz"
## [43] "43" "even" "fizz" "even" "47" "even"
## [49] "buzz" "fizz"
Okay. So, now we have some idea what’s happening there. We can start to apply that by setting up case_when() just like that example - if x is evaluated someway, then tidle sign make it something else. Now, we aren’t doing division and getting the remainder. Instead, we’re going to type in the word in in-between the %%. This is a specific type of operator in R.
##
word_to_num <- function(x) {
case_when(x %in% c("Strongly disagree") ~ 1,
x %in% c("Disagree") ~ 2,
x %in% c("Somewhat disagree") ~ 3,
x %in% c("Neither agree nor disagree") ~ 4,
x %in% c("Somewhat agree") ~ 5,
x %in% c("Agree") ~ 6,
x %in% c("Strongly agree") ~ 7
)
}
So, we have this new fancy function. But how does it work? Let’s read in our data for this activity (WordToNum.csv) by calling it fundata. This is one of our datasets in its almost messiest form. We’re just going to test it out very quickly to show it off. One way to do it is we could go through and apply this function onto each variable. For example, we can see in names(fundata) the variable WillingCaseyPre_1. We could replace the column fundata$WillingCaseyPre_1 with passing that column through our created function factorise(fundata$WillingCaseyPre_1)
fundata <- read.csv("WordToNum.csv")
table(fundata$WillingCaseyPre_1)
##
## Agree
## 922 154
## Disagree Neither agree nor disagree
## 50 63
## Somewhat agree Somewhat disagree
## 124 34
## Strongly agree Strongly disagree
## 94 63
fundata$WillingCaseyPre_1 <- word_to_num(fundata$WillingCaseyPre_1)
table(fundata$WillingCaseyPre_1)
##
## 1 2 3 4 5 6 7
## 63 50 34 63 124 154 94
But again, that also is not as efficient as it could be. Instead, let’s think about WillingCaseyPre_2, 3, and 4. Create a vector that has as characters the three column names, and call that vector columns. We’re doing this because way back at the beginning, we learned that we could call out specific columns as long as those columns were in quotes.
What we’re going to end the class with is the idea of sapply. ?sapply. Basically, we give it a list of things, and a function to do it to. We have our list (you just made it), and we have our function (you just made it). Now we just have to be clear that what we’re saving over is also what we’re fixing.
columns <- c("WillingCaseyPre_2", "WillingCaseyPre_3","WillingCaseyPre_4")
fundata[columns] <- sapply(fundata[columns], word_to_num)