knitr::opts_knit$set(rootdir ="~/Dropbox/PACT Lab")
To install a package, run the following function with the name of the package in quotes. If it asks you permission to install other things, be sure to click Yes. Installing a package requires you to be connected to the Internet, so be sure you have a stable and running internet connection before doing this.
install.packages("tidyverse")
Important libraries we need to install, generally are:
Libraries that are more specific for statistics are:
Libraries that are important for learning R are:
Other Libraries that show up in this Markdown are:
There are many things to do at this stage, and please be sure you have tried all of the below before asking for help. R is a puzzle that is meant to be solved - the computer is not wrong, you must have done something wrong. But R is trying to give you hints.
Did you read the error? What does it say? Did you copy and paste it into Google?
Did you take the time to check whatever it is you were working on? That is, have you run the variable in your console to see it and understand what it looks like? Have you run table(), head(), class(), and more on it to ensure it’s all what you expected?
Have you turned it off and on again – that is, have you tried running your script from top to bottom and reloading your data in?
Have you ensured you have loaded the libraries that you need? If you think you have, is it possible that multiple packages are calling the same function? Have you added the library name in the front to be safe with ::?
Have you tried breaking it yourself, or trying different things?
Have you checked your quotes, your brackets, your parenthesis, your commas?
Loading data is done through primarily read.csv(). Other file formats will require different functions.
We are working to transition the lab to full tidy language use, which also means we want to ensure our data is loaded as a tibble using as_tibble().
library(tidyverse)
#setwd()
#data <- read.csv()
#A dataset we will use to generally practice some features.
data <- iris
data <- as_tibble(data)
Scripts can be hard to read, especially as you get larger and larger code. That is why we use the enter bar. If you write a comma in your code, you should probably hit enter. If you use a plus sign in ggplot, you should probably hit enter. As you continue to read through this manual, you will note I hit enter on most commas. You should do the same. R will format things with tabs and make things really easy to notice.
I also recommend you make your script sectioned by adding ### at various parts of your script. You will notice that it will create a tab on the left side near the numbers. You can use this to collapse and jump around large sections of your script without needing to scroll too much.
R is happy to do things for you without saving its work. For example, we could ask R to calculate the mean of 2, 4, and 100 by:
mean(c(2,4,100))
## [1] 35.33333
But nothing was saved. Instead, if we want to keep that number moving forward, we need to be sure we’re assigning it something. We will always use a left pointing arrow to denote the idea of assigning.
mean_of_three <- mean(c(2,4,100))
Nothing will happen in the console, but instead, this value is now in our Environment. We can always see that value by just typing into R:
mean_of_three
## [1] 35.33333
There are two main things we can assign. If we were to write:
value <- c("Dr. C", "Dr. K")
This would store a value into our Environment under Values. This is different than creating a variable.
data$var <- rep(c("Dr. C", "Dr. K"), 75)
This creates a variable called var. To understand this, we first state that we want things stored within the dataframe we call data, and then, we use the dollar sign to signify that this will be a new variable in the data.
see also: sapply
We can understand things in terms of lists (like above, where value was a list of two names) and data frames. Data frames are actually just lists organized in columns. Brackets are one of the many ways in which we can subset data. If we use a comma in the middle of a bracket, we can specify specific rows (left side) or columns (right side). If there is no comma, r treats it as if you’re simply asking for the nth element in the list of lists, which would be the nth column. You can also call columns by passing through their character name of the column – an important lesson that will be useful later.
head(data[,1]) #Column
## # A tibble: 6 × 1
## Sepal.Length
## <dbl>
## 1 5.1
## 2 4.9
## 3 4.7
## 4 4.6
## 5 5
## 6 5.4
head(data[1]) #Also column
## # A tibble: 6 × 1
## Sepal.Length
## <dbl>
## 1 5.1
## 2 4.9
## 3 4.7
## 4 4.6
## 5 5
## 6 5.4
data[1,] #Rows
## # A tibble: 1 × 6
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species var
## <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 5.1 3.5 1.4 0.2 setosa Dr. C
head(data["Sepal.Length"]) #another way to call the column.
## # A tibble: 6 × 1
## Sepal.Length
## <dbl>
## 1 5.1
## 2 4.9
## 3 4.7
## 4 4.6
## 5 5
## 6 5.4
We can always type ? followed by a function name (like, ?c ) into the Console to learn more about it. The best place to look is the Arguments part, and then immediately scroll to the examples section.
Functions are things that have some letters on one side (many times, a word, or a shortened word), and an open and closed parantheses on the other (). We call this thing() a function. Functions take arguments, or things we put inside of the parantheses. Different functions take different amounts of arguments, but if there are multiple arguments to pass through the function, we would want to always separate it with a comma.
Above, we used c() to create a list of two things. This is our first function! Lists are many times called vectors in coding. If we want to list multiple things, we probably want to assume we are using c() unless otherwise noted.
See: ! ; logical operators
is.na() returns a vector (list) of TRUE or FALSE. NA is R’s term for missingness. That is, is.na() will return a TRUE if something is missing and a FALSE if it is not-missing. We generally will use is.na() with the NOT, that is, to say !is.na(), to give us a list of TRUE if something is present, and only FALSE if something is not present.
table() will give us the cross-tabulation of one to two variables in a dataset and give the counts.
table(data$Species)
##
## setosa versicolor virginica
## 50 50 50
table(data$Species, data$var)
##
## Dr. C Dr. K
## setosa 25 25
## versicolor 25 25
## virginica 25 25
In what we are learning below, we can also call tables through piping.
data %>%
dplyr::select(Species, var) %>%
table()
## var
## Species Dr. C Dr. K
## setosa 25 25
## versicolor 25 25
## virginica 25 25
ls() lists everything in your environment.
rm() removes things from your environment. You can use rm(list=ls()) to remove everything from your environment.
rm(mean_of_three)
We can use < , >, == , >= , <= for most things. If we want to use equal to, R understands this as equivalent to, which requires the use of a double equal sign. A single equal sign will not work – a single equal sign acts like our <- assignment operator. We can also use these at any point on variables in our data.
2+2==4
## [1] TRUE
data$Sepal.Length > 6
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [61] FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
## [73] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [97] FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE
## [109] TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
## [121] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [133] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE
## [145] TRUE TRUE TRUE TRUE TRUE FALSE
table(data$Sepal.Length > 6)
##
## FALSE TRUE
## 89 61
An exclamation mark is used to say NOT. More specifically, it is primarily used to say NOT when we are considering statements of TRUE or FALSE. It will flip things for us. An important thing to think about for ! is the use of paranthesis. That is, we have to be clear about what we’re NOT’ing.
2+2==4
## [1] TRUE
!(2+2==4)
## [1] FALSE
2+3==4
## [1] FALSE
!(2+3==4)
## [1] TRUE
table(!(data$Sepal.Length > 6))
##
## FALSE TRUE
## 61 89
#Notice how compared to above, this ! had
#that we wrap data$Sepal.Length > 6 in ().
see also: relevel; reorder factors
Factoring just tells R that this is a categorical variable. We generally use this to reorder the factors. For example, Species is kept in alphabetical order.
levels(iris$Species)
## [1] "setosa" "versicolor" "virginica"
Yet, using factor(), we can change that order.
iris$Species <- factor(iris$Species, levels = c("virginica",
"versicolor",
"setosa"))
levels(iris$Species)
## [1] "virginica" "versicolor" "setosa"
#Piping:
iris <- iris %>%
mutate(
Species = factor(Species,
levels = c("versicolor",
"virginica",
"setosa")))
levels(iris$Species)
## [1] "versicolor" "virginica" "setosa"
see also: piping, pull
Gets the global mean of some variable or object.
mean(iris$Sepal.Length)
## [1] 5.843333
iris %>%
pull(Sepal.Length) %>%
mean
## [1] 5.843333
see also: piping, pull()
Gets the standard deviation of some variable or object.
sd(iris$Sepal.Length)
## [1] 0.8280661
iris %>%
pull(Sepal.Length) %>%
sd
## [1] 0.8280661
Calculates the standard error of some variable or object. The se is the standard deviation divided by the square root of n.
sd(iris$Sepal.Length)/sqrt(
sum(
!is.na(iris$Sepal.Length)
)
)
## [1] 0.06761132
Since the lab is learning from dplyr, we need to immediately start with %>%, or what is known as the pipe. The pipe takes something on the left of it, and puts functions to the right of it. What this does is also places the thing as first argument. That is, you will see that all of the functions below take data is the first argument. Using the pipe, this step has been done for you, so you never need to specify what dataset you are using, since you have defined it on the left side of the pipe.
This should help with the readability of the script as you go on. We will see the pipe appear a lot, but you can try and read it as “and then”.
Finally, %>% allows us to end our and then, and then, and then, without a () on the last function if it only takes a single argument (if that argument is a dataset). Again, this makes sense, since functions like head() only take one argument, and that argument is data, and we have supplied the first argument in at the beginning of our pipe.
Now, while most of the time we put datasets on the left, it does not need to be the case that it is the only thing on the right (or that functions are created to assume data comes first). For example, the lm() function (see lm()), takes a formula as the first argument, and later, data:
lm(Petal.Length ~ Species, data=iris)
##
## Call:
## lm(formula = Petal.Length ~ Species, data = iris)
##
## Coefficients:
## (Intercept) Speciesvirginica Speciessetosa
## 4.260 1.292 -2.798
So, following pipe logic, trying to pipe in iris on the left would not work, since formula is the first argument, and formula is not a dataset.
iris %>% lm(Petal.Length ~ Species)
This would break! It would say: “Error in as.data.frame.default(data) : cannot coerce class ‘“formula”’ to a data.frame”.
However, piping allows us to use the period as a placeholder of where the left hand side will go.
iris %>% lm(Petal.Length ~ Species, .)
##
## Call:
## lm(formula = Petal.Length ~ Species, data = .)
##
## Coefficients:
## (Intercept) Speciesvirginica Speciessetosa
## 4.260 1.292 -2.798
see also: select(), rowwise()
group_by() will be used to group our datasets by our independent variables. This is critical for other parts of dplyr and tidyr, which we will use this grouped-based dataset to get summary statistics.
Grouped data will force the select() function below to carry forward the grouped values.
data <- data %>%
group_by(Species, var)
head(data)
## # A tibble: 6 × 6
## # Groups: Species, var [2]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species var
## <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 5.1 3.5 1.4 0.2 setosa Dr. C
## 2 4.9 3 1.4 0.2 setosa Dr. K
## 3 4.7 3.2 1.3 0.2 setosa Dr. C
## 4 4.6 3.1 1.5 0.2 setosa Dr. K
## 5 5 3.6 1.4 0.2 setosa Dr. C
## 6 5.4 3.9 1.7 0.4 setosa Dr. K
Mutate() is used to create new variables or change variables (mutate them). You can use mutate instead of creating a new variable through the dollar sign assignment, but either is acceptable. Here, I make two variables. First, I overwrite Sepal.Length by creating a new variable with the same name, and add one to it. I also create a new variable called newvar, which takes a random number between 0 and 1.
names(data)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
## [6] "var"
data <- data %>%
ungroup() %>%
mutate(Sepal.Length = Sepal.Length+1,
newvar = sample(c(0:1), 150, replace=TRUE)) %>%
group_by(Species, var)
names(data)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
## [6] "var" "newvar"
see also: mutate()
recode() is dplyr’s way to well, recode variables. If we do not like how our factors are listed, we can use recode to change things around. Notice because recode() takes a vector variable, we need to use it inside of mutate.
Also note here I was explicit about which recode (dplyr’s) I was calling. If errors occur in code, sometimes it’s because there are multiple packages that both came up with the same function name. For me, I have a package open called “car” (and you might too) and car has its own recode function. Since we opened dplyr and then car, car’s recode gets called first and then breaks this code. To avoid that, I call dplyr explicitly.
data <- data %>%
ungroup() %>%
mutate(var=dplyr::recode(var,
"Dr. C"="Dr. Carriere",
"Dr. K"="Dr. Kilgore"))
# Also can do:
#data$var <- recode(data$var, "Dr. C" = "Dr. Carriere",
# "Dr. K"="Dr. Kilgore")
see also: select(), mutate(), table()
filter() is used to reduce the number of rows. So far, I haven’t found a great use for this, but one way in which I could envision this being useful was to create a new variable that had some value for certain filters.
Coming back to this, one way in which I have found a good use for filter() is to only give output of certain rows in looking at data, both in analyses but also in writing things up, and you only need a few of the rows.
# My original thought.
library(tidyverse)
library(emmeans)
## Warning: package 'emmeans' was built under R version 4.0.5
iris %>%
filter(Species=="setosa") %>%
filter(Petal.Length>1.8) %>%
mutate(newvar2="This is some new variable value") %>%
dplyr::select(newvar2, Species) %>%
table()
## Species
## newvar2 versicolor virginica setosa
## This is some new variable value 0 0 2
#Summarize only some rows if you were writing up a manuscript.
iris %>%
group_by(Species) %>%
summarise(mean=mean(Petal.Length, na.rm=TRUE)) %>%
filter(Species!="setosa")
## # A tibble: 2 × 2
## Species mean
## <fct> <dbl>
## 1 versicolor 4.26
## 2 virginica 5.55
#Summarize analyses based on stuff.
pred_mpg <- lm(displ ~ fl, data=mpg)
summary(pred_mpg)
##
## Call:
## lm(formula = displ ~ fl, data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9018 -1.1018 -0.1826 1.0578 3.7365
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.800 1.273 1.414 0.1588
## fld 1.240 1.395 0.889 0.3749
## fle 2.875 1.350 2.129 0.0343 *
## flp 1.463 1.285 1.139 0.2561
## flr 1.702 1.277 1.333 0.1840
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.273 on 229 degrees of freedom
## Multiple R-squared: 0.04555, Adjusted R-squared: 0.02888
## F-statistic: 2.732 on 4 and 229 DF, p-value: 0.02985
summary(pred_mpg)$coefficients %>%
as_tibble() %>%
filter(`Pr(>|t|)` < .05)
## # A tibble: 1 × 4
## Estimate `Std. Error` `t value` `Pr(>|t|)`
## <dbl> <dbl> <dbl> <dbl>
## 1 2.88 1.35 2.13 0.0343
#Notice only one is significant. There's also only one significant contrast.
summary(emmeans::lsmeans(pred_mpg, pairwise ~ fl))$contrasts %>%
as_tibble() %>%
filter(p.value<.05)
## # A tibble: 1 × 6
## contrast estimate SE df t.ratio p.value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 e - p 1.41 0.484 229 2.92 0.0313
rename() is how we can rename many variables at once. It takes a dataset as the first argument, like all dplyr verbs, and then takes just newname=oldname input continuously without a c() needed.
names(data)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
## [6] "var" "newvar"
data <- data %>% rename(petal.length=Petal.Length,
professor = newvar)
names(data)
## [1] "Sepal.Length" "Sepal.Width" "petal.length" "Petal.Width" "Species"
## [6] "var" "professor"
data <- data %>% rename(Petal.Length = petal.length,
newvar = professor)
select() is used to select certain columns that we either want to keep or want to not keep. You will learn that we can say “not” in two different ways, using the - sign or using the ! sign. While tricky to figure out which, generally default to - while use ! when you’re doing some kind of logical operator (is X larger than Y).
Select can also move the order of the columns around if you tell it to keep everything using everything().
#This will force you to also take value as a variable in data_small.
data_small <- data %>% dplyr::select(Species, Sepal.Length)
names(data_small)
## [1] "Species" "Sepal.Length"
#We can avoid that by using ungroup()
data_small2 <- data %>%
ungroup() %>%
dplyr::select(Species, Sepal.Length)
names(data_small2)
## [1] "Species" "Sepal.Length"
#Using everything() will simply move the order of the variables.
names(data)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
## [6] "var" "newvar"
data_move <- data %>%
dplyr::select(Species, everything())
names(data_move)
## [1] "Species" "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## [6] "var" "newvar"
rm(data_move, data_small2, data_small)
pull() is a verb that says quite literally “pull” this one vector out from the dataset. We can use it to pull out a variable for mean and standard deviation purposes. You saw it above in mean() and sd() using piping.
iris %>%
pull(Sepal.Length) %>%
mean()
## [1] 5.843333
Pluck is similar to pull, but it plucks values that may come out of a function. We might see this if we would generally call $ on a function. Sometimes, these $ are reported out, but other times, they are not.
UCB.table <- margin.table(UCBAdmissions, c(1,2))
chisq.test(UCB.table)$expected
## Gender
## Admit Male Female
## Admitted 1043.461 711.5389
## Rejected 1647.539 1123.4611
UCB.table %>% chisq.test() %>% pluck("expected")
## Gender
## Admit Male Female
## Admitted 1043.461 711.5389
## Rejected 1647.539 1123.4611
This is a simple function that can be passed through at various points, but it generally acts to simply assign a a numeric id to our rows if we do not already have it and will call the new column it generates rowid.
data %>%
rowid_to_column() %>%
head
## # A tibble: 6 × 8
## rowid Sepal.Length Sepal.Width Petal.Length Petal.Width Species var newvar
## <int> <dbl> <dbl> <dbl> <dbl> <fct> <chr> <int>
## 1 1 6.1 3.5 1.4 0.2 setosa Dr. Ca… 0
## 2 2 5.9 3 1.4 0.2 setosa Dr. Ki… 1
## 3 3 5.7 3.2 1.3 0.2 setosa Dr. Ca… 0
## 4 4 5.6 3.1 1.5 0.2 setosa Dr. Ki… 0
## 5 5 6 3.6 1.4 0.2 setosa Dr. Ca… 0
## 6 6 6.4 3.9 1.7 0.4 setosa Dr. Ki… 1
See also: gather(), rowid_to_column()
pivot_longer() takes wide data and makes it longer.
#Using -c() allows me to say "Don't stack these columns on top of each other.
#But for everything else, stack them on top of each other in a new column called Measurement Type
#and put their values in a new column called Measurement.
#https://stackoverflow.com/questions/57977470/is-pivot-longer-and-pivot-wider-transitive
data_long <- data %>% rowid_to_column() %>%
pivot_longer(cols =-c(Species, var, newvar, rowid),
names_to="Measure",
values_to="Value")
head(data)
## # A tibble: 6 × 7
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species var newvar
## <dbl> <dbl> <dbl> <dbl> <fct> <chr> <int>
## 1 6.1 3.5 1.4 0.2 setosa Dr. Carriere 0
## 2 5.9 3 1.4 0.2 setosa Dr. Kilgore 1
## 3 5.7 3.2 1.3 0.2 setosa Dr. Carriere 0
## 4 5.6 3.1 1.5 0.2 setosa Dr. Kilgore 0
## 5 6 3.6 1.4 0.2 setosa Dr. Carriere 0
## 6 6.4 3.9 1.7 0.4 setosa Dr. Kilgore 1
#???? <- ???? %>%
# pivot_longer(cols = starts_with(c("???","???")),
# names_to = c("????", ".value"),
# names_sep="????")
See also: spread(), rowid_to_column()
pivot_wider() takes long data and makes it wider.
data_wide <- data_long %>%
pivot_wider(names_from = "Measure", values_from = "Value")
see also: summarize()
We can use summarise to build new datasets for us with summary data. This is especially helpful when our dataset has been grouped by group_by(). It also takes new variables names, and can use the variables we have created above to make changes on things below. This is one of our primary ways to get visualization data, as well as get summary statistics that might be helpful in describing means and standard errors in results.
The correct spelling is with an s. This is important, especially if you do not clarify with :: that it should be an s. That is, dplyr::summarize will function okay, but summarize() can behave differently than you would like it to. It is better to be safe and use summarise() or dplyr::summarise().
data_sum <- data %>%
dplyr::summarise(
meanSL=mean(Sepal.Length),
sdSL = sd(Sepal.Length),
nSL = sum(!is.na(Sepal.Length)),
seSL = sdSL/sqrt(nSL), #See here, standard error uses sdSL
#and nSL that we calculated above.
meanSW = mean(Sepal.Width),
sdSW = sd(Sepal.Width),
nSW = sum(!is.na(Sepal.Width)), #refer to "! operator" and is.na() to understand.
seSW = sdSW/sqrt(nSW)
)
head(data_sum)
## # A tibble: 1 × 8
## meanSL sdSL nSL seSL meanSW sdSW nSW seSW
## <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int> <dbl>
## 1 6.84 0.828 150 0.0676 3.06 0.436 150 0.0356
gather() is the old version of pivot_longer(). While it works just as well, it will be easier for the lab to focus on pivot_longer(), and any past code may be outdated to other uses.
spread() is the old version of pivot_wider(). While it works just as well, it will be easier for the lab to focus on pivot_wider() and any past code may be outdated to other uses.
Creating averages for variables is a key part of cleaning your data. We typically ask a multitude of questions in a scale and we want to get the overall average score for the scale.
The first simple way is rowMeans(). for rowMeans, it takes the mean of all columns per row. So, we need to first subset our data to only the columns we want. Then, what is left is a dataset purely of the columns we want the average of, which can then be run through rowMeans to get an average for each row, which we can then assign to a new variable. I also run the global mean on this variable to compare it to the next solution.
iris$Sepal.Example <- rowMeans(
subset(
iris,
select=c(
Sepal.Length,
Sepal.Width
)))
mean(iris$Sepal.Example)
## [1] 4.450333
However, rowMeans() will start to misbehave if we’re using a grouped dataset through our tibbles and our group_by() and if we are trying to pipe it in. If so, we would want to shift to using rowwise() and using the mean() function.
see also: psych::alpha() , summarystats(), group_by()
rowwise() is very similar to group_by() in that it doesn’t change anything but the dataset’s functionality. Before, if we were to call mean(), we would get the average across the whole data. But using rowwise(), we get it for each individual row. Then, we can use mean() instead, since we’ve grouped the variables on the rows, and then pass through the variable names as we are used to doing for mean. Note like always, since we are piping, we don’t need to use the dollar sign to signify the variable because R is being told explicitly at the beginning what variable to use.
iris <- iris %>%
group_by(Species) %>%
rowwise() %>%
mutate(Sepal.Example2 = mean(c(Sepal.Length, Sepal.Width), na.rm=TRUE))
mean(iris$Sepal.Example2)
## [1] 4.450333
The lab uses ggplot2 for all data visualizations. Ggplot2 can be broken down into visualizations that takes summary data (geom_line; geom_bar; geom_col; geom_point) and visualizations that take the whole dataset (raincloud plots; geom_point). The lab is moving towards using raincloud plots for most visualizations.
All ggplots start with a base layer that will show nothing except the axes. The idea is that we just continually add layers on top of each other. It is important to think like this because indeed, that’s how things work, so if you want something behind/in front of something else, consider the order in which you are placing layers.
Important to note in the code below that the variables we care about are wrapped in aes(). aes() stands for aesthetics, or visually, what we want to change. If we want things to change based on a certain variable, it probably wants to go into aes. If we want something to change globally (for all points), then we probably want it outside of aes.
Both of these are possible plots - that is, all ggplot2 needs is a dataset. It doesn’t care if it’s the whole data or a summarized version of the data. To ensure this section works well, I will always be piping data just like this at the top of each function. However, please note that in practice, it is better to just save your data into your environment and then
iris %>% ggplot2::ggplot(aes(x=Species, y=Sepal.Length))
iris %>% group_by(Species) %>%
dplyr::summarise(
meanSL=mean(Sepal.Length, na.rm=TRUE),
sdSL = sd(Sepal.Length, na.rm=TRUE),
nSL = sum(!is.na(Sepal.Length)),
seSL = sdSL/sqrt(nSL)) %>%
ggplot2::ggplot(aes(x=Species, y=meanSL))
see also: geom_line(), lollipop graphs, strip chart
geom_point() makes scatterplots, so we can expect geom_point() to like to have the full dataset to plot things well. However, this is not necessarily the case, and there are many graphs that can use geom_point() with a summarized dataset as well where you are highlighting the points that appear on geom_line() (see Maglio et al., 2014 for a use of geom_point() with summarized data, also lollipop graphs).
iris %>% ggplot2::ggplot(aes(x=Species, y=Sepal.Length))+
geom_point()
See also: called jitter plots, strip chart
Above, the points were overlapping each other. We can set the position of the points to be different so they don’t overlap anymore. We call this jittering. position= is important for a few of the geom_s.
iris %>% ggplot2::ggplot(aes(x=Species, y=Sepal.Length))+
geom_point(position=position_jitter(width=.2))
As stated, outside of aes() will set global colors.
iris %>% ggplot2::ggplot(aes(x=Species, y=Sepal.Length))+
geom_point(position=position_jitter(width=.2), color="blue")
see also: theme, legend
Inside of aes() we define things by variables. Having an aes() beyond ggplot() will also lead to a legend appearing, as we can see.
iris %>% ggplot2::ggplot(aes(x=Species, y=Sepal.Length))+
geom_point(position=position_jitter(width=.2), aes(color=Species))
Points can change their shapes. You can find out all the shape options using show_point_shapes() from ggpubr. Though, I will note, when I look at this, the point shapes don’t equate to what is graphed. However, it can at least give you a general idea and use a lot of trial and error to find what you want.
ggpubr::show_point_shapes()
iris %>%
ggplot2::ggplot(aes(x=Species, y=Sepal.Length))+
geom_point(position=position_jitter(width=.2),
aes(color=Species),
shape=21)
Similar, you can put most things outside of aes() into aes() if you want to change it by group. Inside of aes() is generally how you get things to appear on legends.
iris %>%
ggplot2::ggplot(aes(x=Species, y=Sepal.Length))+
geom_point(position=position_jitter(width=.2),
aes(color=Species,
shape=Species))
see also: geom_col()
geom_bar() is a basic bar chart. It has a lot of possible uses, one of which makes it model exactly what geom_col() does if geom_bar() has stat=“identity” set as an argument. It’s general use is to map the number of observations per group, which itself is not a very useful visualization.
iris %>%
ggplot(aes(x=Species, group=Species))+
geom_bar(position="dodge",
colour="black",
aes(fill=Species))+
scale_fill_manual(values=c("grey40", "grey", "grey100"))+
labs(y="Number of Obs. Per Group", x="X Axis", fill="")+
theme(plot.title = element_text(hjust = 0.5))
#mimicking geom_col()
iris %>%
group_by(Species) %>%
summarise(
n_Sepal.Width = sum(!is.na(Sepal.Width)),
mean_Sepal.Width = mean(Sepal.Width, na.rm=TRUE),
sd_Sepal.Width = sd(Sepal.Width, na.rm=TRUE),
se_Sepal.Width = sd_Sepal.Width / sqrt(n_Sepal.Width)) %>%
ggplot(aes(x=Species, y=mean_Sepal.Width, group=Species))+
geom_bar(stat="identity", position="dodge", colour="black", aes(fill=Species))+
scale_fill_manual(values=c("grey40", "grey", "grey100"))+
labs(y="Y Axis", x="X Axis", fill="")+
theme(plot.title = element_text(hjust = 0.5)) +
geom_errorbar(aes(
ymin=mean_Sepal.Width-se_Sepal.Width,
ymax=mean_Sepal.Width+se_Sepal.Width),
width=.3, # Width of the error bars
position=position_dodge(.9))
see: geom_bar()
geom_col() does what we want geom_bar() to do by default - that is, map the averages of various grouping variables.
iris %>% group_by(Species) %>% summarise(
n_Sepal.Width = sum(!is.na(Sepal.Width)),
mean_Sepal.Width = mean(Sepal.Width, na.rm=TRUE),
sd_Sepal.Width = sd(Sepal.Width, na.rm=TRUE),
se_Sepal.Width = sd_Sepal.Width / sqrt(n_Sepal.Width)) %>%
ggplot(aes(x=Species, y=mean_Sepal.Width, group=Species))+
geom_col(position="dodge", colour="black", aes(fill=Species))+
scale_fill_manual(values=c("grey40", "grey", "grey100"))+
labs(y="Y Axis", x="X Axis", fill="")+
theme(plot.title = element_text(hjust = 0.5)) +
geom_errorbar(aes(
ymin=mean_Sepal.Width-se_Sepal.Width,
ymax=mean_Sepal.Width+se_Sepal.Width),
width=.3, # Width of the error bars
position=position_dodge(.9))
Boxplots show the median.
iris %>% ggplot(aes(x=Species, y=Petal.Length))+
geom_boxplot(aes(color=Species))+
geom_line()+
xlab("Species")+
scale_color_manual(values=c("#FFFFFF", "#123456", "#654321"))
see also: Shapiro(), qqplot()
Histograms help show normal distribution of a certain variable.
iris %>% ggplot2::ggplot(aes(x=Sepal.Length))+
geom_histogram(binwidth=.1)
By using facet_grid(), we can even graph these based on some explanatory variable to see if each is normally distributed.
iris %>% ggplot2::ggplot(aes(x=Sepal.Length))+
geom_histogram(binwidth=.1)+
facet_wrap(~Species) #Check nrow and ncol as options to adjust.
We could display the above data also using lines instead of bars. For most kinds of visualizations you would see, if we’re using geom_line(), we probably want to feed it data in this more compressed form. You should try on your own, and really it kind of just comes down to trial and error on how we want them to look. But let’s add on geom_line() instead of geom_bar(), and let’s pass through two arguments, aes() we’ll say color is equal to condition, and size outside of aes is equal to 1.2.
iris %>%
group_by(Species) %>%
summarise(
n_Sepal.Width = sum(!is.na(Sepal.Width)),
mean_Sepal.Width = mean(Sepal.Width, na.rm=TRUE),
sd_Sepal.Width = sd(Sepal.Width, na.rm=TRUE),
se_Sepal.Width = sd_Sepal.Width / sqrt(n_Sepal.Width)) %>%
ggplot(aes(x=Species, y=mean_Sepal.Width, group=Species))+
geom_line(size=1.2, aes(color=Species))+
scale_fill_manual(values=c("grey40", "grey", "grey100"))+
labs(y="Y Axis", x="X Axis", fill="")+
theme(plot.title = element_text(hjust = 0.5)) +
geom_errorbar(aes(
ymin=mean_Sepal.Width-se_Sepal.Width,
ymax=mean_Sepal.Width+se_Sepal.Width),
width=.3,
position=position_dodge(.9))
See: geom_bar(), geom_col(), geom_line()
Geom_errorbar() is used to create the error bars however we define them. For Psychology, we generally plot the 1 +/- SE bars. I plot them alone here to visually show them, but you can see their true functionality in other plots. The big thing to notice about them is that they take a ymin and a ymax, and so we calculate that ahead of time and place it into our compressed dataset.
iris %>%
group_by(Species) %>%
summarise(
n_Sepal.Width = sum(!is.na(Sepal.Width)),
mean_Sepal.Width = mean(Sepal.Width, na.rm=TRUE),
sd_Sepal.Width = sd(Sepal.Width, na.rm=TRUE),
se_Sepal.Width = sd_Sepal.Width / sqrt(n_Sepal.Width)) %>%
ggplot(aes(x=Species, y=mean_Sepal.Width, group=Species))+
geom_errorbar(aes(
ymin=mean_Sepal.Width-se_Sepal.Width,
ymax=mean_Sepal.Width+se_Sepal.Width),
width=.3,
position=position_dodge(.9))
see: annotate
Even better would probably be using a geom_ still, and for that, we have geom_segment(). Notice how we include two dataframes in this ggplot without issue.
x <- c(1, 1.5, 1.5, 1)
xend <- c(1.5, 1.5, 1, 1)
y <- c(2.5, 2.5, 4, 4)
yend <- c(2.5, 4, 4, 2.5)
line <- c(1:4)
box <- data.frame(xend, x, yend, y, line)
rm(xend, x, yend, y, line)
iris %>% ggplot(aes(x=Species, y=Petal.Length))+
geom_jitter()+
xlab("Species")+
geom_vline(xintercept=2, color="red", linetype=2)+
annotate("text", label="I made a box!", x=1.25, y=3.25)+
geom_segment(data=box, aes(x=x, xend=xend, y=y, yend=yend))
see also: geom_boxplot()
Violin plots are a great plot because it takes all of the data (no need to manipulate it) and it also provides the distribution of the data.
iris %>%
ggplot(aes(x=Species, y=Petal.Length, fill=Species))+
geom_violin()+
scale_color_manual(values=c("red", "blue", "green"))
You can add various things of interest to violin plots. For example, say you want to plot the boxplot on top, we just put that boxplot next in line. Notice in this one, I move fill=Species in the aes() out of the entire ggplot() and into only the piece I want to color in.
iris %>%
ggplot(aes(x=Species, y=Petal.Length))+
geom_violin(aes(fill=Species))+
geom_boxplot()+
scale_color_manual(values=c("red", "blue", "green"))
see also: chisq.test(), chisq_test()
Mosaic plots are great for two categorical variables and considering chi square tests.
For example, imagine we have a plot of gender by admissions to University of California Berkeley.
UCB.table <- margin.table(UCBAdmissions, c(1,2))
UCB.table
## Gender
## Admit Male Female
## Admitted 1198 557
## Rejected 1493 1278
We can use mosaicplot() on this table. We can specify color, axis labels, the main title. Also notice \n was used in order to create a line break, which is something we can use in a lot of various places.
mosaicplot(UCB.table,
xlab="Admission Status",
ylab="Gender",
main="University of California Berkeley \n Admission by Gender",
color=c("darkred", "gold"))
We can plot predictions in a lot of ways. One way is through sjPlot’s plot_model feature.
library(sjPlot)
model <-
mpg %>%
mutate(dodge = case_when(manufacturer=="dodge" ~ "Dodge",
manufacturer!="dodge" ~ "Not Dodge"),
old = case_when(year==1999 ~ "Old",
year==2008 ~ "New")) %>%
lm(hwy ~ dodge* old + cty, .)
#plots predictions
sjPlot::plot_model(model, type="pred", terms=c("dodge","old"))+
ylab("")+
xlab("")+
ggtitle("")+
guides(color=guide_legend(title=""))
We can also plot the estimates as a forest plot.
model <-
mpg %>%
mutate(dodge = case_when(manufacturer=="dodge" ~ "Dodge",
manufacturer!="dodge" ~ "Not Dodge"),
old = case_when(year==1999 ~ "Old",
year==2008 ~ "New")) %>%
lm(hwy ~ dodge* old + cty, .)
fitted_plot<- plot_model(model,
axis.labels=c("coef1", "coef2", "coef3", "coef4"),
group.terms = c(1, 2,
3, 2),
vline.color = "Black",
colors=c("DarkGrey", "#046C9A", "#DB545E"))+
ggtitle("title")
fitted_plot[["data"]] <- fitted_plot[["data"]][1:4,]
fitted_plot
library(wesanderson)
library(cowplot)
library(PupillometryR)
library(tidyverse)
ggplot(iris,aes(x=Species,y=Petal.Length, fill = Species, colour = Species))+
geom_flat_violin(position = position_nudge(x = .25, y = 0),
adjust =2,
trim = TRUE)+
geom_point(position = position_jitter(width = .15),
size = .25)+
geom_boxplot(aes(x = (Species), y = Petal.Length),
outlier.shape = NA,
alpha = 0.3,
width = .1,
colour = "BLACK") +
theme_cowplot()+
ggtitle("Raincloud Plot")+
scale_fill_manual(values = wes_palette("Zissou1",
n = 3))+
scale_color_manual(values = wes_palette("Zissou1",
n = 3))+
theme(legend.position = "none",
axis.text.x = element_text(size=8))
Lollipop plots are more modern looking barplots, even if they don’t give as much information as a violin plot or a boxplot.
The second example is a created double lollipop, or barbell, plot.
assignment <- c("Attendance", "Participation", "Lead Discussion",
"Discussion Posts", "Reviewing Papers", "Midterm",
"Project Topic", "Method Identification", "Ethics Training",
"Annotated Bibliography", "Class Pres 1", "Submit IRB",
"Class Pres 2", "Poster Presentation", "Final Paper"
)
Management <- c(0, 10, 7.5, 7.5, 10, 15, 0,
2,2, 5, 2.5, 10, 2.5, 10, 17.5)
Union <- c(5, 10, 10, 10, 5, 20, 2,
2, 2, 5, 5, 5, 4, 5, 10)
data_lg <- data.frame(assignment, Management, Union)
data_lg$assignment <- factor(data_lg$assignment,
levels=c("Attendance", "Participation",
"Lead Discussion", "Discussion Posts",
"Reviewing Papers", "Midterm",
"Project Topic", "Method Identification",
"Ethics Training",
"Annotated Bibliography",
"Class Pres 1", "Submit IRB",
"Class Pres 2", "Poster Presentation",
"Final Paper"))
ggplot(data_lg) +
geom_segment(aes(x=assignment,
xend=assignment,
y=Management,
yend=Union),
color="grey") +
geom_point(aes(x=assignment, y=Management, color="Management"), size=3 ) +
geom_point(aes(x=assignment, y=Union, color="Union"), size=3 ) +
scale_color_manual(values=c(rgb(0.7,0.2,0.1,0.5), rgb(0.2,0.7,0.1,0.5)),
name="")+
coord_flip()+
xlab("") +
ylab("Weight of Assignment for Final Grade")+
theme(
legend.position = "right"
)
Pie charts aren’t a good way to visualize things, and ggplot2 makes that very apparent by making it not the smoothest thing to make. Here, I make a donut chart (a pie chart with a hole in the middle) by combining geom_col() with coord_polar().
Other search terms: pie chart, donut chart
df2 <- data.frame(value = c(6, 15, 78),
group = c("Decreasing", "Staying the same", "Increasing"),
hsize= 3)
df2$group <- factor(df2$group, levels=c("Increasing",
"Decreasing",
"Staying the same"))
df2$label <- paste0(df2$value, "%")
ggplot(df2, aes(x = hsize, y = value, fill = group)) +
geom_col(color = "black") +
geom_text(aes(label = label),
position = position_stack(vjust = 0.5)) +
coord_polar(theta = "y") +
scale_fill_manual(values = c("#C27D38","#CCC591","#798E87" ),
name="Partisan divide is...")+
xlim(c(0.2, 3 + 0.5)) +
theme(panel.background = element_rect(fill = "white"),
panel.grid = element_blank(),
axis.title = element_blank(),
axis.ticks = element_blank(),
axis.text = element_blank(),
legend.position = c(.5,.5))+
ggtitle("Among Democrats...")
other search terms: facet_grid()
Facet_wrap() and facet_grid() are ways to add an additional variable onto your visualization. Many times, we don’t want the grids to overlap as we have above, but instead, want to see them differently side by side.
mpg %>%
mutate(
trans = case_when(
str_detect(trans, "auto") ~ "auto",
str_detect(trans, "manual") ~ "manual"
)) %>%
ggplot(aes(x=displ, y=hwy))+
geom_point()+
facet_wrap(~trans)
There are so many options in themes. It’s really hard to even start writing notes about this. If there’s a feature of the plot background or the plot itself you want to change, it’s probably in themes. Importantly, theme() almost never takes words. Instead, you should imagine theme() as giving details about the format of the plot itself. So, for instance, theme() can edit the size of the font of the y axis, but it can’t tell what the y axis is.
Also, a key thing to use, as you can see in other examples, is using element_blank() to get rid of things.
If you have multiple things, keep them all in a single theme().
iris %>% ggplot(aes(x=Species, y=Petal.Length))+
geom_jitter()+
theme(
panel.background = element_rect(fill = "white"),
panel.grid = element_blank(),
axis.title = element_blank(),
axis.ticks = element_blank(),
axis.text = element_text(angle=90)
#... use ?theme to find literally over fifty possible options.
)
Makes your x axis label. Put it in quotes.
iris %>% ggplot(aes(x=Species, y=Petal.Length))+
geom_jitter()+
xlab("Species")
Makes your y axis label. Put it in quotes.
iris %>% ggplot(aes(x=Species, y=Petal.Length))+
geom_jitter()+
ylab("Petal Length")
Makes a horizontal line with a yintercept needed.
iris %>% ggplot(aes(x=Species, y=Petal.Length))+
geom_jitter()+
xlab("Species")+
geom_hline(yintercept=4, color="red", linetype=2)
Makes a vertical line with a xintercept needed.
iris %>% ggplot(aes(x=Species, y=Petal.Length))+
geom_jitter()+
xlab("Species")+
geom_vline(xintercept=2, color="red", linetype=2)
Also search: cld
library(datarium)
library(lmerTest)
library(emmeans)
selfestdat2 <- selfesteem2
SelfEst.Lmer <- selfestdat2 %>%
pivot_longer(cols=-c(id, treatment),
names_to="Time",
values_to="SelfEsteem") %>%
mutate(Time = factor(Time, levels=c("t1", "t2", "t3"))) %>%
lmer(SelfEsteem ~ Time*treatment + (1|id), data=.)
marginal_lmer <- emmeans(SelfEst.Lmer, ~Time)
pairs(marginal_lmer, adjust="tukey")
## contrast estimate SE df t.ratio p.value
## t1 - t2 1.96 0.802 55 2.442 0.0462
## t1 - t3 4.62 0.802 55 5.767 <.0001
## t2 - t3 2.67 0.802 55 3.325 0.0044
##
## Results are averaged over the levels of: treatment
## Degrees-of-freedom method: kenward-roger
## P value adjustment: tukey method for comparing a family of 3 estimates
CLD = multcomp::cld(marginal_lmer, alpha=0.05,
Letters=letters, ### lc letters
adjust="sidak")
CLD$.group=gsub(" ", "", CLD$.group)
CLD %>% ggplot(aes(x=Time, y=emmean, group=Time, label=.group))+
geom_col(position="dodge", colour="black", aes(fill=factor(Time)))+
#scale_fill_manual(values=c("grey40", "grey", "grey100"))+
labs(y="Y Axis", x="X Axis", fill="")+
theme(plot.title = element_text(hjust = 0.5)) +
geom_errorbar(aes(ymin=emmean-SE, ymax=emmean+SE),
width=.3,
position=position_dodge(.9))+
geom_text(nudge_y=10)
Annotate allows the placement of text and lines. This can be called very obtusely with listing x, y, xend, and yend.
iris %>% ggplot(aes(x=Species, y=Petal.Length))+
geom_jitter()+
xlab("Species")+
geom_vline(xintercept=2, color="red", linetype=2)+
annotate("text", label="Label!", x=1, y=2.5)+
annotate("segment", x=1, xend=1.5, y=4, yend=4)
But if you really wanted to, you can pass an additional dataframe in and use those values to map your lines for you.
x <- c(1, 1.5, 1.5, 1)
xend <- c(1.5, 1.5, 1, 1)
y <- c(2.5, 2.5, 4, 4)
yend <- c(2.5, 4, 4, 2.5)
line <- c(1:4)
box <- data.frame(xend, x, yend, y, line)
rm(xend, x, yend, y, line)
iris %>% ggplot(aes(x=Species, y=Petal.Length))+
geom_jitter()+
xlab("Species")+
geom_vline(xintercept=2, color="red", linetype=2)+
annotate("text", label="I made a box!", x=1.25, y=3.25)+
annotate("segment", x=box$x, xend=box$xend, y=box$y, yend=box$yend)
Review: shape (inside of aes())
If you call a color() in an aes(), you can also manually change the color. Color generally refers to the lines of a geom_ , while fill (below) will refer to the colors within the shapes themselves.
See below how this will create a legend and define the colors.
iris %>% ggplot(aes(x=Species, y=Petal.Length, color=Species))+
geom_jitter()+
geom_line()+
xlab("Species")+
scale_color_manual(values=c("blue", "red", "gray"))
This also will do that, but since it’s outside of the ggplot() call, it will only be focused on geom_jitter.
iris %>% ggplot(aes(x=Species, y=Petal.Length))+
geom_jitter(aes(color=Species))+
geom_line()+
xlab("Species")+
scale_color_manual(values=c("blue", "red", "gray"))
You don’t have to name the colors explicitly. You can also draw from various palettes. Dr. C uses library(wesanderson) and library(harrypotteR) for his palettes.
library(wesanderson)
wp3 <- wesanderson:: wes_palette("Darjeeling2",3, type="discrete")
iris %>% ggplot(aes(x=Species, y=Petal.Length))+
geom_jitter(aes(color=Species))+
geom_line()+
xlab("Species")+
scale_color_manual(values=c(wp3))
rm(wp3)
You can also use the HEX codes.
iris %>% ggplot(aes(x=Species, y=Petal.Length))+
geom_boxplot(aes(color=Species))+
geom_line()+
xlab("Species")+
scale_color_manual(values=c("#FFFFFF", "#123456", "#654321"))
Now repeat this, but use fill.
iris %>% ggplot(aes(x=Species, y=Petal.Length))+
geom_boxplot(aes(fill=Species))+
xlab("Species")+
scale_fill_manual(values=c("#FFFFFF", "#123456", "#654321"))
ggpatterns is its own library. Dr. C hasn’t done a lot of looking into it, but it seems to be what you would expect - a way to fill ggplots with patterns instead of by colors. The help file seems to have a lot of applications you could check out, but I’m not going to list each here, since it creates new functions that work just like their brothers, but with a few additional arguments. I provide two examples.
#Stealing the code from geom_col()
# The only change made is that it goes to geom_col_pattern()
# and fill=Species becomes pattern=Species.
library(ggpattern)
iris %>% group_by(Species) %>% summarise(
n_Sepal.Width = sum(!is.na(Sepal.Width)),
mean_Sepal.Width = mean(Sepal.Width, na.rm=TRUE),
sd_Sepal.Width = sd(Sepal.Width, na.rm=TRUE),
se_Sepal.Width = sd_Sepal.Width / sqrt(n_Sepal.Width)) %>%
ggplot(aes(x=Species, y=mean_Sepal.Width, group=Species))+
geom_col_pattern(position="dodge",
colour="black",
aes(pattern=Species))+
scale_fill_manual(values=c("grey40", "grey", "grey100"))+
labs(y="Y Axis", x="X Axis", fill="")+
theme(plot.title = element_text(hjust = 0.5)) +
geom_errorbar(aes(
ymin=mean_Sepal.Width-se_Sepal.Width,
ymax=mean_Sepal.Width+se_Sepal.Width),
width=.3, # Width of the error bars
position=position_dodge(.9))
And yes, you can make it cats too if you’d like.
#Stealing the code from geom_col()
# The only change made is that it goes to geom_col_pattern()
# and fill=Species becomes pattern=Species.
library(ggpattern)
iris %>% group_by(Species) %>% summarise(
n_Sepal.Width = sum(!is.na(Sepal.Width)),
mean_Sepal.Width = mean(Sepal.Width, na.rm=TRUE),
sd_Sepal.Width = sd(Sepal.Width, na.rm=TRUE),
se_Sepal.Width = sd_Sepal.Width / sqrt(n_Sepal.Width)) %>%
ggplot(aes(x=Species, y=mean_Sepal.Width, group=Species))+
geom_col_pattern(position="dodge",
colour="black",
pattern_type='kitten',
pattern='placeholder',
aes(pattern_angle = Species))+
#scale_fill_manual(values=c("grey40", "grey", "grey100"))+
labs(y="Y Axis", x="X Axis", fill="")+
theme(plot.title = element_text(hjust = 0.5),
legend.position="none") +
geom_errorbar(aes(
ymin=mean_Sepal.Width-se_Sepal.Width,
ymax=mean_Sepal.Width+se_Sepal.Width),
width=.3, # Width of the error bars
position=position_dodge(.9))
t.test() works for a bunch of various simple t.tests. You can use a two sampled t-test like so, where it is read DV ~ IV:
t.test(extra ~ group, data = sleep)
##
## Welch Two Sample t-test
##
## data: extra by group
## t = -1.8608, df = 17.776, p-value = 0.07939
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.3654832 0.2054832
## sample estimates:
## mean in group 1 mean in group 2
## 0.75 2.33
You can compare sample means to theoretical population means.
t.test(sleep$extra, mu=4)
##
## One Sample t-test
##
## data: sleep$extra
## t = -5.4519, df = 19, p-value = 2.927e-05
## alternative hypothesis: true mean is not equal to 4
## 95 percent confidence interval:
## 0.5955845 2.4844155
## sample estimates:
## mean of x
## 1.54
t.test(sleep$extra, mu=4, conf.level = .99)
##
## One Sample t-test
##
## data: sleep$extra
## t = -5.4519, df = 19, p-value = 2.927e-05
## alternative hypothesis: true mean is not equal to 4
## 99 percent confidence interval:
## 0.2490875 2.8309125
## sample estimates:
## mean of x
## 1.54
You can use the additional information exported from t.test to get the confidence intervals of a mean.
t.test(sleep$extra)$conf.int
## [1] 0.5955845 2.4844155
## attr(,"conf.level")
## [1] 0.95
sleep %>% select(extra) %>% t.test() %>% .$conf.int
## [1] 0.5955845 2.4844155
## attr(,"conf.level")
## [1] 0.95
You can use paired data to also test if the two means are different.
sleep2 <- stats::reshape(sleep, direction = "wide",
idvar = "ID", timevar = "group")
t.test(Pair(extra.1, extra.2) ~ 1, data = sleep2)
##
## Paired t-test
##
## data: Pair(extra.1, extra.2)
## t = -4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.4598858 -0.7001142
## sample estimates:
## mean of the differences
## -1.58
But in class we mainly learned it as:
t.test(sleep2$extra.1, sleep2$extra.2, paired=TRUE)
##
## Paired t-test
##
## data: sleep2$extra.1 and sleep2$extra.2
## t = -4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.4598858 -0.7001142
## sample estimates:
## mean of the differences
## -1.58
see also: anova(), Anova(), TukeyHSD()
aov() stands for analysis of variance. aov() does not take a lm() as an input. However, aov() is used when we want to do our anova() test but we also want to check the post-hoc comparisons.
summary(aov(Petal.Length ~ Species, data=iris))
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 437.1 218.55 1180 <2e-16 ***
## Residuals 147 27.2 0.19
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Tukey’s Honest Significant Difference is a way to get post-hoc tests. It already controls for multiple comparisons by considering a different t distribution based on the number of comparisons being made. It takes an aov() object as input.
iris %>%
aov(Petal.Length ~ Species, data=.) %>%
TukeyHSD()
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Petal.Length ~ Species, data = .)
##
## $Species
## diff lwr upr p adj
## virginica-versicolor 1.292 1.08822 1.49578 0
## setosa-versicolor -2.798 -3.00178 -2.59422 0
## setosa-virginica -4.090 -4.29378 -3.88622 0
Linear regression models are what everything in Psychological Statistics is built off of, and so if we publish a paper, we’re using lm. You can notice that it actually would give the similar information to using aov() and then TukeyHSD().
iris %>%
lm(Petal.Length ~ Species, data=.) %>%
summary()
##
## Call:
## lm(formula = Petal.Length ~ Species, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.260 -0.258 0.038 0.240 1.348
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.26000 0.06086 70.00 <2e-16 ***
## Speciesvirginica 1.29200 0.08607 15.01 <2e-16 ***
## Speciessetosa -2.79800 0.08607 -32.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4303 on 147 degrees of freedom
## Multiple R-squared: 0.9414, Adjusted R-squared: 0.9406
## F-statistic: 1180 on 2 and 147 DF, p-value: < 2.2e-16
If we’re doing within-subjects stuff, we’re using lmer(). lmer() stands for linear mixed effect regressions. It allows you the ability to throw things in as the random effect. Random effects - ask your professor.
Dr. C believes that Blocking should use random effect modeling. Dr. K doesn’t agree.
library(broom.mixed)
library(lme4)
yield <- c(93.5, 66.6, 50.5, 42.4,
102.9, 53.2, 47.4, 43.8,
67.0, 54.7, 50, 40.1,
86.3, 61.3, 50.7, 46.4)
block <- c(rep(1, 4), rep(2, 4), rep(3, 4), rep(4, 4))
variety <- rep(1:4, 4)
Barley <- data.frame(yield, block, variety)
Barley$variety <- as.factor(Barley$variety)
Barley$block <- as.factor(Barley$block)
Barley %>%
mutate(variety = as.factor(variety)) %>%
lmer(yield ~ variety + (1|block), .) %>%
summary()
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: yield ~ variety + (1 | block)
## Data: .
##
## REML criterion at convergence: 90.4
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.32860 -0.23296 0.00454 0.28642 1.86536
##
## Random effects:
## Groups Name Variance Std.Dev.
## block (Intercept) 5.533 2.352
## Residual 64.288 8.018
## Number of obs: 16, groups: block, 4
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 87.425 4.178 11.778 20.925 1.11e-10 ***
## variety2 -28.475 5.670 9.000 -5.022 0.000717 ***
## variety3 -37.775 5.670 9.000 -6.663 9.24e-05 ***
## variety4 -44.250 5.670 9.000 -7.805 2.69e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) varty2 varty3
## variety2 -0.679
## variety3 -0.679 0.500
## variety4 -0.679 0.500 0.500
Compare the Estimates of the varieties compared to…
library(multcomp)
library(broom)
Barley %>%
mutate(variety = as.factor(variety)) %>%
lmer(yield ~ variety + (1|block), .) %>%
tidy() %>%
filter(effect=="fixed") %>%
dplyr::select(estimate)
## # A tibble: 4 × 1
## estimate
## <dbl>
## 1 87.4
## 2 -28.5
## 3 -37.8
## 4 -44.2
Barley.lmer <- lmer(yield ~ factor(variety) + (1|block), data=Barley)
marginal_lmer <- emmeans(Barley.lmer, ~variety)
pairs(marginal_lmer, adjust="tukey")
## contrast estimate SE df t.ratio p.value
## 1 - 2 28.48 5.67 9 5.022 0.0033
## 1 - 3 37.77 5.67 9 6.663 0.0004
## 1 - 4 44.25 5.67 9 7.805 0.0001
## 2 - 3 9.30 5.67 9 1.640 0.4053
## 2 - 4 15.78 5.67 9 2.782 0.0833
## 3 - 4 6.47 5.67 9 1.142 0.6747
##
## Degrees-of-freedom method: kenward-roger
## P value adjustment: tukey method for comparing a family of 4 estimates
CLD = cld(marginal_lmer, alpha=0.05,
Letters=letters, ### lc letters
adjust="sidak")
CLD
## variety emmean SE df lower.CL upper.CL .group
## 4 43.2 4.18 11.8 30.9 55.4 a
## 3 49.6 4.18 11.8 37.4 61.9 a
## 2 59.0 4.18 11.8 46.7 71.2 a
## 1 87.4 4.18 11.8 75.2 99.7 b
##
## Degrees-of-freedom method: kenward-roger
## Confidence level used: 0.95
## Conf-level adjustment: sidak method for 4 estimates
## P value adjustment: sidak method for 6 tests
## significance level used: alpha = 0.05
## NOTE: Compact letter displays can be misleading
## because they show NON-findings rather than findings.
## Consider using 'pairs()', 'pwpp()', or 'pwpm()' instead.
Barley.lm <- lm(yield ~ factor(block) + factor(variety), data=Barley)
marginal = emmeans(Barley.lm, ~variety)
pairs(marginal, adjust="tukey")
## contrast estimate SE df t.ratio p.value
## 1 - 2 28.48 5.67 9 5.022 0.0033
## 1 - 3 37.77 5.67 9 6.663 0.0004
## 1 - 4 44.25 5.67 9 7.805 0.0001
## 2 - 3 9.30 5.67 9 1.640 0.4053
## 2 - 4 15.78 5.67 9 2.782 0.0833
## 3 - 4 6.47 5.67 9 1.142 0.6747
##
## Results are averaged over the levels of: block
## P value adjustment: tukey method for comparing a family of 4 estimates
CLD = cld(marginal, alpha=0.05,
Letters=letters, ### lc letters
adjust="sidak")
CLD
## variety emmean SE df lower.CL upper.CL .group
## 4 43.2 4.01 9 30.8 55.6 a
## 3 49.6 4.01 9 37.2 62.1 a
## 2 59.0 4.01 9 46.5 71.4 a
## 1 87.4 4.01 9 75.0 99.8 b
##
## Results are averaged over the levels of: block
## Confidence level used: 0.95
## Conf-level adjustment: sidak method for 4 estimates
## P value adjustment: sidak method for 6 tests
## significance level used: alpha = 0.05
## NOTE: Compact letter displays can be misleading
## because they show NON-findings rather than findings.
## Consider using 'pairs()', 'pwpp()', or 'pwpm()' instead.
And note:
see: summarystats()
Dr. C has his own function that pulls from this to get summary stats of scales. However, alpha can be thought of as the many combinations of correlations between each possible item in a scale. Items = Questions. Many scales have more than 1 question.
psych::alpha wants a dataset of only the items in a scale. It also will do a decent job detecting if you forgot to reverse code something.
library(psych)
library(tidyverse)
iris %>%
ungroup() %>%
dplyr::select(Petal.Length, Petal.Width, Sepal.Length) %>%
psych::alpha(.)
##
## Reliability analysis
## Call: psych::alpha(x = .)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.88 0.96 0.96 0.88 23 0.0067 3.6 1.1 0.87
##
## lower alpha upper 95% confidence boundaries
## 0.86 0.88 0.89
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## Petal.Length 0.90 0.90 0.82 0.82 9 0.0165 NA 0.82
## Petal.Width 0.80 0.93 0.87 0.87 14 0.0164 NA 0.87
## Sepal.Length 0.82 0.98 0.96 0.96 52 0.0079 NA 0.96
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## Petal.Length 150 0.99 0.98 0.99 0.96 3.8 1.77
## Petal.Width 150 0.97 0.96 0.96 0.94 1.2 0.76
## Sepal.Length 150 0.92 0.93 0.87 0.86 5.8 0.83
Cohen’s D is one version of effect sizes. Treat it like any normal equation. Note: these are calculated outside of considerations of p values. That is, you can still find effect sizes with non-significant results.
library(lsr)
starwars %>% cohensD(mass ~ gender, .)
## [1] 0.3004285
Functions are created using both an open set of parentheses as well as a curly bracket.
coolcat <- function(x){
paste0(x, " is a cool cat")
}
coolcat("Dr. C")
## [1] "Dr. C is a cool cat"
see also: []
Imagine we have a data frame of three columns. Two of which we think are numbers, but R is reading them as characters. We know there’s a function as.numeric(). sapply() can apply a single function across multiple columns. To do that, we can pass a list of column names to target.
example1 <- c("1", "2", "3")
example2 <- c("3", "4", "5")
notthiscolumn <- c("a", "b", "c")
exampledf <- data.frame(example1, example2, notthiscolumn)
summary(exampledf)
## example1 example2 notthiscolumn
## Length:3 Length:3 Length:3
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
listofcols <- c("example1", "example2")
exampledf[listofcols] <- sapply(exampledf[listofcols], as.numeric)
summary(exampledf)
## example1 example2 notthiscolumn
## Min. :1.0 Min. :3.0 Length:3
## 1st Qu.:1.5 1st Qu.:3.5 Class :character
## Median :2.0 Median :4.0 Mode :character
## Mean :2.0 Mean :4.0
## 3rd Qu.:2.5 3rd Qu.:4.5
## Max. :3.0 Max. :5.0
see also: sapply
factorise() is the lab’s function to quickly change most scale wordings to numeric values. It grows as we use various anchors and also tries to cover for capitalization changes based on student. It can take a single variable, but for efficiency, we can use sapply to pass it through multiple variables at once.
factorise <- function(x) {
dplyr::case_when(x %in% c("No") ~ 0,
x %in% c("Strongly disagree", "Strongly Disagree",
"Not well at all",
"Not at all",
"Much less than the average person",
"Very unconcerned", "very unconcerned",
"Very unimportant", "Yes") ~ 1,
x %in% c("Disagree",
"Slightly well",
"Not very much",
"Less than the average person",
"Slightly unconcerned", "slightly unconcerned",
"Slightly unimportant") ~ 2,
x %in% c("Somewhat disagree", "Somewhat Disagree",
"Moderately well",
"Barely at all",
"Slightly less than the average person",
"Neither concerned nor unconcerned",
"neither concerned nor unconcerned",
"3. Slightly confident ",
"Neither important nor unimportant") ~ 3,
x %in% c("Neither agree nor disagree", "Neither disagree nor agree",
"Neither Agree Nor Disagree", "Neither Disagree Nor Agree",
"Extremely well",
"Neutral",
"About the same as the average person",
"Slightly concerned", "slightly concerned",
"Slightly important") ~ 4,
x %in% c("Somewhat agree", "Somewhat Agree",
"A fair amount",
"Slightly more than the average person",
"Very concerned", "very concerned",
"Very important") ~ 5,
x %in% c("Agree",
"Very Much",
"More than the average person") ~ 6,
x %in% c("Strongly agree", "Strongly Agree",
"Extremely",
"Much more than the average person") ~ 7
)
}
data$var1 <- rep(c("Agree", "Disagree"), 75)
data$var2 <- rep(c("Strongly disagree", "Strongly agree"), 75)
data$var3 <- rep(c("Strongly Agree", "Strongly Disagree"), 75)
#List the variables that we want to run factorise on. Use quotes.
columns <- c("var1", "var2", "var3")
#Use sapply for all data where the columns can be called using quotes.
#To realize why this is working, reference "Brackets".
#run factorise on those columns.
data[columns] <- sapply(data[columns], factorise)
#see they're all numbers. Also note the difference in caps was covered in var 2-3.
table(data$var1)
##
## 2 6
## 75 75
table(data$var2)
##
## 1 7
## 75 75
table(data$var3)
##
## 1 7
## 75 75
summarystats() is the lab’s quick way to get alpha, mean, and standard deviation of scales. It takes two argument The first is a dataset of only the columns that include the questions that make up the scale. Thus, prior to running summarystats(), you would need to create a new dataset of just those variables using select(). The second argument is the overall averaged variable in our main dataset. Thus, to make this one, you would need to use rowMeans() to create an overall average variable of the scale that will be used in the analysis.
summarystats <- function(input1, input2){
list <- c(psych::alpha(input1)$total[1],
mean=mean(input2, na.rm=TRUE),
sd=sd(input2, na.rm=TRUE))
return(list)
}
data$Sepal.Example <- rowMeans(
subset(
data,
select=c(
Sepal.Length,
Sepal.Width
)))
data_select <- data %>% subset(
select=c(
Sepal.Length,
Sepal.Width
)
)
summarystats(data_select, data$Sepal.Example)
## Number of categories should be increased in order to count frequencies.
## Warning in psych::alpha(input1): Some items were negatively correlated with the total scale and probably
## should be reversed.
## To do this, run the function again with the 'check.keys=TRUE' option
## Some items ( Sepal.Length ) were negatively correlated with the total scale and
## probably should be reversed.
## To do this, run the function again with the 'check.keys=TRUE' option
## Warning in sqrt(Vtc): NaNs produced
## $raw_alpha
## [1] -0.214637
##
## $mean
## [1] 4.950333
##
## $sd
## [1] 0.4446361
rm(data_select)
If you’re in BIO-245/MTH-245, you might need some other functions. This is also more for me to remember some important functions for teaching, but also for your own learning.
qt() will give the t-value for a given alpha level (don’t forget to divide by two if you are considering a two-tailed test) and a degrees of freedom. Remember that for a normal distribution, as t goes to infinity, it becomes more and more shaped like the z distribution, which means that qt(.025) will approach 1.96 as N increases. Likewise, for p=.1, or alpha=.05, qt(.05) will approach 1.64 as N increases.
qt(.025, 10)
## [1] -2.228139
qt(.025, 1000000)
## [1] -1.959966
qt(.05, 10, lower.tail=FALSE)
## [1] 1.812461
qt(.05, 1000000, lower.tail=FALSE)
## [1] 1.644855
Given a t-statistic and a degrees of freedom, we can use pt() to tell us the probability to the left or the right. Without lower.tail = FALSE, it will give us the left side. Many times, when calculating various statistics by hand, it will be important to multiply this pt() result by 2, since we are caring about a two tail test.
For example, we know that in high N, t (or z)=1.96 is equal to 5% p value. To get that, you need pt(1.96, 10000000, lower.tail=FALSE)*2. If you get a p value higher than 1, you are looking at the wrong side.
pt(2.228139, 10)
## [1] 0.975
pt(-1.959966, 1000000)#gets us .025
## [1] 0.02500002
pt(-1.959966, 1000000, lower.tail=FALSE) #be careful about negatives / sides.
## [1] 0.975
pt(-1.959966, 1000000, lower.tail=FALSE)*2 #gets us our .05
## [1] 1.95
Gives the area (probability) to the right under a standard normal curve. For example, knowing that z=1.96 = 5% in a two tailed test:
pnorm(1.96, mean=0, sd=1, lower.tail=FALSE)
## [1] 0.0249979
pnorm(1.96, mean=0, sd=1, lower.tail=FALSE)*2
## [1] 0.04999579
We can use pnorm to calculate the space in between two points.
If the WAIS (an adult intelligence test) has a mean of 100 and a standard deviation of 15, how do we solve what percentage of people fall between 80 and 120?
lb <- pnorm(80, mean=100, sd=15, lower.tail=FALSE)
ub <- pnorm(120, mean=100, sd=15, lower.tail=FALSE)
lb-ub
## [1] 0.8175776
Given a probability value, find the z-score related to it.
qnorm(.975, mean=0, sd=1)
## [1] 1.959964
Also known as Mann-Whitney U Test, this test is the non-parametric t-test of two groups by testing their signed rank. It can function for both paired and unpaired data.
wilcox.test(Ozone ~ Month, data = airquality,
subset = Month %in% c(5, 8))
##
## Wilcoxon rank sum test with continuity correction
##
## data: Ozone by Month
## W = 127.5, p-value = 0.0001208
## alternative hypothesis: true location shift is not equal to 0
We can run a permutation t-test. We can do this for two samples.
library(exactRankTests)
mtcars %>%
mutate(am= as.factor(am)) %>%
perm.test(mpg ~ am, .)
##
## 2-sample Permutation Test (scores mapped into 1:(m+n) using rounded
## scores)
##
## data: mpg by am
## T = 174, p-value = 0.0003442
## alternative hypothesis: true mu is not equal to 0
Or we can do it for one sample comparing against some specific population mean.
perm.test(x=mtcars$mpg, mu=15)
##
## 1-sample Permutation Test (scores mapped into 1:m using rounded
## scores)
##
## data: mtcars$mpg
## T = 325, p-value = 4.833e-06
## alternative hypothesis: true mu is not equal to 15
Is the differences more positive or more negative? It simply measures the signs. A basic non-parametric paired test.
Note: Here, we want data$variable.
library(BSDA)
BSDA::SIGN.test(iris$Petal.Length, iris$Petal.Width)
##
## Dependent-samples Sign-Test
##
## data: iris$Petal.Length and iris$Petal.Width
## S = 150, p-value < 2.2e-16
## alternative hypothesis: true median difference is not equal to 0
## 95 percent confidence interval:
## 2.7 3.0
## sample estimates:
## median of x-y
## 2.9
##
## Achieved and Interpolated Confidence Intervals:
##
## Conf.Level L.E.pt U.E.pt
## Lower Achieved CI 0.9400 2.7 3
## Interpolated CI 0.9500 2.7 3
## Upper Achieved CI 0.9591 2.7 3
For questions like this, R is simply a glorified calculator.
ng1=50
ng2=60
mg1=80.44
mg2=78.42
sdg1 = 3.52
sdg2 = 2.87
dfg1 = 49
dfg2 = 59
meandiff <- mg1-mg2
tval = qt(.025, (dfg2+dfg1), lower.tail = FALSE)
pooledvar = ( (dfg1*(sdg1^2)) + (dfg2*(sdg2^2)) ) /(dfg1+dfg2)
sediff = sqrt(pooledvar*( (1/ng1) + (1/ng2) ) )
meandiff-(tval*sediff)
## [1] 0.8124732
#CeI high
meandiff+(tval*sediff)
## [1] 3.227527
qf() gets us the f critical value. It takes the alpha, the degrees of freedom of the numerator, of the denominator, and lower.tail=FALSE.
qf(.025, 2, 26, lower.tail=FALSE)
## [1] 4.265483
pf() gets us the p value.
pf() takes the f statistic, the degrees of freedom between, degrees of freedom within, and then be sure to set lower.tail=FALSE.
pf(6.1904, 2, 26, lower.tail=FALSE)
## [1] 0.006326612
Anova with a capital A comes from the car package. It is useful for being more conservative in our estimates of calculating sum of squares because it allows us to use Type II or Type III sum of squares. Type II is good for when the interaction is non-significant, while Type III is good if the interaction is significant.
anova() on the other hand uses Type I sum of squares. In class, we will talk about how this is less than ideal, as it calculates sum of squares in a sequential order. However, anova() has its uses as testing fits between two models to see if a model was improved or not, something that Anova cannot handle.
library(car)
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.0.5
##
## Attaching package: 'carData'
## The following objects are masked from 'package:BSDA':
##
## Vocab, Wool
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
#Build a dataset, R doesn't have good interaction ones.
gender <- gl(n = 2, k = 20, labels = c("male","female"))
treat <- rep(rep(c("yes","no"), each=10),2)
set.seed(131)
resp <- c(
rnorm(n = 20, mean = rep(c(15,10), each =10)),
rnorm(n = 20, mean = rep(c(10,15), each =10))
)
example <- data.frame(gender, treat, resp)
example %>%
lm(resp ~ gender * treat, .) %>%
anova
## Analysis of Variance Table
##
## Response: resp
## Df Sum Sq Mean Sq F value Pr(>F)
## gender 1 0.043 0.043 0.0390 0.8447
## treat 1 0.061 0.061 0.0555 0.8151
## gender:treat 1 258.274 258.274 236.4059 <2e-16 ***
## Residuals 36 39.330 1.093
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
example %>%
lm(resp ~ gender * treat, .) %>%
Anova(type="II")
## Anova Table (Type II tests)
##
## Response: resp
## Sum Sq Df F value Pr(>F)
## gender 0.043 1 0.0390 0.8447
## treat 0.061 1 0.0555 0.8151
## gender:treat 258.274 1 236.4059 <2e-16 ***
## Residuals 39.330 36
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here, our general anova() command reports a significant interaction with no significant main effects. The same output is received if using Type II. However, if trying Type III:
example %>%
lm(resp ~ gender * treat, .) %>%
Anova(type="III")
## Anova Table (Type III tests)
##
## Response: resp
## Sum Sq Df F value Pr(>F)
## (Intercept) 983.00 1 899.77 < 2.2e-16 ***
## gender 125.84 1 115.19 9.088e-13 ***
## treat 125.21 1 114.61 9.743e-13 ***
## gender:treat 258.27 1 236.41 < 2.2e-16 ***
## Residuals 39.33 36
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(example, aes(x=gender, group=treat, y=resp))+
geom_col(position=position_dodge(.9), aes(group=treat))
see: pivot_longer()
There are a few ways to calculate repeated measures ANOVAs. Unfortunately, the rstatix library is the only library that provides straightforward assumption tests built within the output, so it is definitely a preferred method of choice. The options are also relatively straight forward.
Remember that for repeated measures, we want our data in long form, so review pivot_longer() for more help.
For a one-way within subjects ANOVA, anova_test needs the response variable (dv=), the within subjects ID identifier (wid=), and the within variable(within=).
library(datarium)
library(rstatix)
library(tidyverse)
selfestdat <- selfesteem
selfestdat <- selfestdat %>%
pivot_longer(cols=-c(id),
names_to="Time",
values_to="SelfEsteem")
selfestdat %>%
anova_test(dv=SelfEsteem, wid=id, within=Time)
## ANOVA Table (type III tests)
##
## $ANOVA
## Effect DFn DFd F p p<.05 ges
## 1 Time 2 18 55.469 2.01e-08 * 0.829
##
## $`Mauchly's Test for Sphericity`
## Effect W p p<.05
## 1 Time 0.551 0.092
##
## $`Sphericity Corrections`
## Effect GGe DF[GG] p[GG] p[GG]<.05 HFe DF[HF] p[HF]
## 1 Time 0.69 1.38, 12.42 2.16e-06 * 0.774 1.55, 13.94 6.03e-07
## p[HF]<.05
## 1 *
You can cut out the other information by piping after the following.
selfestdat %>%
anova_test(dv=SelfEsteem, wid=id, within=Time) %>%
get_anova_table()
## ANOVA Table (type III tests)
##
## Effect DFn DFd F p p<.05 ges
## 1 Time 2 18 55.469 2.01e-08 * 0.829
And you can pass additional options to it as well.
selfestdat %>%
anova_test(dv=SelfEsteem, wid=id, within=Time) %>%
get_anova_table()
## ANOVA Table (type III tests)
##
## Effect DFn DFd F p p<.05 ges
## 1 Time 2 18 55.469 2.01e-08 * 0.829
Nothing too different from the one way, just list two variables in the within category.
library(datarium)
selfestdat2 <- selfesteem2
selfestdat2 <- selfestdat2 %>%
pivot_longer(cols=-c(id, treatment),
names_to="Time",
values_to="SelfEsteem")
selfestdat2 %>%
anova_test(dv=SelfEsteem, wid=id, within=c(Time, treatment))
## ANOVA Table (type III tests)
##
## $ANOVA
## Effect DFn DFd F p p<.05 ges
## 1 Time 2 22 27.369 1.08e-06 * 0.049
## 2 treatment 1 11 15.541 2.00e-03 * 0.059
## 3 Time:treatment 2 22 30.424 4.63e-07 * 0.050
##
## $`Mauchly's Test for Sphericity`
## Effect W p p<.05
## 1 Time 0.469 0.023 *
## 2 Time:treatment 0.616 0.089
##
## $`Sphericity Corrections`
## Effect GGe DF[GG] p[GG] p[GG]<.05 HFe DF[HF]
## 1 Time 0.653 1.31, 14.37 5.03e-05 * 0.705 1.41, 15.52
## 2 Time:treatment 0.723 1.45, 15.9 1.25e-05 * 0.803 1.61, 17.66
## p[HF] p[HF]<.05
## 1 2.81e-05 *
## 2 4.82e-06 *
Let’s create a dataset to go through this. Imagine we have some response variable called reactionTime, a between-subjects effect of Treatment or Not, a within-subjects effect of Time (Weeks 1-3), and six subjects observed over these three weeks.
The code is very similar to above, but this time, you just tell anova_test() which variable is the between, and which is the within.
subject <- c(rep("s1", 3), rep("s2", 3),
rep("s3", 3), rep("s4", 3),
rep("s5", 3), rep("s6", 3))
treat <- c(
rep(
c(
rep("Long", 3), rep("Short", 3)), 3)
)
time <- rep(c("w1", "w2", "w3"), 6)
reactionTime <- c( 466, 520, 502, 475,
494, 490, 516, 566,
577, 491, 544, 526,
484, 529, 539, 470,
511, 528)
mixeddata <- data.frame(subject, treat, time, reactionTime)
rm(subject, treat, time, reactionTime)
mixeddata %>%
anova_test(dv=reactionTime, wid=subject, within=time, between=treat)
## ANOVA Table (type II tests)
##
## $ANOVA
## Effect DFn DFd F p p<.05 ges
## 1 treat 1 4 0.957 0.3830 0.172
## 2 time 2 8 29.638 0.0002 * 0.495
## 3 treat:time 2 8 0.717 0.5170 0.023
##
## $`Mauchly's Test for Sphericity`
## Effect W p p<.05
## 1 time 0.854 0.789
## 2 treat:time 0.854 0.789
##
## $`Sphericity Corrections`
## Effect GGe DF[GG] p[GG] p[GG]<.05 HFe DF[HF] p[HF]
## 1 time 0.872 1.74, 6.98 0.000462 * 1.49 2.98, 11.92 0.0002
## 2 treat:time 0.872 1.74, 6.98 0.503000 1.49 2.98, 11.92 0.5170
## p[HF]<.05
## 1 *
## 2
see lmer()
As stated, there are other ways to get similar results.
library(lmerTest)
selfestdat %>%
lmerTest::lmer(SelfEsteem ~ Time + (1|id), .) %>%
anova()
## Type III Analysis of Variance Table with Satterthwaite's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## Time 102.46 51.228 2 27 65.261 4.564e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
See: qqline(), base plotting
QQPlots tell us how well a given variable fits the normal distribution. If it comes from a normal distribution, it should fit on the line.
qqnorm(iris$Sepal.Length)
Thus, something like this, where we are generating random numbers from a normal distribution, should fit very well.
qqnorm(rnorm(1000, mean=0, sd=1))
See: qqnorm(), base plotting
qqline() goes after the qqnorm to plot the line that the points should lie on.
qqnorm(iris$Sepal.Length)
qqline(iris$Sepal.Length)
the car package gives us the qqplot, the qqline, and a confidence interval around it.
library(car)
qqPlot(iris$Sepal.Length)
## [1] 132 118
hist() is base plotting’s fast way to plot a histogram.
hist(iris$Sepal.Width)
To change the x lab and y lab, pass it inside.
hist(iris$Sepal.Length, xlab="length of sepals")
See also: residuals(), histogram(), lm()
Tests to see if data is normally distributed. shapiro.test() is useful for normal distribution across the whole dataset, while nor.test() from library(onewaytests) allows you to test for each subgroup.
For a single variable:
shapiro.test(iris$Sepal.Length)
##
## Shapiro-Wilk normality test
##
## data: iris$Sepal.Length
## W = 0.97609, p-value = 0.01018
Remember that shapiro just wants a list of numbers, and residuals are also a list of numbers.
model <- iris %>% lm(Sepal.Length ~ Species, .)
shapiro.test(residuals(model))
##
## Shapiro-Wilk normality test
##
## data: residuals(model)
## W = 0.9879, p-value = 0.2189
or
shapiro.test(model$residuals)
##
## Shapiro-Wilk normality test
##
## data: model$residuals
## W = 0.9879, p-value = 0.2189
For variable across groups, you have two options. The first is using onewaytests. nor.test() also plots a lot of great information for us.
library(onewaytests)
iris %>%
nor.test(Sepal.Length ~ Species, .)
##
## Shapiro-Wilk Normality Test (alpha = 0.05)
## --------------------------------------------------
## data : Sepal.Length and Species
##
## Level Statistic p.value Normality
## 1 versicolor 0.9778357 0.4647370 Not reject
## 2 virginica 0.9711794 0.2583147 Not reject
## 3 setosa 0.9776985 0.4595132 Not reject
## --------------------------------------------------
The second is using the rstatix package, which has converted most analyses to be very happy with piping. We are considering moving most of the course towards rstatix, but the problem is the textbook does not follow suit.
library(rstatix)
iris %>%
group_by(Species) %>%
shapiro_test(Sepal.Length)
## # A tibble: 3 × 4
## Species variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 versicolor Sepal.Length 0.978 0.465
## 2 virginica Sepal.Length 0.971 0.258
## 3 setosa Sepal.Length 0.978 0.460
pbinom(q, size, prob)
Put simply, pbinom returns the area to the left of a given value q in the binomial distribution. If you’re interested in the area to the right of a given value q, you can simply add the argument lower.tail = FALSE[^1]
[^1]https://www.statology.org/dbinom-pbinom-qbinom-rbinom-in-r/
Ando flips a fair coin 5 times. What is the probability that the coin lands on heads more than 2 times?
pbinom(q=2, size=5, prob=.5, lower.tail=FALSE)
## [1] 0.5
Suppose Tyler scores a strike on 30% of his attempts when he bowls. If he bowls 10 times, what is the probability that he scores 4 or fewer strikes?
pbinom(q=4, size=10, prob=.3)
## [1] 0.8497317
dbinom(x, size, prob)
“Put simply, dbinom finds the probability of getting a certain number of successes (x) in a certain number of trials (size) where the probability of success on each trial is fixed (prob).” [^2]
[^2]ibid
Bob makes 60% of his free-throw attempts. If he shoots 12 free throws, what is the probability that he makes exactly 10?
dbinom(10, 12, .60)
## [1] 0.06385228
What is the probability of throwing five die and getting exactly three of the same number?
dbinom(x=3, size=5, prob=(1/6))
## [1] 0.03215021
The probability of rolling “4” exactly three times in five rolls of a die can be written as:
\[ \binom{5}{3} = \frac{n!}{k!(n-k)!}p^X(1-p)^{(n-X)} \]
\[ \binom{5}{3} = \frac{5!}{3!(5-3)!}\frac{1}{6}^{3}(1-\frac{1}{6})^{(5-3)} \]
binom.test() will give us the probability of rolling 3 fours or MORE EXTREME (remember your p value definitions).
binom.test(x=3, n=5, p=(1/6))
##
## Exact binomial test
##
## data: 3 and 5
## number of successes = 3, number of trials = 5, p-value = 0.03549
## alternative hypothesis: true probability of success is not equal to 0.1666667
## 95 percent confidence interval:
## 0.1466328 0.9472550
## sample estimates:
## probability of success
## 0.6
However, in a hypothesis testing scenario, we want to multiply our p value by two since it is a two-tailed test.
2*binom.test(x=531,n=648, p=0.5)$p.value
## [1] 1.41102e-63
What is the confidence interval of the proportion of people who get struck by lightning that are men?
library(binom)
binom.confint(x=531, n=648, method="ac")
## method x n mean lower upper
## 1 agresti-coull 531 648 0.8194444 0.7879139 0.8472099
The default setting for binom.confint is to give all of the possible ways to calculate the confidence intervals. This is the same as setting method=“all”.
binom.confint(x=531, n=648)
## method x n mean lower upper
## 1 agresti-coull 531 648 0.8194444 0.7879139 0.8472099
## 2 asymptotic 531 648 0.8194444 0.7898285 0.8490604
## 3 bayes 531 648 0.8189522 0.7891231 0.8482295
## 4 cloglog 531 648 0.8194444 0.7876056 0.8469808
## 5 exact 531 648 0.8194444 0.7876231 0.8483216
## 6 logit 531 648 0.8194444 0.7879178 0.8471935
## 7 probit 531 648 0.8194444 0.7883190 0.8475301
## 8 profile 531 648 0.8194444 0.7885988 0.8477696
## 9 lrt 531 648 0.8194444 0.7885989 0.8477662
## 10 prop.test 531 648 0.8194444 0.7871576 0.8478685
## 11 wilson 531 648 0.8194444 0.7879734 0.8471504
Tests for the homogeneity of variances between two or more groups. If it’s significant, it suggests that there is not homogeneity of variances (we reject the null hypotheses that there is no difference between variances of the groups).
library(car)
library(tidyverse)
iris %>% leveneTest(Sepal.Length ~ Species, center=mean, .)
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 2 7.3811 0.0008818 ***
## 147
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#leveneTest(Sepal.Length ~ Species, data=iris, center=mean) #old code
See cor.test(). This is not worth running.
Given two variables, you can test the correlation between them. This is just cor() with more detail, providing us the confidence intervals, the estimate, the p value, df, and the t statistic.
cor.test(iris$Petal.Length, iris$Sepal.Length)
##
## Pearson's product-moment correlation
##
## data: iris$Petal.Length and iris$Sepal.Length
## t = 21.646, df = 148, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8270363 0.9055080
## sample estimates:
## cor
## 0.8717538
You can also get the estimate directly through asking for that part of cor.test() by using $estimate.
cor.test(iris$Petal.Length, iris$Sepal.Length)$estimate
## cor
## 0.8717538
Given a dataset of only numeric columns that you want to know the correlations of, we can use corr.test() to output the correlations. Here, I ungroup iris (I think I grouped it above awhile back) and then select out the one non-numeric column, and then run corr.test on it.
You also could select certain variables if you wanted to know just their correlations too.
library(tidyverse)
library(psych)
iris %>%
ungroup() %>%
dplyr::select(-Species) %>%
corr.test()
## Call:corr.test(x = .)
## Correlation matrix
## Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Example
## Sepal.Length 1.00 -0.12 0.87 0.82 0.87
## Sepal.Width -0.12 1.00 -0.43 -0.37 0.38
## Petal.Length 0.87 -0.43 1.00 0.96 0.60
## Petal.Width 0.82 -0.37 0.96 1.00 0.58
## Sepal.Example 0.87 0.38 0.60 0.58 1.00
## Sepal.Example2 0.87 0.38 0.60 0.58 1.00
## Sepal.Example2
## Sepal.Length 0.87
## Sepal.Width 0.38
## Petal.Length 0.60
## Petal.Width 0.58
## Sepal.Example 1.00
## Sepal.Example2 1.00
## Sample Size
## [1] 150
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Example
## Sepal.Length 0.00 0.15 0 0 0
## Sepal.Width 0.15 0.00 0 0 0
## Petal.Length 0.00 0.00 0 0 0
## Petal.Width 0.00 0.00 0 0 0
## Sepal.Example 0.00 0.00 0 0 0
## Sepal.Example2 0.00 0.00 0 0 0
## Sepal.Example2
## Sepal.Length 0
## Sepal.Width 0
## Petal.Length 0
## Petal.Width 0
## Sepal.Example 0
## Sepal.Example2 0
##
## To see confidence intervals of the correlations, print with the short=FALSE option
This works in console, and I don’t have the time or energy to determine why it doesn’t want to knit. However, the point is wrapping corr.test() in print(), and setting short=FALSE will also give the confidence intervals of each correlation pair.
mtcars %>%
dplyr::select(mpg, cyl, wt) %>%
psych::corr.test() %>%
print(., short=FALSE)
x <- c(6,1,3,2,5)
y <- c(6,7,3,2,14)
df <- data.frame(x,y)
Stdx <- (df$x-mean(df$x))/sd(df$x) #print fill
Stdy <- (df$y-mean(df$y))/sd(df$y) #print fill
c(sum(Stdx), sum(Stdy))
## [1] 1.942890e-16 -4.857226e-16
(Stdx)*(Stdy)
## [1] -0.1062054 -0.1470537 0.1388840 0.6290629 1.2417864
SumProd <- sum((Stdx*Stdy)) #print to fill
r <-(1/(sum(!is.na(df$x))-1))*SumProd
r
## [1] 0.4391186
To get CI’s, where:
\[ z_r = \frac{1}{2}\ln(\frac{1+r}{1-r}) \]
\[ CI = z_r \pm\cdot1.96\cdot\frac{1}{\sqrt{n-3}} \] \[ r_{low} = \frac{e^{2\cdot CI_{low}}-1}{e^{2\cdot CI_{low}}+1} \] \[ r_{high} = \frac{e^{2\cdot CI_{high}}-1}{e^{2\cdot CI_{high}}+1} \]
#Continuing from prior code
z <- 1/2*log((1+r)/(1-r))
cilow <- z-(1.96*(1/sqrt(sum(!is.na(df$x))-3)))
cihigh <-z+(1.96*(1/sqrt(sum(!is.na(df$x))-3)))
rlow <- (exp(2*cilow)-1)/(exp(2*cilow)+1)
rhigh <-(exp(2*cihigh)-1)/(exp(2*cihigh)+1)
rlow
## [1] -0.7234239
rhigh
## [1] 0.9524071
To get t statistic where t =
\[ t = \frac{r-0}{SE_r} = \frac{r-0}{\sqrt{(1-r^2)/(n-2)}} \]
t <- (r-0)/sqrt((1-(r*r))/(sum(!is.na(df$x))-2))
t
## [1] 0.8465617
qt(.025, 3) #this is the t-crit value we need to reach
## [1] -3.182446
pt(t, 3, lower.tail=FALSE)*2 #this is the p value
## [1] 0.4594243
cor.test(df$x, df$y, method="pearson")
##
## Pearson's product-moment correlation
##
## data: df$x and df$y
## t = 0.84656, df = 3, p-value = 0.4594
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7234117 0.9524048
## sample estimates:
## cor
## 0.4391186
chisq.test() takes a different kind of data then what we’re used to - it wants things as a table.
Note: This is a really cool dataset, and for simplicity, I am not presenting the full results. The actual data shows that there isn’t a bias in Admissions at UCB - there is a bias in what genders apply to what departments, and in what numbers.
UCB.table <- margin.table(UCBAdmissions, c(1,2))
UCB.table
## Gender
## Admit Male Female
## Admitted 1198 557
## Rejected 1493 1278
UCB.table %>% chisq.test()
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: .
## X-squared = 91.61, df = 1, p-value < 2.2e-16
see: chisq.test, mosaicplot, chisq_test, chisq_descriptives
The assumptions of a chi square is that no cell has an expected value below one and that less than 25% of the cells have observations under 5. We can check the expected proportions by asking chisq.test() to print out the expected.
UCB.table <- margin.table(UCBAdmissions, c(1,2))
UCB.table
## Gender
## Admit Male Female
## Admitted 1198 557
## Rejected 1493 1278
chisq.test(UCB.table)$expected
## Gender
## Admit Male Female
## Admitted 1043.461 711.5389
## Rejected 1647.539 1123.4611
#Pipe Version
UCB.table %>% chisq.test() %>% .$expected
## Gender
## Admit Male Female
## Admitted 1043.461 711.5389
## Rejected 1647.539 1123.4611
Given a list of list of expected, we can pass that list into expected.
observed <- c(50, 60, 40, 47, 53)
expected <- c(.2, .2, .2, .2, .2)
chisq.test(observed, p=expected)
##
## Chi-squared test for given probabilities
##
## data: observed
## X-squared = 4.36, df = 4, p-value = 0.3595
observed <- c(50, 60, 40, 47, 53)
expected <- c(.1, .1, .1, .1, .6)
chisq.test(observed, p=expected)
##
## Chi-squared test for given probabilities
##
## data: observed
## X-squared = 165.09, df = 4, p-value < 2.2e-16
If this list is something we’ve calculated by hand, and due to rounding, we might not be adding up to 1, then we can add rescale.p = TRUE to fix that.
observed <- c(50, 60, 40, 47, 53)
expected <- c(.1, .1, .1, .1, .599) #adds up to .999
chisq.test(observed, p=expected, rescale.p=TRUE)
##
## Chi-squared test for given probabilities
##
## data: observed
## X-squared = 164.7, df = 4, p-value < 2.2e-16
see: chisq.test, mosaicplot, chisq_descriptives
The same approach can be done using rstatix’s chisq_test.
library(rstatix)
UCB.table <- margin.table(UCBAdmissions, c(1,2))
UCB.table
## Gender
## Admit Male Female
## Admitted 1198 557
## Rejected 1493 1278
UCB.table %>% chisq_test()
## # A tibble: 1 × 6
## n statistic p df method p.signif
## * <dbl> <dbl> <dbl> <int> <chr> <chr>
## 1 4526 91.6 1.06e-21 1 Chi-square test ****
see: chisq.test, mosaicplot, chisq_test
And rstatix provides other important information, such as our phats. It also prints our expected as well.
library(rstatix)
UCB.table <- margin.table(UCBAdmissions, c(1,2))
UCB.table
## Gender
## Admit Male Female
## Admitted 1198 557
## Rejected 1493 1278
UCB.table %>% chisq_test() %>% chisq_descriptives()
## # A tibble: 4 × 9
## Admit Gender observed prop row.prop col.prop expected resid std.resid
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Admitted Male 1198 0.265 0.683 0.445 1043. 4.78 9.60
## 2 Rejected Male 1493 0.330 0.539 0.555 1648. -3.81 -9.60
## 3 Admitted Female 557 0.123 0.317 0.304 712. -5.79 -9.60
## 4 Rejected Female 1278 0.282 0.461 0.696 1123. 4.61 9.60
Notice that row.prop is the proportion of a given subset (Admitted-Male) for all of those in a given row (Admitted). This is the same as doing:
1198/(1198+557)
## [1] 0.6826211
That is to say, males make up 68.3% of those admitted, while males make up 53.9% of those rejected.
Given a certain \(\chi^2\) value, and given a certain df, what is the probability of observing this?
pchisq(q=4.57, df=3, lower.tail=FALSE)
## [1] 0.2061308
see: chisq.test, mosaicplot, chisq_test, chisq_descriptives
Imagine you had a table that looked like:
| <—– Patient Type —–> | |||
|---|---|---|---|
| Ulcer Patients | Controls | ||
| Blood Type | O | 911 | 4578 |
| A | 579 | 4219 | |
| B | 124 | 890 | |
| AB | 41 | 313 |
We could repeat for each column. I start with blood type and make a list of the observations for the first column using rep() (repeat) going down the list of all ulcer observations and then all control observations.
Then I have to repeat the other variable - that is, what group they were in. Since I went by columns, I’ll go by columns again.
I then put it in a data frame.
bloodtype <- c(
rep("O", 911), rep("A", 579), rep("B", 124), rep("AB", 41),
rep("O", 4578), rep("A", 4219), rep("B", 890), rep("AB", 313)
)
patienttype <- c(
rep("Ulcer patients", 1655),
rep("Controls", 10000)
)
ulcer.data <- data.frame(bloodtype, patienttype)
table(ulcer.data$bloodtype, ulcer.data$patienttype)
##
## Controls Ulcer patients
## A 4219 579
## AB 313 41
## B 890 124
## O 4578 911
#
ulcer.data %>%
select(bloodtype, patienttype) %>%
table()
## patienttype
## bloodtype Controls Ulcer patients
## A 4219 579
## AB 313 41
## B 890 124
## O 4578 911
However, that table doesn’t look right - R has sorted our table by alphabetical order, which isn’t what we want. So, we can overwrite the variables as themselves but with new factor levels.
Note: If you spell anything wrong here in levels=c(), you will wipe your table clean of that one and it will go to zero. If you’re finding you’re missing observations where you had them before, it’s probably here.
ulcer.data$patienttype <- factor(ulcer.data$patienttype,
levels=c("Ulcer patients", "Controls"))
ulcer.data$bloodtype <- factor(ulcer.data$bloodtype,
levels= c("O", "A", "B", "AB"))
ulcer.table <- table(ulcer.data$bloodtype, ulcer.data$patienttype)
ulcer.table
##
## Ulcer patients Controls
## O 911 4578
## A 579 4219
## B 124 890
## AB 41 313
Below, I also do a similar thing using piping. This will not save the factor orders within the dataset. However, since I don’t care about the dataset, as long as the table is okay, I’m fine with this.
bloodtype <- c(
rep("O", 911), rep("A", 579), rep("B", 124), rep("AB", 41),
rep("O", 4578), rep("A", 4219), rep("B", 890), rep("AB", 313)
)
patienttype <- c(
rep("Ulcer patients", 1655),
rep("Controls", 10000)
)
ulcer.data <- data.frame(bloodtype, patienttype)
ulcer.table <- ulcer.data %>%
mutate(patienttype = factor(patienttype,
levels=c("Ulcer patients", "Controls")),
bloodtype = factor(bloodtype,
levels= c("O", "A", "B", "AB"))) %>%
select(bloodtype, patienttype) %>%
table()
ulcer.table
## patienttype
## bloodtype Ulcer patients Controls
## O 911 4578
## A 579 4219
## B 124 890
## AB 41 313
Agresti (1990, p. 61f; 2002, p. 91) Fisher’s Tea Drinker
A British woman claimed to be able to distinguish whether milk or tea was added to the cup first. To test, she was given 8 cups of tea, in four of which milk was added first. The null hypothesis is that there is no association between the true order of pouring and the woman’s guess, the alternative that there is a positive association (that the odds ratio is greater than 1).
TeaTasting <- matrix(c(3, 1, 1, 3),
nrow = 2,
dimnames = list(Guess = c("Milk", "Tea"),
Truth = c("Milk", "Tea")))
TeaTasting
## Truth
## Guess Milk Tea
## Milk 3 1
## Tea 1 3
fisher.test(TeaTasting, alternative = "greater")
##
## Fisher's Exact Test for Count Data
##
## data: TeaTasting
## p-value = 0.2429
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
## 0.3135693 Inf
## sample estimates:
## odds ratio
## 6.408309
## => p = 0.2429, association could not be established
This function from the rstatix package will print out if it detects a given y is outside of 1.5xIQR(+Q3 or -Q1) as outliers and 3xIQR(+Q3 or -Q1) as extreme values.
We are generally hesitant on this in psychology, as we always must consider what it means to be extreme (and isn’t that the beauty of psychology - that people are extreme?). However, it can be a useful tool to have in case you want to check if the data is inputted properly, and it probably does have some value there too.
library(rstatix)
mpg %>%
group_by(manufacturer) %>%
mutate(m_hwy = mean(hwy, na.rm=TRUE)) %>%
rstatix::identify_outliers(hwy) %>%
dplyr::select(manufacturer, model, year, m_hwy, hwy, is.outlier, is.extreme)
## # A tibble: 4 × 7
## manufacturer model year m_hwy hwy is.outlier is.extreme
## <chr> <chr> <int> <dbl> <int> <lgl> <lgl>
## 1 subaru forester awd 2008 25.6 23 TRUE FALSE
## 2 volkswagen jetta 1999 29.2 44 TRUE TRUE
## 3 volkswagen new beetle 1999 29.2 44 TRUE TRUE
## 4 volkswagen new beetle 1999 29.2 41 TRUE TRUE
A shortcut to calculating the means and sd’s by group. Do it yourself, don’t add more libraries. However, the code is potentially more straight forward for you.
library(rstatix)
iris %>%
group_by(Species) %>%
get_summary_stats(Petal.Length,
show=c("n", "mean", "sd", "se"))
## # A tibble: 3 × 6
## Species variable n mean sd se
## <fct> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 versicolor Petal.Length 50 4.26 0.47 0.066
## 2 virginica Petal.Length 50 5.55 0.552 0.078
## 3 setosa Petal.Length 50 1.46 0.174 0.025
Given a average amount of observations, dpois() can tell us what the expected frequency is on a poisson distribution of observations.
This is the same as doing the following equation:
\[ \frac{e^{(-\mu)}\cdot(\mu^{x})} {X!} \]
where X is the number of successes and \(\mu\) is the expected (mean) number of events in a unit of time or space. lambda is poisson’s way to say expected number of events.
dpois(1, lambda=.5)
## [1] 0.3032653
A more detailed example follows:
#First, I write all of our observations out.
Count <- c(rep(0, 103),
rep(1, 72),
rep(2, 44),
rep(3, 14),
rep(4, 3),
rep(5, 1),
rep(6,1))
meanFreq <- mean(Count) #On average, there is how many? .945
meanFreq
## [1] 0.9453782
NumFish <- table(Count) #We can table this to get it to look like the slides.
NumFish
## Count
## 0 1 2 3 4 5 6
## 103 72 44 14 3 1 1
data <- data.frame(NumFish) #We turn this into a dataframe.
#Unfortunately, table() treats our numbers as factors. So we first turn them into
#characters (letters, but letters 0,1,2,3,4,5,6), and then turn those into numbers.
#Doing JUST as.numeric() will leave you with 1,2,3,4,5,6,7 because 0 is not a level.
data$Count <- as.numeric(as.character(data$Count))
# We want to calculate the poisson probabilities. Two ways.
data$PrX <- dpois(x=0:6, meanFreq)
data$PrX2 <- (exp(-(meanFreq))* (meanFreq^data$Count))/factorial(data$Count)
#We also want to calculate the expected probabilities.
data$Expected <- sum(data$Freq)*data$PrX
data
## Count Freq PrX PrX2 Expected
## 1 0 103 0.3885326190 0.3885326190 92.47076333
## 2 1 72 0.3673102491 0.3673102491 87.41983928
## 3 2 44 0.1736235421 0.1736235421 41.32240302
## 4 3 14 0.0547133011 0.0547133011 13.02176566
## 5 4 3 0.0129311899 0.0129311899 3.07762319
## 6 5 1 0.0024449729 0.0024449729 0.58190354
## 7 6 1 0.0003852373 0.0003852373 0.09168648
Notice how PrX and PrX2 are the same. Again, dpois is doing the equation above for each x.
Oh no! Some of ours is too low. Let’s combine at 4 or more. To do that, make a factor that is the same for all rows after. Then summarise all by taking the sum of the groups. The only things that will get added is that last factor.
data$Unique <- c("0","1","2","3","4 or more","4 or more","4 or more")
data <- data %>% group_by(Unique) %>% summarise_all(sum)
#Now, we're left with one final problem - 4 or more could be 7. 8. 9! 10!!!
data <- data %>% mutate(PrX =
case_when(Unique != "4 or more" ~ PrX,
Unique == "4 or more" ~ 1-sum(PrX[Unique!="4 or more"])))
library(fmsb)
## Warning: package 'fmsb' was built under R version 4.0.5
riskratio(6,1,702,1376)
## Disease Nondisease Total
## Exposed 6 696 702
## Nonexposed 1 1375 1376
##
## Risk ratio estimate and its significance probability
##
## data: 6 1 702 1376
## p-value = 0.003623
## 95 percent confidence interval:
## 1.418659 97.496107
## sample estimates:
## [1] 11.76068
oddsratio(6,1,696,1375, p.calc.by.independence=TRUE)
## Disease Nondisease Total
## Exposed 6 696 702
## Nonexposed 1 1375 1376
## Total 7 2071 2078
##
## Odds ratio estimate and its significance probability
##
## data: 6 1 696 1375
## p-value = 0.003623
## 95 percent confidence interval:
## 1.424261 98.650607
## sample estimates:
## [1] 11.85345
library(pwr)
## pwr.t.test(n = , d = , sig.level = , power = , type = c("two.sample", "one.sample", "paired"))
pwr.t.test(d=(0.25/0.25), sig.level=0.05, power=0.8, type="two.sample")
##
## Two-sample t test power calculation
##
## n = 16.71472
## d = 1
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
# pwr.t2n.test(n1 = , n2= , d = , sig.level =, power = )
\[ Z=\frac{\bar Y-\mu}{\sigma_{\bar Y}} \]
\[ t = \frac{\bar Y-\mu_0}{s/\sqrt{n}} \]
\[ \bar Y\pm SE_{\bar Y}\cdot t_{\alpha(2), df} \]
\[ \frac{df\:s^2}{\chi^2_{\frac{\alpha}{2},df}}\leq\sigma^2 \leq \frac{df\:s^2}{\chi^2_{1-\frac{\alpha}{2},df}} \]
\[ r = \frac{1}{n-1}\cdot\sum\limits_{i=1}^{n}\frac{x_i-\bar x}{s_x}\cdot\frac{y_i-\bar y}{s_y} \]
or
\[ r = \frac{\sum(x_i - \bar x)(y_i - \bar y)}{\sqrt{\sum{(x_i - \bar x)^2\cdot\sum(y_i-\bar y)^2}}} \]
\[ SE_r = \sqrt\frac{1-r^2}{n-2} \]
\[ z_r = \frac{1}{2}\ln(\frac{1+r}{1-r}) \]
\[ CI = z_r \pm\cdot1.96\cdot\frac{1}{\sqrt{n-3}} \]
\[ r_{low} = \frac{e^{2\cdot CI_{low}}-1}{e^{2\cdot CI_{low}}+1} \]
\[ r_{high} = \frac{e^{2\cdot CI_{high}}-1}{e^{2\cdot CI_{high}}+1} \]
\[ t = \frac{r-0}{SE_r} = \frac{r-0}{\sqrt{(1-r^2)/(n-2)}} \]
\[ F_s = \frac{MS(between)}{MS(within)} \]
or
\[ F_s = \frac{MS(treatment)}{MS(error)} \]
\[ SS(blocks) = \sum\limits_{j=1}^{J}m_j(\bar y_j-\bar{\bar{y}})^2 \]
\[ J-1 \]
\[ MS(blocks) = \frac{\sum\limits_{j=1}^{J}m_j(\bar y_j-\bar{\bar{y}})^2}{J-1} \]
\[ s_{pooled} = \sqrt{\frac{\sum\limits_{i=1}^I(n_i-1)s_i^2}{\sum\limits_{i=1}^I(n_i-1)}} = \sqrt{\frac{\sum\limits_{i=1}^I(n_i-1)s^2_i}{n_.-I}} \]
\[ \sum\limits_{i=1}^I(n_i-1)s^2_i = \sum\limits_{i=1}^I\sum\limits_{j=1}^{n_i}(y_{ij}-\bar y_i)^2 \]
\[ n_. -1 \]
\[ MS(within) = \frac{SS(within)}{df(within)} \]
\[ MS_{error} = \frac{\sum\limits_{i=1}^ks_i^2(n_i-1)}{N-k} \]
\[ MS_{error} = \frac{df_1s^2_1+df_2s^2_2}{df_1+df_2} = s^2_p \]
\[ \sum\limits_{i=1}^In_i(\bar y_i - \bar{\bar y})^2 \]
\[ I-1 \]
\[ \frac{\sum\limits_{i=1}^In_i(\bar y_i - \bar{\bar y})^2}{I-1}=\frac{SS_{groups}}{df_{groups}} \]
\[ \sum\limits_{i=1}^I\sum\limits_{j=1}^{n_i}(y_{ij}-\bar y_i)^2 \]
\[ Y_i-\hat Y_i \]
\[ SS_{residuals} = \sum\limits_{i=1}^n(Y_i-\hat Y_i)^2 \]
\[ b= \frac{Covariance(X,Y)}{Variance(X)} = \frac{\sum\limits_{i=1}^n(X_i-\hat X)(Y_i-\hat Y)}{\sum\limits_{i=1}^n(X_i-\hat X)^2} \]
\[ SE_b = \sqrt{\frac{MS_{residuals}}{\sum(X_i-\hat X)^2}} \]
\[ MS_{residual} = \frac{SS_{residual}}{df_{residual}} \]
\[ SS_{total} = \sum(Y_i-\hat Y)^2 \]
\[ SS_{regression} = b \sum(X_i-\hat X)(Y_i - \hat Y) \]
\[ SS_{residual} = SS_{total}-SS_{regression} \]
\[ b \pm t_{\alpha[2],df} SE_b \]
\[ n-2 \]
\[ t= \frac{b-\beta_0}{SE_b} \]
\[ \chi_s^2 = \sum\limits_{all\cdot classes}\frac{(o_i-e_i)^2}{e_i} \]
\[ df = number\:of\:categories-1 - number\:of\:parameters\:estimated\:from\:data \]
\[ df_{\chi^2}=(r-1)(k-1) \]
\[ \sum_i\sum_j(Observed_{i,j}\cdot\ln\frac{Observed_{i,j}}{Expected_{i,j}}) \]
\[ \widehat {RR} = \frac{\hat p_1}{\hat p_2} where \hat p_i = \frac{n_{bad outomes_i}}{n_{individuals_i}} \]
\[ SE[\ln(\widehat{RR})] = \sqrt{\frac{1}{a}+\frac{1}{b}-\frac{1}{a+c}-\frac{1}{b+d}} \]
\[ 95\%CI = e^{\ln(\widehat{RR})\pm(1.96)(SE[\ln(\widehat{RR})])} \]
\[ odds = \hat O_1 = \frac{\hat p_1}{1-\hat p_1} \\ \hat O_2 = \frac{\hat p_2}{1-\hat p_2}\\ \widehat{OR} = \frac{\hat O_1}{\hat O_2} = \frac{a/c}{b/d} = \frac{ad}{bc} \]
Where X = number of successes, \(\mu\) is the expected (mean) number of events in a unit of time or space and ! is factorial notation.
\[ \frac{e^{(-\mu)}\cdot(\mu^{x})} {X!} \]
\[ \hat p = \frac{X}{n} \]
\[ SE_{\hat p}= \sqrt{\frac{\hat p (1-\hat p)}{n}} \]
\[ p= p' \pm(1.96)\sqrt{\frac{p'(1-p')}{n+4}} \]
\[ Pr[X] = \binom{n}{X}p^X(1-p)^{(n-X)} \]
where (Binomial Coefficient) \[ \binom{n}{X} = \frac{n!}{k!(n-k)!} \]
and where \[ n! = n \cdot (n-1) \cdot (n-2) \cdot ... 3 \cdot 2 \cdot 1 \]
\[ \mu = n\cdot p \]
\[ \sigma = \sqrt{n \cdot p \cdot (1-p)} \]
There are a lot of ways you can format a Markdown.
Surrounding words with * can make text italicized.
Surrounding words with ** can make text bolded.
You can write many letters. Ones important for this class might be:
Using hashtags to emphasize certain words is also visually important for labs and organization purposes. You might notice how certain headers here have different sizes. The main Question should lead with a single hashtag and a space after it.
# Question 1
While a part to that question should be the next level lower, so that would be two hash tags with a space next to it.
## Question 1a
Bullets use /* a single star with a space, and an enter after, to help Markdown tell the difference between a list and a bolded/italicized word. You saw what bullet formating looks like above in Greek Letters.
Tables are funky, and you probably won’t ever need to use them, but for completeness, I include the example here. Important to note is that it is rather cumbersome to merge cells together (it requires you code with html), and we will not consider that option.
| <— | Creek | —> | ||
|---|---|---|---|---|
| a | b | c | ||
| Treatment | No Fish | 11 | 8 | 7 |
| Galaxias | 9 | 4 | 4 | |
| Trout | 6 | 4 | 0 |
This could be written as:
| | | <--- | Creek | ---> |
|-----------|----------|------|-------|------|
| | | a | b | c |
| Treatment | No Fish | 11 | 8 | 7 |
| | Galaxias | 9 | 4 | 4 |
| | Trout | 6 | 4 | 0 |