Lab Libraries
AH Something Broke!
R Basics
- Loading Data [read.csv() and as_tibble()]
Formatting Your r Code
Basic Functions ()
- c()
- ’ - ’
- is.na()
- names()
- head()
- tail()
- table()
- ls()
- rm()
- Logical Operators
- !
  - factor()
Basic Math Functions
- mean()
- sd()
- se
- rbind()
- cbind()
Tidyverse Functions
- %>%
- group_by()
- mutate()
- recode()
- filter()
- rename()
- select()
- pull()
- pluck()
- rowid_to_column()
- pivot_longer()
- pivot_wider()
- summarise()
- gather()
- spread()
Creating Averages for New Variables:
- rowMeans()
- rowwise() and mean()
ggplot2
Data Analysis
Advanced R
- Writing Your Own Functions
- sapply()
Lab Functions
- factorise()
- summarystats()
Applied Statistics In The Life Sciences
Equations
Formating Markdowns
Random Extra
- adegenet::showmekittens()

knitr::opts_knit$set(rootdir ="~/Dropbox/PACT Lab")

Lab Libraries

To install a package, run the following function with the name of the package in quotes. If it asks you permission to install other things, be sure to click Yes. Installing a package requires you to be connected to the Internet, so be sure you have a stable and running internet connection before doing this.

install.packages("tidyverse")

Important libraries we need to install, generally are:

tidyverse
psych
wesanderson
harrypotter
cowplot
ggpubr

Libraries that are more specific for statistics are:

jtools
lsr
car
lme4 (random effect modeling)
lmerTest
sjPlot
multcomp (cld)
onewaytests
emmeans (emmeans ; lsmeans)
BSDA
rstatix (anova_test)
pwr (power analyses)
pwr2ppl (power analyses)

Libraries that are important for learning R are:

swirl
jpeg
A free online textbook, https://r4ds.had.co.nz/.
Another free online textbook, https://statsthinking21.github.io/statsthinking21-R-site/index.html

Other Libraries that show up in this Markdown are:

PupillometryR (for raincloud plots)
broom.mixed (to pipe lmer)
broom (to pipe lmer as well)
exactRankTests / coin
catdata
datarium (for more datasets)
ggpattern (for patterned fill ggplots)

AH Something Broke!

There are many things to do at this stage, and please be sure you have tried all of the below before asking for help. R is a puzzle that is meant to be solved - the computer is not wrong, you must have done something wrong. But R is trying to give you hints.

Did you read the error? What does it say? Did you copy and paste it into Google?
Did you take the time to check whatever it is you were working on? That is, have you run the variable in your console to see it and understand what it looks like? Have you run table(), head(), class(), and more on it to ensure it’s all what you expected?
Have you turned it off and on again – that is, have you tried running your script from top to bottom and reloading your data in?
Have you ensured you have loaded the libraries that you need? If you think you have, is it possible that multiple packages are calling the same function? Have you added the library name in the front to be safe with ::?
Have you tried breaking it yourself, or trying different things?
Have you checked your quotes, your brackets, your parenthesis, your commas?

R Basics

Loading Data [read.csv() and as_tibble()]

Loading data is done through primarily read.csv(). Other file formats will require different functions.

We are working to transition the lab to full tidy language use, which also means we want to ensure our data is loaded as a tibble using as_tibble().

library(tidyverse)
#setwd()
#data <- read.csv()
#A dataset we will use to generally practice some features.
data <- iris
data <- as_tibble(data)

Formatting Your r Code

Scripts can be hard to read, especially as you get larger and larger code. That is why we use the enter bar. If you write a comma in your code, you should probably hit enter. If you use a plus sign in ggplot, you should probably hit enter. As you continue to read through this manual, you will note I hit enter on most commas. You should do the same. R will format things with tabs and make things really easy to notice.

I also recommend you make your script sectioned by adding ### at various parts of your script. You will notice that it will create a tab on the left side near the numbers. You can use this to collapse and jump around large sections of your script without needing to scroll too much.

Assignment

R is happy to do things for you without saving its work. For example, we could ask R to calculate the mean of 2, 4, and 100 by:

mean(c(2,4,100))

## [1] 35.33333

But nothing was saved. Instead, if we want to keep that number moving forward, we need to be sure we’re assigning it something. We will always use a left pointing arrow to denote the idea of assigning.

mean_of_three <- mean(c(2,4,100))

Nothing will happen in the console, but instead, this value is now in our Environment. We can always see that value by just typing into R:

mean_of_three

## [1] 35.33333

Values vs. variables

There are two main things we can assign. If we were to write:

value <- c("Dr. C", "Dr. K")

This would store a value into our Environment under Values. This is different than creating a variable.

data$var <- rep(c("Dr. C", "Dr. K"), 75)

This creates a variable called var. To understand this, we first state that we want things stored within the dataframe we call data, and then, we use the dollar sign to signify that this will be a new variable in the data.

Brackets

Getting Help

We can always type ? followed by a function name (like, ?c ) into the Console to learn more about it. The best place to look is the Arguments part, and then immediately scroll to the examples section.

Basic Functions ()

Functions are things that have some letters on one side (many times, a word, or a shortened word), and an open and closed parantheses on the other (). We call this thing() a function. Functions take arguments, or things we put inside of the parantheses. Different functions take different amounts of arguments, but if there are multiple arguments to pass through the function, we would want to always separate it with a comma.

c()

Above, we used c() to create a list of two things. This is our first function! Lists are many times called vectors in coding. If we want to list multiple things, we probably want to assume we are using c() unless otherwise noted.

’ - ’

is one way in R we say not. Typically, - will be used in terms of saying NOT this variable or NOT this list of things.

is.na()

See: ! ; logical operators

is.na() returns a vector (list) of TRUE or FALSE. NA is R’s term for missingness. That is, is.na() will return a TRUE if something is missing and a FALSE if it is not-missing. We generally will use is.na() with the NOT, that is, to say !is.na(), to give us a list of TRUE if something is present, and only FALSE if something is not present.

names()

names() will tell us the names of all columns in a dataset.

head()

head() will give us the first six rows of all columns in a dataset.

tail()

tail() will give us the last six rows of all columns in a dataset.

table()

table() will give us the cross-tabulation of one to two variables in a dataset and give the counts.

table(data$Species)

## 
##     setosa versicolor  virginica 
##         50         50         50

table(data$Species, data$var)

##             
##              Dr. C Dr. K
##   setosa        25    25
##   versicolor    25    25
##   virginica     25    25

In what we are learning below, we can also call tables through piping.

data %>% 
  dplyr::select(Species, var) %>% 
  table()

##             var
## Species      Dr. C Dr. K
##   setosa        25    25
##   versicolor    25    25
##   virginica     25    25

ls()

ls() lists everything in your environment.

rm()

rm() removes things from your environment. You can use rm(list=ls()) to remove everything from your environment.

rm(mean_of_three)

Logical Operators

We can use < , >, == , >= , <= for most things. If we want to use equal to, R understands this as equivalent to, which requires the use of a double equal sign. A single equal sign will not work – a single equal sign acts like our <- assignment operator. We can also use these at any point on variables in our data.

2+2==4

## [1] TRUE

data$Sepal.Length > 6

##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [49] FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
##  [61] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
##  [73]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
##  [97] FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## [109]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
## [121]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [133]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE
## [145]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

table(data$Sepal.Length > 6)

## 
## FALSE  TRUE 
##    89    61

!

An exclamation mark is used to say NOT. More specifically, it is primarily used to say NOT when we are considering statements of TRUE or FALSE. It will flip things for us. An important thing to think about for ! is the use of paranthesis. That is, we have to be clear about what we’re NOT’ing.

2+2==4

## [1] TRUE

!(2+2==4)

## [1] FALSE

2+3==4

## [1] FALSE

!(2+3==4)

## [1] TRUE

table(!(data$Sepal.Length > 6))

## 
## FALSE  TRUE 
##    61    89

#Notice how compared to above, this ! had
#that we wrap data$Sepal.Length > 6 in ().

factor()

Basic Math Functions

mean()

sd()

se

Calculates the standard error of some variable or object. The se is the standard deviation divided by the square root of n.

sd(iris$Sepal.Length)/sqrt(
  sum(
    !is.na(iris$Sepal.Length)
    )
  )

## [1] 0.06761132

rbind()

cbind()

Tidyverse Functions

%>%

Since the lab is learning from dplyr, we need to immediately start with %>%, or what is known as the pipe. The pipe takes something on the left of it, and puts functions to the right of it. What this does is also places the thing as first argument. That is, you will see that all of the functions below take data is the first argument. Using the pipe, this step has been done for you, so you never need to specify what dataset you are using, since you have defined it on the left side of the pipe.

This should help with the readability of the script as you go on. We will see the pipe appear a lot, but you can try and read it as “and then”.

Finally, %>% allows us to end our and then, and then, and then, without a () on the last function if it only takes a single argument (if that argument is a dataset). Again, this makes sense, since functions like head() only take one argument, and that argument is data, and we have supplied the first argument in at the beginning of our pipe.

Now, while most of the time we put datasets on the left, it does not need to be the case that it is the only thing on the right (or that functions are created to assume data comes first). For example, the lm() function (see lm()), takes a formula as the first argument, and later, data:

lm(Petal.Length ~ Species, data=iris)

## 
## Call:
## lm(formula = Petal.Length ~ Species, data = iris)
## 
## Coefficients:
##      (Intercept)  Speciesvirginica     Speciessetosa  
##            4.260             1.292            -2.798

So, following pipe logic, trying to pipe in iris on the left would not work, since formula is the first argument, and formula is not a dataset.

iris %>% lm(Petal.Length ~ Species)

This would break! It would say: “Error in as.data.frame.default(data) : cannot coerce class ‘“formula”’ to a data.frame”.

However, piping allows us to use the period as a placeholder of where the left hand side will go.

iris %>% lm(Petal.Length ~ Species, .)

## 
## Call:
## lm(formula = Petal.Length ~ Species, data = .)
## 
## Coefficients:
##      (Intercept)  Speciesvirginica     Speciessetosa  
##            4.260             1.292            -2.798

group_by()

mutate()

Mutate() is used to create new variables or change variables (mutate them). You can use mutate instead of creating a new variable through the dollar sign assignment, but either is acceptable. Here, I make two variables. First, I overwrite Sepal.Length by creating a new variable with the same name, and add one to it. I also create a new variable called newvar, which takes a random number between 0 and 1.

names(data)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
## [6] "var"

data <- data %>% 
  ungroup() %>% 
  mutate(Sepal.Length = Sepal.Length+1,
         newvar  = sample(c(0:1), 150,  replace=TRUE)) %>% 
  group_by(Species, var)

names(data)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
## [6] "var"          "newvar"

recode()

filter()

rename()

rename() is how we can rename many variables at once. It takes a dataset as the first argument, like all dplyr verbs, and then takes just newname=oldname input continuously without a c() needed.

names(data)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
## [6] "var"          "newvar"

data <- data %>% rename(petal.length=Petal.Length,
                        professor = newvar)
names(data)

## [1] "Sepal.Length" "Sepal.Width"  "petal.length" "Petal.Width"  "Species"     
## [6] "var"          "professor"

data <- data %>% rename(Petal.Length = petal.length,
                        newvar = professor)

select()

select() is used to select certain columns that we either want to keep or want to not keep. You will learn that we can say “not” in two different ways, using the - sign or using the ! sign. While tricky to figure out which, generally default to - while use ! when you’re doing some kind of logical operator (is X larger than Y).

Select can also move the order of the columns around if you tell it to keep everything using everything().

#This will force you to also take value as a variable in data_small.
data_small <- data %>% dplyr::select(Species, Sepal.Length)
names(data_small)

## [1] "Species"      "Sepal.Length"

#We can avoid that by using ungroup()
data_small2 <- data %>% 
  ungroup() %>% 
  dplyr::select(Species, Sepal.Length)
names(data_small2)

## [1] "Species"      "Sepal.Length"

#Using everything() will simply move the order of the variables.
names(data)

## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
## [6] "var"          "newvar"

data_move <- data %>% 
    dplyr::select(Species, everything()) 
names(data_move)

## [1] "Species"      "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [6] "var"          "newvar"

rm(data_move, data_small2, data_small)

pull()

pull() is a verb that says quite literally “pull” this one vector out from the dataset. We can use it to pull out a variable for mean and standard deviation purposes. You saw it above in mean() and sd() using piping.

iris %>%
  pull(Sepal.Length) %>% 
  mean()

## [1] 5.843333

pluck()

Pluck is similar to pull, but it plucks values that may come out of a function. We might see this if we would generally call $ on a function. Sometimes, these $ are reported out, but other times, they are not.

UCB.table <- margin.table(UCBAdmissions, c(1,2))

chisq.test(UCB.table)$expected

##           Gender
## Admit          Male    Female
##   Admitted 1043.461  711.5389
##   Rejected 1647.539 1123.4611

UCB.table %>% chisq.test() %>% pluck("expected")

##           Gender
## Admit          Male    Female
##   Admitted 1043.461  711.5389
##   Rejected 1647.539 1123.4611

rowid_to_column()

This is a simple function that can be passed through at various points, but it generally acts to simply assign a a numeric id to our rows if we do not already have it and will call the new column it generates rowid.

data %>%
  rowid_to_column() %>%
  head

## # A tibble: 6 × 8
##   rowid Sepal.Length Sepal.Width Petal.Length Petal.Width Species var     newvar
##   <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>   <chr>    <int>
## 1     1          6.1         3.5          1.4         0.2 setosa  Dr. Ca…      0
## 2     2          5.9         3            1.4         0.2 setosa  Dr. Ki…      1
## 3     3          5.7         3.2          1.3         0.2 setosa  Dr. Ca…      0
## 4     4          5.6         3.1          1.5         0.2 setosa  Dr. Ki…      0
## 5     5          6           3.6          1.4         0.2 setosa  Dr. Ca…      0
## 6     6          6.4         3.9          1.7         0.4 setosa  Dr. Ki…      1

pivot_longer()

pivot_wider()

summarise()

gather()

gather() is the old version of pivot_longer(). While it works just as well, it will be easier for the lab to focus on pivot_longer(), and any past code may be outdated to other uses.

spread()

spread() is the old version of pivot_wider(). While it works just as well, it will be easier for the lab to focus on pivot_wider() and any past code may be outdated to other uses.

Creating Averages for New Variables:

Creating averages for variables is a key part of cleaning your data. We typically ask a multitude of questions in a scale and we want to get the overall average score for the scale.

rowMeans()

The first simple way is rowMeans(). for rowMeans, it takes the mean of all columns per row. So, we need to first subset our data to only the columns we want. Then, what is left is a dataset purely of the columns we want the average of, which can then be run through rowMeans to get an average for each row, which we can then assign to a new variable. I also run the global mean on this variable to compare it to the next solution.

iris$Sepal.Example <- rowMeans(
    subset(
      iris,
        select=c(
        Sepal.Length,
        Sepal.Width
        )))
mean(iris$Sepal.Example)

## [1] 4.450333

However, rowMeans() will start to misbehave if we’re using a grouped dataset through our tibbles and our group_by() and if we are trying to pipe it in. If so, we would want to shift to using rowwise() and using the mean() function.

rowwise() and mean()

see also: psych::alpha() , summarystats(), group_by()

rowwise() is very similar to group_by() in that it doesn’t change anything but the dataset’s functionality. Before, if we were to call mean(), we would get the average across the whole data. But using rowwise(), we get it for each individual row. Then, we can use mean() instead, since we’ve grouped the variables on the rows, and then pass through the variable names as we are used to doing for mean. Note like always, since we are piping, we don’t need to use the dollar sign to signify the variable because R is being told explicitly at the beginning what variable to use.

iris <- iris %>% 
  group_by(Species) %>% 
  rowwise() %>% 
  mutate(Sepal.Example2 = mean(c(Sepal.Length, Sepal.Width), na.rm=TRUE))
mean(iris$Sepal.Example2)

## [1] 4.450333

ggplot2

The lab uses ggplot2 for all data visualizations. Ggplot2 can be broken down into visualizations that takes summary data (geom_line; geom_bar; geom_col; geom_point) and visualizations that take the whole dataset (raincloud plots; geom_point). The lab is moving towards using raincloud plots for most visualizations.

All ggplots start with a base layer that will show nothing except the axes. The idea is that we just continually add layers on top of each other. It is important to think like this because indeed, that’s how things work, so if you want something behind/in front of something else, consider the order in which you are placing layers.

Important to note in the code below that the variables we care about are wrapped in aes(). aes() stands for aesthetics, or visually, what we want to change. If we want things to change based on a certain variable, it probably wants to go into aes. If we want something to change globally (for all points), then we probably want it outside of aes.

Both of these are possible plots - that is, all ggplot2 needs is a dataset. It doesn’t care if it’s the whole data or a summarized version of the data. To ensure this section works well, I will always be piping data just like this at the top of each function. However, please note that in practice, it is better to just save your data into your environment and then

iris %>% ggplot2::ggplot(aes(x=Species, y=Sepal.Length))

iris %>% group_by(Species) %>%
  dplyr::summarise(
    meanSL=mean(Sepal.Length, na.rm=TRUE),
    sdSL = sd(Sepal.Length, na.rm=TRUE),
    nSL = sum(!is.na(Sepal.Length)), 
    seSL = sdSL/sqrt(nSL)) %>%
  ggplot2::ggplot(aes(x=Species, y=meanSL))

geoms

geom_point()

see also: geom_line(), lollipop graphs, strip chart

geom_point() makes scatterplots, so we can expect geom_point() to like to have the full dataset to plot things well. However, this is not necessarily the case, and there are many graphs that can use geom_point() with a summarized dataset as well where you are highlighting the points that appear on geom_line() (see Maglio et al., 2014 for a use of geom_point() with summarized data, also lollipop graphs).

iris %>% ggplot2::ggplot(aes(x=Species, y=Sepal.Length))+
  geom_point()

geom_point(position=position_jitter)

See also: called jitter plots, strip chart

Above, the points were overlapping each other. We can set the position of the points to be different so they don’t overlap anymore. We call this jittering. position= is important for a few of the geom_s.

iris %>% ggplot2::ggplot(aes(x=Species, y=Sepal.Length))+
  geom_point(position=position_jitter(width=.2))

geom_point(color=) outside of aes

As stated, outside of aes() will set global colors.

iris %>% ggplot2::ggplot(aes(x=Species, y=Sepal.Length))+
  geom_point(position=position_jitter(width=.2), color="blue")

geom_point(aes(color=))

geom_point(shape=) outside of aes

Points can change their shapes. You can find out all the shape options using show_point_shapes() from ggpubr. Though, I will note, when I look at this, the point shapes don’t equate to what is graphed. However, it can at least give you a general idea and use a lot of trial and error to find what you want.

ggpubr::show_point_shapes()

iris %>% 
  ggplot2::ggplot(aes(x=Species, y=Sepal.Length))+
  geom_point(position=position_jitter(width=.2),
             aes(color=Species),
             shape=21)

geom_point(shape=) inside aes()

Similar, you can put most things outside of aes() into aes() if you want to change it by group. Inside of aes() is generally how you get things to appear on legends.

iris %>% 
  ggplot2::ggplot(aes(x=Species, y=Sepal.Length))+
  geom_point(position=position_jitter(width=.2),
             aes(color=Species,
                 shape=Species))

geom_bar()

geom_col()

see: geom_bar()

geom_col() does what we want geom_bar() to do by default - that is, map the averages of various grouping variables.

iris %>% group_by(Species) %>% summarise(
                          n_Sepal.Width      = sum(!is.na(Sepal.Width)),
                          mean_Sepal.Width   = mean(Sepal.Width, na.rm=TRUE),
                          sd_Sepal.Width     = sd(Sepal.Width, na.rm=TRUE),
                          se_Sepal.Width     = sd_Sepal.Width / sqrt(n_Sepal.Width)) %>%
  ggplot(aes(x=Species, y=mean_Sepal.Width, group=Species))+
  geom_col(position="dodge", colour="black", aes(fill=Species))+
  scale_fill_manual(values=c("grey40", "grey", "grey100"))+
  labs(y="Y Axis", x="X Axis", fill="")+
  theme(plot.title = element_text(hjust = 0.5)) + 
  geom_errorbar(aes(
    ymin=mean_Sepal.Width-se_Sepal.Width,
    ymax=mean_Sepal.Width+se_Sepal.Width),
                width=.3,                    # Width of the error bars
                position=position_dodge(.9))

geom_boxplot()

Boxplots show the median.

iris %>% ggplot(aes(x=Species, y=Petal.Length))+
    geom_boxplot(aes(color=Species))+
    geom_line()+
    xlab("Species")+
    scale_color_manual(values=c("#FFFFFF", "#123456", "#654321"))

geom_histogram()

geom_line()

We could display the above data also using lines instead of bars. For most kinds of visualizations you would see, if we’re using geom_line(), we probably want to feed it data in this more compressed form. You should try on your own, and really it kind of just comes down to trial and error on how we want them to look. But let’s add on geom_line() instead of geom_bar(), and let’s pass through two arguments, aes() we’ll say color is equal to condition, and size outside of aes is equal to 1.2.

iris %>%
  group_by(Species) %>%
  summarise(
    n_Sepal.Width = sum(!is.na(Sepal.Width)),
    mean_Sepal.Width   = mean(Sepal.Width, na.rm=TRUE),
    sd_Sepal.Width     = sd(Sepal.Width, na.rm=TRUE),
    se_Sepal.Width     = sd_Sepal.Width / sqrt(n_Sepal.Width)) %>%
  ggplot(aes(x=Species, y=mean_Sepal.Width, group=Species))+
  geom_line(size=1.2, aes(color=Species))+
  scale_fill_manual(values=c("grey40", "grey", "grey100"))+
  labs(y="Y Axis", x="X Axis", fill="")+
  theme(plot.title = element_text(hjust = 0.5)) + 
  geom_errorbar(aes(
    ymin=mean_Sepal.Width-se_Sepal.Width,
    ymax=mean_Sepal.Width+se_Sepal.Width),
                width=.3,                    
                position=position_dodge(.9))

geom_errorbar()

See: geom_bar(), geom_col(), geom_line()

Geom_errorbar() is used to create the error bars however we define them. For Psychology, we generally plot the 1 +/- SE bars. I plot them alone here to visually show them, but you can see their true functionality in other plots. The big thing to notice about them is that they take a ymin and a ymax, and so we calculate that ahead of time and place it into our compressed dataset.

iris %>%
  group_by(Species) %>%
  summarise(
    n_Sepal.Width = sum(!is.na(Sepal.Width)),
    mean_Sepal.Width   = mean(Sepal.Width, na.rm=TRUE),
    sd_Sepal.Width     = sd(Sepal.Width, na.rm=TRUE),
    se_Sepal.Width     = sd_Sepal.Width / sqrt(n_Sepal.Width)) %>%
  ggplot(aes(x=Species, y=mean_Sepal.Width, group=Species))+
  geom_errorbar(aes(
    ymin=mean_Sepal.Width-se_Sepal.Width,
    ymax=mean_Sepal.Width+se_Sepal.Width),
                width=.3,                    
                position=position_dodge(.9))

geom_segment()

geom_segment

see: annotate

Even better would probably be using a geom_ still, and for that, we have geom_segment(). Notice how we include two dataframes in this ggplot without issue.

x    <- c(1,   1.5, 1.5, 1)
xend <- c(1.5, 1.5, 1,   1)
y    <- c(2.5, 2.5, 4,   4)
yend <- c(2.5, 4,   4,   2.5)
line <- c(1:4)
box <- data.frame(xend, x, yend, y, line)

rm(xend, x, yend, y, line)
iris %>% ggplot(aes(x=Species, y=Petal.Length))+
  geom_jitter()+
  xlab("Species")+
  geom_vline(xintercept=2, color="red", linetype=2)+
  annotate("text", label="I made a box!", x=1.25, y=3.25)+
  geom_segment(data=box, aes(x=x, xend=xend, y=y, yend=yend))

geom_violin()

Types of Plots

mosaic plots

Prediction plots

We can plot predictions in a lot of ways. One way is through sjPlot’s plot_model feature.

library(sjPlot)

model <- 
  mpg %>% 
  mutate(dodge = case_when(manufacturer=="dodge" ~ "Dodge",
                           manufacturer!="dodge" ~ "Not Dodge"),
         old = case_when(year==1999 ~ "Old",
                         year==2008 ~ "New")) %>%
  lm(hwy ~ dodge* old + cty, .)

#plots predictions
sjPlot::plot_model(model, type="pred", terms=c("dodge","old"))+
  ylab("")+
  xlab("")+
  ggtitle("")+
  guides(color=guide_legend(title=""))

Coefficient Plots

We can also plot the estimates as a forest plot.

model <- 
  mpg %>% 
  mutate(dodge = case_when(manufacturer=="dodge" ~ "Dodge",
                           manufacturer!="dodge" ~ "Not Dodge"),
         old = case_when(year==1999 ~ "Old",
                         year==2008 ~ "New")) %>%
  lm(hwy ~ dodge* old + cty, .)

fitted_plot<- plot_model(model,
           axis.labels=c("coef1", "coef2", "coef3", "coef4"),
           group.terms = c(1, 2,
                           3, 2),
           vline.color = "Black",
           colors=c("DarkGrey", "#046C9A",  "#DB545E"))+
  ggtitle("title")
fitted_plot[["data"]]  <- fitted_plot[["data"]][1:4,]
fitted_plot

raincloud plots

library(wesanderson)
library(cowplot)
library(PupillometryR)
library(tidyverse)

ggplot(iris,aes(x=Species,y=Petal.Length, fill = Species, colour = Species))+
    geom_flat_violin(position = position_nudge(x = .25, y = 0),
                     adjust =2,
                     trim = TRUE)+
    geom_point(position = position_jitter(width = .15),
               size = .25)+
    geom_boxplot(aes(x = (Species), y = Petal.Length),
                 outlier.shape = NA,
                 alpha = 0.3,
                 width = .1,
                 colour = "BLACK") +
    theme_cowplot()+ 
    ggtitle("Raincloud Plot")+
    scale_fill_manual(values = wes_palette("Zissou1",
                                           n = 3))+
    scale_color_manual(values = wes_palette("Zissou1",
                                            n = 3))+
    theme(legend.position = "none",
          axis.text.x = element_text(size=8))

lollipop plots

Lollipop plots are more modern looking barplots, even if they don’t give as much information as a violin plot or a boxplot.

The second example is a created double lollipop, or barbell, plot.

assignment <- c("Attendance", "Participation", "Lead Discussion",
                "Discussion Posts", "Reviewing Papers", "Midterm",
                "Project Topic", "Method Identification", "Ethics Training",
                "Annotated Bibliography", "Class Pres 1", "Submit IRB",
                "Class Pres 2", "Poster Presentation", "Final Paper"
                )
Management <- c(0, 10,  7.5, 7.5, 10, 15, 0, 
                2,2, 5, 2.5, 10, 2.5, 10, 17.5)
Union <- c(5, 10, 10, 10, 5, 20, 2,
           2, 2, 5, 5, 5, 4, 5, 10)
data_lg <- data.frame(assignment, Management, Union)
data_lg$assignment <- factor(data_lg$assignment,
                             levels=c("Attendance", "Participation",
                                      "Lead Discussion", "Discussion Posts",
                                      "Reviewing Papers", "Midterm",
                                      "Project Topic", "Method Identification",
                                      "Ethics Training",
                                      "Annotated Bibliography",
                                      "Class Pres 1", "Submit IRB", 
                                      "Class Pres 2", "Poster Presentation",
                                      "Final Paper"))

ggplot(data_lg) +
  geom_segment(aes(x=assignment,
                    xend=assignment,
                    y=Management,
                    yend=Union),
               color="grey") +
  geom_point(aes(x=assignment, y=Management, color="Management"), size=3 ) +
  geom_point(aes(x=assignment, y=Union, color="Union"), size=3 ) +
  scale_color_manual(values=c(rgb(0.7,0.2,0.1,0.5), rgb(0.2,0.7,0.1,0.5)), 
                     name="")+
  coord_flip()+
  xlab("") +
  ylab("Weight of Assignment for Final Grade")+
  theme(
    legend.position = "right"
  )

coord_polar()

Pie charts aren’t a good way to visualize things, and ggplot2 makes that very apparent by making it not the smoothest thing to make. Here, I make a donut chart (a pie chart with a hole in the middle) by combining geom_col() with coord_polar().

Other search terms: pie chart, donut chart

df2 <- data.frame(value = c(6, 15, 78),
                 group = c("Decreasing", "Staying the same", "Increasing"),
                 hsize= 3)
df2$group <- factor(df2$group, levels=c("Increasing",
                                        "Decreasing",
                                        "Staying the same"))
df2$label <- paste0(df2$value, "%")

ggplot(df2, aes(x = hsize, y = value, fill = group)) +
  geom_col(color = "black") +
  geom_text(aes(label = label),
            position = position_stack(vjust = 0.5)) +
  coord_polar(theta = "y") +
  scale_fill_manual(values = c("#C27D38","#CCC591","#798E87" ),
                    name="Partisan divide is...")+
  xlim(c(0.2, 3 + 0.5)) +
  theme(panel.background = element_rect(fill = "white"),
        panel.grid = element_blank(),
        axis.title = element_blank(),
        axis.ticks = element_blank(),
        axis.text = element_blank(),
        legend.position = c(.5,.5))+
  ggtitle("Among Democrats...")

Other ggplot() features

facet_wrap()

other search terms: facet_grid()

Facet_wrap() and facet_grid() are ways to add an additional variable onto your visualization. Many times, we don’t want the grids to overlap as we have above, but instead, want to see them differently side by side.

mpg %>%
  mutate(
    trans = case_when(
       str_detect(trans, "auto")  ~ "auto",
        str_detect(trans, "manual") ~ "manual"
   )) %>%
  ggplot(aes(x=displ, y=hwy))+
  geom_point()+
  facet_wrap(~trans)

themes()

There are so many options in themes. It’s really hard to even start writing notes about this. If there’s a feature of the plot background or the plot itself you want to change, it’s probably in themes. Importantly, theme() almost never takes words. Instead, you should imagine theme() as giving details about the format of the plot itself. So, for instance, theme() can edit the size of the font of the y axis, but it can’t tell what the y axis is.

Also, a key thing to use, as you can see in other examples, is using element_blank() to get rid of things.

If you have multiple things, keep them all in a single theme().

iris %>% ggplot(aes(x=Species, y=Petal.Length))+
  geom_jitter()+
  theme(
    panel.background = element_rect(fill = "white"),
    panel.grid = element_blank(),
    axis.title = element_blank(),
    axis.ticks = element_blank(),
    axis.text = element_text(angle=90)
    #... use ?theme to find literally over fifty possible options.
  )

xlab()

Makes your x axis label. Put it in quotes.

iris %>% ggplot(aes(x=Species, y=Petal.Length))+
  geom_jitter()+
  xlab("Species")

ylab()

Makes your y axis label. Put it in quotes.

iris %>% ggplot(aes(x=Species, y=Petal.Length))+
  geom_jitter()+
  ylab("Petal Length")

geom_hline()

Makes a horizontal line with a yintercept needed.

iris %>% ggplot(aes(x=Species, y=Petal.Length))+
  geom_jitter()+
  xlab("Species")+
  geom_hline(yintercept=4, color="red", linetype=2)

geom_vline()

Makes a vertical line with a xintercept needed.

iris %>% ggplot(aes(x=Species, y=Petal.Length))+
  geom_jitter()+
  xlab("Species")+
  geom_vline(xintercept=2, color="red", linetype=2)

Plotting significance letters (cld)

Also search: cld

library(datarium)
library(lmerTest)
library(emmeans)
selfestdat2 <- selfesteem2
SelfEst.Lmer <- selfestdat2 %>%
  pivot_longer(cols=-c(id, treatment),
               names_to="Time",
               values_to="SelfEsteem") %>%
  mutate(Time = factor(Time, levels=c("t1", "t2", "t3"))) %>%
  lmer(SelfEsteem ~ Time*treatment + (1|id), data=.)
marginal_lmer <- emmeans(SelfEst.Lmer, ~Time)
pairs(marginal_lmer, adjust="tukey")

##  contrast estimate    SE df t.ratio p.value
##  t1 - t2      1.96 0.802 55   2.442  0.0462
##  t1 - t3      4.62 0.802 55   5.767  <.0001
##  t2 - t3      2.67 0.802 55   3.325  0.0044
## 
## Results are averaged over the levels of: treatment 
## Degrees-of-freedom method: kenward-roger 
## P value adjustment: tukey method for comparing a family of 3 estimates

CLD = multcomp::cld(marginal_lmer, alpha=0.05, 
          Letters=letters,  ### lc letters
          adjust="sidak")   
CLD$.group=gsub(" ", "", CLD$.group)

CLD %>% ggplot(aes(x=Time, y=emmean, group=Time, label=.group))+
  geom_col(position="dodge", colour="black", aes(fill=factor(Time)))+
  #scale_fill_manual(values=c("grey40", "grey", "grey100"))+
  labs(y="Y Axis", x="X Axis", fill="")+
  theme(plot.title = element_text(hjust = 0.5)) + 
  geom_errorbar(aes(ymin=emmean-SE, ymax=emmean+SE),
                width=.3,
                position=position_dodge(.9))+
  geom_text(nudge_y=10)

annotate()

Annotate allows the placement of text and lines. This can be called very obtusely with listing x, y, xend, and yend.

iris %>% ggplot(aes(x=Species, y=Petal.Length))+
  geom_jitter()+
  xlab("Species")+
  geom_vline(xintercept=2, color="red", linetype=2)+
  annotate("text", label="Label!", x=1, y=2.5)+
  annotate("segment", x=1, xend=1.5, y=4, yend=4)

But if you really wanted to, you can pass an additional dataframe in and use those values to map your lines for you.

x    <- c(1,   1.5, 1.5, 1)
xend <- c(1.5, 1.5, 1,   1)
y    <- c(2.5, 2.5, 4,   4)
yend <- c(2.5, 4,   4,   2.5)
line <- c(1:4)
box <- data.frame(xend, x, yend, y, line)

rm(xend, x, yend, y, line)
iris %>% ggplot(aes(x=Species, y=Petal.Length))+
  geom_jitter()+
  xlab("Species")+
  geom_vline(xintercept=2, color="red", linetype=2)+
  annotate("text", label="I made a box!", x=1.25, y=3.25)+
  annotate("segment", x=box$x, xend=box$xend, y=box$y, yend=box$yend)

scale_color_manual()

Review: shape (inside of aes())

If you call a color() in an aes(), you can also manually change the color. Color generally refers to the lines of a geom_ , while fill (below) will refer to the colors within the shapes themselves.

See below how this will create a legend and define the colors.

iris %>% ggplot(aes(x=Species, y=Petal.Length, color=Species))+
  geom_jitter()+
  geom_line()+
  xlab("Species")+
  scale_color_manual(values=c("blue", "red", "gray"))

This also will do that, but since it’s outside of the ggplot() call, it will only be focused on geom_jitter.

iris %>% ggplot(aes(x=Species, y=Petal.Length))+
  geom_jitter(aes(color=Species))+
  geom_line()+
  xlab("Species")+
  scale_color_manual(values=c("blue", "red", "gray"))

You don’t have to name the colors explicitly. You can also draw from various palettes. Dr. C uses library(wesanderson) and library(harrypotteR) for his palettes.

library(wesanderson)
wp3 <- wesanderson:: wes_palette("Darjeeling2",3, type="discrete")
iris %>% ggplot(aes(x=Species, y=Petal.Length))+
  geom_jitter(aes(color=Species))+
  geom_line()+
  xlab("Species")+
  scale_color_manual(values=c(wp3))

rm(wp3)

You can also use the HEX codes.

iris %>% ggplot(aes(x=Species, y=Petal.Length))+
    geom_boxplot(aes(color=Species))+
    geom_line()+
    xlab("Species")+
    scale_color_manual(values=c("#FFFFFF", "#123456", "#654321"))

scale_fill_manual()

Now repeat this, but use fill.

iris %>% ggplot(aes(x=Species, y=Petal.Length))+
    geom_boxplot(aes(fill=Species))+
    xlab("Species")+
    scale_fill_manual(values=c("#FFFFFF", "#123456", "#654321"))

ggpatterns

ggpatterns is its own library. Dr. C hasn’t done a lot of looking into it, but it seems to be what you would expect - a way to fill ggplots with patterns instead of by colors. The help file seems to have a lot of applications you could check out, but I’m not going to list each here, since it creates new functions that work just like their brothers, but with a few additional arguments. I provide two examples.

#Stealing the code from geom_col()
# The only change made is that it goes to geom_col_pattern()
# and fill=Species becomes pattern=Species.
library(ggpattern)
iris %>% group_by(Species) %>% summarise(
                          n_Sepal.Width      = sum(!is.na(Sepal.Width)),
                          mean_Sepal.Width   = mean(Sepal.Width, na.rm=TRUE),
                          sd_Sepal.Width     = sd(Sepal.Width, na.rm=TRUE),
                          se_Sepal.Width     = sd_Sepal.Width / sqrt(n_Sepal.Width)) %>%
  ggplot(aes(x=Species, y=mean_Sepal.Width, group=Species))+
  geom_col_pattern(position="dodge",
                   colour="black",
                   aes(pattern=Species))+
  scale_fill_manual(values=c("grey40", "grey", "grey100"))+
  labs(y="Y Axis", x="X Axis", fill="")+
  theme(plot.title = element_text(hjust = 0.5)) + 
  geom_errorbar(aes(
    ymin=mean_Sepal.Width-se_Sepal.Width,
    ymax=mean_Sepal.Width+se_Sepal.Width),
                width=.3,                    # Width of the error bars
                position=position_dodge(.9))

And yes, you can make it cats too if you’d like.

#Stealing the code from geom_col()
# The only change made is that it goes to geom_col_pattern()
# and fill=Species becomes pattern=Species.
library(ggpattern)
iris %>% group_by(Species) %>% summarise(
                          n_Sepal.Width      = sum(!is.na(Sepal.Width)),
                          mean_Sepal.Width   = mean(Sepal.Width, na.rm=TRUE),
                          sd_Sepal.Width     = sd(Sepal.Width, na.rm=TRUE),
                          se_Sepal.Width     = sd_Sepal.Width / sqrt(n_Sepal.Width)) %>%
  ggplot(aes(x=Species, y=mean_Sepal.Width, group=Species))+
  geom_col_pattern(position="dodge",
                   colour="black",
                   pattern_type='kitten',
                   pattern='placeholder',
                   aes(pattern_angle = Species))+
  #scale_fill_manual(values=c("grey40", "grey", "grey100"))+
  labs(y="Y Axis", x="X Axis", fill="")+
  theme(plot.title = element_text(hjust = 0.5),
        legend.position="none") + 
  geom_errorbar(aes(
    ymin=mean_Sepal.Width-se_Sepal.Width,
    ymax=mean_Sepal.Width+se_Sepal.Width),
                width=.3,                    # Width of the error bars
                position=position_dodge(.9))

Data Analysis

t.test()

Two Sample Independent T-Test

t.test() works for a bunch of various simple t.tests. You can use a two sampled t-test like so, where it is read DV ~ IV:

t.test(extra ~ group, data = sleep)

## 
##  Welch Two Sample t-test
## 
## data:  extra by group
## t = -1.8608, df = 17.776, p-value = 0.07939
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.3654832  0.2054832
## sample estimates:
## mean in group 1 mean in group 2 
##            0.75            2.33

One Sample Mean Comparison

You can compare sample means to theoretical population means.

t.test(sleep$extra, mu=4)

## 
##  One Sample t-test
## 
## data:  sleep$extra
## t = -5.4519, df = 19, p-value = 2.927e-05
## alternative hypothesis: true mean is not equal to 4
## 95 percent confidence interval:
##  0.5955845 2.4844155
## sample estimates:
## mean of x 
##      1.54

t.test(sleep$extra, mu=4, conf.level = .99)

## 
##  One Sample t-test
## 
## data:  sleep$extra
## t = -5.4519, df = 19, p-value = 2.927e-05
## alternative hypothesis: true mean is not equal to 4
## 99 percent confidence interval:
##  0.2490875 2.8309125
## sample estimates:
## mean of x 
##      1.54

Getting Confidence Intervals in t.test

You can use the additional information exported from t.test to get the confidence intervals of a mean.

t.test(sleep$extra)$conf.int

## [1] 0.5955845 2.4844155
## attr(,"conf.level")
## [1] 0.95

sleep %>% select(extra) %>% t.test() %>% .$conf.int

## [1] 0.5955845 2.4844155
## attr(,"conf.level")
## [1] 0.95

Two Sample Dependent T-Test

You can use paired data to also test if the two means are different.

sleep2 <- stats::reshape(sleep, direction = "wide", 
                  idvar = "ID", timevar = "group")
t.test(Pair(extra.1, extra.2) ~ 1, data = sleep2)

## 
##  Paired t-test
## 
## data:  Pair(extra.1, extra.2)
## t = -4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.4598858 -0.7001142
## sample estimates:
## mean of the differences 
##                   -1.58

But in class we mainly learned it as:

t.test(sleep2$extra.1, sleep2$extra.2, paired=TRUE)

## 
##  Paired t-test
## 
## data:  sleep2$extra.1 and sleep2$extra.2
## t = -4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.4598858 -0.7001142
## sample estimates:
## mean of the differences 
##                   -1.58

aov()

Contrasts as apriori post-hoc tests

TukeyHSD()

Tukey’s Honest Significant Difference is a way to get post-hoc tests. It already controls for multiple comparisons by considering a different t distribution based on the number of comparisons being made. It takes an aov() object as input.

iris %>% 
  aov(Petal.Length ~ Species, data=.) %>%
  TukeyHSD()

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Petal.Length ~ Species, data = .)
## 
## $Species
##                        diff      lwr      upr p adj
## virginica-versicolor  1.292  1.08822  1.49578     0
## setosa-versicolor    -2.798 -3.00178 -2.59422     0
## setosa-virginica     -4.090 -4.29378 -3.88622     0

lm()

Linear regression models are what everything in Psychological Statistics is built off of, and so if we publish a paper, we’re using lm. You can notice that it actually would give the similar information to using aov() and then TukeyHSD().

iris %>%
  lm(Petal.Length ~ Species, data=.) %>%
  summary()

## 
## Call:
## lm(formula = Petal.Length ~ Species, data = .)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.260 -0.258  0.038  0.240  1.348 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.26000    0.06086   70.00   <2e-16 ***
## Speciesvirginica  1.29200    0.08607   15.01   <2e-16 ***
## Speciessetosa    -2.79800    0.08607  -32.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4303 on 147 degrees of freedom
## Multiple R-squared:  0.9414, Adjusted R-squared:  0.9406 
## F-statistic:  1180 on 2 and 147 DF,  p-value: < 2.2e-16

glm()

lme4::lmer()

If we’re doing within-subjects stuff, we’re using lmer(). lmer() stands for linear mixed effect regressions. It allows you the ability to throw things in as the random effect. Random effects - ask your professor.

Dr. C believes that Blocking should use random effect modeling. Dr. K doesn’t agree.

library(broom.mixed)
library(lme4)
yield <- c(93.5, 66.6, 50.5, 42.4,
           102.9, 53.2, 47.4, 43.8,
           67.0, 54.7, 50, 40.1,
           86.3, 61.3, 50.7, 46.4)
block <- c(rep(1, 4), rep(2, 4), rep(3, 4), rep(4, 4))
variety <- rep(1:4, 4)
Barley <- data.frame(yield, block, variety)
Barley$variety <- as.factor(Barley$variety)
Barley$block <- as.factor(Barley$block)


Barley %>% 
  mutate(variety = as.factor(variety)) %>%
  lmer(yield ~ variety + (1|block), .) %>%
  summary()

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: yield ~ variety + (1 | block)
##    Data: .
## 
## REML criterion at convergence: 90.4
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.32860 -0.23296  0.00454  0.28642  1.86536 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  block    (Intercept)  5.533   2.352   
##  Residual             64.288   8.018   
## Number of obs: 16, groups:  block, 4
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)   87.425      4.178  11.778  20.925 1.11e-10 ***
## variety2     -28.475      5.670   9.000  -5.022 0.000717 ***
## variety3     -37.775      5.670   9.000  -6.663 9.24e-05 ***
## variety4     -44.250      5.670   9.000  -7.805 2.69e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##          (Intr) varty2 varty3
## variety2 -0.679              
## variety3 -0.679  0.500       
## variety4 -0.679  0.500  0.500

Compare the Estimates of the varieties compared to…

library(multcomp)
library(broom)
Barley %>% 
  mutate(variety = as.factor(variety)) %>%
  lmer(yield ~ variety + (1|block), .) %>% 
  tidy() %>%
  filter(effect=="fixed") %>%
  dplyr::select(estimate)

## # A tibble: 4 × 1
##   estimate
##      <dbl>
## 1     87.4
## 2    -28.5
## 3    -37.8
## 4    -44.2

Barley.lmer <- lmer(yield ~ factor(variety) + (1|block), data=Barley)
marginal_lmer <- emmeans(Barley.lmer, ~variety)
pairs(marginal_lmer, adjust="tukey")

##  contrast estimate   SE df t.ratio p.value
##  1 - 2       28.48 5.67  9   5.022  0.0033
##  1 - 3       37.77 5.67  9   6.663  0.0004
##  1 - 4       44.25 5.67  9   7.805  0.0001
##  2 - 3        9.30 5.67  9   1.640  0.4053
##  2 - 4       15.78 5.67  9   2.782  0.0833
##  3 - 4        6.47 5.67  9   1.142  0.6747
## 
## Degrees-of-freedom method: kenward-roger 
## P value adjustment: tukey method for comparing a family of 4 estimates

CLD = cld(marginal_lmer, alpha=0.05, 
          Letters=letters,  ### lc letters
          adjust="sidak")  
CLD

##  variety emmean   SE   df lower.CL upper.CL .group
##  4         43.2 4.18 11.8     30.9     55.4  a    
##  3         49.6 4.18 11.8     37.4     61.9  a    
##  2         59.0 4.18 11.8     46.7     71.2  a    
##  1         87.4 4.18 11.8     75.2     99.7   b   
## 
## Degrees-of-freedom method: kenward-roger 
## Confidence level used: 0.95 
## Conf-level adjustment: sidak method for 4 estimates 
## P value adjustment: sidak method for 6 tests 
## significance level used: alpha = 0.05 
## NOTE: Compact letter displays can be misleading
##       because they show NON-findings rather than findings.
##       Consider using 'pairs()', 'pwpp()', or 'pwpm()' instead.

Barley.lm <- lm(yield ~ factor(block) + factor(variety), data=Barley)
marginal = emmeans(Barley.lm, ~variety)
pairs(marginal, adjust="tukey")

##  contrast estimate   SE df t.ratio p.value
##  1 - 2       28.48 5.67  9   5.022  0.0033
##  1 - 3       37.77 5.67  9   6.663  0.0004
##  1 - 4       44.25 5.67  9   7.805  0.0001
##  2 - 3        9.30 5.67  9   1.640  0.4053
##  2 - 4       15.78 5.67  9   2.782  0.0833
##  3 - 4        6.47 5.67  9   1.142  0.6747
## 
## Results are averaged over the levels of: block 
## P value adjustment: tukey method for comparing a family of 4 estimates

CLD = cld(marginal, alpha=0.05, 
          Letters=letters,  ### lc letters
          adjust="sidak")  
CLD

##  variety emmean   SE df lower.CL upper.CL .group
##  4         43.2 4.01  9     30.8     55.6  a    
##  3         49.6 4.01  9     37.2     62.1  a    
##  2         59.0 4.01  9     46.5     71.4  a    
##  1         87.4 4.01  9     75.0     99.8   b   
## 
## Results are averaged over the levels of: block 
## Confidence level used: 0.95 
## Conf-level adjustment: sidak method for 4 estimates 
## P value adjustment: sidak method for 6 tests 
## significance level used: alpha = 0.05 
## NOTE: Compact letter displays can be misleading
##       because they show NON-findings rather than findings.
##       Consider using 'pairs()', 'pwpp()', or 'pwpm()' instead.

And note:

psych::alpha()

see: summarystats()

Dr. C has his own function that pulls from this to get summary stats of scales. However, alpha can be thought of as the many combinations of correlations between each possible item in a scale. Items = Questions. Many scales have more than 1 question.

psych::alpha wants a dataset of only the items in a scale. It also will do a decent job detecting if you forgot to reverse code something.

library(psych)
library(tidyverse)
iris %>% 
  ungroup() %>%
  dplyr::select(Petal.Length, Petal.Width, Sepal.Length) %>%
  psych::alpha(.)

## 
## Reliability analysis   
## Call: psych::alpha(x = .)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N    ase mean  sd median_r
##       0.88      0.96    0.96      0.88  23 0.0067  3.6 1.1     0.87
## 
##  lower alpha upper     95% confidence boundaries
## 0.86 0.88 0.89 
## 
##  Reliability if an item is dropped:
##              raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## Petal.Length      0.90      0.90    0.82      0.82   9   0.0165    NA  0.82
## Petal.Width       0.80      0.93    0.87      0.87  14   0.0164    NA  0.87
## Sepal.Length      0.82      0.98    0.96      0.96  52   0.0079    NA  0.96
## 
##  Item statistics 
##                n raw.r std.r r.cor r.drop mean   sd
## Petal.Length 150  0.99  0.98  0.99   0.96  3.8 1.77
## Petal.Width  150  0.97  0.96  0.96   0.94  1.2 0.76
## Sepal.Length 150  0.92  0.93  0.87   0.86  5.8 0.83

lsr::cohensD()

Cohen’s D is one version of effect sizes. Treat it like any normal equation. Note: these are calculated outside of considerations of p values. That is, you can still find effect sizes with non-significant results.

library(lsr)
starwars %>% cohensD(mass ~ gender, .)

## [1] 0.3004285

Advanced R

Writing Your Own Functions

Functions are created using both an open set of parentheses as well as a curly bracket.

coolcat <- function(x){
  paste0(x, " is a cool cat")
}

coolcat("Dr. C")

## [1] "Dr. C is a cool cat"

sapply()

Lab Functions

factorise()

summarystats()

summarystats() is the lab’s quick way to get alpha, mean, and standard deviation of scales. It takes two argument The first is a dataset of only the columns that include the questions that make up the scale. Thus, prior to running summarystats(), you would need to create a new dataset of just those variables using select(). The second argument is the overall averaged variable in our main dataset. Thus, to make this one, you would need to use rowMeans() to create an overall average variable of the scale that will be used in the analysis.

summarystats <- function(input1, input2){
  list <- c(psych::alpha(input1)$total[1],
            mean=mean(input2, na.rm=TRUE),
            sd=sd(input2, na.rm=TRUE))
  return(list)
}

data$Sepal.Example <- rowMeans(
    subset(
      data,
        select=c(
        Sepal.Length,
        Sepal.Width
        )))

data_select <-  data %>% subset(
      select=c(
      Sepal.Length,
      Sepal.Width
      )
    )

summarystats(data_select, data$Sepal.Example)

## Number of categories should be increased  in order to count frequencies.

## Warning in psych::alpha(input1): Some items were negatively correlated with the total scale and probably 
## should be reversed.  
## To do this, run the function again with the 'check.keys=TRUE' option

## Some items ( Sepal.Length ) were negatively correlated with the total scale and 
## probably should be reversed.  
## To do this, run the function again with the 'check.keys=TRUE' option

## Warning in sqrt(Vtc): NaNs produced

## $raw_alpha
## [1] -0.214637
## 
## $mean
## [1] 4.950333
## 
## $sd
## [1] 0.4446361

rm(data_select)

Applied Statistics In The Life Sciences

If you’re in BIO-245/MTH-245, you might need some other functions. This is also more for me to remember some important functions for teaching, but also for your own learning.

T Statistic / Z Statistic Stuff

qt()

qt() will give the t-value for a given alpha level (don’t forget to divide by two if you are considering a two-tailed test) and a degrees of freedom. Remember that for a normal distribution, as t goes to infinity, it becomes more and more shaped like the z distribution, which means that qt(.025) will approach 1.96 as N increases. Likewise, for p=.1, or alpha=.05, qt(.05) will approach 1.64 as N increases.

qt(.025, 10)

## [1] -2.228139

qt(.025, 1000000)

## [1] -1.959966

qt(.05, 10, lower.tail=FALSE)

## [1] 1.812461

qt(.05, 1000000, lower.tail=FALSE)

## [1] 1.644855

pt()

Given a t-statistic and a degrees of freedom, we can use pt() to tell us the probability to the left or the right. Without lower.tail = FALSE, it will give us the left side. Many times, when calculating various statistics by hand, it will be important to multiply this pt() result by 2, since we are caring about a two tail test.

For example, we know that in high N, t (or z)=1.96 is equal to 5% p value. To get that, you need pt(1.96, 10000000, lower.tail=FALSE)*2. If you get a p value higher than 1, you are looking at the wrong side.

pt(2.228139, 10)

## [1] 0.975

pt(-1.959966, 1000000)#gets us .025

## [1] 0.02500002

pt(-1.959966, 1000000, lower.tail=FALSE)  #be careful about negatives / sides.

## [1] 0.975

pt(-1.959966, 1000000, lower.tail=FALSE)*2 #gets us our .05

## [1] 1.95

pnorm()

Gives the area (probability) to the right under a standard normal curve. For example, knowing that z=1.96 = 5% in a two tailed test:

pnorm(1.96, mean=0, sd=1, lower.tail=FALSE)

## [1] 0.0249979

pnorm(1.96, mean=0, sd=1, lower.tail=FALSE)*2

## [1] 0.04999579

We can use pnorm to calculate the space in between two points.

If the WAIS (an adult intelligence test) has a mean of 100 and a standard deviation of 15, how do we solve what percentage of people fall between 80 and 120?

lb <- pnorm(80, mean=100, sd=15, lower.tail=FALSE)
ub <- pnorm(120, mean=100, sd=15, lower.tail=FALSE)
lb-ub

## [1] 0.8175776

qnorm()

Given a probability value, find the z-score related to it.

qnorm(.975, mean=0, sd=1)

## [1] 1.959964

wilcox.test()

Also known as Mann-Whitney U Test, this test is the non-parametric t-test of two groups by testing their signed rank. It can function for both paired and unpaired data.

wilcox.test(Ozone ~ Month, data = airquality,
            subset = Month %in% c(5, 8))

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Ozone by Month
## W = 127.5, p-value = 0.0001208
## alternative hypothesis: true location shift is not equal to 0

exactRankTests::perm.test()

We can run a permutation t-test. We can do this for two samples.

library(exactRankTests)
mtcars %>% 
  mutate(am= as.factor(am)) %>%
  perm.test(mpg ~ am, .)

## 
##  2-sample Permutation Test (scores mapped into 1:(m+n) using rounded
##  scores)
## 
## data:  mpg by am
## T = 174, p-value = 0.0003442
## alternative hypothesis: true mu is not equal to 0

Or we can do it for one sample comparing against some specific population mean.

perm.test(x=mtcars$mpg, mu=15)

## 
##  1-sample Permutation Test (scores mapped into 1:m using rounded
##  scores)
## 
## data:  mtcars$mpg
## T = 325, p-value = 4.833e-06
## alternative hypothesis: true mu is not equal to 15

BSDA::SIGN.test()

Is the differences more positive or more negative? It simply measures the signs. A basic non-parametric paired test.

Note: Here, we want data$variable.

library(BSDA)
BSDA::SIGN.test(iris$Petal.Length, iris$Petal.Width)

## 
##  Dependent-samples Sign-Test
## 
## data:  iris$Petal.Length and iris$Petal.Width
## S = 150, p-value < 2.2e-16
## alternative hypothesis: true median difference is not equal to 0
## 95 percent confidence interval:
##  2.7 3.0
## sample estimates:
## median of x-y 
##           2.9 
## 
## Achieved and Interpolated Confidence Intervals: 
## 
##                   Conf.Level L.E.pt U.E.pt
## Lower Achieved CI     0.9400    2.7      3
## Interpolated CI       0.9500    2.7      3
## Upper Achieved CI     0.9591    2.7      3

Confidence Intervals

By hand only given 2 means, 2 SDs, and 2 Ns.

For questions like this, R is simply a glorified calculator.

ng1=50
ng2=60
mg1=80.44
mg2=78.42
sdg1 = 3.52
sdg2 = 2.87
dfg1 = 49
dfg2 = 59
meandiff <- mg1-mg2
tval = qt(.025, (dfg2+dfg1), lower.tail = FALSE)

pooledvar = (  (dfg1*(sdg1^2))  + (dfg2*(sdg2^2))   ) /(dfg1+dfg2)

sediff = sqrt(pooledvar*(   (1/ng1)   + (1/ng2)   ) )

meandiff-(tval*sediff)

## [1] 0.8124732

#CeI high
meandiff+(tval*sediff)

## [1] 3.227527

F Statistic Stuff

qf()

qf() gets us the f critical value. It takes the alpha, the degrees of freedom of the numerator, of the denominator, and lower.tail=FALSE.

qf(.025, 2, 26, lower.tail=FALSE)

## [1] 4.265483

pf()

pf() gets us the p value.

pf() takes the f statistic, the degrees of freedom between, degrees of freedom within, and then be sure to set lower.tail=FALSE.

pf(6.1904, 2, 26, lower.tail=FALSE)

## [1] 0.006326612

anova() vs.car::Anova

Anova with a capital A comes from the car package. It is useful for being more conservative in our estimates of calculating sum of squares because it allows us to use Type II or Type III sum of squares. Type II is good for when the interaction is non-significant, while Type III is good if the interaction is significant.

anova() on the other hand uses Type I sum of squares. In class, we will talk about how this is less than ideal, as it calculates sum of squares in a sequential order. However, anova() has its uses as testing fits between two models to see if a model was improved or not, something that Anova cannot handle.

library(car)

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.0.5

## 
## Attaching package: 'carData'

## The following objects are masked from 'package:BSDA':
## 
##     Vocab, Wool

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

#Build a dataset, R doesn't have good interaction ones.
gender <- gl(n = 2, k = 20, labels = c("male","female"))
treat <- rep(rep(c("yes","no"), each=10),2)
set.seed(131)
resp <- c(
  rnorm(n = 20, mean = rep(c(15,10), each =10)),
  rnorm(n = 20, mean = rep(c(10,15), each =10))
  )
example <- data.frame(gender, treat, resp)

example %>%
  lm(resp ~ gender * treat, .) %>%
  anova

## Analysis of Variance Table
## 
## Response: resp
##              Df  Sum Sq Mean Sq  F value Pr(>F)    
## gender        1   0.043   0.043   0.0390 0.8447    
## treat         1   0.061   0.061   0.0555 0.8151    
## gender:treat  1 258.274 258.274 236.4059 <2e-16 ***
## Residuals    36  39.330   1.093                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

example %>%
  lm(resp ~ gender * treat, .) %>%
  Anova(type="II")

## Anova Table (Type II tests)
## 
## Response: resp
##               Sum Sq Df  F value Pr(>F)    
## gender         0.043  1   0.0390 0.8447    
## treat          0.061  1   0.0555 0.8151    
## gender:treat 258.274  1 236.4059 <2e-16 ***
## Residuals     39.330 36                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here, our general anova() command reports a significant interaction with no significant main effects. The same output is received if using Type II. However, if trying Type III:

example %>%
  lm(resp ~ gender * treat, .) %>%
  Anova(type="III")

## Anova Table (Type III tests)
## 
## Response: resp
##              Sum Sq Df F value    Pr(>F)    
## (Intercept)  983.00  1  899.77 < 2.2e-16 ***
## gender       125.84  1  115.19 9.088e-13 ***
## treat        125.21  1  114.61 9.743e-13 ***
## gender:treat 258.27  1  236.41 < 2.2e-16 ***
## Residuals     39.33 36                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ggplot(example, aes(x=gender, group=treat, y=resp))+
  geom_col(position=position_dodge(.9), aes(group=treat))

anova_test() for Repeated Measures ANOVA

see: pivot_longer()

There are a few ways to calculate repeated measures ANOVAs. Unfortunately, the rstatix library is the only library that provides straightforward assumption tests built within the output, so it is definitely a preferred method of choice. The options are also relatively straight forward.

Remember that for repeated measures, we want our data in long form, so review pivot_longer() for more help.

One Way Repeated Measures / Within-Subjects ANOVA

For a one-way within subjects ANOVA, anova_test needs the response variable (dv=), the within subjects ID identifier (wid=), and the within variable(within=).

library(datarium)
library(rstatix)
library(tidyverse)
selfestdat <- selfesteem
selfestdat <- selfestdat %>% 
  pivot_longer(cols=-c(id),
               names_to="Time",
               values_to="SelfEsteem")

selfestdat %>%
  anova_test(dv=SelfEsteem, wid=id, within=Time)

## ANOVA Table (type III tests)
## 
## $ANOVA
##   Effect DFn DFd      F        p p<.05   ges
## 1   Time   2  18 55.469 2.01e-08     * 0.829
## 
## $`Mauchly's Test for Sphericity`
##   Effect     W     p p<.05
## 1   Time 0.551 0.092      
## 
## $`Sphericity Corrections`
##   Effect  GGe      DF[GG]    p[GG] p[GG]<.05   HFe      DF[HF]    p[HF]
## 1   Time 0.69 1.38, 12.42 2.16e-06         * 0.774 1.55, 13.94 6.03e-07
##   p[HF]<.05
## 1         *

get_anova_table()

You can cut out the other information by piping after the following.

selfestdat %>%
  anova_test(dv=SelfEsteem, wid=id, within=Time) %>%
  get_anova_table()

## ANOVA Table (type III tests)
## 
##   Effect DFn DFd      F        p p<.05   ges
## 1   Time   2  18 55.469 2.01e-08     * 0.829

And you can pass additional options to it as well.

selfestdat %>%
  anova_test(dv=SelfEsteem, wid=id, within=Time) %>%
  get_anova_table()

## ANOVA Table (type III tests)
## 
##   Effect DFn DFd      F        p p<.05   ges
## 1   Time   2  18 55.469 2.01e-08     * 0.829

Two Way Repeated Subjects ANOVA (Both Within)

Nothing too different from the one way, just list two variables in the within category.

library(datarium)
selfestdat2 <- selfesteem2
selfestdat2 <- selfestdat2 %>%
  pivot_longer(cols=-c(id, treatment),
               names_to="Time",
               values_to="SelfEsteem")

selfestdat2 %>%
  anova_test(dv=SelfEsteem, wid=id, within=c(Time, treatment))

## ANOVA Table (type III tests)
## 
## $ANOVA
##           Effect DFn DFd      F        p p<.05   ges
## 1           Time   2  22 27.369 1.08e-06     * 0.049
## 2      treatment   1  11 15.541 2.00e-03     * 0.059
## 3 Time:treatment   2  22 30.424 4.63e-07     * 0.050
## 
## $`Mauchly's Test for Sphericity`
##           Effect     W     p p<.05
## 1           Time 0.469 0.023     *
## 2 Time:treatment 0.616 0.089      
## 
## $`Sphericity Corrections`
##           Effect   GGe      DF[GG]    p[GG] p[GG]<.05   HFe      DF[HF]
## 1           Time 0.653 1.31, 14.37 5.03e-05         * 0.705 1.41, 15.52
## 2 Time:treatment 0.723  1.45, 15.9 1.25e-05         * 0.803 1.61, 17.66
##      p[HF] p[HF]<.05
## 1 2.81e-05         *
## 2 4.82e-06         *

Two Way Repeated Subjects ANOVA (Mixed)

Let’s create a dataset to go through this. Imagine we have some response variable called reactionTime, a between-subjects effect of Treatment or Not, a within-subjects effect of Time (Weeks 1-3), and six subjects observed over these three weeks.

The code is very similar to above, but this time, you just tell anova_test() which variable is the between, and which is the within.

subject <- c(rep("s1", 3), rep("s2", 3), 
             rep("s3", 3), rep("s4", 3), 
             rep("s5", 3), rep("s6", 3))
treat <- c(
  rep(
    c(
      rep("Long", 3), rep("Short", 3)), 3)
  )
time <- rep(c("w1", "w2", "w3"), 6)
reactionTime <- c( 466, 520, 502, 475,
                   494, 490, 516, 566,
                   577, 491, 544, 526,
                   484, 529, 539, 470,
                   511, 528)
mixeddata <- data.frame(subject, treat, time, reactionTime)
rm(subject, treat, time, reactionTime)

mixeddata %>%
  anova_test(dv=reactionTime, wid=subject, within=time, between=treat)

## ANOVA Table (type II tests)
## 
## $ANOVA
##       Effect DFn DFd      F      p p<.05   ges
## 1      treat   1   4  0.957 0.3830       0.172
## 2       time   2   8 29.638 0.0002     * 0.495
## 3 treat:time   2   8  0.717 0.5170       0.023
## 
## $`Mauchly's Test for Sphericity`
##       Effect     W     p p<.05
## 1       time 0.854 0.789      
## 2 treat:time 0.854 0.789      
## 
## $`Sphericity Corrections`
##       Effect   GGe     DF[GG]    p[GG] p[GG]<.05  HFe      DF[HF]  p[HF]
## 1       time 0.872 1.74, 6.98 0.000462         * 1.49 2.98, 11.92 0.0002
## 2 treat:time 0.872 1.74, 6.98 0.503000           1.49 2.98, 11.92 0.5170
##   p[HF]<.05
## 1         *
## 2

Other Repeated Measures Approaches

see lmer()

As stated, there are other ways to get similar results.

library(lmerTest)
selfestdat %>% 
  lmerTest::lmer(SelfEsteem ~  Time + (1|id), .) %>%
  anova()

## Type III Analysis of Variance Table with Satterthwaite's method
##      Sum Sq Mean Sq NumDF DenDF F value    Pr(>F)    
## Time 102.46  51.228     2    27  65.261 4.564e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Normality

qqnorm()

See: qqline(), base plotting

QQPlots tell us how well a given variable fits the normal distribution. If it comes from a normal distribution, it should fit on the line.

qqnorm(iris$Sepal.Length)

Thus, something like this, where we are generating random numbers from a normal distribution, should fit very well.

qqnorm(rnorm(1000, mean=0, sd=1))

qqline()

See: qqnorm(), base plotting

qqline() goes after the qqnorm to plot the line that the points should lie on.

qqnorm(iris$Sepal.Length)
qqline(iris$Sepal.Length)

car::qqPlot()

the car package gives us the qqplot, the qqline, and a confidence interval around it.

library(car)
qqPlot(iris$Sepal.Length)

## [1] 132 118

hist()

hist() is base plotting’s fast way to plot a histogram.

hist(iris$Sepal.Width)

To change the x lab and y lab, pass it inside.

hist(iris$Sepal.Length, xlab="length of sepals")

shapiro.test() / nor.test() / shapiro_test()

See also: residuals(), histogram(), lm()

Tests to see if data is normally distributed. shapiro.test() is useful for normal distribution across the whole dataset, while nor.test() from library(onewaytests) allows you to test for each subgroup.

Normality of a Single Variable

For a single variable:

shapiro.test(iris$Sepal.Length)

## 
##  Shapiro-Wilk normality test
## 
## data:  iris$Sepal.Length
## W = 0.97609, p-value = 0.01018

Normality of Residuals

Remember that shapiro just wants a list of numbers, and residuals are also a list of numbers.

model <- iris %>% lm(Sepal.Length ~ Species, .)
shapiro.test(residuals(model))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(model)
## W = 0.9879, p-value = 0.2189

shapiro.test(model$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model$residuals
## W = 0.9879, p-value = 0.2189

Normality For Each Group

For variable across groups, you have two options. The first is using onewaytests. nor.test() also plots a lot of great information for us.

library(onewaytests)
iris %>%
  nor.test(Sepal.Length ~ Species, .)

## 
##   Shapiro-Wilk Normality Test (alpha = 0.05) 
## -------------------------------------------------- 
##   data : Sepal.Length and Species 
## 
##        Level Statistic   p.value   Normality
## 1 versicolor 0.9778357 0.4647370  Not reject
## 2  virginica 0.9711794 0.2583147  Not reject
## 3     setosa 0.9776985 0.4595132  Not reject
## --------------------------------------------------

The second is using the rstatix package, which has converted most analyses to be very happy with piping. We are considering moving most of the course towards rstatix, but the problem is the textbook does not follow suit.

library(rstatix)
iris %>% 
  group_by(Species) %>% 
  shapiro_test(Sepal.Length)

## # A tibble: 3 × 4
##   Species    variable     statistic     p
##   <fct>      <chr>            <dbl> <dbl>
## 1 versicolor Sepal.Length     0.978 0.465
## 2 virginica  Sepal.Length     0.971 0.258
## 3 setosa     Sepal.Length     0.978 0.460

Binomials

pbinom()

pbinom(q, size, prob)

Put simply, pbinom returns the area to the left of a given value q in the binomial distribution. If you’re interested in the area to the right of a given value q, you can simply add the argument lower.tail = FALSE[^1]

[^1]https://www.statology.org/dbinom-pbinom-qbinom-rbinom-in-r/

Ando flips a fair coin 5 times. What is the probability that the coin lands on heads more than 2 times?

pbinom(q=2, size=5, prob=.5, lower.tail=FALSE)

## [1] 0.5

Suppose Tyler scores a strike on 30% of his attempts when he bowls. If he bowls 10 times, what is the probability that he scores 4 or fewer strikes?

pbinom(q=4, size=10, prob=.3)

## [1] 0.8497317

dbinom()

dbinom(x, size, prob)

“Put simply, dbinom finds the probability of getting a certain number of successes (x) in a certain number of trials (size) where the probability of success on each trial is fixed (prob).” [^2]

[^2]ibid

Bob makes 60% of his free-throw attempts. If he shoots 12 free throws, what is the probability that he makes exactly 10?

dbinom(10, 12, .60)

## [1] 0.06385228

What is the probability of throwing five die and getting exactly three of the same number?

dbinom(x=3,  size=5, prob=(1/6))

## [1] 0.03215021

binom.test()

The probability of rolling “4” exactly three times in five rolls of a die can be written as:

p= Probability of a single 4 = $\frac{1}{6}$.
n = 5 dice rolls
k = 3 successes
X = also k, so 3

\[ \binom{5}{3} = \frac{n!}{k!(n-k)!}p^X(1-p)^{(n-X)} \]

\[ \binom{5}{3} = \frac{5!}{3!(5-3)!}\frac{1}{6}^{3}(1-\frac{1}{6})^{(5-3)} \]

binom.test() will give us the probability of rolling 3 fours or MORE EXTREME (remember your p value definitions).

binom.test(x=3, n=5, p=(1/6))

## 
##  Exact binomial test
## 
## data:  3 and 5
## number of successes = 3, number of trials = 5, p-value = 0.03549
## alternative hypothesis: true probability of success is not equal to 0.1666667
## 95 percent confidence interval:
##  0.1466328 0.9472550
## sample estimates:
## probability of success 
##                    0.6

However, in a hypothesis testing scenario, we want to multiply our p value by two since it is a two-tailed test.

2*binom.test(x=531,n=648, p=0.5)$p.value

## [1] 1.41102e-63

binom.confint

531 of 648 people struck by lightning in the US from 1995 to 2008 were men.
Men make up 49.2% of the U.S. population

What is the confidence interval of the proportion of people who get struck by lightning that are men?

library(binom)
binom.confint(x=531, n=648, method="ac")

##          method   x   n      mean     lower     upper
## 1 agresti-coull 531 648 0.8194444 0.7879139 0.8472099

The default setting for binom.confint is to give all of the possible ways to calculate the confidence intervals. This is the same as setting method=“all”.

binom.confint(x=531, n=648)

##           method   x   n      mean     lower     upper
## 1  agresti-coull 531 648 0.8194444 0.7879139 0.8472099
## 2     asymptotic 531 648 0.8194444 0.7898285 0.8490604
## 3          bayes 531 648 0.8189522 0.7891231 0.8482295
## 4        cloglog 531 648 0.8194444 0.7876056 0.8469808
## 5          exact 531 648 0.8194444 0.7876231 0.8483216
## 6          logit 531 648 0.8194444 0.7879178 0.8471935
## 7         probit 531 648 0.8194444 0.7883190 0.8475301
## 8        profile 531 648 0.8194444 0.7885988 0.8477696
## 9            lrt 531 648 0.8194444 0.7885989 0.8477662
## 10     prop.test 531 648 0.8194444 0.7871576 0.8478685
## 11        wilson 531 648 0.8194444 0.7879734 0.8471504

Homogeneity of Variances

car::leveneTest()

Tests for the homogeneity of variances between two or more groups. If it’s significant, it suggests that there is not homogeneity of variances (we reject the null hypotheses that there is no difference between variances of the groups).

library(car)
library(tidyverse)
iris %>% leveneTest(Sepal.Length ~ Species, center=mean, .)

## Levene's Test for Homogeneity of Variance (center = mean)
##        Df F value    Pr(>F)    
## group   2  7.3811 0.0008818 ***
##       147                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#leveneTest(Sepal.Length ~ Species, data=iris, center=mean) #old code

Correlations

cor()

See cor.test(). This is not worth running.

cor.test()

Given two variables, you can test the correlation between them. This is just cor() with more detail, providing us the confidence intervals, the estimate, the p value, df, and the t statistic.

cor.test(iris$Petal.Length, iris$Sepal.Length)

## 
##  Pearson's product-moment correlation
## 
## data:  iris$Petal.Length and iris$Sepal.Length
## t = 21.646, df = 148, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8270363 0.9055080
## sample estimates:
##       cor 
## 0.8717538

You can also get the estimate directly through asking for that part of cor.test() by using $estimate.

cor.test(iris$Petal.Length, iris$Sepal.Length)$estimate

##       cor 
## 0.8717538

psych::corr.test()

Given a dataset of only numeric columns that you want to know the correlations of, we can use corr.test() to output the correlations. Here, I ungroup iris (I think I grouped it above awhile back) and then select out the one non-numeric column, and then run corr.test on it.

You also could select certain variables if you wanted to know just their correlations too.

library(tidyverse)
library(psych)

iris %>%
  ungroup() %>%
  dplyr::select(-Species) %>%
  corr.test()

## Call:corr.test(x = .)
## Correlation matrix 
##                Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Example
## Sepal.Length           1.00       -0.12         0.87        0.82          0.87
## Sepal.Width           -0.12        1.00        -0.43       -0.37          0.38
## Petal.Length           0.87       -0.43         1.00        0.96          0.60
## Petal.Width            0.82       -0.37         0.96        1.00          0.58
## Sepal.Example          0.87        0.38         0.60        0.58          1.00
## Sepal.Example2         0.87        0.38         0.60        0.58          1.00
##                Sepal.Example2
## Sepal.Length             0.87
## Sepal.Width              0.38
## Petal.Length             0.60
## Petal.Width              0.58
## Sepal.Example            1.00
## Sepal.Example2           1.00
## Sample Size 
## [1] 150
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##                Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Example
## Sepal.Length           0.00        0.15            0           0             0
## Sepal.Width            0.15        0.00            0           0             0
## Petal.Length           0.00        0.00            0           0             0
## Petal.Width            0.00        0.00            0           0             0
## Sepal.Example          0.00        0.00            0           0             0
## Sepal.Example2         0.00        0.00            0           0             0
##                Sepal.Example2
## Sepal.Length                0
## Sepal.Width                 0
## Petal.Length                0
## Petal.Width                 0
## Sepal.Example               0
## Sepal.Example2              0
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

This works in console, and I don’t have the time or energy to determine why it doesn’t want to knit. However, the point is wrapping corr.test() in print(), and setting short=FALSE will also give the confidence intervals of each correlation pair.

mtcars %>%
  dplyr::select(mpg, cyl, wt) %>%
  psych::corr.test() %>%
  print(., short=FALSE)

Correlations by Hand

x <- c(6,1,3,2,5)
y <- c(6,7,3,2,14)
df <- data.frame(x,y)
Stdx <- (df$x-mean(df$x))/sd(df$x) #print fill
Stdy <- (df$y-mean(df$y))/sd(df$y) #print fill
c(sum(Stdx), sum(Stdy))

## [1]  1.942890e-16 -4.857226e-16

(Stdx)*(Stdy)

## [1] -0.1062054 -0.1470537  0.1388840  0.6290629  1.2417864

SumProd <- sum((Stdx*Stdy)) #print to fill
r <-(1/(sum(!is.na(df$x))-1))*SumProd
r

## [1] 0.4391186

To get CI’s, where:

\[ z_r = \frac{1}{2}\ln(\frac{1+r}{1-r}) \]

\[ CI = z_r \pm\cdot1.96\cdot\frac{1}{\sqrt{n-3}} \] \[ r_{low} = \frac{e^{2\cdot CI_{low}}-1}{e^{2\cdot CI_{low}}+1} \] \[ r_{high} = \frac{e^{2\cdot CI_{high}}-1}{e^{2\cdot CI_{high}}+1} \]

#Continuing from prior code
z <- 1/2*log((1+r)/(1-r))
cilow <- z-(1.96*(1/sqrt(sum(!is.na(df$x))-3)))
cihigh <-z+(1.96*(1/sqrt(sum(!is.na(df$x))-3)))

rlow <- (exp(2*cilow)-1)/(exp(2*cilow)+1)
rhigh <-(exp(2*cihigh)-1)/(exp(2*cihigh)+1)

rlow

## [1] -0.7234239

rhigh

## [1] 0.9524071

To get t statistic where t =

\[ t = \frac{r-0}{SE_r} = \frac{r-0}{\sqrt{(1-r^2)/(n-2)}} \]

t <- (r-0)/sqrt((1-(r*r))/(sum(!is.na(df$x))-2))
t

## [1] 0.8465617

qt(.025, 3) #this is the t-crit value we need to reach

## [1] -3.182446

pt(t, 3, lower.tail=FALSE)*2 #this is the p value

## [1] 0.4594243

cor.test(df$x, df$y, method="pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  df$x and df$y
## t = 0.84656, df = 3, p-value = 0.4594
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7234117  0.9524048
## sample estimates:
##       cor 
## 0.4391186

chisq.test()

chisq.test() takes a different kind of data then what we’re used to - it wants things as a table.

Note: This is a really cool dataset, and for simplicity, I am not presenting the full results. The actual data shows that there isn’t a bias in Admissions at UCB - there is a bias in what genders apply to what departments, and in what numbers.

UCB.table <- margin.table(UCBAdmissions, c(1,2))
UCB.table

##           Gender
## Admit      Male Female
##   Admitted 1198    557
##   Rejected 1493   1278

UCB.table %>% chisq.test()

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  .
## X-squared = 91.61, df = 1, p-value < 2.2e-16

chisq.test$Expected

see: chisq.test, mosaicplot, chisq_test, chisq_descriptives

The assumptions of a chi square is that no cell has an expected value below one and that less than 25% of the cells have observations under 5. We can check the expected proportions by asking chisq.test() to print out the expected.

UCB.table <- margin.table(UCBAdmissions, c(1,2))
UCB.table

##           Gender
## Admit      Male Female
##   Admitted 1198    557
##   Rejected 1493   1278

chisq.test(UCB.table)$expected

##           Gender
## Admit          Male    Female
##   Admitted 1043.461  711.5389
##   Rejected 1647.539 1123.4611

#Pipe Version
UCB.table %>% chisq.test() %>% .$expected

##           Gender
## Admit          Male    Female
##   Admitted 1043.461  711.5389
##   Rejected 1647.539 1123.4611

chisq.test() Goodness of Fit (given a list of expected proportions)

Given a list of list of expected, we can pass that list into expected.

observed <- c(50, 60, 40, 47, 53) 
expected <- c(.2, .2, .2, .2, .2)
chisq.test(observed, p=expected)

## 
##  Chi-squared test for given probabilities
## 
## data:  observed
## X-squared = 4.36, df = 4, p-value = 0.3595

observed <- c(50, 60, 40, 47, 53) 
expected <- c(.1, .1, .1, .1, .6)
chisq.test(observed, p=expected)

## 
##  Chi-squared test for given probabilities
## 
## data:  observed
## X-squared = 165.09, df = 4, p-value < 2.2e-16

If this list is something we’ve calculated by hand, and due to rounding, we might not be adding up to 1, then we can add rescale.p = TRUE to fix that.

observed <- c(50, 60, 40, 47, 53) 
expected <- c(.1, .1, .1, .1, .599) #adds up to .999
chisq.test(observed, p=expected, rescale.p=TRUE)

## 
##  Chi-squared test for given probabilities
## 
## data:  observed
## X-squared = 164.7, df = 4, p-value < 2.2e-16

chisq_test()

see: chisq.test, mosaicplot, chisq_descriptives

The same approach can be done using rstatix’s chisq_test.

library(rstatix)
UCB.table <- margin.table(UCBAdmissions, c(1,2))
UCB.table

##           Gender
## Admit      Male Female
##   Admitted 1198    557
##   Rejected 1493   1278

UCB.table %>% chisq_test()

## # A tibble: 1 × 6
##       n statistic        p    df method          p.signif
## * <dbl>     <dbl>    <dbl> <int> <chr>           <chr>   
## 1  4526      91.6 1.06e-21     1 Chi-square test ****

chisq_descriptives

see: chisq.test, mosaicplot, chisq_test

And rstatix provides other important information, such as our phats. It also prints our expected as well.

library(rstatix)
UCB.table <- margin.table(UCBAdmissions, c(1,2))
UCB.table

##           Gender
## Admit      Male Female
##   Admitted 1198    557
##   Rejected 1493   1278

UCB.table %>% chisq_test() %>% chisq_descriptives()

## # A tibble: 4 × 9
##   Admit    Gender observed  prop row.prop col.prop expected resid std.resid
##   <fct>    <fct>     <dbl> <dbl>    <dbl>    <dbl>    <dbl> <dbl>     <dbl>
## 1 Admitted Male       1198 0.265    0.683    0.445    1043.  4.78      9.60
## 2 Rejected Male       1493 0.330    0.539    0.555    1648. -3.81     -9.60
## 3 Admitted Female      557 0.123    0.317    0.304     712. -5.79     -9.60
## 4 Rejected Female     1278 0.282    0.461    0.696    1123.  4.61      9.60

Notice that row.prop is the proportion of a given subset (Admitted-Male) for all of those in a given row (Admitted). This is the same as doing:

1198/(1198+557)

## [1] 0.6826211

That is to say, males make up 68.3% of those admitted, while males make up 53.9% of those rejected.

pchisq

Given a certain $\chi^2$ value, and given a certain df, what is the probability of observing this?

pchisq(q=4.57, df=3, lower.tail=FALSE)

## [1] 0.2061308

Building a Table For Chi Square Tests

see: chisq.test, mosaicplot, chisq_test, chisq_descriptives

Imagine you had a table that looked like:

		<—– Patient Type —–>
		Ulcer Patients	Controls
Blood Type	O	911	4578
	A	579	4219
	B	124	890
	AB	41	313

We could repeat for each column. I start with blood type and make a list of the observations for the first column using rep() (repeat) going down the list of all ulcer observations and then all control observations.

Then I have to repeat the other variable - that is, what group they were in. Since I went by columns, I’ll go by columns again.

I then put it in a data frame.

bloodtype <- c(
  rep("O", 911), rep("A", 579), rep("B", 124), rep("AB", 41),
  rep("O", 4578), rep("A", 4219), rep("B", 890), rep("AB", 313)
  )

patienttype <- c(
  rep("Ulcer patients", 1655),
  rep("Controls", 10000)
  )

ulcer.data <- data.frame(bloodtype, patienttype)
table(ulcer.data$bloodtype, ulcer.data$patienttype)

##     
##      Controls Ulcer patients
##   A      4219            579
##   AB      313             41
##   B       890            124
##   O      4578            911

#
ulcer.data %>% 
  select(bloodtype, patienttype) %>%
  table()

##          patienttype
## bloodtype Controls Ulcer patients
##        A      4219            579
##        AB      313             41
##        B       890            124
##        O      4578            911

However, that table doesn’t look right - R has sorted our table by alphabetical order, which isn’t what we want. So, we can overwrite the variables as themselves but with new factor levels.

Note: If you spell anything wrong here in levels=c(), you will wipe your table clean of that one and it will go to zero. If you’re finding you’re missing observations where you had them before, it’s probably here.

ulcer.data$patienttype <- factor(ulcer.data$patienttype,
                                 levels=c("Ulcer patients", "Controls"))

ulcer.data$bloodtype <- factor(ulcer.data$bloodtype, 
                               levels= c("O", "A", "B", "AB"))

ulcer.table <- table(ulcer.data$bloodtype, ulcer.data$patienttype)
ulcer.table

##     
##      Ulcer patients Controls
##   O             911     4578
##   A             579     4219
##   B             124      890
##   AB             41      313

Below, I also do a similar thing using piping. This will not save the factor orders within the dataset. However, since I don’t care about the dataset, as long as the table is okay, I’m fine with this.

bloodtype <- c(
  rep("O", 911), rep("A", 579), rep("B", 124), rep("AB", 41),
  rep("O", 4578), rep("A", 4219), rep("B", 890), rep("AB", 313)
  )

patienttype <- c(
  rep("Ulcer patients", 1655),
  rep("Controls", 10000)
  )

ulcer.data <- data.frame(bloodtype, patienttype)
ulcer.table <- ulcer.data %>%
  mutate(patienttype = factor(patienttype, 
         levels=c("Ulcer patients", "Controls")),
         bloodtype = factor(bloodtype, 
         levels= c("O", "A", "B", "AB"))) %>% 
  select(bloodtype, patienttype) %>%
  table()

ulcer.table

##          patienttype
## bloodtype Ulcer patients Controls
##        O             911     4578
##        A             579     4219
##        B             124      890
##        AB             41      313

fisher.test()

Agresti (1990, p. 61f; 2002, p. 91) Fisher’s Tea Drinker

A British woman claimed to be able to distinguish whether milk or tea was added to the cup first. To test, she was given 8 cups of tea, in four of which milk was added first. The null hypothesis is that there is no association between the true order of pouring and the woman’s guess, the alternative that there is a positive association (that the odds ratio is greater than 1).

TeaTasting <- matrix(c(3, 1, 1, 3),
       nrow = 2,
       dimnames = list(Guess = c("Milk", "Tea"),
                       Truth = c("Milk", "Tea")))

TeaTasting

##       Truth
## Guess  Milk Tea
##   Milk    3   1
##   Tea     1   3

fisher.test(TeaTasting, alternative = "greater")

## 
##  Fisher's Exact Test for Count Data
## 
## data:  TeaTasting
## p-value = 0.2429
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
##  0.3135693       Inf
## sample estimates:
## odds ratio 
##   6.408309

## => p = 0.2429, association could not be established

Identification of Outliers

rstatix::identify_outliers()

This function from the rstatix package will print out if it detects a given y is outside of 1.5xIQR(+Q3 or -Q1) as outliers and 3xIQR(+Q3 or -Q1) as extreme values.

We are generally hesitant on this in psychology, as we always must consider what it means to be extreme (and isn’t that the beauty of psychology - that people are extreme?). However, it can be a useful tool to have in case you want to check if the data is inputted properly, and it probably does have some value there too.

library(rstatix)
mpg %>% 
  group_by(manufacturer) %>%
  mutate(m_hwy = mean(hwy, na.rm=TRUE)) %>%
  rstatix::identify_outliers(hwy) %>%
  dplyr::select(manufacturer, model, year, m_hwy, hwy, is.outlier,  is.extreme)

## # A tibble: 4 × 7
##   manufacturer model         year m_hwy   hwy is.outlier is.extreme
##   <chr>        <chr>        <int> <dbl> <int> <lgl>      <lgl>     
## 1 subaru       forester awd  2008  25.6    23 TRUE       FALSE     
## 2 volkswagen   jetta         1999  29.2    44 TRUE       TRUE      
## 3 volkswagen   new beetle    1999  29.2    44 TRUE       TRUE      
## 4 volkswagen   new beetle    1999  29.2    41 TRUE       TRUE

Summary Statistics (Means, SDs)

rstatix::get_summary_stats()

A shortcut to calculating the means and sd’s by group. Do it yourself, don’t add more libraries. However, the code is potentially more straight forward for you.

library(rstatix)
iris %>%
  group_by(Species) %>%
  get_summary_stats(Petal.Length,
                    show=c("n", "mean", "sd", "se"))

## # A tibble: 3 × 6
##   Species    variable         n  mean    sd    se
##   <fct>      <chr>        <dbl> <dbl> <dbl> <dbl>
## 1 versicolor Petal.Length    50  4.26 0.47  0.066
## 2 virginica  Petal.Length    50  5.55 0.552 0.078
## 3 setosa     Petal.Length    50  1.46 0.174 0.025

Poissons and Risk Ratios

dpois

Given a average amount of observations, dpois() can tell us what the expected frequency is on a poisson distribution of observations.

This is the same as doing the following equation:

\[ \frac{e^{(-\mu)}\cdot(\mu^{x})} {X!} \]

where X is the number of successes and $\mu$ is the expected (mean) number of events in a unit of time or space. lambda is poisson’s way to say expected number of events.

dpois(1, lambda=.5)

## [1] 0.3032653

A more detailed example follows:

#First, I write all of our observations out.
Count <- c(rep(0, 103),
               rep(1, 72),
               rep(2, 44),
               rep(3, 14),
               rep(4, 3), 
               rep(5, 1), 
               rep(6,1))
meanFreq <- mean(Count)  #On average, there is how many? .945
meanFreq

## [1] 0.9453782

NumFish <- table(Count) #We can table this to get it to look like the slides.
NumFish

## Count
##   0   1   2   3   4   5   6 
## 103  72  44  14   3   1   1

data <- data.frame(NumFish) #We turn this into a dataframe.

#Unfortunately, table() treats our numbers as factors. So we first turn them into
#characters (letters, but letters 0,1,2,3,4,5,6), and then turn those into numbers.
#Doing JUST as.numeric() will leave you with 1,2,3,4,5,6,7 because 0 is not a level.

data$Count <- as.numeric(as.character(data$Count)) 

# We want to calculate the poisson probabilities. Two ways.

data$PrX <- dpois(x=0:6, meanFreq)
data$PrX2 <- (exp(-(meanFreq))* (meanFreq^data$Count))/factorial(data$Count)

#We also want to calculate the expected probabilities.
data$Expected <- sum(data$Freq)*data$PrX

data

##   Count Freq          PrX         PrX2    Expected
## 1     0  103 0.3885326190 0.3885326190 92.47076333
## 2     1   72 0.3673102491 0.3673102491 87.41983928
## 3     2   44 0.1736235421 0.1736235421 41.32240302
## 4     3   14 0.0547133011 0.0547133011 13.02176566
## 5     4    3 0.0129311899 0.0129311899  3.07762319
## 6     5    1 0.0024449729 0.0024449729  0.58190354
## 7     6    1 0.0003852373 0.0003852373  0.09168648

Notice how PrX and PrX2 are the same. Again, dpois is doing the equation above for each x.

Oh no! Some of ours is too low. Let’s combine at 4 or more. To do that, make a factor that is the same for all rows after. Then summarise all by taking the sum of the groups. The only things that will get added is that last factor.

data$Unique <- c("0","1","2","3","4 or more","4 or more","4 or more")
data <- data %>% group_by(Unique) %>% summarise_all(sum)

#Now, we're left with one final problem - 4 or more could be 7. 8. 9! 10!!!
data <- data %>% mutate(PrX = 
                          case_when(Unique != "4 or more" ~ PrX,
                                    Unique == "4 or more" ~ 1-sum(PrX[Unique!="4 or more"])))

riskratio()

library(fmsb)

## Warning: package 'fmsb' was built under R version 4.0.5

riskratio(6,1,702,1376)

##            Disease Nondisease Total
## Exposed          6        696   702
## Nonexposed       1       1375  1376

## 
##  Risk ratio estimate and its significance probability
## 
## data:  6 1 702 1376
## p-value = 0.003623
## 95 percent confidence interval:
##   1.418659 97.496107
## sample estimates:
## [1] 11.76068

oddsratio()

oddsratio(6,1,696,1375, p.calc.by.independence=TRUE)

##            Disease Nondisease Total
## Exposed          6        696   702
## Nonexposed       1       1375  1376
## Total            7       2071  2078

## 
##  Odds ratio estimate and its significance probability
## 
## data:  6 1 696 1375
## p-value = 0.003623
## 95 percent confidence interval:
##   1.424261 98.650607
## sample estimates:
## [1] 11.85345

Power Analyses

pwr.t.test For Equal N

library(pwr)
## pwr.t.test(n = , d = , sig.level = , power = , type = c("two.sample", "one.sample", "paired"))

pwr.t.test(d=(0.25/0.25), sig.level=0.05, power=0.8, type="two.sample")

## 
##      Two-sample t test power calculation 
## 
##               n = 16.71472
##               d = 1
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

pwr.t.test for Unequal N

# pwr.t2n.test(n1 = , n2= , d = , sig.level =, power = )

Equations

T Tests

Z Score

\[ Z=\frac{\bar Y-\mu}{\sigma_{\bar Y}} \]

T Statistic

\[ t = \frac{\bar Y-\mu_0}{s/\sqrt{n}} \]

Confidence Intervals of the Mean (t)

\[ \bar Y\pm SE_{\bar Y}\cdot t_{\alpha(2), df} \]

Confidence Intervals for the variance (t)

\[ \frac{df\:s^2}{\chi^2_{\frac{\alpha}{2},df}}\leq\sigma^2 \leq \frac{df\:s^2}{\chi^2_{1-\frac{\alpha}{2},df}} \]

Correlations

r (correlation coefficient)

\[ r = \frac{1}{n-1}\cdot\sum\limits_{i=1}^{n}\frac{x_i-\bar x}{s_x}\cdot\frac{y_i-\bar y}{s_y} \]

\[ r = \frac{\sum(x_i - \bar x)(y_i - \bar y)}{\sqrt{\sum{(x_i - \bar x)^2\cdot\sum(y_i-\bar y)^2}}} \]

Standard Error of correlation coefficient

\[ SE_r = \sqrt\frac{1-r^2}{n-2} \]

Fisher’s z transformation of correlation coefficient

\[ z_r = \frac{1}{2}\ln(\frac{1+r}{1-r}) \]

Fisher’s z confidence intervals

\[ CI = z_r \pm\cdot1.96\cdot\frac{1}{\sqrt{n-3}} \]

\[ r_{low} = \frac{e^{2\cdot CI_{low}}-1}{e^{2\cdot CI_{low}}+1} \]

\[ r_{high} = \frac{e^{2\cdot CI_{high}}-1}{e^{2\cdot CI_{high}}+1} \]

T statistic for Fisher’s z

\[ t = \frac{r-0}{SE_r} = \frac{r-0}{\sqrt{(1-r^2)/(n-2)}} \]

ANOVA (F Stats)

F statistic (general)

\[ F_s = \frac{MS(between)}{MS(within)} \]

\[ F_s = \frac{MS(treatment)}{MS(error)} \]

Sum of Squares Among Blocks

\[ SS(blocks) = \sum\limits_{j=1}^{J}m_j(\bar y_j-\bar{\bar{y}})^2 \]

Degrees of Freedom of Blocks

\[ J-1 \]

Mean Square of Blocks

\[ MS(blocks) = \frac{\sum\limits_{j=1}^{J}m_j(\bar y_j-\bar{\bar{y}})^2}{J-1} \]

Variation Within Groups (ANOVA)

\[ s_{pooled} = \sqrt{\frac{\sum\limits_{i=1}^I(n_i-1)s_i^2}{\sum\limits_{i=1}^I(n_i-1)}} = \sqrt{\frac{\sum\limits_{i=1}^I(n_i-1)s^2_i}{n_.-I}} \]

Sum of Squares (within) pooled variance

\[ \sum\limits_{i=1}^I(n_i-1)s^2_i = \sum\limits_{i=1}^I\sum\limits_{j=1}^{n_i}(y_{ij}-\bar y_i)^2 \]

degrees of freedom within

\[ n_. -1 \]

Mean Sum of Squares Within

\[ MS(within) = \frac{SS(within)}{df(within)} \]

Mean Squared Error

\[ MS_{error} = \frac{\sum\limits_{i=1}^ks_i^2(n_i-1)}{N-k} \]

Mean Squared Error when k= 2

\[ MS_{error} = \frac{df_1s^2_1+df_2s^2_2}{df_1+df_2} = s^2_p \]

Sum of Squares Between

\[ \sum\limits_{i=1}^In_i(\bar y_i - \bar{\bar y})^2 \]

Degrees of freedom between

\[ I-1 \]

Mean Square Between

\[ \frac{\sum\limits_{i=1}^In_i(\bar y_i - \bar{\bar y})^2}{I-1}=\frac{SS_{groups}}{df_{groups}} \]

Total Sum of Squares

\[ \sum\limits_{i=1}^I\sum\limits_{j=1}^{n_i}(y_{ij}-\bar y_i)^2 \]

Regressions

Residuals of Regression Line

\[ Y_i-\hat Y_i \]

Sum of Squares Residuals

\[ SS_{residuals} = \sum\limits_{i=1}^n(Y_i-\hat Y_i)^2 \]

Slope ($\beta$) estimated by b

\[ b= \frac{Covariance(X,Y)}{Variance(X)} = \frac{\sum\limits_{i=1}^n(X_i-\hat X)(Y_i-\hat Y)}{\sum\limits_{i=1}^n(X_i-\hat X)^2} \]

Standard Error of Beta

\[ SE_b = \sqrt{\frac{MS_{residuals}}{\sum(X_i-\hat X)^2}} \]

Mean Square Residual

\[ MS_{residual} = \frac{SS_{residual}}{df_{residual}} \]

Sum of Squares Total for Beta

\[ SS_{total} = \sum(Y_i-\hat Y)^2 \]

Sum of Squares Regression

\[ SS_{regression} = b \sum(X_i-\hat X)(Y_i - \hat Y) \]

Sum of Squares Residual

\[ SS_{residual} = SS_{total}-SS_{regression} \]

Confidence Interval of $\beta$

\[ b \pm t_{\alpha[2],df} SE_b \]

Degrees of freedom residual

\[ n-2 \]

T-statistic for beta

\[ t= \frac{b-\beta_0}{SE_b} \]

Chi Square

Chi Square Statistic

\[ \chi_s^2 = \sum\limits_{all\cdot classes}\frac{(o_i-e_i)^2}{e_i} \]

Degrees of Freedom for chi square

\[ df = number\:of\:categories-1 - number\:of\:parameters\:estimated\:from\:data \]

Chi Square Degrees of Freedom

\[ df_{\chi^2}=(r-1)(k-1) \]

G Test

\[ \sum_i\sum_j(Observed_{i,j}\cdot\ln\frac{Observed_{i,j}}{Expected_{i,j}}) \]

Relative Risk

\[ \widehat {RR} = \frac{\hat p_1}{\hat p_2} where \hat p_i = \frac{n_{bad outomes_i}}{n_{individuals_i}} \]

Relative Risk (Risk Ratio) Standard Error

\[ SE[\ln(\widehat{RR})] = \sqrt{\frac{1}{a}+\frac{1}{b}-\frac{1}{a+c}-\frac{1}{b+d}} \]

Relative Risk (Risk Ratio) Confidence Intervals

\[ 95\%CI = e^{\ln(\widehat{RR})\pm(1.96)(SE[\ln(\widehat{RR})])} \]

Odds Ratio (OR)

\[ odds = \hat O_1 = \frac{\hat p_1}{1-\hat p_1} \\ \hat O_2 = \frac{\hat p_2}{1-\hat p_2}\\ \widehat{OR} = \frac{\hat O_1}{\hat O_2} = \frac{a/c}{b/d} = \frac{ad}{bc} \]

Poisson Distribution

Where X = number of successes, $\mu$ is the expected (mean) number of events in a unit of time or space and ! is factorial notation.

\[ \frac{e^{(-\mu)}\cdot(\mu^{x})} {X!} \]

Proportions

Proportion Equation

\[ \hat p = \frac{X}{n} \]

Standard Error of Proportions

\[ SE_{\hat p}= \sqrt{\frac{\hat p (1-\hat p)}{n}} \]

Confidence Intervals of Proportions

\[ p= p' \pm(1.96)\sqrt{\frac{p'(1-p')}{n+4}} \]

Calculating Binomial Probabilities

\[ Pr[X] = \binom{n}{X}p^X(1-p)^{(n-X)} \]

where (Binomial Coefficient) \[ \binom{n}{X} = \frac{n!}{k!(n-k)!} \]

and where \[ n! = n \cdot (n-1) \cdot (n-2) \cdot ... 3 \cdot 2 \cdot 1 \]

Mean of the Binomial Distribution

\[ \mu = n\cdot p \]

Standard Deviation of Binomial Distribution

\[ \sigma = \sqrt{n \cdot p \cdot (1-p)} \]

Formating Markdowns

There are a lot of ways you can format a Markdown.

Italicized Font

Surrounding words with * can make text italicized.

Bolded Font

Surrounding words with ** can make text bolded.

Greek Letters

You can write many letters. Ones important for this class might be:

$\alpha$ by writing $\alpha$ .
$\beta$ by writing $\beta$ .
$\epsilon$ by writing $\epsilon$ .
$\eta$ by writing $\eta$.
$\eta^2$ by writing $\eta^2$.
$\mu$ by writing $\mu$.
$\chi^2$ by writing $\chi^2$ .

Headers

Using hashtags to emphasize certain words is also visually important for labs and organization purposes. You might notice how certain headers here have different sizes. The main Question should lead with a single hashtag and a space after it.

# Question 1

While a part to that question should be the next level lower, so that would be two hash tags with a space next to it.

## Question 1a

Bullets

Bullets use /* a single star with a space, and an enter after, to help Markdown tell the difference between a list and a bolded/italicized word. You saw what bullet formating looks like above in Greek Letters.

Tables

Tables are funky, and you probably won’t ever need to use them, but for completeness, I include the example here. Important to note is that it is rather cumbersome to merge cells together (it requires you code with html), and we will not consider that option.

		<—	Creek	—>
		a	b	c
Treatment	No Fish	11	8	7
	Galaxias	9	4	4
	Trout	6	4	0

This could be written as:

|           |          | <--- | Creek | ---> |
|-----------|----------|------|-------|------|
|           |          | a    | b     | c    |
| Treatment | No Fish  | 11   | 8     | 7    |
|           | Galaxias | 9    | 4     | 4    |
|           | Trout    | 6    | 4     | 0    |

Dr. C’s R Notes for Lab and Life

A Living Document

05/06/2022