Lab Libraries

To install a package, run the following function with the name of the package in quotes. If it asks you permission to install other things, be sure to click Yes. Installing a package requires you to be connected to the Internet, so be sure you have a stable and running internet connection before doing this.

install.packages("tidyverse")

Important libraries we need to install, generally are:

Libraries that are more specific for statistics are:

Libraries that are important for learning R are:

AH Something Broke!

There are many things to do at this stage, and please be sure you have tried all of the below before asking for help. R is a puzzle that is meant to be solved - the computer is not wrong, you must have done something wrong. But R is trying to give you hints.

  1. Did you read the error? What does it say? Did you copy and paste it into Google?

  2. Did you take the time to check whatever it is you were working on? That is, have you run the variable in your console to see it and understand what it looks like? Have you run table(), head(), class(), and more on it to ensure it’s all what you expected?

  3. Have you turned it off and on again – that is, have you tried running your script from top to bottom and reloading your data in?

  4. Have you ensured you have loaded the libraries that you need? If you think you have, is it possible that multiple packages are calling the same function? Have you added the library name in the front to be safe with ::?

  5. Have you tried breaking it yourself, or trying different things?

  6. Have you checked your quotes, your brackets, your parenthesis, your commas?

R Basics

Loading Data [read.csv() and as_tibble()]

Loading data is done through primarily read.csv(). Other file formats will require different functions.

We are working to transition the lab to full tidy language use, which also means we want to ensure our data is loaded as a tibble using as_tibble().

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
#setwd()
#data <- read.csv()
#A dataset we will use to generally practice some features.
data <- iris
data <- as_tibble(data)

Formatting Your r Code

Scripts can be hard to read, especially as you get larger and larger code. That is why we use the enter bar. If you write a comma in your code, you should probably hit enter. If you use a plus sign in ggplot, you should probably hit enter. As you continue to read through this manual, you will note I hit enter on most commas. You should do the same. R will format things with tabs and make things really easy to notice.

I also recommend you make your script sectioned by adding ### at various parts of your script. You will notice that it will create a tab on the left side near the numbers. You can use this to collapse and jump around large sections of your script without needing to scroll too much.

Assignment

R is happy to do things for you without saving its work. For example, we could ask R to calculate the mean of 2, 4, and 100 by:

mean(c(2,4,100))
## [1] 35.33333

But nothing was saved. Instead, if we want to keep that number moving forward, we need to be sure we’re assigning it something. We will always use a left pointing arrow to denote the idea of assigning.

mean_of_three <- mean(c(2,4,100))

Nothing will happen in the console, but instead, this value is now in our Environment. We can always see that value by just typing into R:

mean_of_three
## [1] 35.33333

Values vs. variables

There are two main things we can assign. If we were to write:

value <- c("Dr. C", "Dr. K")

This would store a value into our Environment under Values. This is different than creating a variable.

data$var <- rep(c("Dr. C", "Dr. K"), 75)

This creates a variable called var. To understand this, we first state that we want things stored within the dataframe we call data, and then, we use the dollar sign to signify that this will be a new variable in the data.

Brackets

Back to top

see also: sapply

We can understand things in terms of lists (like above, where value was a list of two names) and data frames. Data frames are actually just lists organized in columns. Brackets are one of the many ways in which we can subset data. If we use a comma in the middle of a bracket, we can specify specific rows (left side) or columns (right side). If there is no comma, r treats it as if you’re simply asking for the nth element in the list of lists, which would be the nth column. You can also call columns by passing through their character name of the column – an important lesson that will be useful later.

head(data[,1]) #Column
## # A tibble: 6 × 1
##   Sepal.Length
##          <dbl>
## 1          5.1
## 2          4.9
## 3          4.7
## 4          4.6
## 5          5  
## 6          5.4
head(data[1]) #Also column
## # A tibble: 6 × 1
##   Sepal.Length
##          <dbl>
## 1          5.1
## 2          4.9
## 3          4.7
## 4          4.6
## 5          5  
## 6          5.4
data[1,] #Rows
## # A tibble: 1 × 6
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species var  
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>   <chr>
## 1          5.1         3.5          1.4         0.2 setosa  Dr. C
head(data["Sepal.Length"]) #another way to call the column.
## # A tibble: 6 × 1
##   Sepal.Length
##          <dbl>
## 1          5.1
## 2          4.9
## 3          4.7
## 4          4.6
## 5          5  
## 6          5.4

Getting Help

Back to top

We can always type ? followed by a function name (like, ?c ) into the Console to learn more about it. The best place to look is the Arguments part, and then immediately scroll to the examples section.

Basic Functions ()

Back to top

Functions are things that have some letters on one side (many times, a word, or a shortened word), and an open and closed parantheses on the other (). We call this thing() a function. Functions take arguments, or things we put inside of the parantheses. Different functions take different amounts of arguments, but if there are multiple arguments to pass through the function, we would want to always separate it with a comma.

c()

Back to top

Above, we used c() to create a list of two things. This is our first function! Lists are many times called vectors in coding. If we want to list multiple things, we probably want to assume we are using c() unless otherwise noted.

-

  • is one way in R we say not. Typically, - will be used in terms of saying NOT this variable or NOT this list of things.

is.na()

Back to top

See: ! ; logical operators

is.na() returns a vector (list) of TRUE or FALSE. NA is R’s term for missingness. That is, is.na() will return a TRUE if something is missing and a FALSE if it is not-missing. We generally will use is.na() with the NOT, that is, to say !is.na(), to give us a list of TRUE if something is present, and only FALSE if something is not present.

names()

Back to top

names() will tell us the names of all columns in a dataset.

tail()

Back to top

tail() will give us the last six rows of all columns in a dataset.

table()

Back to top

table() will give us the cross-tabulation of one to two variables in a dataset and give the counts.

table(data$Species)
## 
##     setosa versicolor  virginica 
##         50         50         50
table(data$Species, data$var)
##             
##              Dr. C Dr. K
##   setosa        25    25
##   versicolor    25    25
##   virginica     25    25

ls()

ls() lists everything in your environment.

rm()

rm() removes things from your environment. You can use rm(list=ls()) to remove everything from your environment.

rm(mean_of_three)

Logical Operators

We can use < , >, == , >= , <= for most things. If we want to use equal to, R understands this as equivalent to, which requires the use of a double equal sign. A single equal sign will not work – a single equal sign acts like our <- assignment operator. We can also use these at any point on variables in our data.

2+2==4
## [1] TRUE
data$Sepal.Length > 6
##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [49] FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
##  [61] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE
##  [73]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
##  [97] FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## [109]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
## [121]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [133]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE
## [145]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
table(data$Sepal.Length > 6)
## 
## FALSE  TRUE 
##    89    61

!

An exclamation mark is used to say NOT. More specifically, it is primarily used to say NOT when we are considering statements of TRUE or FALSE. It will flip things for us. An important thing to think about for ! is the use of paranthesis. That is, we have to be clear about what we’re NOT’ing.

2+2==4
## [1] TRUE
!(2+2==4)
## [1] FALSE
2+3==4
## [1] FALSE
!(2+3==4)
## [1] TRUE
table(!(data$Sepal.Length > 6))
## 
## FALSE  TRUE 
##    61    89
#Notice how compared to above, this ! had
#that we wrap data$Sepal.Length > 6 in ().

Basic Math Functions

mean()

Gets the global mean of some variable or object.

mean(iris$Sepal.Length)
## [1] 5.843333

sd()

Gets the standard deviation of some variable or object.

sd(iris$Sepal.Length)
## [1] 0.8280661

se

Calculates the standard error of some variable or object. The se is the standard deviation divided by the square root of n.

sd(iris$Sepal.Length)/sqrt(
  sum(
    !is.na(iris$Sepal.Length)
    )
  )
## [1] 0.06761132

Tidyverse Functions

%>%

Since the lab is learning from dplyr, we need to immediately start with %>%, or what is known as the pipe. The pipe takes data on the left of it, and puts functions to the right of it. What this does is also places the data as first argument. That is, you will see that all of the functions below take data is the first argument. Using the pipe, this step has been done for you, so you never need to specify what dataset you are using, since you have defined it on the left side of the pipe.

This should help with the readability of the script as you go on. We will see the pipe appear a lot, but you can try and read it as “and then”.

Finally, %>% allows us to end our and then, and then, and then, without a () on the last function if it only takes a single argument (if that argument is a dataset). Again, this makes sense, since functions like head() only take one argument, and that argument is data, and we have supplied the first argument in at the beginning of our pipe.

group_by()

group_by() will be used to group our datasets by our independent variables. This is critical for other parts of dplyr and tidyr, which we will use this grouped-based dataset to get summary statistics.

Grouped data will force that the select() function below (see select) will carry forward the grouped values.

data <- data %>% group_by(Species, var)

head(data)
## # A tibble: 6 × 6
## # Groups:   Species, var [2]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species var  
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>   <chr>
## 1          5.1         3.5          1.4         0.2 setosa  Dr. C
## 2          4.9         3            1.4         0.2 setosa  Dr. K
## 3          4.7         3.2          1.3         0.2 setosa  Dr. C
## 4          4.6         3.1          1.5         0.2 setosa  Dr. K
## 5          5           3.6          1.4         0.2 setosa  Dr. C
## 6          5.4         3.9          1.7         0.4 setosa  Dr. K

mutate()

Mutate() is used to create new variables or change variables (mutate them). You can use mutate instead of creating a new variable through the dollar sign assignment, but either is acceptable. Here, I make two variables. First, I overwrite Sepal.Length by creating a new variable with the same name, and add one to it. I also create a new variable called newvar, which takes a random number between 0 and 1.

names(data)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
## [6] "var"
data <- data %>% 
  ungroup() %>% 
  mutate(Sepal.Length = Sepal.Length+1,
         newvar  = sample(c(0:1), 150,  replace=TRUE)) %>% 
  group_by(Species, var)

names(data)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
## [6] "var"          "newvar"

recode()

see also: mutate()

recode() is dplyr’s way to well, recode variables. If we do not like how our factors are listed, we can use recode to change things around.

data <- data %>% mutate(var=recode(var,
                                   "Dr. C"="Dr. Carriere", 
                                   "Dr. K" = "Dr. Kilgore"))

# Also can do:
#data$var <- recode(data$var, "Dr. C" = "Dr. Carriere",
#                   "Dr. K"="Dr. Kilgore")

filter()

rename()

rename() is how we can rename many variables at once. It takes a dataset as the first argument, like all dplyr verbs, and then takes just newname=oldname input continuously without a c() needed.

names(data)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
## [6] "var"          "newvar"
data <- data %>% rename(petal.length=Petal.Length,
                        professor = newvar)
names(data)
## [1] "Sepal.Length" "Sepal.Width"  "petal.length" "Petal.Width"  "Species"     
## [6] "var"          "professor"
data <- data %>% rename(Petal.Length = petal.length,
                        newvar = professor)

select()

select() is used to select certain columns that we either want to keep or want to not keep. You will learn that we can say “not” in two different ways, using the - sign or using the ! sign. While tricky to figure out which, generally default to - while use ! when you’re doing some kind of logical operator (is X larger than Y).

Select can also move the order of the columns around if you tell it to keep everything using everything().

#This will force you to also take value as a variable in data_small.
data_small <- data %>% select(Species, Sepal.Length)
## Adding missing grouping variables: `var`
names(data_small)
## [1] "var"          "Species"      "Sepal.Length"
#We can avoid that by using ungroup()
data_small2 <- data %>% ungroup() %>% select(Species, Sepal.Length)
names(data_small2)
## [1] "Species"      "Sepal.Length"
#Using everything() will simply move the order of the variables.
names(data)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
## [6] "var"          "newvar"
data_move <- data %>% select(Species, everything()) 
names(data_move)
## [1] "Species"      "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [6] "var"          "newvar"
rm(data_move, data_small2, data_small)

rowid_to_column()

This is a simple function that can be passed through at various points, but it generally acts to simply assign a a numeric id to our rows if we do not already have it and will call the new column it generates rowid.

data %>% rowid_to_column() %>% head
## # A tibble: 6 × 8
## # Groups:   Species, var [2]
##   rowid Sepal.Length Sepal.Width Petal.Length Petal.Width Species var     newvar
##   <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>   <chr>    <int>
## 1     1          6.1         3.5          1.4         0.2 setosa  Dr. Ca…      0
## 2     2          5.9         3            1.4         0.2 setosa  Dr. Ki…      1
## 3     3          5.7         3.2          1.3         0.2 setosa  Dr. Ca…      1
## 4     4          5.6         3.1          1.5         0.2 setosa  Dr. Ki…      1
## 5     5          6           3.6          1.4         0.2 setosa  Dr. Ca…      1
## 6     6          6.4         3.9          1.7         0.4 setosa  Dr. Ki…      0

pivot_longer()

See also: gather(), rowid_to_column()

pivot_longer() takes wide data and makes it longer.

#Using -c() allows me to say "Don't stack these columns on top of each other.
#But for everything else, stack them on top of each other in a new column called Measurement Type
#and put their values in a new column called Measurement.
#https://stackoverflow.com/questions/57977470/is-pivot-longer-and-pivot-wider-transitive
data_wide <- data %>% rowid_to_column() %>% 
  pivot_longer(cols =-c(Species, var, newvar, rowid),
                      names_to="Measure",
                      values_to="Value")

head(data)
## # A tibble: 6 × 7
## # Groups:   Species, var [2]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species var          newvar
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>   <chr>         <int>
## 1          6.1         3.5          1.4         0.2 setosa  Dr. Carriere      0
## 2          5.9         3            1.4         0.2 setosa  Dr. Kilgore       1
## 3          5.7         3.2          1.3         0.2 setosa  Dr. Carriere      1
## 4          5.6         3.1          1.5         0.2 setosa  Dr. Kilgore       1
## 5          6           3.6          1.4         0.2 setosa  Dr. Carriere      1
## 6          6.4         3.9          1.7         0.4 setosa  Dr. Kilgore       0

pivot_wider()

See also: spread(), rowid_to_column()

pivot_wider() takes long data and makes it wider.

data_long <- data_wide %>% 
  pivot_wider(names_from = "Measure", values_from = "Value")

summarise()

see also: summarize()

We can use summarise to build new datasets for us with summary data. This is especially helpful when our dataset has been grouped by group_by(). It also takes new variables names, and can use the variables we have created above to make changes on things below. This is one of our primary ways to get visualization data, as well as get summary statistics that might be helpful in describing means and standard errors in results.

The correct spelling is with an s. This is important, especially if you do not clarify with :: that it should be an s. That is, dplyr::summarize will function okay, but summarize() can behave differently than you would like it to. It is better to be safe and use summarise() or dplyr::summarise().

data_sum <- data %>% 
  dplyr::summarise(
    
    meanSL=mean(Sepal.Length),
    sdSL = sd(Sepal.Length),
    nSL = sum(!is.na(Sepal.Length)), 
    seSL = sdSL/sqrt(nSL), #See here, standard error uses sdSL 
                            #and nSL that we calculated above.
    
    meanSW = mean(Sepal.Width),
    sdSW = sd(Sepal.Width),
    nSW = sum(!is.na(Sepal.Width)), #refer to "! operator" and is.na() to understand.
    seSW = sdSW/sqrt(nSW)
    )
## `summarise()` has grouped output by 'Species'. You can override using the `.groups` argument.
head(data_sum)
## # A tibble: 6 × 10
## # Groups:   Species [3]
##   Species    var          meanSL  sdSL   nSL   seSL meanSW  sdSW   nSW   seSW
##   <fct>      <chr>         <dbl> <dbl> <int>  <dbl>  <dbl> <dbl> <int>  <dbl>
## 1 setosa     Dr. Carriere   6.02 0.391    25 0.0782   3.48 0.325    25 0.0651
## 2 setosa     Dr. Kilgore    5.99 0.317    25 0.0633   3.38 0.426    25 0.0853
## 3 versicolor Dr. Carriere   6.99 0.556    25 0.111    2.78 0.336    25 0.0672
## 4 versicolor Dr. Kilgore    6.88 0.478    25 0.0956   2.76 0.297    25 0.0594
## 5 virginica  Dr. Carriere   7.50 0.603    25 0.121    2.94 0.287    25 0.0574
## 6 virginica  Dr. Kilgore    7.67 0.669    25 0.134    3.01 0.356    25 0.0713

gather()

gather() is the old version of pivot_longer(). While it works just as well, it will be easier for the lab to focus on pivot_longer(), and any past code may be outdated to other uses.

spread()

spread() is the old version of pivot_wider(). While it works just as well, it will be easier for the lab to focus on pivot_wider() and any past code may be outdated to other uses.

Creating Averages for New Variables:

Creating averages for variables is a key part of cleaning your data. We typically ask a multitude of questions in a scale and we want to get the overall average score for the scale.

rowMeans()

The first simple way is rowMeans(). for rowMeans, it takes the mean of all columns per row. So, we need to first subset our data to only the columns we want. Then, what is left is a dataset purely of the columns we want the average of, which can then be run through rowMeans to get an average for each row, which we can then assign to a new variable. I also run the global mean on this variable to compare it to the next solution.

iris$Sepal.Example <- rowMeans(
    subset(
      iris,
      select=c(
        Sepal.Length,
        Sepal.Width
        )))
mean(iris$Sepal.Example)
## [1] 4.450333

However, rowMeans() will start to misbehave if we’re using a grouped dataset through our tibbles and our group_by() and if we are trying to pipe it in. If so, we would want to shift to using rowwise() and using the mean() function.

rowwise() and mean()

see also: psych::alpha() , summarystats(), group_by()

rowwise() is very similar to group_by() in that it doesn’t change anything but the dataset’s functionality. Before, if we were to call mean(), we would get the average across the whole data. But using rowwise(), we get it for each individual row. Then, we can use mean() instead, since we’ve grouped the variables on the rows, and then pass through the variable names as we are used to doing for mean. Note like always, since we are piping, we don’t need to use the dollar sign to signify the variable because R is being told explicitly at the beginning what variable to use.

iris <- iris %>% 
  group_by(Species) %>% 
  rowwise() %>% 
  mutate(Sepal.Example2 = mean(c(Sepal.Length, Sepal.Width), na.rm=TRUE))
mean(iris$Sepal.Example2)
## [1] 4.450333

ggplot2

The lab uses ggplot2 for all data visualizations. Ggplot2 can be broken down into visualizations that takes summary data (geom_line; geom_bar; geom_col; geom_point) and visualizations that take the whole dataset (raincloud plots; geom_point). The lab is moving towards using raincloud plots for most visualizations.

All ggplots start with a base layer that will show nothing except the axes. The idea is that we just continually add layers on top of each other. It is important to think like this because indeed, that’s how things work, so if you want something behind/in front of something else, consider the order in which you are placing layers.

Important to note in the code below that the variables we care about are wrapped in aes(). aes() stands for aesthetics, or visually, what we want to change. If we want things to change based on a certain variable, it probably wants to go into aes. If we want something to change globally (for all points), then we probably want it outside of aes.

Both of these are possible plots - that is, all ggplot2 needs is a dataset. It doesn’t care if it’s the whole data or a summarized version of the data. To ensure this section works well, I will always be piping data just like this at the top of each function. However, please note that in practice, it is better to just save your data into your environment and then

iris %>% ggplot2::ggplot(aes(x=Species, y=Sepal.Length))

iris %>% group_by(Species) %>%
  dplyr::summarise(
    meanSL=mean(Sepal.Length, na.rm=TRUE),
    sdSL = sd(Sepal.Length, na.rm=TRUE),
    nSL = sum(!is.na(Sepal.Length)), 
    seSL = sdSL/sqrt(nSL)) %>%
  ggplot2::ggplot(aes(x=Species, y=meanSL))

geom_point()

see also: geom_line(), lollipop graphs

geom_point() makes scatterplots, so we can expect geom_point() to like to have the full dataset to plot things well. However, this is not necessarily the case, and there are many graphs that can use geom_point() with a summarized dataset as well where you are highlighting the points that appear on geom_line() (see Maglio et al., 2014 for a use of geom_point() with summarized data, also lollipop graphs).

iris %>% ggplot2::ggplot(aes(x=Species, y=Sepal.Length))+
  geom_point()

geom_point(position=position_jitter)

Above, the points were overlapping each other. We can set the position of the points to be different so they don’t overlap anymore. We call this jittering. position= is important for a few of the geom_s.

iris %>% ggplot2::ggplot(aes(x=Species, y=Sepal.Length))+
  geom_point(position=position_jitter(width=.2))

geom_point(color=) outside of aes

As stated, outside of aes() will set global colors.

iris %>% ggplot2::ggplot(aes(x=Species, y=Sepal.Length))+
  geom_point(position=position_jitter(width=.2), color="blue")

geom_point(aes(color=))

see also: theme, legend

Inside of aes() we define things by variables. Having an aes() beyond ggplot() will also lead to a legend appearing, as we can see.

iris %>% ggplot2::ggplot(aes(x=Species, y=Sepal.Length))+
  geom_point(position=position_jitter(width=.2), aes(color=Species))

### geom_point(shape=) outside of aes

Points can change their shapes. You can find out all the shape options using show_point_shapes() from ggpubr. Though, I will note, when I look at this, the point shapes don’t equate to what is graphed. However, it can at least give you a general idea and use a lot of trial and error to find what you want.

ggpubr::show_point_shapes()
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.

iris %>% ggplot2::ggplot(aes(x=Species, y=Sepal.Length))+
  geom_point(position=position_jitter(width=.2), aes(color=Species), shape=21)

geom_point(shape=) inside aes()

Similar, you can put most things outside of aes() into aes() if you want to change it by group.

iris %>% ggplot2::ggplot(aes(x=Species, y=Sepal.Length))+
  geom_point(position=position_jitter(width=.2), aes(color=Species, shape=Species))

geom_bar()

#

geom_col()

#

geom_boxplot()

#

geom_histogram()

You shouldn’t use this. It’s lame. Plus, we can do this with geom_bar() just as well. But the important note about histograms is that they do not have a y aesthetic, since the y that we care about is the count.

iris %>% ggplot2::ggplot(aes(x=Sepal.Length))+
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

geom_line()

#

geom_errorbar()

#

geom_segment()

see also: annotate

#

geom_violin()

raincloud plots

lollipop plots

Lollipop plots are more modern looking barplots, even if they don’t give as much information as a violin plot or a boxplot.

The second example is a created double lollipop, or barbell, plot.

assignment <- c("Attendance", "Participation",
                "Lead Discussion", "Discussion Posts",
                "Reviewing Papers", "Midterm",
                "Project Topic", "Method Identification",
                "Ethics Training", "Annotated Bibliography",
                "Class Pres 1", "Submit IRB",
                "Class Pres 2", "Poster Presentation",
                "Final Paper"
                )
Management <- c(0, 10, 
           7.5, 7.5,
           10, 15,
           0, 2,
           2,
           5, 2.5,
           10, 2.5,
           10, 17.5)
Union <- c(5, 10,
              10, 10,
              5, 20,
              2,2, 2,
              5, 5,
              5, 4,
              5, 10)
data_lg <- data.frame(assignment, Management, Union)
data_lg$assignment <- factor(data_lg$assignment, levels=c("Attendance", "Participation",
                                                    "Lead Discussion", "Discussion Posts", "Reviewing Papers", "Midterm", "Project Topic", "Method Identification",
"Ethics Training", "Annotated Bibliography", "Class Pres 1", "Submit IRB",
"Class Pres 2", "Poster Presentation","Final Paper"))

ggplot(data_lg) +
  geom_segment( aes(x=assignment, xend=assignment, y=Management, yend=Union), color="grey") +
  geom_point( aes(x=assignment, y=Management, color="Management"), size=3 ) +
  geom_point( aes(x=assignment, y=Union, color="Union"), size=3 ) +
  scale_color_manual(values=c(rgb(0.7,0.2,0.1,0.5), rgb(0.2,0.7,0.1,0.5)), 
                     name="")+
  coord_flip()+
  xlab("") +
  ylab("Weight of Assignment for Final Grade")+
  theme(
    legend.position = "right"
  )

facet_wrap()

themes()

xlab()

ylab()

geom_hline()

geom_vline()

geom_violin()

annotate()

#annotate("text", label="Label!", x=1, y=1)
#annotate("segment", x=1, xend=1.5, y=1, yend=1)

scale_color_manual()

scale_fill_manual()

Data Analysis

t.test()

aov()

lm()

glm()

psych::alpha()

lsr::cohensD()

Advanced R

Writing Your Own Functions

Functions are created using both an open set of parentheses as well as a curly bracket.

coolcat <- function(x){
  paste0(x, " is a cool cat")
}

coolcat("Dr. C")
## [1] "Dr. C is a cool cat"

sapply()

Lab Functions

factorise()

see also: sapply

factorise() is the lab’s function to quickly change most scale wordings to numeric values. It grows as we use various anchors and also tries to cover for capitalization changes based on student. It can take a single variable, but for efficiency, we can use sapply to pass it through multiple variables at once.

factorise <- function(x) {
  dplyr::case_when(x %in% c("No") ~ 0,
                   x %in% c("Strongly disagree", "Strongly Disagree",
                            "Not well at all",
                            "Not at all",
                            "Much less than the average person",
                            "Very unconcerned", "very unconcerned",
                            "Very unimportant", "Yes") ~ 1,
                   x %in% c("Disagree",
                            "Slightly well",
                            "Not very much",
                            "Less than the average person",
                            "Slightly unconcerned", "slightly unconcerned",
                            "Slightly unimportant") ~ 2,
                   x %in% c("Somewhat disagree", "Somewhat Disagree",
                            "Moderately well",
                            "Barely at all",
                            "Slightly less than the average person",
                            "Neither concerned nor unconcerned",
                            "neither concerned nor unconcerned",
                            "Neither important nor unimportant") ~ 3,
                   x %in% c("Neither agree nor disagree", "Neither disagree nor agree",
                            "Neither Agree Nor Disagree", "Neither Disagree Nor Agree",
                            "Extremely well",
                            "Neutral",
                            "About the same as the average person",
                            "Slightly concerned", "slightly concerned",
                            "Slightly important") ~ 4,
                   x %in% c("Somewhat agree", "Somewhat Agree", 
                            "A fair amount",
                            "Slightly more than the average person",
                            "Very concerned", "very concerned",
                            "Very important") ~ 5,
                   x %in% c("Agree",
                            "Very Much",
                            "More than the average person") ~ 6,
                   x %in% c("Strongly agree", "Strongly Agree",
                            "Extremely",
                            "Much more than the average person") ~ 7
  )
}

data$var1 <- rep(c("Agree", "Disagree"), 75)
data$var2 <- rep(c("Strongly disagree", "Strongly agree"), 75)
data$var3 <- rep(c("Strongly Agree", "Strongly Disagree"), 75)

#List the variables that we want to run factorise on. Use quotes.
columns <- c("var1", "var2", "var3")
#Use sapply for all data where the columns can be called using quotes.
#To realize why this is working, reference "Brackets".
#run factorise on those columns.
data[columns] <- sapply(data[columns], factorise)

#see they're all numbers. Also note the difference in caps was covered in var 2-3.
table(data$var1)
## 
##  2  6 
## 75 75
table(data$var2)
## 
##  1  7 
## 75 75
table(data$var3)
## 
##  1  7 
## 75 75

summarystats()

summarystats() is the lab’s quick way to get alpha, mean, and standard deviation of scales. It takes two argument The first is a dataset of only the columns that include the questions that make up the scale. Thus, prior to running summarystats(), you would need to create a new dataset of just those variables using select(). The second argument is the overall averaged variable in our main dataset. Thus, to make this one, you would need to use rowMeans() to create an overall average variable of the scale that will be used in the analysis.

summarystats <- function(input1, input2){
  list <- c(psych::alpha(input1)$total[1],
            mean=mean(input2, na.rm=TRUE),
            sd=sd(input2, na.rm=TRUE))
  return(list)
}

data$Sepal.Example <- rowMeans(
    subset(
      data,
      select=c(
        Sepal.Length,
        Sepal.Width
        )))

data_select <-  data %>% subset(
    select=c(
      Sepal.Length,
      Sepal.Width
      )
    )

summarystats(data_select, data$Sepal.Example)
## Number of categories should be increased  in order to count frequencies.
## Warning in psych::alpha(input1): Some items were negatively correlated with the total scale and probably 
## should be reversed.  
## To do this, run the function again with the 'check.keys=TRUE' option
## Some items ( Sepal.Length ) were negatively correlated with the total scale and 
## probably should be reversed.  
## To do this, run the function again with the 'check.keys=TRUE' option
## Warning in sqrt(Vtc): NaNs produced
## $raw_alpha
## [1] -0.214637
## 
## $mean
## [1] 4.950333
## 
## $sd
## [1] 0.4446361
rm(data_select)
data <- data %>% select(-Sepal.Example)