Good morning! Ready to learn how to use R? Let’s get started.
Setting up R & RStudio before you start working
First, let’s get a few things set up in RStudio:
Click File > New File > R Script. This will open a new panel in the top left of your RStudio layout.
Click File > Save As. You can name the file “practice”. Make a new folder on the desktop called “workshop” and save it there.
We need to tell R where all our files are. We do this by setting the working directory. In the bottom-right panel in RStudio, click on the Files tab. Click the little box with three dots “…”, then navigate to your “workshop” folder. Last, click on the blue gear with “More” written next to it and select “Set as working directory”
Okay, now we are ready to start doing things. We are going to write our first lines of code to install and load a package. Packages are add-ons to R that provide specialized functions.
Type the following line of code in your Console (bottom-left window) and then press Enter.
install.packages("tidyverse")
You just downloaded a package. You also just wrote and executed your first line of code. Congratulations! You are now a coder.
Now, let’s load the package so we can use it. In the top-left panel, type the following line of code into your R Script. To run this code, you should press CTRL+Enter (Command+Enter on a Mac).
library(tidyverse)
package <U+393C><U+3E31>tidyverse<U+393C><U+3E32> was built under R version 3.5.3[30m-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.2.1 --[39m
[30m[32mv[30m [34mggplot2[30m 3.1.0 [32mv[30m [34mpurrr [30m 0.3.0
[32mv[30m [34mtibble [30m 2.0.1 [32mv[30m [34mdplyr [30m 0.8.0.[31m1[30m
[32mv[30m [34mtidyr [30m 0.8.2 [32mv[30m [34mstringr[30m 1.4.0
[32mv[30m [34mreadr [30m 1.3.1 [32mv[30m [34mforcats[30m 0.4.0 [39m
[30m-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[30m [34mdplyr[30m::[32mfilter()[30m masks [34mstats[30m::filter()
[31mx[30m [34mdplyr[30m::[32mlag()[30m masks [34mstats[30m::lag()[39m
We’re going to download some data that we will use today. Copy and paste the following line of code in your Console:
download.file("https://raw.githubusercontent.com/gtlaflair/ltrc-2019/gh-pages/data/placement_1.csv",
"placement_1.csv", mode = "wb")
trying URL 'https://raw.githubusercontent.com/gtlaflair/ltrc-2019/gh-pages/data/placement_1.csv'
Content type 'text/plain; charset=utf-8' length 17869 bytes (17 KB)
downloaded 17 KB
Great, now we should be all set to go. We will use that package and that data a little bit later.
Some key basics of how R works
R is a calculator!
You can use R like a calculator. Try it out:
2+2
[1] 4
3*8
[1] 24
((2+2)*(3*8))/10
[1] 9.6
9^2
[1] 81
This isn’t the main reason to use R, but you will occasionally use mathematical operations in R. What if you want to take the square root of number?
You do things in R with functions
Functions are the powerhouse of R: These are commands you use to do things. Base R has many functions, and if you load additional packages, you can use other functions. install.packages() and library() are functions, so you are already an experienced function user. To calculate a square root, you can use a function called sqrt(). Let’s try:
sqrt(49)
[1] 7
But to really do work in R, you’ll need to use objects
Objects, like the name suggests, are things your can do stuff to in R. You use function to create and change objects, and you can use functions and objects together to create new objects. Objects are good ways of saving something you want for later. Let’s start making a few simple objects.
First, let’s save the number 7 as an object:
seven <- 7
And now lets do some math with it:
seven + 7
[1] 14
seven^2
[1] 49
seven/seven
[1] 1
seven
[1] 7
Yes, none of this is very interesting yet. But as you can see, you can use and manipulate objects. Importantly, if you just type the object name, R will simply show you the object - in whatever form makes the most sense.
I won’t spend too much more time on these kinds of simplistic objects- but feel free to ask questions. Instead, let’s make an object that has some real data we might actually be interested in working with.
data <- read_csv("placement_1.csv")
Parsed with column specification:
cols(
.default = col_double(),
names = [31mcol_character()[39m,
country = [31mcol_character()[39m,
admin_date = [34mcol_datetime(format = "")[39m
)
See spec(...) for full column specifications.
Creating new variables
This dataset is missing total scores. Let’s create some new variables for these using dplyr::mutate. First, we’ll make a reading total score.
data <- data %>% mutate(read_total = rowSums(.[40:74], na.rm = T))
Now we’ll do the same thing for listening:
data <- data %>% mutate(list_total = rowSums(.[5:39], na.rm = T))
Finally, let’s creat a whole-test total score. This will be easy!
data <- mutate(data, total = read_total + list_total)
Summarizing data
We can also use dplyr to summarize data in R. Let’s start simple by just looking at total scores.
data %>% summarise(total_mean = mean(total))
We can do more than one summary stat in the same function. We’ll add in SD, median, min, and max.
data %>% summarise(total_mean = mean(total),
total_sd = sd(total),
total_median = median(total),
total_min = min(total),
total_max = max(total))
What if we want to save those summary stats so that we can look at them later? Let’s create a new object that contains our calculations of summary stats.
total_summary <- data %>% summarise(total_mean = mean(total),
total_sd = sd(total),
total_median = median(total),
total_min = min(total),
total_max = max(total))
Now if we just type “total_summary” and run it (in a script or in the console), we get to see our values.
total_summary
One very powerful feature of dplyr is the ability to do split-apply-combine. Basically, you split up your data into groups, apply some function or calcuation, and then combine the results. We’ll try this out by using dplyr::group_by to produce summary statistics for learners’ country of origin.
data %>% group_by(country) %>% summarise(total_mean = mean(total),
total_sd = sd(total),
total_median = median(total),
total_min = min(total),
total_max = max(total))
Of course, we will probably want to save these summary stats. We can save to an object, and we can write that object to a .csv that we can send to our colleagues.
country_summary <- data %>% group_by(country) %>% summarise(total_mean = mean(total),
total_sd = sd(total),
total_median = median(total),
total_min = min(total),
total_max = max(total))
write_csv(country_summary, "country_summary.csv")
Basic Stats
Very briefly, here are ways to run some basic stats in R.
Correlation
R is great for correlations. There are packages and functions that can run correlations for a whole buch of variables at once and even automatically generate really cool graphics to illustrate correlations. We’ll start with a simple but very nice function, cor.test(), that provides a p value AND a confidence interval for a simple bivariate correlation.
cor.test(data$read_total, data$list_total, method = "pearson")
Pearson's product-moment correlation
data: data$read_total and data$list_total
t = 8.8627, df = 86, p-value = 9.368e-14
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5629244 0.7865349
sample estimates:
cor
0.6909085
t-test
R, of course, can also do a simple t-test. Let’s compare the total scores of Russian and Chinese students.
First, we will use a tidyverse package called tidyr to filter the dataset.
china_russia <- data %>% filter(country %in% c("china", "russia"))
t-test time!
t.test(total ~ country, data = china_russia)
Welch Two Sample t-test
data: total by country
t = 0.59227, df = 54.286, p-value = 0.5561
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.048987 7.444821
sample estimates:
mean in group china mean in group russia
41.91667 40.21875
ANOVA
An ANOVA would let us compare the total scores of students from all 3 countries.
country_total_anova <- aov(total ~ country, data = data)
country_total_anova
Call:
aov(formula = total ~ country, data = data)
Terms:
country Residuals
Sum of Squares 269.936 10540.019
Deg. of Freedom 2 85
Residual standard error: 11.13554
Estimated effects may be unbalanced
summary(country_total_anova)
Df Sum Sq Mean Sq F value Pr(>F)
country 2 270 135 1.088 0.341
Residuals 85 10540 124
R can do many, many more statistical analyses- everything from Chi-square tests to mixed-effect regression to structural equation modeling to network analysis. This was just a taste of the basics!
Graphics with R
Graphics might be the biggest reason to start learning and using R. Compared to SPSS or Excel, graphics made in R generally look better and are actually easier to customize once you get familiar with the basics. I strongly recommend using the ggplot package (part of tidyverse) for creating graphics.
Histograms: an easy first plot
We’ll start by making a histogram of total scores.
ggplot(data = data, aes(x = total))+
geom_histogram()

That is a very basic, but very readable histogram. Let’s customize it a bit to make it prettier:
total_hist <- ggplot(data = data, aes(x = total))+
geom_histogram(binwidth = 5)+
labs(x = "Total Score", y = "Number of Test-Takers")+
theme_bw()
total_hist

Now this looks pretty nice! We can save a high-resolution version of this that will look great in a powerpoint or a paper:
ggsave("total_histogram.png", total_hist, width = 6, height = 4, units = "in", dpi = 300)
Go to your folder and open that - looks pretty good, right? We can also do some really neat stuff looking at subgroups with plots. Let’s try:
ggplot(data = data, aes(x = total))+
geom_histogram(binwidth = 5)+
labs(x = "Total Score", y = "Number of Test-Takers")+
theme_bw()+
facet_wrap(~country, nrow = 1)

Looks like Russian students might have a bimodal distribution! You can save this graph if you want - look at the previous example for hints.
Boxplots and variations
Boxplots are nice ways of summarizing different groups or different variables graphically. Here’s a simple one:
ggplot(data = data, aes(x = country, y = total))+
geom_boxplot()+
theme_bw()

That’s okay, but we could make it more informative. Violin plots are a variation that show the distributions in more detail. We’ll also add some color just to make things pop more.
ggplot(data = data, aes(x = country, y = total, fill = country))+
geom_violin()+
theme_bw()

We can also add individual points to boxplots (or violin plots)
ggplot(data = data, aes(x = country, y = total, fill = country))+
geom_boxplot()+
geom_dotplot(binaxis='y', stackdir='center', dotsize=1, binwidth = 1, fill = "black")+
theme_bw()

Lots of flexibility here!
Scatterplots
The last plot type we’ll take a look at today is the scatterplot. These are often the most informative plots, in my opinion, because they can show the relationship between variables in really interesting ways. Let’s start simple:
ggplot(data = data, aes(x = read_total, y = list_total))+
geom_point()

This shows a pretty clear relationship between reading and listening scores, just like our correlation from earlier indicated. In this plot, it doesn’t look like we have too much overplotting (i.e., when you have points that overlap each other). But if we did have some overplotting issues, we could use geom_jitter instead of geom_point, like this:
ggplot(data = data, aes(x = read_total, y = list_total))+
geom_jitter()

Looking at this, I think we should probably use geom_jitter. geom_jitter adds a little bit of random noise to the location of points, nudging them away from each other so you can see them better.
Let’s clean up this plot and add a regression line to show the relationship more clearly.
ggplot(data = data, aes(x = read_total, y = list_total))+
geom_jitter()+
geom_smooth(method = lm, se = FALSE)+
labs(y = "Listening Score", x = "Reading Score")+
theme_bw()

This is looking nice! We can also break down this relationship by groups according to country:
ggplot(data = data, aes(x = read_total, y = list_total, group = country, color = country))+
geom_jitter()+
geom_smooth(method = lm, se = FALSE)+
labs(y = "Listening Score", x = "Reading Score")+
theme_bw()

I encourage you to try using R to create graphics for your next presentation, paper, or thesis, even if you still want to use SPSS for your stats.
---
title: "Intro to R & RStudio for HUFS TESOL"
output: html_notebook
---

Good morning! Ready to learn how to use R? Let's get started.

## Setting up R & RStudio before you start working

First, let's get a few things set up in RStudio:

1. Click File > New File > R Script. This will open a new panel in the top left of your RStudio layout.

2. Click File > Save As. You can name the file "practice". Make a new folder on the desktop called "workshop" and save it there.

3. We need to tell R where all our files are. We do this by *setting the working directory*. In the bottom-right panel in RStudio, click on the Files tab. Click the little box with three dots "...", then navigate to your "workshop" folder. Last, click on the blue gear with "More" written next to it and select "Set as working directory"

Okay, now we are ready to start doing things. We are going to write our first lines of code to *install* and *load* a **package**. Packages are add-ons to R that provide specialized functions.

Type the following line of code in your Console (bottom-left window) and then press **Enter**.

```{r eval=FALSE}
install.packages("tidyverse")
```

You just downloaded a package. You also just wrote and executed your first line of code. Congratulations! You are now a coder.

Now, let's *load* the package so we can use it. In the top-left panel, type the following line of code into your R Script. To run this code, you should press **CTRL+Enter** (Command+Enter on a Mac).

```{r}
library(tidyverse)
```

We're going to download some data that we will use today. Copy and paste the following line of code in your Console:

```{r}
download.file("https://raw.githubusercontent.com/gtlaflair/ltrc-2019/gh-pages/data/placement_1.csv",
              "placement_1.csv", mode = "wb")
```


Great, now we should be all set to go. We will use that package and that data a little bit later.

## Some key basics of how R works

### R is a calculator!

You can use R like a calculator. Try it out:

```{r}
2+2

3*8

((2+2)*(3*8))/10

9^2
```

This isn't the main reason to use R, but you will occasionally use mathematical operations in R. What if you want to take the square root of number? 

### You do things in R with functions

**Functions** are the powerhouse of R: These are commands you use to do things. Base R has many functions, and if you load additional packages, you can use other functions. install.packages() and library() are functions, so you are already an experienced function user. To calculate a square root, you can use a function called sqrt(). Let's try:

```{r}
sqrt(49)
```

### But to really do work in R, you'll need to use objects

**Objects**, like the name suggests, are things your can do stuff to in R. You use function to create and change objects, and you can use functions and objects together to create new objects. Objects are good ways of *saving* something you want for later. Let's start making a few simple objects.

First, let's save the number 7 as an object:

```{r}
seven <- 7
```

And now lets do some math with it:

```{r}
seven + 7

seven^2

seven/seven

seven
```

Yes, none of this is very interesting yet. But as you can see, you can use and manipulate objects. Importantly, if you just type the object name, R will simply show you the object - in whatever form makes the most sense. 

I won't spend too much more time on these kinds of simplistic objects- but feel free to ask questions. Instead, let's make an object that has some real *data* we might actually be interested in working with.

```{r}
data <- read_csv("placement_1.csv")
```

## Creating new variables

This dataset is missing total scores. Let's create some new variables for these using dplyr::mutate. First, we'll make a reading total score.

```{r}
data <- data %>% mutate(read_total = rowSums(.[40:74], na.rm = T))
```

Now we'll do the same thing for listening:

```{r}
data <- data %>% mutate(list_total = rowSums(.[5:39], na.rm = T))
```

Finally, let's creat a whole-test total score. This will be easy!

```{r}
data <- mutate(data, total = read_total + list_total)
```

## Summarizing data

We can also use dplyr to summarize data in R. Let's start simple by just looking at total scores.

```{r}
data %>% summarise(total_mean = mean(total))
```

We can do more than one summary stat in the same function. We'll add in SD, median, min, and max.

```{r}
data %>% summarise(total_mean = mean(total),
                   total_sd = sd(total),
                   total_median = median(total),
                   total_min = min(total),
                   total_max = max(total))
```

What if we want to save those summary stats so that we can look at them later? Let's create a new object that contains our calculations of summary stats.

```{r}
total_summary <- data %>% summarise(total_mean = mean(total),
                   total_sd = sd(total),
                   total_median = median(total),
                   total_min = min(total),
                   total_max = max(total))
```

Now if we just type "total_summary" and run it (in a script or in the console), we get to see our values.

```{r}
total_summary
```

One very powerful feature of dplyr is the ability to do *split-apply-combine*. Basically, you split up your data into groups, apply some function or calcuation, and then combine the results. We'll try this out by using dplyr::group_by to produce summary statistics for learners' country of origin.

```{r}
data %>% group_by(country) %>% summarise(total_mean = mean(total),
                   total_sd = sd(total),
                   total_median = median(total),
                   total_min = min(total),
                   total_max = max(total))
```

Of course, we will probably want to save these summary stats. We can save to an object, and we can write that object to a .csv that we can send to our colleagues.

```{r}
country_summary <- data %>% group_by(country) %>% summarise(total_mean = mean(total),
                   total_sd = sd(total),
                   total_median = median(total),
                   total_min = min(total),
                   total_max = max(total))

write_csv(country_summary, "country_summary.csv")
```


## Basic Stats

Very briefly, here are ways to run some basic stats in R.

### Correlation
R is great for correlations. There are packages and functions that can run correlations for a whole buch of variables at once and even automatically generate really cool graphics to illustrate correlations. We'll start with a simple but very nice function, cor.test(), that provides a p value AND a confidence interval for a simple bivariate correlation.

```{r}
cor.test(data$read_total, data$list_total, method = "pearson")
```

### t-test
R, of course, can also do a simple t-test. Let's compare the total scores of Russian and Chinese students.

First, we will use a tidyverse package called tidyr to filter the dataset.

```{r}
china_russia <- data %>% filter(country %in% c("china", "russia"))
```

t-test time!

```{r}
t.test(total ~ country, data = china_russia)
```

### ANOVA
An ANOVA would let us compare the total scores of students from all 3 countries.

```{r}
country_total_anova <- aov(total ~ country, data = data)

country_total_anova

summary(country_total_anova)
```

R can do many, many more statistical analyses- everything from Chi-square tests to mixed-effect regression to structural equation modeling to network analysis. This was just a taste of the basics!

## Graphics with R
Graphics might be the biggest reason to start learning and using R. Compared to SPSS or Excel, graphics made in R generally look better and are actually *easier* to customize once you get familiar with the basics. I strongly recommend using the ggplot package (part of tidyverse) for creating graphics.

### Histograms: an easy first plot
We'll start by making a histogram of total scores.

```{r}
ggplot(data = data, aes(x = total))+
  geom_histogram()
```

That is a very basic, but very readable histogram. Let's customize it a bit to make it prettier:

```{r}
total_hist <- ggplot(data = data, aes(x = total))+
  geom_histogram(binwidth = 5)+
  labs(x = "Total Score", y = "Number of Test-Takers")+
  theme_bw()

total_hist
```

Now this looks pretty nice! We can save a high-resolution version of this that will look great in a powerpoint or a paper:

```{r}
ggsave("total_histogram.png", total_hist, width = 6, height = 4, units = "in", dpi = 300)
```

Go to your folder and open that - looks pretty good, right? We can also do some really neat stuff looking at subgroups with plots. Let's try:

```{r}
ggplot(data = data, aes(x = total))+
  geom_histogram(binwidth = 5)+
  labs(x = "Total Score", y = "Number of Test-Takers")+
  theme_bw()+
  facet_wrap(~country, nrow = 1)
```
Looks like Russian students might have a bimodal distribution! You can save this graph if you want - look at the previous example for hints.

### Boxplots and variations
Boxplots are nice ways of summarizing different groups or different variables graphically. Here's a simple one:

```{r}
ggplot(data = data, aes(x = country, y = total))+
  geom_boxplot()+
  theme_bw()
```
That's okay, but we could make it more informative. Violin plots are a variation that show the distributions in more detail. We'll also add some color just to make things pop more.

```{r}
ggplot(data = data, aes(x = country, y = total, fill = country))+
  geom_violin()+
  theme_bw()
```

We can also add individual points to boxplots (or violin plots)
```{r}
ggplot(data = data, aes(x = country, y = total, fill = country))+
  geom_boxplot()+
  geom_dotplot(binaxis='y', stackdir='center', dotsize=1, binwidth = 1, fill = "black")+
  theme_bw()
```
Lots of flexibility here!

### Scatterplots
The last plot type we'll take a look at today is the scatterplot. These are often the most informative plots, in my opinion, because they can show the relationship between variables in really interesting ways. Let's start simple:

```{r}
ggplot(data = data, aes(x = read_total, y = list_total))+
  geom_point()
```
This shows a pretty clear relationship between reading and listening scores, just like our correlation from earlier indicated. In this plot, it doesn't look like we have too much **overplotting** (i.e., when you have points that overlap each other). But if we did have some overplotting issues, we could use geom_jitter instead of geom_point, like this:

```{r}
ggplot(data = data, aes(x = read_total, y = list_total))+
  geom_jitter()
```
Looking at this, I think we should probably use geom_jitter. geom_jitter adds a little bit of random noise to the location of points, nudging them away from each other so you can see them better.

Let's clean up this plot and add a regression line to show the relationship more clearly.

```{r}
ggplot(data = data, aes(x = read_total, y = list_total))+
  geom_jitter()+
  geom_smooth(method = lm, se = FALSE)+
  labs(y = "Listening Score", x = "Reading Score")+
  theme_bw()
```
This is looking nice! We can also break down this relationship by groups according to country:

```{r}
ggplot(data = data, aes(x = read_total, y = list_total, group = country, color = country))+
  geom_jitter()+
  geom_smooth(method = lm, se = FALSE)+
  labs(y = "Listening Score", x = "Reading Score")+
  theme_bw()
```


I encourage you to try using R to create graphics for your next presentation, paper, or thesis, even if you still want to use SPSS for your stats.