install.packages("tidyverse", "DescTools")Intro to R and Test Score Data
Intro to R
R has many useful functions for analyzing data. There are also additional add-on functions you can download and use. To install these add-ons, which are called packages, you can click the Packages tab in the bottom right, then click Install to bring up a window where you can search for and install a specific package. For today, you will need install the tidyverse and DescTools packages.
Alternatively, you can use the following code:
Now you have two packages installed. You won’t need to install packages every time you want to use them, but you will need to load them every time you want to use them. Here’s the code to do that:
library(tidyverse)── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DescTools)R has many functions and it can take awhile to get comfortable with doing basic tasks, but we will start simple today.
First, as a warm-up, we will try using R as a calculator:
1+1[1] 2
Addition and subtraction are very easy in R - same for multiplication (*) and division (/). We need to carefully use parentheses, too, when doing math or writing code in R.
((1+1)/2)*30[1] 30
Although we can use symbols like + - * / to do math, for most things we do in R we will need to use functions. A common function for math is sqrt(), which allows us to calculate a square root.
sqrt(16)[1] 4
Test Score Data
Now we will take a look at some test score data. This data comes from Loewen et al. (2020), where we administered three tests to measure growth in Spanish ability after studying via the language learning app Babbel:
ACTFL OPIc
Spanish Vocabulary Yes/No test (modeled after LexTale)
Spanish Grammar Test (involving error identification and correction)
First, make sure you have the Spanish_Tests_for_Babbel.csv file downloaded and in your current working directory (click on the Files tab in the bottom right and see if it is showing up). Then, we will use the following code to load the data into R:
babbel <- read_csv("Spanish_Tests_for_Babbel.csv")Rows: 54 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): ID
dbl (3): Speaking, Grammar, Vocabulary
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
To look at this data, you can click the small grid icon in the top-right panel (in the Environment tab). We can also look at the first few rows with the head() function:
head(babbel)# A tibble: 6 × 4
ID Speaking Grammar Vocabulary
<chr> <dbl> <dbl> <dbl>
1 P001 1 10 14
2 P002 4 32 31
3 P003 3 22 27
4 P004 3 34 33
5 P007 2 19 6
6 P008 4 25 24
As you can see, there are four variables, corresponding to the four columns. ID is a nominal variable, speaking is ordinal, and grammar and vocabulary are continuous.
In R, we often need to ‘isolate’ or refer to a specific variable. We can do that with the data$variable format. Let’s try it:
babbel$Grammar [1] 10 32 22 34 19 25 25 16 4 11 28 12 28 6 11 36 42 14 34 36 4 44 33 39 37
[26] 48 7 11 31 6 30 48 11 19 13 43 9 23 3 28 16 17 11 8 26 3 12 20 32 1
[51] 0 3 1 11
This brings up all the values in the Grammar variable. these are the individual grammar test scores. We calculate measures of central tendency and dispersion based on all of these scores.
Central Tendency
For all three tests, we can start by computing a very simple measure of central tendency, the mode. We will use the Mode() function from DescTools:
OPIc: The mode is 1, which occurred 17 times.
Mode(babbel$Speaking)[1] 1
attr(,"freq")
[1] 17
Grammar: The mode is 11, which occurred 6 times.
Mode(babbel$Grammar)[1] 11
attr(,"freq")
[1] 6
Vocabulary: The mode is 21, which occurred 6 times.
Mode(babbel$Vocabulary)[1] 21
attr(,"freq")
[1] 6
The OPIc scores are ordinal, so the median will be our best measure of central tendency:
median(babbel$Speaking)[1] 2
The value of 2 corresponds to Novice Mid on the ACTFL scale. We can also calculate medians for the grammar and vocabulary test scores, too:
median(babbel$Grammar)[1] 18
median(babbel$Vocabulary)[1] 21
Because the Grammar and Vocabulary test scores are continuous (interval), we can also calculate the mean for each.
mean(babbel$Grammar)[1] 20.24074
mean(babbel$Vocabulary)[1] 19.62963
In the previous bit of code, we got means that have many digits after the decimal place. Generally, we only need two digits. We can make that happen with the round() function.
round(mean(babbel$Grammar), 2)[1] 20.24
This code example also shows how you can put functions within functions in R. This is not so different from spreadsheet software like Excel or Sheets!
Dispersion
For now, we’ll focus on the Grammar scores. The first measure of dispersion we will look at is the range. There is a simple R function that calculates it:
range(babbel$Grammar)[1] 0 48
This shows that the lowest score is 0, and the highest is 48, so we can say the range is 48. You can also use the min() and max() functions to get each value separately:
min(babbel$Grammar)[1] 0
max(babbel$Grammar)[1] 48
The next measure of dispersion to calculate is the standard deviation:
sd(babbel$Grammar)[1] 13.62979
And if we want a nicer-looking number, we can round it:
round(sd(babbel$Grammar), 2)[1] 13.63
So, putting it all together, we could make a table with all of the descriptive statistics for the Grammar test scores. But first, very quickly, it’s important to note the number of people who took the test:
length(babbel$Grammar)[1] 54
On to a summary table:
| Test | N | Mean | SD | Median | Min | Max |
|---|---|---|---|---|---|---|
| Grammar | 54 | 20.24 | 13.63 | 18 | 0 | 48 |
Visualizing Test Score Data
To see what the distribution of test scores looks like, it is very useful to create a plot called a histogram. There are two ways to do this in R - a quick way, and a pretty way.
First, the quick way, using the hist() function:
hist(babbel$Grammar)This plot was very easy to make, and it doesn’t look too bad, but the labels are messy. We can make one that looks a bit nicer using the ggplot() function from the tidyverse package.
ggplot(data = babbel, aes(x = Grammar))+ #this defines the data for the plot
geom_histogram(binwidth = 5)+
labs(x = "Grammar Test Score", y = "Frequency")+
theme_bw()What do you notice about this distribution of Grammar test scores?
Extra: Score Transformations
It is also very easy for us to do score transformations in R. We will just need to do a little math and create a new variable
babbel$Grammar_z <- (babbel$Grammar - mean(babbel$Grammar))/sd(babbel$Grammar)Let’s take a look at these z-scores for Grammar:
head(babbel$Grammar_z)[1] -0.75134993 0.86276167 0.12907458 1.00949909 -0.09103155 0.34918071
The mean of a set of z-scores should be about 0:
mean(babbel$Grammar_z)[1] 1.937007e-17
round(mean(babbel$Grammar_z),2)[1] 0
And the SD should be about 1:
sd(babbel$Grammar_z)[1] 1
How about the range?
range(babbel$Grammar_z)[1] -1.485037 2.036661
So there is one person about 1.5 SD units below the mean, and one person just over 2 SD units above the mean.
With the z-scores, we can compute t and CEEB scores very easily:
babbel$Grammar_t <- (babbel$Grammar_z*10) + 50
babbel$Grammar_CEEB <- (babbel$Grammar_z*100) + 500And check out the t scores:
head(babbel$Grammar_t)[1] 42.48650 58.62762 51.29075 60.09499 49.08968 53.49181
And CEEB scores:
head(babbel$Grammar_CEEB)[1] 424.8650 586.2762 512.9075 600.9499 490.8968 534.9181
Of course, you can round these scores, too.