Intro to R and Test Score Data

Author

Dan Isbell

Intro to R

R has many useful functions for analyzing data. There are also additional add-on functions you can download and use. To install these add-ons, which are called packages, you can click the Packages tab in the bottom right, then click Install to bring up a window where you can search for and install a specific package. For today, you will need install the tidyverse and DescTools packages.

Alternatively, you can use the following code:

install.packages("tidyverse", "DescTools")

Now you have two packages installed. You won’t need to install packages every time you want to use them, but you will need to load them every time you want to use them. Here’s the code to do that:

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

R has many functions and it can take awhile to get comfortable with doing basic tasks, but we will start simple today.

First, as a warm-up, we will try using R as a calculator:

1+1

[1] 2

Addition and subtraction are very easy in R - same for multiplication (*) and division (/). We need to carefully use parentheses, too, when doing math or writing code in R.

((1+1)/2)*30

[1] 30

Although we can use symbols like + - * / to do math, for most things we do in R we will need to use functions. A common function for math is sqrt(), which allows us to calculate a square root.

sqrt(16)

[1] 4

Test Score Data

Now we will take a look at some test score data. This data comes from Loewen et al. (2020), where we administered three tests to measure growth in Spanish ability after studying via the language learning app Babbel:

ACTFL OPIc
Spanish Vocabulary Yes/No test (modeled after LexTale)
Spanish Grammar Test (involving error identification and correction)

First, make sure you have the Spanish_Tests_for_Babbel.csv file downloaded and in your current working directory (click on the Files tab in the bottom right and see if it is showing up). Then, we will use the following code to load the data into R:

babbel <- read_csv("Spanish_Tests_for_Babbel.csv")

Rows: 54 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): ID
dbl (3): Speaking, Grammar, Vocabulary

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

To look at this data, you can click the small grid icon in the top-right panel (in the Environment tab). We can also look at the first few rows with the head() function:

head(babbel)

# A tibble: 6 × 4
  ID    Speaking Grammar Vocabulary
  <chr>    <dbl>   <dbl>      <dbl>
1 P001         1      10         14
2 P002         4      32         31
3 P003         3      22         27
4 P004         3      34         33
5 P007         2      19          6
6 P008         4      25         24

As you can see, there are four variables, corresponding to the four columns. ID is a nominal variable, speaking is ordinal, and grammar and vocabulary are continuous.

In R, we often need to ‘isolate’ or refer to a specific variable. We can do that with the data$variable format. Let’s try it:

babbel$Grammar

 [1] 10 32 22 34 19 25 25 16  4 11 28 12 28  6 11 36 42 14 34 36  4 44 33 39 37
[26] 48  7 11 31  6 30 48 11 19 13 43  9 23  3 28 16 17 11  8 26  3 12 20 32  1
[51]  0  3  1 11

This brings up all the values in the Grammar variable. these are the individual grammar test scores. We calculate measures of central tendency and dispersion based on all of these scores.

Central Tendency

For all three tests, we can start by computing a very simple measure of central tendency, the mode. We will use the Mode() function from DescTools:

OPIc: The mode is 1, which occurred 17 times.

Mode(babbel$Speaking)

[1] 1
attr(,"freq")
[1] 17

Grammar: The mode is 11, which occurred 6 times.

Mode(babbel$Grammar)

[1] 11
attr(,"freq")
[1] 6

Vocabulary: The mode is 21, which occurred 6 times.

Mode(babbel$Vocabulary)

[1] 21
attr(,"freq")
[1] 6

The OPIc scores are ordinal, so the median will be our best measure of central tendency:

median(babbel$Speaking)

[1] 2

The value of 2 corresponds to Novice Mid on the ACTFL scale. We can also calculate medians for the grammar and vocabulary test scores, too:

median(babbel$Grammar)

[1] 18

median(babbel$Vocabulary)

[1] 21

Because the Grammar and Vocabulary test scores are continuous (interval), we can also calculate the mean for each.

mean(babbel$Grammar)

[1] 20.24074

mean(babbel$Vocabulary)

[1] 19.62963

Cleaning Up Numbers - Rounding

In the previous bit of code, we got means that have many digits after the decimal place. Generally, we only need two digits. We can make that happen with the round() function.

round(mean(babbel$Grammar), 2)

[1] 20.24

This code example also shows how you can put functions within functions in R. This is not so different from spreadsheet software like Excel or Sheets!

Dispersion

For now, we’ll focus on the Grammar scores. The first measure of dispersion we will look at is the range. There is a simple R function that calculates it:

range(babbel$Grammar)

[1]  0 48

This shows that the lowest score is 0, and the highest is 48, so we can say the range is 48. You can also use the min() and max() functions to get each value separately:

min(babbel$Grammar)

[1] 0

max(babbel$Grammar)

[1] 48

The next measure of dispersion to calculate is the standard deviation:

sd(babbel$Grammar)

[1] 13.62979

And if we want a nicer-looking number, we can round it:

round(sd(babbel$Grammar), 2)

[1] 13.63

So, putting it all together, we could make a table with all of the descriptive statistics for the Grammar test scores. But first, very quickly, it’s important to note the number of people who took the test:

length(babbel$Grammar)

[1] 54

On to a summary table:

Summary of Grammar Test Scores
Test	N	Mean	SD	Median	Min	Max
Grammar	54	20.24	13.63	18	0	48

Visualizing Test Score Data

To see what the distribution of test scores looks like, it is very useful to create a plot called a histogram. There are two ways to do this in R - a quick way, and a pretty way.

First, the quick way, using the hist() function:

hist(babbel$Grammar)

This plot was very easy to make, and it doesn’t look too bad, but the labels are messy. We can make one that looks a bit nicer using the ggplot() function from the tidyverse package.

ggplot(data = babbel, aes(x = Grammar))+ #this defines the data for the plot
  geom_histogram(binwidth = 5)+ 
  labs(x = "Grammar Test Score", y = "Frequency")+
  theme_bw()

What do you notice about this distribution of Grammar test scores?

References

Loewen, S., Isbell, D. R., & Sporn, Z. (2020). The effectiveness of app-based language instruction for developing receptive linguistic knowledge and oral communicative ability. Foreign Language Annals, flan.12454. https://doi.org/10.1111/flan.12454