If you haven’t already, please read the Introduction to this set of lab activities, which includes instructions on where to find the .csv files referenced in each lab.

Workflow and Tidying

I have created each lab in R Markdown, which exports a file as an .html. You will see text, along with R “code chunks” embedded (as below), and sometimes output including plots. You can copy and paste the code into an R script for this lab, along with any notes you find helpful using #. You could use R Markdown to create your own lab file (with text, code, and output), but only if you want to spend the time learning it: see http://rmarkdown.rstudio.com if you’re curious.

Run this code in R yourself:

data("PlantGrowth")    #brings in the built-in data set "PlantGrowth"
summary(PlantGrowth)
##      weight       group   
##  Min.   :3.590   ctrl:10  
##  1st Qu.:4.550   trt1:10  
##  Median :5.155   trt2:10  
##  Mean   :5.073            
##  3rd Qu.:5.530            
##  Max.   :6.310

You should now see “PlantGrowth” as a data.frame in your ‘Environment’ window (30 obs. of 2 variables) and see the summary() of the data.frame in your console, as above.

For the first part of the lab, the Whole Game from R for Data Science (Wickham et al. 2023) is a good reference.

Exercise 1

NOTE, for all labs the Exercise heading with numbered questions (below) indicates the material in the lab you should work on for practice. I recommend keeping the code in a script, or creating a .Rmd file (R Markdown, see above).

This exercise will be some basic practice with bringing in new data and taking a look at it. First make sure you’ve installed the tidyverse package and run

library(tidyverse)

Installing tidyverse installs all of the associated packages ggplot2, dplyr, tidyr, readr and others that you may have installed separately before. In addition there are other packages used as part of the tidyverse that would need to be loaded separately using library().

We’re going to import a dataset from (Fogarty 2023) - remember these files should be provided wherever you’ve accessed this html file.

simd <- read_csv("simd2020.csv")

You should have a Warning: message, along with some other messages in your console. You don’t necessarily need to deal with these right away (as long as the data reads in to R) but in this case one of the issues is easy to ‘fix’.

1.1

Follow R’s suggestion to “call problems() on your data frame for details…”

The key information to pay attention to here is the information in the ‘actual’ column. Go ahead and open the original sim2020.csv file, e.g. in Excel. Over in the ‘Attendance’ column there are some * characters. We’re going to re-import the data but tell R to treat the * as “NA”.

simd <- read_csv("simd2020.csv", na = "*")

1.2

To make sure this imported the way you wanted, you can always use view() or click on the object in the ‘Environment’ window. For this exercise I want you to run summary() but just on that column (Attendance) and paste in the output (it will show how many NA’s are in that column). If you’re not sure how to do this, please ask (or peek ahead to #5, below).

1.3

Run spec(simd). How many variables are considered strings (characters)? What is the other kind of variable in this dataset according to R?

1.4

You can take a look at the values of the variables using the glimpse() function. NOTE, the glimpse() function is the tibble-friendly version of the str() function from base R. Run glimpse(simd) (paste at least 7 rows of output.)

1.5

Now we’re going to use some other useful functions. Sometimes you want to double-check the variable type or values, just from one column (variable) in your dataset. Use the following functions. class(simd$Council_area) returns the variable type, head(simd$Council_area) returns the values of the first few rows, and tail(simd$Council_area) returns the values of the last few rows.

1.6

Just as a reminder of how particular R is to exact syntax, run class(simd$Council_Area) - why do you get the Warning message that you do?

1.7

Pick a NUMERIC variable (column) from the simd tibble and use class(), head() and tail().

Now let’s use the count() function from the dplyr package. As part of the tidyverse, we’ll use a coding concept known as pipes, which allows us to link multiple functions together in one block of code. In the code chunk below, note that the first line is just the tibble, simd, we want R to use with count(). You can think of it as “with this data, do the following…”. And then we just use the column name ‘Council_area’ directly, since R knows we’re referring to simd.

simd %>%
count(Council_area) 
## # A tibble: 32 × 2
##    Council_area              n
##    <chr>                 <int>
##  1 Aberdeen City           283
##  2 Aberdeenshire           340
##  3 Angus                   155
##  4 Argyll and Bute         125
##  5 City of Edinburgh       597
##  6 Clackmannanshire         72
##  7 Dumfries and Galloway   201
##  8 Dundee City             188
##  9 East Ayrshire           163
## 10 East Dunbartonshire     130
## # ℹ 22 more rows

As you can see the default output is the first 10 rows - in this case, ordered alphabetically. However, we might want to see it sorted by the value of n, the count.

1.8

Run the code above but add sort = TRUE within the count() function. If you can’t get this to run, run help(count) and look at one of the starwars %>% examples with sort = TRUE.

As a final activity in this exercise, let’s explore how to subset our data. In base R this is done using brackets [ , ] but with pipes the coding is (for a lot of people) more intuitive. I’m going to run through some ‘simple’ examples and then for practice you can code an original example.

If you want a subset of your data with just certain variables, you can use select():

simd2 <- simd %>% select(Council_area:Income_rate)

Which makes a new data.frame (or ‘tibble’), simd2, with the 4 variables (columns) from Council_area to Income_rate.

More often, you might want to create a subset based on certain variable values (i.e. selecting rows from within your data). We do this by using the filter() function (either by itself or with the select() function).

To subset the data to only subset data from Glasgow City use the following. Note, we need to specify == and not = in the code.

simd3 <- simd %>% filter(Council_area=="Glasgow City")

If we want to filter AND use the select() function, the variable we are filtering needs to be included in the selected variables.

You can combine criteria in the filter() function and filter by numeric variables as well (the following code is not complete since it just has the filter() lines):

filter(Council_area=="Glasgow City" & Intermediate_Zone=="Hillhead") OR

filter(Employment_rate >= .25) OR

filter(Employment_rate >= .25 & Council_area=="Glasgow City")

1.9

Come up with your own subsetting code using both select() and filter().

More tidying, and using ggplot2

Exercise 2

If you’re not already familiar with the ‘tidyverse’ in R, this will be a further introduction to it. Go to 5 Data tidying and read 5.1 (Introduction) through 5.5 (Summary). Again, you don’t need to run the code from that section unless you want to for practice.

We’re going to bring in a different dataset, the number of Covid-19 cases by NHS Scotland health board, to demonstrate data pivoting in the tidyverse (Fogarty 2023).

covid <- read_csv("covid_totals.csv")
## Rows: 14 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): health_board
## dbl (2): 2020, 2021
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

(as this output shows, no warning messages for this data - just the default spec() information about the variables imported.)

2.1

Following the example of using pivot_longer() from R4DS 5.3.1 data in column names, try to tidy this data and create a new tibble named ‘covid1’ (the code is below if you can’t get it to work). How many rows (observations) did the original ‘covid’ tibble have? How many does ‘covid1’ have?

There is a LOT more you can do with pivot_longer() as described in the rest of Section 5.3 Lengthening Data, but we’re going move on for now.

covid1 <- covid %>%
  pivot_longer(c(`2020`,`2021`), names_to = "year", values_to = "cases")

2.2

We’ll use the ‘longer’ covid1 data you just created, to do some visualizations. Following the model of ‘a really basic boxplot’ in the R Graph Gallery, create this plot yourself with the following code:

covid1 %>%    #using the pipes coding again!
ggplot(aes(x=as.factor(year), y=cases)) + 
        geom_boxplot(fill="slateblue", alpha=0.2) + 
        ylab("COVID cases") +
        xlab("year")

2.3

Yes this is a boxplot of the cases of COVID-19 in these Scottish health board areas, but it’s not particularly revealing. You may have noticed the Violin Plot example too so we can try that here:

covid1 %>%
  ggplot(aes(x=as.factor(year), y=cases)) + 
  geom_violin(fill="slateblue", alpha=0.2) + 
  ylab("COVID cases") +
  xlab("year")

Another alternative visualization: a line plot to show change, which links each data point by the ‘health board’.

# Visualize changes over time
covid1 %>%
ggplot(aes(x = as.factor(year), y = cases)) +
  geom_line(aes(group = health_board), color = "grey50") +
  ylab("COVID cases") +
  xlab ("YEAR")

2.4

Much more than the other plots, this shows that the 2021 COVID cases, by Scottish health board, tend to be higher than the 2020 cases (no surprise!). However it’s not all that exciting to look at. Go to the line plot page on the data-to-viz site and pick some way to add to/revise this plot. It could be symbols for the data, different titles, changes to the values on the y-axis, more interesting colors… just experiment a little!

Congratulations, you’ve completed Lab 1!

Citations

Fogarty, B. J. 2023. Quantitative social science data with r: An introduction. 2nd edition. Sage Publications.
Wickham, H., M. Çetinkaya-Rundel, and G. Grolemund. 2023. R for data science. 2nd edition. O’Reilly Media, Inc.