If you haven’t already, please read the Introduction to this
set of lab activities, which includes instructions on where to find the
.csv files referenced in each lab.
I have created each lab in R Markdown, which exports a file as an
.html. You will see text, along with R “code chunks”
embedded (as below), and sometimes output including plots. You can copy
and paste the code into an R script for this lab, along with any notes
you find helpful using #. You could use R Markdown
to create your own lab file (with text, code, and output), but only if
you want to spend the time learning it: see http://rmarkdown.rstudio.com if you’re curious.
Run this code in R yourself:
data("PlantGrowth") #brings in the built-in data set "PlantGrowth"
summary(PlantGrowth)
## weight group
## Min. :3.590 ctrl:10
## 1st Qu.:4.550 trt1:10
## Median :5.155 trt2:10
## Mean :5.073
## 3rd Qu.:5.530
## Max. :6.310
You should now see “PlantGrowth” as a data.frame in your
‘Environment’ window (30 obs. of 2 variables) and see the
summary() of the data.frame in your console, as above.
For the first part of the lab, the Whole Game from R for Data Science (Wickham et al. 2023) is a good reference.
NOTE, for all labs the Exercise heading with
numbered questions (below) indicates the material in the lab you should
work on for practice. I recommend keeping the code in a script, or
creating a .Rmd file (R Markdown, see above).
This exercise will be some basic practice with bringing in new data and taking a look at it. First make sure you’ve installed the tidyverse package and run
library(tidyverse)
Installing tidyverse installs all of the associated
packages ggplot2, dplyr, tidyr,
readr and others that you may have installed separately
before. In addition there are other packages used as part of the
tidyverse that would need to be loaded separately using
library().
We’re going to import a dataset from (Fogarty 2023) - remember these files should be provided wherever you’ve accessed this html file.
simd <- read_csv("simd2020.csv")
You should have a Warning: message, along with some other messages in your console. You don’t necessarily need to deal with these right away (as long as the data reads in to R) but in this case one of the issues is easy to ‘fix’.
Follow R’s suggestion to “call problems() on your data
frame for details…”
The key information to pay attention to here is the information in the ‘actual’ column. Go ahead and open the original sim2020.csv file, e.g. in Excel. Over in the ‘Attendance’ column there are some * characters. We’re going to re-import the data but tell R to treat the * as “NA”.
simd <- read_csv("simd2020.csv", na = "*")
To make sure this imported the way you wanted, you can always use
view() or click on the object in the ‘Environment’ window.
For this exercise I want you to run summary() but just on
that column (Attendance) and paste in the output (it will show how many
NA’s are in that column). If you’re not sure how to do this, please ask
(or peek ahead to #5, below).
Run spec(simd). How many variables are considered
strings (characters)? What is the other kind of variable in this dataset
according to R?
You can take a look at the values of the variables using the
glimpse() function. NOTE, the glimpse()
function is the tibble-friendly version of the str()
function from base R. Run glimpse(simd) (paste at least 7
rows of output.)
Now we’re going to use some other useful functions. Sometimes you
want to double-check the variable type or values, just from one column
(variable) in your dataset. Use the following functions.
class(simd$Council_area) returns the variable type,
head(simd$Council_area) returns the values of the first few
rows, and tail(simd$Council_area) returns the values of the
last few rows.
Just as a reminder of how particular R is to exact syntax, run
class(simd$Council_Area) - why do you get the Warning message that you do?
Pick a NUMERIC variable (column) from the simd tibble
and use class(), head() and
tail().
Now let’s use the count() function from the
dplyr package. As part of the tidyverse, we’ll use a coding
concept known as pipes, which allows us to link
multiple functions together in one block of code. In the code chunk
below, note that the first line is just the tibble, simd,
we want R to use with count(). You can think of it as “with
this data, do the following…”. And then we just use the column name
‘Council_area’ directly, since R knows we’re referring to
simd.
simd %>%
count(Council_area)
## # A tibble: 32 × 2
## Council_area n
## <chr> <int>
## 1 Aberdeen City 283
## 2 Aberdeenshire 340
## 3 Angus 155
## 4 Argyll and Bute 125
## 5 City of Edinburgh 597
## 6 Clackmannanshire 72
## 7 Dumfries and Galloway 201
## 8 Dundee City 188
## 9 East Ayrshire 163
## 10 East Dunbartonshire 130
## # ℹ 22 more rows
As you can see the default output is the first 10 rows - in this case, ordered alphabetically. However, we might want to see it sorted by the value of n, the count.
Run the code above but add sort = TRUE within the
count() function. If you can’t get this to run, run
help(count) and look at one of the
starwars %>% examples with sort = TRUE.
As a final activity in this exercise, let’s explore how to subset our data. In base R this is done using brackets [ , ] but with pipes the coding is (for a lot of people) more intuitive. I’m going to run through some ‘simple’ examples and then for practice you can code an original example.
If you want a subset of your data with just certain variables, you
can use select():
simd2 <- simd %>% select(Council_area:Income_rate)
Which makes a new data.frame (or ‘tibble’), simd2, with
the 4 variables (columns) from Council_area to Income_rate.
More often, you might want to create a subset based on certain
variable values (i.e. selecting rows from within your data). We do this
by using the filter() function (either by itself or with
the select() function).
To subset the data to only subset data from Glasgow City use the following. Note, we need to specify == and not = in the code.
simd3 <- simd %>% filter(Council_area=="Glasgow City")
If we want to filter AND use the select() function, the
variable we are filtering needs to be included in the selected
variables.
You can combine criteria in the filter() function and
filter by numeric variables as well (the following code is not complete
since it just has the filter() lines):
filter(Council_area=="Glasgow City" & Intermediate_Zone=="Hillhead")
OR
filter(Employment_rate >= .25) OR
filter(Employment_rate >= .25 & Council_area=="Glasgow City")
Come up with your own subsetting code using both
select() and filter().
ggplot2If you’re not already familiar with the ‘tidyverse’ in R, this will be a further introduction to it. Go to 5 Data tidying and read 5.1 (Introduction) through 5.5 (Summary). Again, you don’t need to run the code from that section unless you want to for practice.
We’re going to bring in a different dataset, the number of Covid-19 cases by NHS Scotland health board, to demonstrate data pivoting in the tidyverse (Fogarty 2023).
covid <- read_csv("covid_totals.csv")
## Rows: 14 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): health_board
## dbl (2): 2020, 2021
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
(as this output shows, no warning messages for this data - just the
default spec() information about the variables
imported.)
Following the example of using pivot_longer() from R4DS
5.3.1 data
in column names, try to tidy this data and create a
new tibble named ‘covid1’ (the code is below if you can’t get it to
work). How many rows (observations) did the original ‘covid’ tibble
have? How many does ‘covid1’ have?
There is a LOT more you can do with pivot_longer() as
described in the rest of Section 5.3
Lengthening Data, but we’re going move on for now.
covid1 <- covid %>%
pivot_longer(c(`2020`,`2021`), names_to = "year", values_to = "cases")
We’ll use the ‘longer’ covid1 data you just created, to do some visualizations. Following the model of ‘a really basic boxplot’ in the R Graph Gallery, create this plot yourself with the following code:
covid1 %>% #using the pipes coding again!
ggplot(aes(x=as.factor(year), y=cases)) +
geom_boxplot(fill="slateblue", alpha=0.2) +
ylab("COVID cases") +
xlab("year")
Yes this is a boxplot of the cases of COVID-19 in these Scottish health board areas, but it’s not particularly revealing. You may have noticed the Violin Plot example too so we can try that here:
covid1 %>%
ggplot(aes(x=as.factor(year), y=cases)) +
geom_violin(fill="slateblue", alpha=0.2) +
ylab("COVID cases") +
xlab("year")
Another alternative visualization: a line plot to show change, which links each data point by the ‘health board’.
# Visualize changes over time
covid1 %>%
ggplot(aes(x = as.factor(year), y = cases)) +
geom_line(aes(group = health_board), color = "grey50") +
ylab("COVID cases") +
xlab ("YEAR")
Much more than the other plots, this shows that the 2021 COVID cases, by Scottish health board, tend to be higher than the 2020 cases (no surprise!). However it’s not all that exciting to look at. Go to the line plot page on the data-to-viz site and pick some way to add to/revise this plot. It could be symbols for the data, different titles, changes to the values on the y-axis, more interesting colors… just experiment a little!
Congratulations, you’ve completed Lab 1!