Quick refresher on the most important parts from the last session:
To access RStudio on LJMU computers, go to the Application Player and search “RStudio”.This course is designed for the RStudio 64 with the circular logo.
On your own computer, you can download RStudio at their own downloads page, remember it is completely free to anyone.
Often, R Studio will keep your previous session saved, which means that you will probably open the script we created last week.
Code can be stored in the Script pane and run in the Console pane. If we want to run code from the Script pane, we highlight it and press Ctrl+Enter (Cmd+Enter on Mac). This will grab the code from the Script and put in in the Console.
The Workspace pane displays all of our variables and the Viewer pane displays plots and help files.
? followed by the function, e.g. ?mean into the console and the help file will pop up in the Viewer pane.The Maths, Stats & IT Team also run frequent drop-in sessions and one-to-ones on the Library Calendar.
c() function and these can be stored in a variable using the assignment (<-) operator, such as in the following x <- c(1,2,3).We can also chain together the c() function as in y <- c(x,4,5).
x <- c("Hello","World") will create a vector of characters, y <- x == "Hello" will create a vector of logicals.If different data types are passed into the vector creating function, R will turn them into the same type x <- c("Hello", "World", 1,2,3) will be a character vector.
We can use [ and ] to pull out subsets of our data x[3], which can be extracted using numeric vectors x[2:3], or logical ones x[c(T,T,F,T,F)] or by telling R which ones we don’t want x[-3].
Subsetting and Assignments can be combined to change single elements in a vector x[3] <- "To"
We can use the data() function to load up one of the Built-in R functions, e.g. data(iris)
We can describe our data using functions such as mean(), sd() and table()
We can plot data in a basic way using plot()
Data is distributed according to a probability density function (pdf) and a cumulative density function (cdf)
We can use functions related to these distributions with the dxxx(), pxxx(), qxxx()andrxxx()` family of functions
R is designed to be a modular approach to statistical analysis and is open-source and allows the addition of packages. This means that the version of R you download is the basic version of it and we can download packages which contian more advanced functions. Think of it like DLC, but for statistics (and much like R it is still free!) These packages are built by other R users and can be downloaded from a repository called CRAN (Comprehensive R Archive Network).
The first package we’re going to download and install is the tidyverse package (which actually imports/installs a bunch of other packages).
The command to install a new packages is:
install.packages("tidyverse",lock=F)
This function will search CRAN for the package called "tidyverse" and install it into our own package repository (located in your Documents folder). Note that this needs to be in quotes as this is a string that R will search for. It also downloads and installs all the other packages that tidyverse uses. As it’s installing, you’ll see it run through all these packages (e.g. broom, ggplot2,tidyr), we’ll get to these later.
If you’re running this on your home computer and have full admin rights, then you can omit the lock=F argument. However, on University ran computers, without this argument we can hit annoying errors.
You can see the currently installed packages inthe Viewer pane under the packages tab. If you click on the name of a package in here, it will bring up a Help file (still in the Viewer pane) for that package, including a list of functions contained in that package (which link to their Help Files). Some packages include links to vignettes/documentation regarding the package.
As well as storing the actual package files, CRAN also stores documentation about the package. This includes the Help files that get installed along with the packages. CRAN homepages for packages are standardised and all look the same with (roughly) the same information.
Let’s have a look at the CRAN page for the tidyverse.
Most of what we see here is for fairly advanced R users, so we’re going to ignore a lot of it. At the top, you can see the name of he package and a quick blurb about what it’s for. There are also a bunch of links to other packages that the current package uses. Under the Downloads heading, you can see the Reference manual and the Vignettes.
The Reference manual for a package contains a list of all of the functions that are in that package, and information about their use. This information is the same information that is contained in the ? Help files that is downloaded into R. This can be useful if you can’t remember the name of a function, but you know it’s in a certain package.
The Vignettes are more of a user guide. They’re usually written by the package author and contains information on how to use the package. Kind of like an instruction manual. Packages don’t always have vignettes, but when they do, I highly recomend reading them.
So, we’ve installed a package, but we’re not actually using it. Once a package has been installed it is stored on our system and we only need to do this once. But, we also need to load the package into our current R session. (Note that this time, we don’t need quotation marks)
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.3
## ✓ tidyr 1.0.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
We now have access to all of the functions contained within the core bundle of tidyverse packages (listed in the console under Attaching packages. This means that by installing and loading the tidyverse package, we’ve actually installed and loaded these other packages (e.g. ggplot2 and dplyr). Let’s get started!
One of the most important changes that comes with the tidyverse is the addition of the Pipe: %>%. It’s a new operator that comes with the dplyr package and it allows us to completely change how we use functions. It can make our code much easier to read, particularly when writing complicated series of functions. For example:
#Written as operators:
x <- (((2+1)*4)/6)
#Written as functions (these also come with dplyr):
x <- divide_by(multiply_by(add(2,1),4),6)
#Spread over multiple lines
x <- 2
x <- add(x,1)
x <- multiply_by(x,4)
x <- divide_by(x,6)
#Written with pipes:
x <- 2 %>%
add(1) %>%
multiply_by(4) %>%
divide_by(6)
The functions in the second version perform the same actions as the operators in the first (e.g. multiply_by(a,b) does the same as a*b). If this is all we’re doing, using these functions would seem silly (just use the operators). But if we’re using more complicated functions (which we will learn about soon), the pipe version is much neater and easier to read.
In the single line version, we are working from the inside out. This gets easier to read in the version where we spread it across multiple lines, but it’s prone to mistakes. What if we type z instead of x in a function call or an assignment call? The piped version gets rid of this risk and duplicated code (x <- on every line). I’ve chosen to spread my command over multiple lines to make it easier to read and to indent each new line (RStudio actually does this automatically for you), this makes my code look neater and much easier to read.
What the pipe does is allow us to pipe our answer into the next function and chain them together. This changes our syntax a little bit:
f(a,b,c,d)
a %>% f(b,c,d)
The shortcut to quickly enter this in RStudio is Ctrl/Cmd + Alt + M
I’ve uploaded a messy dataset online for us to use today. You can either download it and save it somewhere in your system, or you can load it straight from the URL. It’s in a csv format, so we use the following command:
dat <- read_csv("https://raw.githubusercontent.com/MyKo101/RJunk/master/data_anthro.csv")
## Parsed with column specification:
## cols(
## ID = col_character(),
## DOB = col_character(),
## Gender = col_character(),
## Ht = col_character(),
## Wgt = col_character()
## )
This read_csv() function is contained in the readr package. There is a built in function in R that allows us to bring in csv files, but it’s messier. We’re in the tidyverse now!
This has given a bit of an output. read_csv tries to guess what data types columns/variables within this dataset are. We can suppress this message with the argument col_types = col(), or we can define what we want them to be by passing the data types as a vector (d=double, c=character,l=logical) col_types=c("c","c","c","c","c").
Let’s have a look at this data. Previously, we needed to use the head() function to just look at the top few rows. But that was when we were using data.frame structure from the basic R package. read_csv() loads the data and stores it in a tibble format. This is essentially the same as a data.frame (and can be used in the same way), but provides a bit more information and consistency and only prints the first few rows:
dat
## # A tibble: 150 x 5
## ID DOB Gender Ht Wgt
## <chr> <chr> <chr> <chr> <chr>
## 1 001 22/03/1965 m 5 6 7 1
## 2 002 17/04/1977 f 5 9 11 0
## 3 003 24/07/1966 f 5 4 9 8
## 4 004 22/02/1985 Female 5 7 11 1
## 5 005 30/07/1985 M 6 0 10 10
## 6 006 09/07/1987 male 6 2 11 12
## 7 007 23/01/1995 female 5 10 10 2
## 8 008 31/12/1992 M 6 3 12 1
## 9 009 24/01/1952 Male 5 4 7 10
## 10 010 11/06/1953 F 5 2 8 6
## # … with 140 more rows
R now tells us that the tibble has 30 rows and 5 columns, and as well as giving us the names of the variables in the data, it also tells us their types (<dbl> or <chr>) and at the bottom, we’re told that there are 20 more rows. If we had a lot of columns in this data, R would only show us the first few columns and tell us at the bottom what the extra columns are (names and data types).
This data is quite messy. There are some things we need to fix. First thing to do would be to list out everything we need to fix.
We could edit this data in excel and go through one cell at a time, but if we have a big data set, this is going to be difficult. Let’s get R to do it for us
The dplyr package provides us with the mutate() function, which is incredibly useful for editing data as we go, and is designed to be used with the pipe. Let’s use it, along with the as.numeric() function to convert the ID variable into a number.
dat %>% mutate(ID = as.numeric(ID))
## # A tibble: 150 x 5
## ID DOB Gender Ht Wgt
## <dbl> <chr> <chr> <chr> <chr>
## 1 1 22/03/1965 m 5 6 7 1
## 2 2 17/04/1977 f 5 9 11 0
## 3 3 24/07/1966 f 5 4 9 8
## 4 4 22/02/1985 Female 5 7 11 1
## 5 5 30/07/1985 M 6 0 10 10
## 6 6 09/07/1987 male 6 2 11 12
## 7 7 23/01/1995 female 5 10 10 2
## 8 8 31/12/1992 M 6 3 12 1
## 9 9 24/01/1952 Male 5 4 7 10
## 10 10 11/06/1953 F 5 2 8 6
## # … with 140 more rows
Exactly as we needed, ID is now a number (double). The mutate() function takes in a tibble as it’s first argument (remember, this is what the pipe is doing) and changes it depending on whatever other information is passed to it. We can edit more than one variable at a time this way and create new variables. It’s also worth mentioning that R performs these mutations in order and so any changes we make at the start of the mutate() command can be used later.
dat %>%
mutate(x = runif(150), #Remember this function?
y = runif(150),
x = x + y)
## # A tibble: 150 x 7
## ID DOB Gender Ht Wgt x y
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 001 22/03/1965 m 5 6 7 1 1.18 0.253
## 2 002 17/04/1977 f 5 9 11 0 0.610 0.0839
## 3 003 24/07/1966 f 5 4 9 8 0.870 0.0759
## 4 004 22/02/1985 Female 5 7 11 1 1.54 0.756
## 5 005 30/07/1985 M 6 0 10 10 1.39 0.616
## 6 006 09/07/1987 male 6 2 11 12 1.30 0.868
## 7 007 23/01/1995 female 5 10 10 2 1.22 0.943
## 8 008 31/12/1992 M 6 3 12 1 0.323 0.127
## 9 009 24/01/1952 Male 5 4 7 10 1.14 0.639
## 10 010 11/06/1953 F 5 2 8 6 1.09 0.279
## # … with 140 more rows
Note that this is the same as running mutate(dat,x=runif(30),y=runif(30),x=x+y).
Right now, we’ve not stored the output from the previous mutate() call, dat is still as it was when we loaded it with the ID variable stored as a character.
For now, let’s clear out our Script pane to keep ourselves organised, and eliminate the code that we don’t need. Copy this into your Script pane and run it all (Ctrl/Cmd + A will select all the code in the Script pane and Ctrl/Cmd + Enter will run it all)
rm(list=ls())
#Load up my libraries
library(tidyverse)
#Load up my data
dat <- read_csv("https://raw.githubusercontent.com/MyKo101/RJunk/master/data_anthro.csv")
#Edit the dat file
dat_clean <- dat %>%
mutate(ID = as.numeric(ID)) #Turn ID into numeric
dat_clean
I’ve added a function as the first line rm(list=ls()) which will clear out our workspace and get rid of any variables that have been created. I’ve also added annotations with the # symbol. R won’t run anything in a line that is preceded by the # symbol (and it’ll even colour code it as well), this allows you to make notes (either to yourself or others) explaining what your code is actually doing. This isn’t always necessary if it’s obvious, but it can be useful to keep track of things.
Annotations are voluntary. We don’t have to put them in there, but they really help when you’re reading over code that you’ve not used in a while, or you’re sending it to someone else. It’s much easier to annotation and make your code as readable as possibl than it is to stand there explaining your code to someone who has never seen it before.
The dplyr package also provides us with the rename function, which allows us to rename some of the variables. This can make it easier to interpret what the variables are. Coming up with clear and concise names is very good practice for budding coders.
We’re going to add this to our dat editing call
dat_clean <- dat %>%
mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
rename(Height = Ht, Weight = Wgt) #Rename some variables
dat_clean
## # A tibble: 150 x 5
## ID DOB Gender Height Weight
## <dbl> <chr> <chr> <chr> <chr>
## 1 1 22/03/1965 m 5 6 7 1
## 2 2 17/04/1977 f 5 9 11 0
## 3 3 24/07/1966 f 5 4 9 8
## 4 4 22/02/1985 Female 5 7 11 1
## 5 5 30/07/1985 M 6 0 10 10
## 6 6 09/07/1987 male 6 2 11 12
## 7 7 23/01/1995 female 5 10 10 2
## 8 8 31/12/1992 M 6 3 12 1
## 9 9 24/01/1952 Male 5 4 7 10
## 10 10 11/06/1953 F 5 2 8 6
## # … with 140 more rows
The syntax here is just new-name = old-name. Now the variables Ht and Wgt are completely gone from dat, and we have new variables Height and Weight.
That’s two out of our five problems solved.
The Gender variable is still a bit of mess. Let’s take a look. This time, we’re just going to put this in the console.
table(dat$Gender) #Remember the table() function from last week?
##
## f F female Female m M male
## 13 26 18 18 12 19 22
## Male Non-Binary
## 18 4
We as humans can tell what each of these entries means. Some of them mean Male and some mean Female. But R can’t discern that. We’re gonna lump them together by converting all the relevant values into “Male” or “Female”. We can do this using the recode() function inside the mutate() function.
dat_clean <- dat %>%
mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
mutate(Gender = Gender %>% recode(m = "Male", male = "Male", M = "Male", #Tidy up Gender
f = "Female", female = "Female", F = "Female"))
dat_clean
## # A tibble: 150 x 5
## ID DOB Gender Height Weight
## <dbl> <chr> <chr> <chr> <chr>
## 1 1 22/03/1965 Male 5 6 7 1
## 2 2 17/04/1977 Female 5 9 11 0
## 3 3 24/07/1966 Female 5 4 9 8
## 4 4 22/02/1985 Female 5 7 11 1
## 5 5 30/07/1985 Male 6 0 10 10
## 6 6 09/07/1987 Male 6 2 11 12
## 7 7 23/01/1995 Female 5 10 10 2
## 8 8 31/12/1992 Male 6 3 12 1
## 9 9 24/01/1952 Male 5 4 7 10
## 10 10 11/06/1953 Female 5 2 8 6
## # … with 140 more rows
The first argument is actually the vector that we are recoding, Gender. Since we’re only performing the one function here, we’re not going to pipe as it’s still pretty neat as it is (thanks to the neat indentation oriented formatting). Confusingly, this is written backwards to the rename() function, so we need old-name = new-name and the new name has to be in quotes (because it is a string), but other than that, it’s pretty straight forward.
If we’re recoding a string that has spaces or that starts with a number, we need to put them in between backticks. You’ll be used to the single quote or apostrophe (’) and double quote ("), but in R, we sometimes need to use the backtick as a quotation mark (`). The backtick is usually located to the left of the 1 button on your keyboard (yes, that button is finally useful!)
x <- c("One","Two","3","Fo ur")
recode(x,`3` = "Three",`Fo ur` = "Four")
## [1] "One" "Two" "Three" "Four"
Technically, the recode() function can be used with a numeric vectors rather than a character vector, but it is much easier not to (and there are other alternatives).
The recode() function is pretty useful for a lot of circumstances, but it still looks a little clunky in our code. Previously, we learned about creating branches of code with the if and else statements. We can do a similar thing using a function.
x <- c("One","Two","3","Four")
if_else(x == "3","Three",x)
## [1] "One" "Two" "Three" "Four"
For our vector, it evaluates the first argument (x==3 into a logical vector) and then if it’s TRUE, it uses the second argument and if it’s FALSE uses the third argument. A lot of the time, the third argument will just be the original vector (so we only change it if the logical statement is TRUE).
We’re also going to introduce a new operator, the %in% operator. Remember operators have something on the left and something on the right. The %in% operator checks if the things in the vector on the left are in the vector on the right
x <- c(1,2,3,4,5,6)
x %in% c(2,4,6)
## [1] FALSE TRUE FALSE TRUE FALSE TRUE
For every element in x, we check whether it is in the vector on the right. If it is, it returns TRUE, otherwise, it returns FALSE. We can use this instead of the previous recode
dat_clean <- dat %>%
mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
mutate(Gender = if_else(Gender %in% c("m","M","male"),"Male",Gender), #Tidy up Gender
Gender = if_else(Gender %in% c("f","F","female"),"Female",Gender))
dat_clean
## # A tibble: 150 x 5
## ID DOB Gender Height Weight
## <dbl> <chr> <chr> <chr> <chr>
## 1 1 22/03/1965 Male 5 6 7 1
## 2 2 17/04/1977 Female 5 9 11 0
## 3 3 24/07/1966 Female 5 4 9 8
## 4 4 22/02/1985 Female 5 7 11 1
## 5 5 30/07/1985 Male 6 0 10 10
## 6 6 09/07/1987 Male 6 2 11 12
## 7 7 23/01/1995 Female 5 10 10 2
## 8 8 31/12/1992 Male 6 3 12 1
## 9 9 24/01/1952 Male 5 4 7 10
## 10 10 11/06/1953 Female 5 2 8 6
## # … with 140 more rows
There are plenty of situations where recode() is better than if_else(). Since we’re wanting to turn a lot of values into a few simpler ones, the if_else() function is better. If we’re just wanting to rename the values in our variable, then recode() works better. For example, if all the values were either “M”, “F” or “N” and we want them to be “Male”, “Female” or “Non-Binary”, then recode() would be the better option.
But remember that “Better” is often subjective. These commands do the same thing, but whichever you understand better should be the one you use!
We’ve converted the recode() function into a pair of if_else() functions, but, we’re still having to perform the if_else() twice. We could neaten it up a little using pipes:
dat_clean <- dat %>%
mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
mutate(Gender = if_else(Gender %in% c("m","M","male"),"Male",Gender) %>% #Tidy up Gender
if_else(. %in% c("f","F","female"),"Female",.))
dat_clean
## # A tibble: 150 x 5
## ID DOB Gender Height Weight
## <dbl> <chr> <chr> <chr> <chr>
## 1 1 22/03/1965 Male 5 6 7 1
## 2 2 17/04/1977 Female 5 9 11 0
## 3 3 24/07/1966 Female 5 4 9 8
## 4 4 22/02/1985 Female 5 7 11 1
## 5 5 30/07/1985 Male 6 0 10 10
## 6 6 09/07/1987 Male 6 2 11 12
## 7 7 23/01/1995 Female 5 10 10 2
## 8 8 31/12/1992 Male 6 3 12 1
## 9 9 24/01/1952 Male 5 4 7 10
## 10 10 11/06/1953 Female 5 2 8 6
## # … with 140 more rows
Here, we’ve used a . to represent the data being piped in. Ordinarily, the pipe will put the data into the first argument of the function, but we can over-ride this, by using the .
In this example, we’ve chained together a pair of if_else() functions. Not too bad when we’re just doing two, but if we needed to look at more cases, this could get complicated. We can use the case_when() function for this:
dat_clean <- dat %>%
mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
mutate(Gender = case_when(Gender %in% c("m","M","male") ~ "Male", #Tidy up Gender
Gender %in% c("f","F","female") ~ "Female",
T ~ Gender))
dat_clean
## # A tibble: 150 x 5
## ID DOB Gender Height Weight
## <dbl> <chr> <chr> <chr> <chr>
## 1 1 22/03/1965 Male 5 6 7 1
## 2 2 17/04/1977 Female 5 9 11 0
## 3 3 24/07/1966 Female 5 4 9 8
## 4 4 22/02/1985 Female 5 7 11 1
## 5 5 30/07/1985 Male 6 0 10 10
## 6 6 09/07/1987 Male 6 2 11 12
## 7 7 23/01/1995 Female 5 10 10 2
## 8 8 31/12/1992 Male 6 3 12 1
## 9 9 24/01/1952 Male 5 4 7 10
## 10 10 11/06/1953 Female 5 2 8 6
## # … with 140 more rows
Each argument in the case_when() function is a logical, just like in the if_else() statement, followed by the ~ symbol, and then the resut that we want to use in that case. R evaluates each of the conditional statements in order, this means that if two statements are correct (for your given data), R will return the first choice. This is also why we finish with a T ~ Gender to give the case_when() function a default option, if none of the other statements are TRUE, then this last one is definitely (and literally) TRUE
x <- 1:10
case_when(x < 3 ~ "small", #This is true for 1 & 2
x < 7 ~ "medium", #This is true for 1, 2, 3, 4, 5 and 6
T ~ "large") #Everything else (7, 8 & 9) get this result
## [1] "small" "small" "medium" "medium" "medium" "medium" "large" "large"
## [9] "large" "large"
For our purposes, we can use any of the previous options: recode(), if_else() or case_when(). It’s upto personal choice and style. Which do you find the easiest to understand? Which makes the most sense and is clearer for you. Whichever one it is, keep that one. We’re going to look at one final way to do this soon
The last few bits involved evaluating a conditional statement, sometimes, we need to check a few different statements at once. We might need them both to be true, or just one of them, or neither of them. We can use operators to combine logical vectors (just like we use operators for numeric vectors e.g c(5,4) + c(3,2))
x <- c(T,T,F,F)
y <- c(T,F,T,F)
x & y #AND opertor, both the left and right must be TRUE
## [1] TRUE FALSE FALSE FALSE
x | y #OR operator, either the left or the right need to be TRUE
## [1] TRUE TRUE TRUE FALSE
xor(x,y) #Exclusive OR, only one can be TRUE
## [1] FALSE TRUE TRUE FALSE
Obviously the last one, xor() is a function and not an operator, but it can still come in handy for our if_else(), case_when() and even just the plain old if() statements.
In data terms, we’d say that the Gender variable is categorical. Since we have a categorical data, it would make sense that we also want to store it in our tibble as such. Let’s do that!
Underneath, R stores a factor as a number with an associated vector of levels, which is (usually) a character.
dat_clean <- dat %>%
mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
mutate(Gender = case_when(Gender %in% c("m","M","male") ~ "Male",
Gender %in% c("f","F","female") ~ "Female",
T ~ Gender), #Tidy up Gender
Gender = factor(Gender))
dat_clean
## # A tibble: 150 x 5
## ID DOB Gender Height Weight
## <dbl> <chr> <fct> <chr> <chr>
## 1 1 22/03/1965 Male 5 6 7 1
## 2 2 17/04/1977 Female 5 9 11 0
## 3 3 24/07/1966 Female 5 4 9 8
## 4 4 22/02/1985 Female 5 7 11 1
## 5 5 30/07/1985 Male 6 0 10 10
## 6 6 09/07/1987 Male 6 2 11 12
## 7 7 23/01/1995 Female 5 10 10 2
## 8 8 31/12/1992 Male 6 3 12 1
## 9 9 24/01/1952 Male 5 4 7 10
## 10 10 11/06/1953 Female 5 2 8 6
## # … with 140 more rows
The data type of Gender has switched to <fct>, which means it’s a factor. If we have a look at it, it’ll tell us the levels of that factor (i.e. what are all the possible values)
head(dat_clean$Gender)
## [1] Male Female Female Female Male Male
## Levels: Female Male Non-Binary
head(as.numeric(dat_clean$Gender)) #This is what R sees
## [1] 2 1 1 1 2 2
levels(dat_clean$Gender) #This is what we see
## [1] "Female" "Male" "Non-Binary"
In this version, “Male” has been chosen to be the first level. This means if we run any analysis, “Male” will be used as the reference category. This is because it’s the first element in the vector and so R just uses this by default. A lot of functions that can take a factor as an argument will automatically convert a character into a factor, and again, just use the first element as the reference.
This might not be what we want. We’re going to make the “Female” category into our reference category using the relevel() function:
dat_clean <- dat %>%
mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
mutate(Gender = case_when(Gender %in% c("m","M","male") ~ "Male",
Gender %in% c("f","F","female") ~ "Female",
T ~ Gender) %>% #Tidy up Gender
factor %>%
relevel("Female"))
levels(dat_clean$Gender)
## [1] "Female" "Male" "Non-Binary"
The forcats package in the tidyverse provides a bunch of ways to re-order factors depending on the other data in your dataset. For example, you use fct_infreq() to set the levels to be in order depending on how many of each category there are (i.e. biggest category becomes the reference category), or you can set them all manually using fct_manual().
Above, we used the recode(), if_else() and case_when() method of re-organising the Gender variable, but forcats has it’s own way. We can restructure Gender and turn it into a factor in one gor:
dat_clean <- dat %>%
mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
mutate(Gender = fct_collapse(Gender, #Tidy up Gender
Male = c("m","male","M"),
Female = c("f","female","F")))
levels(dat_clean$Gender)
## [1] "Female" "Male" "Non-Binary"
The only thing to be wary with this method is that the levels we are collapsing must exist in the factor (e.g. if we had put Male = c("m","male","M","Mal")), R would throw an error because there is no Mal value in the Gender variable.
The Height & Weight variables are still in an unusual form. From inspection, we can see that they’re written in imperial measures: feet & inches and stone & lbs. For Height, the feet & inches are separated by a space, so we can use this to split the vector into two
dat_clean <- dat %>%
mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
mutate(Gender = fct_collapse(Gender, #Tidy up Gender
Male = c("m","male","M"),
Female = c("f","female","F"))) %>%
separate(Height,c("Feet","Inches"),sep=" ") #Split Height up
dat_clean
## # A tibble: 150 x 6
## ID DOB Gender Feet Inches Weight
## <dbl> <chr> <fct> <chr> <chr> <chr>
## 1 1 22/03/1965 Male 5 6 7 1
## 2 2 17/04/1977 Female 5 9 11 0
## 3 3 24/07/1966 Female 5 4 9 8
## 4 4 22/02/1985 Female 5 7 11 1
## 5 5 30/07/1985 Male 6 0 10 10
## 6 6 09/07/1987 Male 6 2 11 12
## 7 7 23/01/1995 Female 5 10 10 2
## 8 8 31/12/1992 Male 6 3 12 1
## 9 9 24/01/1952 Male 5 4 7 10
## 10 10 11/06/1953 Female 5 2 8 6
## # … with 140 more rows
This removes the Height variable and replaces it with two new ones, the Feet and Inches variables. We can now combine these back together by converting them into metres using the following conversion: 1 foot = 0.3048 m and 1 inch = 0.0254 m.
dat_clean <- dat %>%
mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
mutate(Gender = fct_collapse(Gender, #Tidy up Gender
Male = c("m","male","M"),
Female = c("f","female","F"))) %>%
separate(Height,c("Feet","Inches"),sep=" ") %>% #Split Height up
mutate(Height = 0.3048*Feet + 0.0254*Inches) #Convert into metres
## Error in 0.3048 * Feet: non-numeric argument to binary operator
We’ve hit our first error. Feet and Inches are still stored as characters. We need to convert them into Numbers before we can multiply and add them. We could do it this way:
dat_clean <- dat %>%
mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
mutate(Gender = fct_collapse(Gender, #Tidy up Gender
Male = c("m","male","M"),
Female = c("f","female","F"))) %>%
separate(Height,c("Feet","Inches"),sep=" ") %>% #Split Height up
mutate(Feet = as.numeric(Feet), #Convert to Numbers
Inches = as.numeric(Inches))
Which is easy enough, but what if we were wanting to do it with a lot of variables at once?
We can use the mutate_at() function to apply a function to a bunch of variables all at once. This is pretty advanced tidyverse stuff!
dat_clean <- dat %>%
mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
mutate(Gender = fct_collapse(Gender, #Tidy up Gender
Male = c("m","male","M"),
Female = c("f","female","F"))) %>%
separate(Height,c("Feet","Inches"),sep=" ") %>% #Split Height up
mutate_at(c("Feet","Inches"),as.numeric) #Convert to Numbers
dat_clean
This is a scoped version of the mutate() function, meaning that it mutates the tibble on a specific set of variables. We told the mutate_at() function that we want to apply the as.numeric() function to Feet and Inches. We passed the names of the variables as strings to mutate_at() and then passed the function without the brackets at the end.
We can now go back to the formula for converting feet/inches to metres and re-run it with Feet and Inches as numbers
dat_clean <- dat %>%
mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
mutate(Gender = fct_collapse(Gender, #Tidy up Gender
Male = c("m","male","M"),
Female = c("f","female","F"))) %>%
separate(Height,c("Feet","Inches"),sep=" ") %>% #Split Height up
mutate_at(c("Feet","Inches"),as.numeric) %>% #Convert to Numbers
mutate(Height = 0.3048*Feet + 0.0254*Inches) #Convert into metres
dat_clean
## # A tibble: 150 x 7
## ID DOB Gender Feet Inches Weight Height
## <dbl> <chr> <fct> <dbl> <dbl> <chr> <dbl>
## 1 1 22/03/1965 Male 5 6 7 1 1.68
## 2 2 17/04/1977 Female 5 9 11 0 1.75
## 3 3 24/07/1966 Female 5 4 9 8 1.63
## 4 4 22/02/1985 Female 5 7 11 1 1.70
## 5 5 30/07/1985 Male 6 0 10 10 1.83
## 6 6 09/07/1987 Male 6 2 11 12 1.88
## 7 7 23/01/1995 Female 5 10 10 2 1.78
## 8 8 31/12/1992 Male 6 3 12 1 1.91
## 9 9 24/01/1952 Male 5 4 7 10 1.63
## 10 10 11/06/1953 Female 5 2 8 6 1.57
## # … with 140 more rows
Almost done. We’ve now got Height into metres, but do we really need the Feet and Inches variables anymore? We can pick which variables we want to keep in our dataset using the select() function
dat_clean <- dat %>%
mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
mutate(Gender = fct_collapse(Gender, #Tidy up Gender
Male = c("m","male","M"),
Female = c("f","female","F"))) %>%
separate(Height,c("Feet","Inches"),sep=" ") %>% #Split Height up
mutate_at(c("Feet","Inches"),as.numeric) %>% #Convert to Numbers
mutate(Height = 0.3048*Feet + 0.0254*Inches) %>% #Convert into metres
select(ID,DOB,Gender,Height,Weight) #Pick just what we want to keep
dat_clean
## # A tibble: 150 x 5
## ID DOB Gender Height Weight
## <dbl> <chr> <fct> <dbl> <chr>
## 1 1 22/03/1965 Male 1.68 7 1
## 2 2 17/04/1977 Female 1.75 11 0
## 3 3 24/07/1966 Female 1.63 9 8
## 4 4 22/02/1985 Female 1.70 11 1
## 5 5 30/07/1985 Male 1.83 10 10
## 6 6 09/07/1987 Male 1.88 11 12
## 7 7 23/01/1995 Female 1.78 10 2
## 8 8 31/12/1992 Male 1.91 12 1
## 9 9 24/01/1952 Male 1.63 7 10
## 10 10 11/06/1953 Female 1.57 8 6
## # … with 140 more rows
Similar to working with vectors, we can also tell R which variables we don’t want. This might actually be easier here.
dat_clean <- dat %>%
mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
mutate(Gender = fct_collapse(Gender, #Tidy up Gender
Male = c("m","male","M"),
Female = c("f","female","F"))) %>%
separate(Height,c("Feet","Inches"),sep=" ") %>% #Split Height up
mutate_at(c("Feet","Inches"),as.numeric) %>% #Convert to Numbers
mutate(Height = 0.3048*Feet + 0.0254*Inches) %>% #Convert into metres
select(-Feet,-Inches) #Drop what we don't want
dat_clean
## # A tibble: 150 x 5
## ID DOB Gender Weight Height
## <dbl> <chr> <fct> <chr> <dbl>
## 1 1 22/03/1965 Male 7 1 1.68
## 2 2 17/04/1977 Female 11 0 1.75
## 3 3 24/07/1966 Female 9 8 1.63
## 4 4 22/02/1985 Female 11 1 1.70
## 5 5 30/07/1985 Male 10 10 1.83
## 6 6 09/07/1987 Male 11 12 1.88
## 7 7 23/01/1995 Female 10 2 1.78
## 8 8 31/12/1992 Male 12 1 1.91
## 9 9 24/01/1952 Male 7 10 1.63
## 10 10 11/06/1953 Female 8 6 1.57
## # … with 140 more rows
We’ve converted Height from feet and inches into metres. We still need to do the same for Weight into kg. The conversion between kg and st/lbs is 1 stone = 6.35029 kg and 1 lb = 0.453592 kg. Can you replicate what I did here for the Weight variable? Take Five minutes and have a think. Some functions that you run for this can be combined with the previous functions that we ran for Height. We also want to create a new variable, BMI, which is calculated as BMI = Weight/Height^2, let’s add this too!
dat_clean <- dat %>%
mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
mutate(Gender = fct_collapse(Gender, #Tidy up Gender
Male = c("m","male","M"),
Female = c("f","female","F"))) %>%
separate(Height,c("Feet","Inches"),sep=" ") %>% #Split Height up
separate(Weight,c("Stone","Pounds"),sep=" ") %>% #Split Weight up
mutate_at(c("Feet","Inches","Stone","Pounds"),as.numeric) %>% #Convert
mutate(Height = 0.3048*Feet + 0.0254*Inches,
Weight = 6.3502*Stone + 0.4535*Pounds,
BMI = Weight/Height^2) %>% #Convert into metres
select(-Feet,-Inches,-Stone,-Pounds) #Drop what we don't want
dat_clean
## # A tibble: 150 x 6
## ID DOB Gender Height Weight BMI
## <dbl> <chr> <fct> <dbl> <dbl> <dbl>
## 1 1 22/03/1965 Male 1.68 44.9 16.0
## 2 2 17/04/1977 Female 1.75 69.9 22.7
## 3 3 24/07/1966 Female 1.63 60.8 23.0
## 4 4 22/02/1985 Female 1.70 70.3 24.3
## 5 5 30/07/1985 Male 1.83 68.0 20.3
## 6 6 09/07/1987 Male 1.88 75.3 21.3
## 7 7 23/01/1995 Female 1.78 64.4 20.4
## 8 8 31/12/1992 Male 1.91 76.7 21.1
## 9 9 24/01/1952 Male 1.63 49.0 18.5
## 10 10 11/06/1953 Female 1.57 53.5 21.6
## # … with 140 more rows
Dates are hard! This is just a fact. Leap Days, Time Zones, Daylight’s Savings Time, Leap Seconds. All make for difficulty in dealing with Dates. We’re going to need a specialised package for dates: lubridate. This is a package designed with the tidyverse in mind, but is not installed when we downloaded the tidyverse package at the start, so in the console, we’re going to install the lubridate package: install.packages("lubridate").
As is convention (and keeping our code organised), we’re going to load the package into R using the library() function at the start of our code (rather than in the middle).
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
This little info box just tells us that there is a function called date() in the base package and in the lubridate package. Since we’re loading the lubridate package, this essentially overwrites the date() function from the base package. Which is fine, since lubridate has a better version. We can still use the original date() function by writing base::date() (but we don’t need it).
dat_clean <- dat %>%
mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
mutate(Gender = fct_collapse(Gender, #Tidy up Gender
Male = c("m","male","M"),
Female = c("f","female","F"))) %>%
separate(Height,c("Feet","Inches"),sep=" ") %>% #Split Height up
separate(Weight,c("Stone","Pounds"),sep=" ") %>% #Split Weight up
mutate_at(c("Feet","Inches","Stone","Pounds"),as.numeric) %>% #Convert
mutate(Height = 0.3048*Feet + 0.0254*Inches,
Weight = 6.3502*Stone + 0.4535*Pounds,
BMI = Weight/Height^2) %>% #Convert into metres
select(-Feet,-Inches,-Stone,-Pounds) %>% #Drop what we don't want
mutate(DOB = dmy(DOB)) # Convert Date
dat_clean
## # A tibble: 150 x 6
## ID DOB Gender Height Weight BMI
## <dbl> <date> <fct> <dbl> <dbl> <dbl>
## 1 1 1965-03-22 Male 1.68 44.9 16.0
## 2 2 1977-04-17 Female 1.75 69.9 22.7
## 3 3 1966-07-24 Female 1.63 60.8 23.0
## 4 4 1985-02-22 Female 1.70 70.3 24.3
## 5 5 1985-07-30 Male 1.83 68.0 20.3
## 6 6 1987-07-09 Male 1.88 75.3 21.3
## 7 7 1995-01-23 Female 1.78 64.4 20.4
## 8 8 1992-12-31 Male 1.91 76.7 21.1
## 9 9 1952-01-24 Male 1.63 49.0 18.5
## 10 10 1953-06-11 Female 1.57 53.5 21.6
## # … with 140 more rows
The dmy() function takes in dates of the form “Day-Month-Year” and converts them into a type date. Pretty intuitive. There are similar functions such as mdy(), ymd() for different formats. These are all found in the same ?dmy Help file. The Date type functions similar to a number since it is considered a continuous variable, rather than a character, which makes it much easier to work with. For example, we can now sort them in order and find the median DOB, the dplyr package provides us with the median() function, which is an improvement on the quantile() function:
median(dat_clean$DOB)
## [1] NA
Oh, it’s returned NA, which means that there are NA values in our data. NA is usually used for Missing values. Let’s see how many we have. The is.na() function returns TRUE if the element in a vector is NA and FALSE if it isn’t (i.e. it contains an acutal number)
dat_clean$DOB %>%
is.na %>%
sum
## [1] 5
A lot of functions also contain the option to ignore NA values, so let’s get the real median()
median(dat_clean$DOB, na.rm=T) #na.rm=T tells median to remove the NAs
## [1] "1978-03-11"
In our DOB variable, we have some NAs and they can really mess up our work. Let’s get rid of those entries using the filter() function. Remember that the ! (not) operator turns TRUE into FALSE and vice versa:
dat_clean <- dat %>%
mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
mutate(Gender = fct_collapse(Gender, #Tidy up Gender
Male = c("m","male","M"),
Female = c("f","female","F"))) %>%
separate(Height,c("Feet","Inches"),sep=" ") %>% #Split Height up
separate(Weight,c("Stone","Pounds"),sep=" ") %>% #Split Weight up
mutate_at(c("Feet","Inches","Stone","Pounds"),as.numeric) %>% #Convert
mutate(Height = 0.3048*Feet + 0.0254*Inches,
Weight = 6.3502*Stone + 0.4535*Pounds,
BMI = Weight/Height^2) %>% #Convert into metres
select(-Feet,-Inches,-Stone,-Pounds) %>% #Drop what we don't want
mutate(DOB = dmy(DOB)) %>% # Convert Date
filter(!is.na(DOB)) #Get rid of the NAs
dat_clean
## # A tibble: 145 x 6
## ID DOB Gender Height Weight BMI
## <dbl> <date> <fct> <dbl> <dbl> <dbl>
## 1 1 1965-03-22 Male 1.68 44.9 16.0
## 2 2 1977-04-17 Female 1.75 69.9 22.7
## 3 3 1966-07-24 Female 1.63 60.8 23.0
## 4 4 1985-02-22 Female 1.70 70.3 24.3
## 5 5 1985-07-30 Male 1.83 68.0 20.3
## 6 6 1987-07-09 Male 1.88 75.3 21.3
## 7 7 1995-01-23 Female 1.78 64.4 20.4
## 8 8 1992-12-31 Male 1.91 76.7 21.1
## 9 9 1952-01-24 Male 1.63 49.0 18.5
## 10 10 1953-06-11 Female 1.57 53.5 21.6
## # … with 135 more rows
Now we’ve completely got rid of the NAs in our dataset and should have values for every cell.
Now we have a bunch of tidy data, what can we do with it? Well firstly, we can grab out some important descriptive statistics. Previously, we have used things like the mean() and sd() functions on vectors and we can do that here within our tibble. What happens if we do it in a mutate() call?
dat_clean <- dat %>%
mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
mutate(Gender = fct_collapse(Gender, #Tidy up Gender
Male = c("m","male","M"),
Female = c("f","female","F"))) %>%
separate(Height,c("Feet","Inches"),sep=" ") %>% #Split Height up
separate(Weight,c("Stone","Pounds"),sep=" ") %>% #Split Weight up
mutate_at(c("Feet","Inches","Stone","Pounds"),as.numeric) %>% #Convert
mutate(Height = 0.3048*Feet + 0.0254*Inches,
Weight = 6.3502*Stone + 0.4535*Pounds,
BMI = Weight/Height^2) %>% #Convert into metres
select(-Feet,-Inches,-Stone,-Pounds) %>% #Drop what we don't want
mutate(DOB = dmy(DOB)) %>% # Convert Date
filter(!is.na(DOB)) %>% #Get rid of the NAs
mutate(mean.BMI = mean(BMI))
dat_clean
## # A tibble: 145 x 7
## ID DOB Gender Height Weight BMI mean.BMI
## <dbl> <date> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 1 1965-03-22 Male 1.68 44.9 16.0 20.5
## 2 2 1977-04-17 Female 1.75 69.9 22.7 20.5
## 3 3 1966-07-24 Female 1.63 60.8 23.0 20.5
## 4 4 1985-02-22 Female 1.70 70.3 24.3 20.5
## 5 5 1985-07-30 Male 1.83 68.0 20.3 20.5
## 6 6 1987-07-09 Male 1.88 75.3 21.3 20.5
## 7 7 1995-01-23 Female 1.78 64.4 20.4 20.5
## 8 8 1992-12-31 Male 1.91 76.7 21.1 20.5
## 9 9 1952-01-24 Male 1.63 49.0 18.5 20.5
## 10 10 1953-06-11 Female 1.57 53.5 21.6 20.5
## # … with 135 more rows
We now have the average BMI in every row. Maybe that’s what we wanted. But it’s not what I wanted. Let’s put it in a summarise() call instead. Since we’ve neatened up our dat_clean dataset as much as possible, we’re going to save it and edit from there
dat_clean <- dat %>%
mutate(ID = as.numeric(ID)) %>% #Turn ID into numeric
rename(Height = Ht, Weight = Wgt) %>% #Rename some variables
mutate(Gender = fct_collapse(Gender, #Tidy up Gender
Male = c("m","male","M"),
Female = c("f","female","F"))) %>%
separate(Height,c("Feet","Inches"),sep=" ") %>% #Split Height up
separate(Weight,c("Stone","Pounds"),sep=" ") %>% #Split Weight up
mutate_at(c("Feet","Inches","Stone","Pounds"),as.numeric) %>% #Convert
mutate(Height = 0.3048*Feet + 0.0254*Inches,
Weight = 6.3502*Stone + 0.4535*Pounds,
BMI = Weight/Height^2) %>% #Convert into metres
select(-Feet,-Inches,-Stone,-Pounds) %>% #Drop what we don't want
mutate(DOB = dmy(DOB)) %>% # Convert Date
filter(!is.na(DOB)) #Get rid of the NAs
dat_clean %>%
summarise(mean.BMI = mean(BMI))
## # A tibble: 1 x 1
## mean.BMI
## <dbl>
## 1 20.5
Well, that’s a bit better, but what if we want more than just the mean of the BMI? Let’s over-complicated this summarise() function
dat_clean %>%
summarise_at(c("Height","Weight","BMI"),
list(mean=mean,sd=sd,min=min,max=max))
## # A tibble: 1 x 12
## Height_mean Weight_mean BMI_mean Height_sd Weight_sd BMI_sd Height_min
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.70 59.5 20.5 0.0943 14.6 4.24 1.47
## # … with 5 more variables: Weight_min <dbl>, BMI_min <dbl>, Height_max <dbl>,
## # Weight_max <dbl>, BMI_max <dbl>
Within the summarise_at() function, we’ve created a list() of functions and told R to perform that function on all of the variables we gave it. Very similar to the way we used mutate_at() earlier.
It’s all very well having a summary of the entire dataset, but what if I want to stratify. I want the same statistics, but based on a the value in Gender.
Easy. Peasy.
dat_clean %>%
group_by(Gender) #Group the data
## # A tibble: 145 x 6
## # Groups: Gender [3]
## ID DOB Gender Height Weight BMI
## <dbl> <date> <fct> <dbl> <dbl> <dbl>
## 1 1 1965-03-22 Male 1.68 44.9 16.0
## 2 2 1977-04-17 Female 1.75 69.9 22.7
## 3 3 1966-07-24 Female 1.63 60.8 23.0
## 4 4 1985-02-22 Female 1.70 70.3 24.3
## 5 5 1985-07-30 Male 1.83 68.0 20.3
## 6 6 1987-07-09 Male 1.88 75.3 21.3
## 7 7 1995-01-23 Female 1.78 64.4 20.4
## 8 8 1992-12-31 Male 1.91 76.7 21.1
## 9 9 1952-01-24 Male 1.63 49.0 18.5
## 10 10 1953-06-11 Female 1.57 53.5 21.6
## # … with 135 more rows
We can tell R that we want the rows in our tibble to be grouped together based on the value in the Gender variable. Now, the summarise() function will respect this grouping.
dat_clean %>%
group_by(Gender) %>% #Group the data
summarise_at(c("Height","Weight","BMI"),
list(mean=mean,sd=sd,min=min,max=max))
## # A tibble: 3 x 13
## Gender Height_mean Weight_mean BMI_mean Height_sd Weight_sd BMI_sd Height_min
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Female 1.65 55.2 20.3 0.0764 12.8 4.16 1.47
## 2 Male 1.76 64.5 20.8 0.0785 15.2 4.40 1.55
## 3 Non-B… 1.62 54.1 20.6 0.0381 9.40 3.14 1.57
## # … with 5 more variables: Weight_min <dbl>, BMI_min <dbl>, Height_max <dbl>,
## # Weight_max <dbl>, BMI_max <dbl>
Quite often in our data analysis, we will need to combine data from multiple sources into a single dataset. For example, we might have taken scoring measurements separately and want to combine them back in with our original anthropometric data above.
We can do that with the family of *_join() functions. These functions take two tibbles (or data.frames), x and y and joins them together based on an identifier for each row. However, sometimes we may have an id that is in one dataset, but not in the other. The *_join() family allows us to decide what entries to keep.
inner_join() will keep only the IDs that appear in both x and y.full_join() will keep all IDs from both and if there are any missing, it will replace them with an NA.left_join() and right_join() will keep all of the IDs in the left (x) or the right (x) regardless of whether they’re in the other or not.Let’s bring in a new dataset and try it out!
dat_score <- read_csv("https://raw.githubusercontent.com/MyKo101/RJunk/master/data_score.csv")
## Parsed with column specification:
## cols(
## ID = col_double(),
## Score_1 = col_double(),
## Score_2 = col_double(),
## Score_3 = col_double()
## )
dat_score
## # A tibble: 150 x 4
## ID Score_1 Score_2 Score_3
## <dbl> <dbl> <dbl> <dbl>
## 1 1 68.2 81.0 88.8
## 2 2 42.8 56.2 72.1
## 3 3 45.1 56.6 63.7
## 4 4 59.3 64.9 63.5
## 5 5 52.3 49.8 54.4
## 6 6 63.5 68.2 79.9
## 7 7 32.5 40.3 52.3
## 8 8 32.2 42.2 56.5
## 9 9 59.2 62.6 75.7
## 10 10 38.9 41.5 50.8
## # … with 140 more rows
This dataset has the same ID as our previous one, but don’t forget we previously deleted some rows from our dataset (based on whether DOB was missing or not). So we don’t have the exact same ones. This means we’re going to want to use inner_join()? We need to pass the datasets as arguments along with a vector telling R which variables are our identifiers:
dat_joined <- inner_join(dat_clean,dat_score,by="ID")
dat_joined
## # A tibble: 145 x 9
## ID DOB Gender Height Weight BMI Score_1 Score_2 Score_3
## <dbl> <date> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1965-03-22 Male 1.68 44.9 16.0 68.2 81.0 88.8
## 2 2 1977-04-17 Female 1.75 69.9 22.7 42.8 56.2 72.1
## 3 3 1966-07-24 Female 1.63 60.8 23.0 45.1 56.6 63.7
## 4 4 1985-02-22 Female 1.70 70.3 24.3 59.3 64.9 63.5
## 5 5 1985-07-30 Male 1.83 68.0 20.3 52.3 49.8 54.4
## 6 6 1987-07-09 Male 1.88 75.3 21.3 63.5 68.2 79.9
## 7 7 1995-01-23 Female 1.78 64.4 20.4 32.5 40.3 52.3
## 8 8 1992-12-31 Male 1.91 76.7 21.1 32.2 42.2 56.5
## 9 9 1952-01-24 Male 1.63 49.0 18.5 59.2 62.6 75.7
## 10 10 1953-06-11 Female 1.57 53.5 21.6 38.9 41.5 50.8
## # … with 135 more rows
So we have the scores in what is known as a wide format, the table has a lot of columns and each score (1-3) has it’s own column. We can change this into a long format, where each ID x Score combination has it’s own row:
dat_joined %>%
gather(Score_1,Score_2,Score_3,key="Score_Num",value="Score")
## # A tibble: 435 x 8
## ID DOB Gender Height Weight BMI Score_Num Score
## <dbl> <date> <fct> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 1 1965-03-22 Male 1.68 44.9 16.0 Score_1 68.2
## 2 2 1977-04-17 Female 1.75 69.9 22.7 Score_1 42.8
## 3 3 1966-07-24 Female 1.63 60.8 23.0 Score_1 45.1
## 4 4 1985-02-22 Female 1.70 70.3 24.3 Score_1 59.3
## 5 5 1985-07-30 Male 1.83 68.0 20.3 Score_1 52.3
## 6 6 1987-07-09 Male 1.88 75.3 21.3 Score_1 63.5
## 7 7 1995-01-23 Female 1.78 64.4 20.4 Score_1 32.5
## 8 8 1992-12-31 Male 1.91 76.7 21.1 Score_1 32.2
## 9 9 1952-01-24 Male 1.63 49.0 18.5 Score_1 59.2
## 10 10 1953-06-11 Female 1.57 53.5 21.6 Score_1 38.9
## # … with 425 more rows
The effects of gather() can be undone with the spread() function
dat_joined %>%
gather(Score_1,Score_2,Score_3,key="Score_Num",value="Score") %>%
spread(key="Score_Num",value="Score")
## # A tibble: 145 x 9
## ID DOB Gender Height Weight BMI Score_1 Score_2 Score_3
## <dbl> <date> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1965-03-22 Male 1.68 44.9 16.0 68.2 81.0 88.8
## 2 2 1977-04-17 Female 1.75 69.9 22.7 42.8 56.2 72.1
## 3 3 1966-07-24 Female 1.63 60.8 23.0 45.1 56.6 63.7
## 4 4 1985-02-22 Female 1.70 70.3 24.3 59.3 64.9 63.5
## 5 5 1985-07-30 Male 1.83 68.0 20.3 52.3 49.8 54.4
## 6 6 1987-07-09 Male 1.88 75.3 21.3 63.5 68.2 79.9
## 7 7 1995-01-23 Female 1.78 64.4 20.4 32.5 40.3 52.3
## 8 8 1992-12-31 Male 1.91 76.7 21.1 32.2 42.2 56.5
## 9 9 1952-01-24 Male 1.63 49.0 18.5 59.2 62.6 75.7
## 10 10 1953-06-11 Female 1.57 53.5 21.6 38.9 41.5 50.8
## # … with 135 more rows
These two functions work really well together depending on how we want our data to look. How we want our data to look will usually depend on what analysis we are doing.
We’ve introduced a bunch of new functions throughout this lesson and mentioned quite a few different packages. The Reference manuals and vignettes from a package can be incredibly useful for a thorough description of how the package works. But, a more intuitive resource is the RStudio Cheatsheets. These are much more visual and are really useful as a lookup if you can’t quite remember what function you need to use (rather than trawling through the Reference manual). Not all Cheatsheets are on the RStudio page, but remember, Google is your friend and searching specifically for Cheatsheets is easy! For example: “R forcats Cheatsheet”
There is also a shortcut to some of these Cheatsheets within RStudio in the menubar (at the top of the window) Help > Cheatsheets
Let’s take a look at the dplyr Cheatsheet! We can find this in the Help shortcut in R Studio.