Coding can save you a lot of time doing some of the more tiresome tasks with regards to tidying and transforming data as well as enabling you to create some fantastic visuals.
Loading packages
Firstly you need load the key packages you are going to need to use. They might not be installed yet so you may need to run the install.packages() code.
In this case you will need to use “tidyverse” which is a suite of packages that enable you to tidy and graph your data (these include ggplot2 which is can be used to create excellent visualizations. You will also need the “readxl” package to read data into R from excel.
install.packages("tidyverse")
install.packages("readxl")
To load your packages use the code below library()
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.0 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.1 ✔ tibble 3.1.8
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(readxl)
Now you are ready to go - we will collect some data using a Microsoft Form but to get you going the code below simulates normally distributed data, 200 data points with a mean height of 175 cm and a standard deviation of 10 cm. rnorm(200, 175, 10)
I’ve called this “height”. height<- rnorm(200, 175, 10NOTE the <- sign can usually be replaced by an equals.
We’ve then added this into a data frame so it is usable and called the data frame (data) If you write head(data) you will see the data frame and the column header named height/
We can then graph the distribution of this data, work out averages (mean / median) etc.
mean(data$height)
[1] 174.9143
median(data$height)
[1] 174.4126
sd(data$height)
[1] 9.431599
Simple plot with ggplot, you can copy the final code below but it is good to make a plot step by step to understand what the code is doing. A ggplot is made up of layers, starting with selecting the data frame you want to graph and the “aesthetics”. Here we want to use data from our data frame called “data” and graph the column height on the x-axis.
ggplot(data, aes(x = height))
Next we need to add the type of graph - a histogram. Here I’ve asked it to fill the bars with a white colour. Try changing white to blue.
ggplot(data, aes(x = height)) +geom_histogram(colour =1, fill ="white")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Histogram’s are nice but I do like “density plots” to show the distribution of the data, for this the code is slightly more complicated (but you can just copy mine).
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Finally, we can change the background by adding a theme and we can give our chart and axies titles
ggplot(data, aes(x = height)) +geom_histogram(aes(y = ..density..), colour =1, fill ="white") +geom_density(lwd =1, colour =4,fill =4, alpha =0.25) +labs(title="Distribution of height", y ="Freaquency density" , x="Height (cm)")+theme_classic()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Here is a similar example of a boxplot with individual data points added and “jittered”
ggplot(data) +aes(x = height, y ="") +geom_boxplot(fill ="#67A7DD") +geom_jitter(alpha=0.25) +theme_classic()
Reading in data
The first thing we should do is set up our workspace, I would set a folder in your OneDrive for this and then set your working directory to this folder. You will then need to set your working directory - the code below is mine so will need to be changed or set manually Files –> More –> Set as working directly.
We then need to read in our Excel sheet. Download the sheet Likert.xlsx from blacboard and save it in your folder. Then we can read it in using the function read_excel() and in this case I’ve called this data.
The data is in wide format, so let’s change it to long using the function pivot_longer(). Note below we are going to use a pipe function (%>%) which is useful when building code. We are selecting columns 2 to 4 and moving the column names (names of our Likert questions) to “Item” and our Likert values to “Frequency”. This code transforms your data in a flash and you have it in wide format (data) and long format (data_long) ready for whichever analysis you want to do.
We’ll do a really simple plot now - there is lots you can do to make this look pretty but you can Google this or Google ggplot cheat sheet. You can also use the code from above to add a theme or titles.
ggplot(data_long, aes(x = Item, y = Frequency, fill = Item)) +geom_boxplot()
Wisdom of the crowd
So we’ll now have a go at bring our data on jelly beans into R and graphing it to see what the distribution looks like and see if our average is anywhere near the answer. I’ve used the code below to read in only column 7 “Beans” and turn the text into a number (numeric variable). We’ll then run a ggplot and take the average.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
[1] 5228.46
[1] 1798.5
When data has a distribution with a long tail as in these data we might want to either take the median value or perhaps log transform our data. Here we will take the Log10. Here if someone guesses 100 that is converted to 2 because 10 to the power 2 (10^2) = 10 x 10 = 100. or 1000 would be 3 because 10^3 = 10 x 10 x 10 = 1000.
The code below adds a new column called log with the Log10 of Beans. We then run the same histogram and take our averages. I have then converted these back to raw units.
beans <- beans %>%mutate(log =log10(Beans))ggplot(beans, aes(x = log)) +geom_histogram(aes(y = ..density..),colour =1, fill ="white") +geom_density(lwd =1, colour =4,fill =4, alpha =0.25) +labs(title="Distribution of the log of our guesses", y ="Freaquency" , x="Log of guesses")+theme_classic()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The code below requries and additional packages ggpubr - this packages allows us to arrange two or more plots either side by side or stacked on top of each other. Here I’ve named the plots a for the raw data for our jelly bean guesses) and b for the log of our guesses.
So what about our Likert scale data can we quickly and easily graph these? Here I have read in the column’s for our Likert data and named these “likert”.
likert <-read_excel("Scary data.xlsx", range =cell_cols(7:14), col_names =TRUE)data <-read_excel("Scary data.xlsx", range =cell_cols(c(7, 15, 16)), col_names =TRUE) head(likert)
Notice how these are currently words, we might want to convert to numbers for analysis - to do this we will use a simple piece of code using the function mutate()
# A tibble: 6 × 3
ID2 Item Frequency
<chr> <chr> <dbl>
1 P20 Data collection 2
2 P20 Using excel 3
3 P20 Statistics (averages) 1
4 P20 Types of data 1
5 P20 Data visualisation 2
6 P20 Statistics (general) 3
We can now graph this data
ggplot(data_long, aes(x = Item, y = Frequency, fill = Item)) +geom_boxplot()
What about our height and age data
We also have two numeric values for age and height (columns 15 & 16) so let’s read these in too but remember we want to ensure we read in our unique ID too (column 7). To do this we will read the whole file and the cut it up using the code below:
data <-read_excel("Scary data.xlsx")# Subset the columns you needdata<-subset(data, select =c(ID2, Height, Age))data$Age <-as.numeric(data$Age)data$Height <-as.numeric(data$Height)head(data)
# A tibble: 1 × 3
ID2 Height Age
<chr> <dbl> <dbl>
1 P20 179 42
Can you now run histogram and density plots for these data?
Combining data
What if we have data from two forms or in two spreadsheets and we want to merge them together. As long as we have a common denominator a column that is consistent between the spreadsheets this is easy to do in R. In this case we have two forms both with an ID that we selected from the drop down menu on our form and these are the same for both forms so we can combine. The second asks us to rate our pain on a VAS (Visual analogue scale) from 1-10. We will name this pain.
The code below simply selects the data for ID (ID2) and Pain. We then need to convert pain to a number.
pain <-read_excel("Scary data 2.xlsx")pain<-subset(pain, select =c(ID2, Pain))pain$Pain <-as.numeric(pain$Pain)
We’ll now use a simple “left join” to join pain with our original data sheet. We are joining data and pain together with the common variable ID2 - Bingo!
data <-left_join(data, pain,by =c("ID2"))head(data)
# A tibble: 1 × 4
ID2 Height Age Pain
<chr> <dbl> <dbl> <dbl>
1 P20 179 42 3
Try doing this in excel - it is doable but it is not easy and take time!