Scary statistics

Introduction to dealing with data through R

Coding can save you a lot of time doing some of the more tiresome tasks with regards to tidying and transforming data as well as enabling you to create some fantastic visuals.

Loading packages

Firstly you need load the key packages you are going to need to use. They might not be installed yet so you may need to run the install.packages() code.

In this case you will need to use “tidyverse” which is a suite of packages that enable you to tidy and graph your data (these include ggplot2 which is can be used to create excellent visualizations. You will also need the “readxl” package to read data into R from excel.

install.packages("tidyverse")

install.packages("readxl")

To load your packages use the code below library()

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.0     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.1     ✔ tibble    3.1.8
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

library(readxl)

Now you are ready to go - we will collect some data using a Microsoft Form but to get you going the code below simulates normally distributed data, 200 data points with a mean height of 175 cm and a standard deviation of 10 cm. rnorm(200, 175, 10)

I’ve called this “height”. height<- rnorm(200, 175, 10 NOTE the <- sign can usually be replaced by an equals.

We’ve then added this into a data frame so it is usable and called the data frame (data) If you write head(data) you will see the data frame and the column header named height/

set.seed(123)
height<-rnorm(200, 175, 10)
data<-data.frame(height = height)
head(data)

We can then graph the distribution of this data, work out averages (mean / median) etc.

mean(data$height)

[1] 174.9143

median(data$height)

[1] 174.4126

sd(data$height)

[1] 9.431599

Simple plot with ggplot, you can copy the final code below but it is good to make a plot step by step to understand what the code is doing. A ggplot is made up of layers, starting with selecting the data frame you want to graph and the “aesthetics”. Here we want to use data from our data frame called “data” and graph the column height on the x-axis.

ggplot(data, aes(x = height))

Next we need to add the type of graph - a histogram. Here I’ve asked it to fill the bars with a white colour. Try changing white to blue.

ggplot(data, aes(x = height)) + 
  geom_histogram(colour = 1, fill = "white")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Histogram’s are nice but I do like “density plots” to show the distribution of the data, for this the code is slightly more complicated (but you can just copy mine).

ggplot(data, aes(x = height)) + 
  geom_histogram(aes(y = ..density..), colour = 1, fill = "white") +
  geom_density(lwd = 1, colour = 4,
               fill = 4, alpha = 0.25)

Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Finally, we can change the background by adding a theme and we can give our chart and axies titles

ggplot(data, aes(x = height)) + 
  geom_histogram(aes(y = ..density..), colour = 1, fill = "white") +
  geom_density(lwd = 1, colour = 4,
               fill = 4, alpha = 0.25) +
  labs(title="Distribution of height", y = "Freaquency density" , x="Height (cm)")+ 
  theme_classic()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Here is a similar example of a boxplot with individual data points added and “jittered”

ggplot(data) +
  aes(x = height, y = "") +
  geom_boxplot(fill = "#67A7DD") +
  geom_jitter(alpha=0.25) +
  theme_classic()

Reading in data

The first thing we should do is set up our workspace, I would set a folder in your OneDrive for this and then set your working directory to this folder. You will then need to set your working directory - the code below is mine so will need to be changed or set manually Files –> More –> Set as working directly.

setwd("~/Library/CloudStorage/OneDrive-TeessideUniversity/Work/Teaching/Stats/Data/data_L4")

We then need to read in our Excel sheet. Download the sheet Likert.xlsx from blacboard and save it in your folder. Then we can read it in using the function read_excel() and in this case I’ve called this data.

data <- read_excel("Likert.xlsx")
head(data)

# A tibble: 6 × 4
  ID    Collection Excel Averages
  <chr>      <dbl> <dbl>    <dbl>
1 P1             2     3        5
2 P2             3     2        1
3 P3             1     5        4
4 P4             2     1        5
5 P5             1     3        2
6 P6             1     5        3

The data is in wide format, so let’s change it to long using the function pivot_longer(). Note below we are going to use a pipe function (%>%) which is useful when building code. We are selecting columns 2 to 4 and moving the column names (names of our Likert questions) to “Item” and our Likert values to “Frequency”. This code transforms your data in a flash and you have it in wide format (data) and long format (data_long) ready for whichever analysis you want to do.

data_long=data %>%
  pivot_longer(cols = c(2:4), names_to = "Item", values_to = "Frequency")
head(data_long)

# A tibble: 6 × 3
  ID    Item       Frequency
  <chr> <chr>          <dbl>
1 P1    Collection         2
2 P1    Excel              3
3 P1    Averages           5
4 P2    Collection         3
5 P2    Excel              2
6 P2    Averages           1

We’ll do a really simple plot now - there is lots you can do to make this look pretty but you can Google this or Google ggplot cheat sheet. You can also use the code from above to add a theme or titles.

ggplot(data_long, aes(x = Item, y = Frequency, fill = Item)) +
  geom_boxplot()

Wisdom of the crowd

So we’ll now have a go at bring our data on jelly beans into R and graphing it to see what the distribution looks like and see if our average is anywhere near the answer. I’ve used the code below to read in only column 7 “Beans” and turn the text into a number (numeric variable). We’ll then run a ggplot and take the average.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

[1] 5228.46

[1] 1798.5

When data has a distribution with a long tail as in these data we might want to either take the median value or perhaps log transform our data. Here we will take the Log10. Here if someone guesses 100 that is converted to 2 because 10 to the power 2 (10^2) = 10 x 10 = 100. or 1000 would be 3 because 10^3 = 10 x 10 x 10 = 1000.

The code below adds a new column called log with the Log10 of Beans. We then run the same histogram and take our averages. I have then converted these back to raw units.

beans <- beans %>% 
  mutate(log = log10(Beans))

ggplot(beans, aes(x = log)) +
  geom_histogram(aes(y = ..density..),
                 colour = 1, fill = "white") +
  geom_density(lwd = 1, colour = 4,
               fill = 4, alpha = 0.25) +
  labs(title="Distribution of the log of our guesses", y = "Freaquency" , x="Log of guesses")+ 
  theme_classic()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

lm<-mean(beans$log)
gm<-10^lm
lmed<-10^median(beans$log)
m

[1] 5228.46

gm

[1] 2262.346

med

[1] 1798.5

lmed

[1] 1787.512

The code below requries and additional packages ggpubr - this packages allows us to arrange two or more plots either side by side or stacked on top of each other. Here I’ve named the plots a for the raw data for our jelly bean guesses) and b for the log of our guesses.

library(ggpubr)

a<-ggplot(beans) +
  aes(x = "", y = Beans) +
  geom_boxplot(fill = "#67A7DD") +
  geom_jitter(alpha=0.25) +
  coord_flip() +
  theme_classic()

b<-ggplot(beans) +
  aes(x = "", y = log) +
  geom_boxplot(fill = "#67A7DD") +
  geom_jitter(alpha=0.25) +
  coord_flip() +
  theme_classic()

ggarrange(a,b, ncol = 1)

Likert scale data

So what about our Likert scale data can we quickly and easily graph these? Here I have read in the column’s for our Likert data and named these “likert”.

likert <- read_excel("Scary data.xlsx", range = cell_cols(7:14), col_names = TRUE)
data <- read_excel("Scary data.xlsx", range = cell_cols(c(7, 15, 16)), col_names = TRUE) 

head(likert)

# A tibble: 1 × 8
  ID2   `Data collection` `Using excel` Statis…¹ Types…² Data …³ Stati…⁴ Data …⁵
  <chr> <chr>             <chr>         <chr>    <chr>   <chr>   <chr>   <chr>  
1 P20   Disagree          Neutral       Strongl… Strong… Disagr… Neutral Disagr…
# … with abbreviated variable names ¹`Statistics (averages)`, ²`Types of data`,
#   ³`Data visualisation`, ⁴`Statistics (general)`, ⁵`Data distribution`

Notice how these are currently words, we might want to convert to numbers for analysis - to do this we will use a simple piece of code using the function mutate()

likert <- likert %>%
mutate_at(vars(-1), ~recode(., 
    "Strongly disagree" = 1,
    "Disagree" = 2,
    "Neutral" = 3,
    "Agree" = 4,
    "Strongly agree" = 5
  ))

head(likert)

# A tibble: 1 × 8
  ID2   `Data collection` `Using excel` Statis…¹ Types…² Data …³ Stati…⁴ Data …⁵
  <chr>             <dbl>         <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 P20                   2             3        1       1       2       3       2
# … with abbreviated variable names ¹`Statistics (averages)`, ²`Types of data`,
#   ³`Data visualisation`, ⁴`Statistics (general)`, ⁵`Data distribution`

With a slight tweak to the code above we can now pivot this to long format (there are now 8 columns.

data_long= likert %>%
  pivot_longer(cols = c(2:8), names_to = "Item", values_to = "Frequency")

head(data_long)

# A tibble: 6 × 3
  ID2   Item                  Frequency
  <chr> <chr>                     <dbl>
1 P20   Data collection               2
2 P20   Using excel                   3
3 P20   Statistics (averages)         1
4 P20   Types of data                 1
5 P20   Data visualisation            2
6 P20   Statistics (general)          3

We can now graph this data

ggplot(data_long, aes(x = Item, y = Frequency, fill = Item)) +
  geom_boxplot()

What about our height and age data

We also have two numeric values for age and height (columns 15 & 16) so let’s read these in too but remember we want to ensure we read in our unique ID too (column 7). To do this we will read the whole file and the cut it up using the code below:

data <- read_excel("Scary data.xlsx")

# Subset the columns you need
data<- subset(data, select = c(ID2, Height, Age))

data$Age <- as.numeric(data$Age)
data$Height <- as.numeric(data$Height)
head(data)

# A tibble: 1 × 3
  ID2   Height   Age
  <chr>  <dbl> <dbl>
1 P20      179    42

Can you now run histogram and density plots for these data?

Combining data

What if we have data from two forms or in two spreadsheets and we want to merge them together. As long as we have a common denominator a column that is consistent between the spreadsheets this is easy to do in R. In this case we have two forms both with an ID that we selected from the drop down menu on our form and these are the same for both forms so we can combine. The second asks us to rate our pain on a VAS (Visual analogue scale) from 1-10. We will name this pain.

The code below simply selects the data for ID (ID2) and Pain. We then need to convert pain to a number.

pain <- read_excel("Scary data 2.xlsx")

pain<-subset(pain, select = c(ID2, Pain))
pain$Pain <- as.numeric(pain$Pain)

We’ll now use a simple “left join” to join pain with our original data sheet. We are joining data and pain together with the common variable ID2 - Bingo!

data <- left_join(data, pain,
                          by = c("ID2"))
head(data)

# A tibble: 1 × 4
  ID2   Height   Age  Pain
  <chr>  <dbl> <dbl> <dbl>
1 P20      179    42     3

Try doing this in excel - it is doable but it is not easy and take time!