1 + 1[1] 2
R Studio is a powerful tool when working with data. You can create and document your workflow, clean, tidy and visualize your data, wrangle with it and run simple and highly complex analyses.
At this point you should have already:
Either downloaded R and R-Studio or be accessing these through the app store on a university PC.
Set up a folder for your workflow.
Created an R-Studio project saved in your folder.
The aim of this workbook is to get you comfortable with some of the basics for using R so you can create or read in data sets, produce some summary statistics, plot data and run a simple statistical test or two.
You can use R as a calculator and ask it do some simple things e.g.,
1 + 1[1] 2
You can also give data names using equals = or most commonly leftward assignment <- or the less common rightward assignment ->
a = 2 * 2
b <- 1 + 1
a * b -> c
c[1] 8
Whilst there are some things that can be done in “base R” often we need to install and load packages that extend the capacity of base R. We need to do this if we want to read data in from an excel or .csv file. Firstly it is important the file we want to import is visible in our “working directory”. You will need to set this by clicking on “files” -> More -> “Set as working directory”.
Then you will need to make sure you have the correct packages installed and loaded. For this we will need three packages “tidyverse” (more on this later), “janitor” and “readxl”
We use the code install.packages("readxl")to install the package and then library() to load. Once installed you shouldn’t need to do it again BUT you always need to load your packages:
N.B.: An R package is a collection of functions, data, and documentation that extends the capabilities of base R.
library(readxl)
library(janitor)
library(tidyverse)
library(car) #for qqplotsWe want to load “Human Movement Participant Details(1-18).xlsx” and dats.csv into R today. These files are on blackboard.
First set up a folder you can always access (u-drive or TU one-drive) and download these files and save it in your folder.
Back in R, you should now see your excel file in “files” (the bottom right hand corner of your R-Studio console).
We need to set our working directory before reading our data in
How do we get this into R?
setwd("~/Library/CloudStorage/OneDrive-TeessideUniversity/Work/Teaching/Human Movement/2025/Data")#read in my data.csv file here
data <- read_csv("data.csv")Rows: 30 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (3): Participant, Internal, External
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)# A tibble: 6 × 3
Participant Internal External
<dbl> <dbl> <dbl>
1 1 32 36
2 2 6 7
3 3 12 12
4 4 7 9
5 5 23 24
6 6 9 19
#read in my particpant information from the excel file here
p_info <- read_excel("Human Movement Participant Details(1-18).xlsx")
head(data)# A tibble: 6 × 3
Participant Internal External
<dbl> <dbl> <dbl>
1 1 32 36
2 2 6 7
3 3 12 12
4 4 7 9
5 5 23 24
6 6 9 19
The data file “data” looks pretty simple to work with but the participants info sheet needs tidying up. First lets get better column headers. We’ll use the clean_names() function from the janitor packages for this:
p_info <-clean_names(p_info)Now lets select the columns we need, we’ll use colnames() to show us our column names and male life easier then we’ll select the id2 (which we will call id), sex, age, stature and mass columns)
colnames(p_info) [1] "id" "start_time" "completion_time" "email"
[5] "name" "id2" "sex" "age"
[9] "stature" "mass"
p_info_clean <- p_info %>%
select(c(id = "id2",
sex,
age,
stature,
mass))
head(p_info_clean)# A tibble: 6 × 5
id sex age stature mass
<chr> <chr> <chr> <chr> <chr>
1 P18 Male 23 152.4 99.2
2 P05 Male 22 178.6 77.6
3 P03 Male 19 178.3 94.3
4 P02 Male 20 175.6 67.7
5 P07 Male 43 179.8 91.6
6 P15 Male 22 188.3 79.099999999999994
Notice that age, stature and mass should be numbers and are currently set as characters
# let's set these as numbers
p_info_clean$age <- as.numeric(p_info_clean$age)
p_info_clean$stature <- as.numeric(p_info_clean$stature)
p_info_clean$mass <- as.numeric(p_info_clean$mass)
head(p_info_clean)# A tibble: 6 × 5
id sex age stature mass
<chr> <chr> <dbl> <dbl> <dbl>
1 P18 Male 23 152. 99.2
2 P05 Male 22 179. 77.6
3 P03 Male 19 178. 94.3
4 P02 Male 20 176. 67.7
5 P07 Male 43 180. 91.6
6 P15 Male 22 188. 79.1
Now our data looks good.
Let’s get some averages now for age, stature and mass of our participants
mean(p_info_clean$age)[1] 23
sd(p_info_clean$age)[1] 7.132671
We can do it individually as above or we can summarise across the columns we want to choose in this case columns 3, 4 and 5 c(3:5).
p_info_clean %>% summarize(across(c(3:5), mean))# A tibble: 1 × 3
age stature mass
<dbl> <dbl> <dbl>
1 23 173. 81.6
p_info_clean %>% summarize(across(c(3:5), sd))# A tibble: 1 × 3
age stature mass
<dbl> <dbl> <dbl>
1 7.13 11.3 12.6
Let’s take our data.csv file. We can visualise this simply with base R e.g. with a barchart or histogram here:
barplot(height = data$Internal, names.arg = data$Participant)hist(data$Internal)The histogram gives us a visual representation of the distribution of our data - we need to check it meets the assumption of normality. We can also use a qq-plot to do this
The Q-Q plot, or quantile-quantile plot, is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as a Normal or exponential. For example, if we run a statistical analysis that assumes our dependent variable is Normally distributed, we can use a Normal Q-Q plot to check that assumption. It’s just a visual check, not an air-tight proof, so it is somewhat subjective. But itallows us to see at-a-glance if our assumption is plausible, and if not, how the assumption is violated and what data points contribute to the violation.
A Q-Q plot is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight.
We can use base R or library(car) so you might need to install the package if you don’t have it already:
qqPlot(data$Internal)[1] 19 29
#or in base R
qqnorm(data$Internal)
qqline(data$Internal)In this case it looks like our data might not be normally distributed, although most of the points are within the blue shaded area, but we might want a formal test to check.
The Shapiro–Wilk test is a test of normality and simply tells us if the data is significantly different from normal. Here the null hypothesis is that the data will be normal and the test gives us a p-value for the likelihood that we would see the distribution we do if the null hypothesis is true. Remember p-values are crude and in reality why is p=0.51 any better than p=0.49. As such this is also not “air-tight”.
The code below runs the Shapiro-Wilk test for Internal and we get W = 0.975, p = 0.678)
shapiro.test(data$Internal)
Shapiro-Wilk normality test
data: data$Internal
W = 0.97485, p-value = 0.6783
ggplot is a powerful plotting too that is part of the “tidyverse”.
Tidyverse is a collection of packages that share an underlying design philosophy, grammar, and data structure. Tidyverse allows you to import, wrangle (tidy & transform), visualise and model your data. Below we will primarily use a package called dplyr - but we will use some functions from other packages.
Here we are going to cover the following data manipulation functions that will serve you well over time:
filter() picks cases based on their values.
arrange() changes the ordering of the rows.
select() picks variables based on their names.
mutate() adds new variables that are functions of existing variables
summarise() reduces multiple values down to a single summary.
group_by() makes an existing table and converts it into a grouped table
Take a look at the tidyverse website where you can go into each of the different packages and functions.
#here is an example of a hitogram and density plot in ggplot
ggplot(data, aes(x = Internal)) +
geom_histogram(aes(y = ..density..),
colour = 1, fill = "lightgreen") +
geom_density()+
theme_classic()Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Okay so how does it work. Well first we want to convert our data from “wide” to “long” format to help us plot it.
# we are using pivot_longer and selecting the columns 2 and 3 with names (Internal and External) going to the column Condition, and the values going to the column Score
data_long<- data %>%
pivot_longer(cols = c(2:3), names_to = "Condition", values_to = "Score")Now let’s plot these:
# lets start by setting a canvas from our data (data_long) and telling R we want Condition (Internal and External) on the x axis and Score on the y.
ggplot(data = data_long, aes(x = Condition, y = Score))Now let’s add a plot to it - I’m going for a box plot
ggplot(data = data_long, aes(x = Condition, y = Score))+ # I've added + here
geom_boxplot() # and geom_boxplot here # what about some points?
ggplot(data = data_long, aes(x = Condition, y = Score))+
geom_boxplot() +
geom_point()So, external looks more effective than internal but we need to do some statistics to find that out. We can take the mean and sd for each but also look to see if there is a statistically “significant difference” using a t-test (this because we have two conditions and because the data are “paired” we will run a Paired Samples T-test:
# for means we can do these individually
mean(data$Internal)[1] 19.46667
# or use the summarize function:
data %>% summarize(across(c(2:3), mean))# A tibble: 1 × 2
Internal External
<dbl> <dbl>
1 19.5 25.3
And for our t-test
t.test(data$Internal, data$External, paired = TRUE)
Paired t-test
data: data$Internal and data$External
t = -7.774, df = 29, p-value = 1.424e-08
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-7.410106 -4.323227
sample estimates:
mean difference
-5.866667
So there is a significant difference with a higher score in the External conditions. I would write this something like this:
We observed and improved outcome in the external condition compared to the internal (mean difference 5.87, 95% confidence intervals 4.32 to 7.41, p<0.0001).