# Install packages, only the first time using R
install.packages("tidyverse")
install.packages("janitor")Applied Criminological Research Methods: Quantitative data analysis
Welcome
In this practical you will analyse a simulated (not real) survey dataset on public attitudes to policing and the criminal justice system. The dataset contains responses from 300 participants and covers a range of topics such as: trust in police, perceptions of fairness, victimisation experience, and demographic characteristics.
The structure of the data is similar to what you will get when you export your own Qualtrics survey rows are respondents, columns are variables (questions). By the end of this session you will be able to:
- Load data into R and inspect its structure
- Produce frequency tables and summary statistics
- Check for and handle missing values
- Visualise the distribution of variables
- Explore associations between variables using crosstabs and charts
1. Before We Start
1.1. Posit Cloud
We are using Posit Cloud (formerly RStudio Cloud), which means you do not need to install anything. Everything runs in your browser. Follow the instructions in the pre-session tasks document to get the data in your session.
1.2. Open a new Script
First, open a new script in R studio and save it in your session, so you will be able to access this script at a later time if you want to revise or modify a code.
In R Studio:
Go to File… New File… R script
In R we use the symbol # on a line to add comments. These are useful to add quick notes about the code or to provide more explanations of what you are doing.
1.3. Load Packages
Packages are collections of extra functions that extend what R can do. Think of base R as a basic kitchen, and packages as specialist equipment. We load them at the start of every session.
We are using two packages today:
tidyverse: a collection of tools for data manipulation and visualisation (includesggplot2for charts anddplyrfor data manipulation/wrangling)janitor: makes frequency tables and crosstabs very easy to produce
# Load packages: run this first, every session
library(tidyverse)
library(janitor)2. Import the Data
Import the dataset called policing_attitudes.csv. It is a comma-separated values file which is one of the most common format used for storing data, including from Qualtrics. This dataset can only be imported if you have loaded the package tidyverse.
# Import the dataset
survey <- read_csv("policing_attitudes.csv")We have given our dataset the name survey. You will use this name every time you refer to the data in your code.
3. Getting to Know the Data
Before any analysis, we need to understand what we have. This is not optional, it is the first and most important step of any data analysis project.
3.1. First Look
Inspect the dataset, use the function View()
View(survey)You can also use the function head() that shows you the first 6 rows of the data set, this function is useful to have a look at the data within the console, specially when you have large datasets.
# See the first 6 rows
head(survey)# A tibble: 6 × 12
age gender ethnicity education area_type victim_of_crime had_police_contact
<dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 51 Man White Bri… Undergra… Suburban Yes Yes
2 35 Woman Asian or … GCSEs or… Urban Yes Yes
3 36 Woman Black or … Undergra… Rural No No
4 24 Man White Bri… Postgrad… Suburban No Yes
5 62 Man White Bri… GCSEs or… Rural Yes Yes
6 22 Man Other GCSEs or… Suburban No No
# ℹ 5 more variables: trust_police <dbl>, perceived_fairness <dbl>,
# police_effectiveness <dbl>, support_community_policing <dbl>,
# contact_satisfaction <dbl>
Another function used to get a “glimpse” at your data is using the function glimpse
glimpse(survey)Rows: 300
Columns: 12
$ age <dbl> 51, 35, 36, 24, 62, 22, 52, 36, 62, 41, 45,…
$ gender <chr> "Man", "Woman", "Woman", "Man", "Man", "Man…
$ ethnicity <chr> "White British", "Asian or Asian British", …
$ education <chr> "Undergraduate degree", "GCSEs or equivalen…
$ area_type <chr> "Suburban", "Urban", "Rural", "Suburban", "…
$ victim_of_crime <chr> "Yes", "Yes", "No", "No", "Yes", "No", "No"…
$ had_police_contact <chr> "Yes", "Yes", "No", "Yes", "Yes", "No", "Ye…
$ trust_police <dbl> 5, 3, 3, 3, 3, 3, 4, 4, 2, 2, 3, 2, 3, 2, 1…
$ perceived_fairness <dbl> 2, 1, 4, 4, 3, 2, 3, 4, 3, 3, 3, 3, 3, 4, 2…
$ police_effectiveness <dbl> 4, 2, 4, 2, 3, 2, 4, 3, 3, 4, 1, 4, 1, 2, 2…
$ support_community_policing <dbl> 3, 1, 3, 4, 3, 3, 3, 3, 2, 4, 2, 4, 4, 4, 4…
$ contact_satisfaction <dbl> 2, 1, NA, 1, 5, NA, 1, NA, NA, 1, NA, 5, 3,…
Task 1: Look at the output of glimpse() and answer the following:
- How many observations (rows) does the dataset have? ____________
- How many variables (columns)? ____________
- What data type is
age? ____________ - What data type is
gender? ____________
A note on data types:
<dbl>means numeric (a numeric variable).<chr>means character (text) indicating categorical variables. Most of your Qualtrics variables will come in as<chr>. R does not automatically know that “1 = Strongly disagree” is an ordered category rather than just a word. We will deal with this below.
4. Univariate Analysis: One Variable at a Time
Univariate analysis means looking at each variable on its own. The goal is to understand its distribution, what values appear, how often, and whether anything looks unusual.
We can use a series of descriptive statistics and graphs to help us to understand and make sense of the data.
4.1. Categorical Variables: Frequency Tables
For categorical variables (like gender, ethnicity, area type) we use frequency tables.
The simplest way of getting a frequency table is using the function table()
table(survey$gender) # You need to include the name of the dataset plus the symbol $ before the variable
Man Non-binary Prefer not to say Woman
134 7 7 152
The tabyl() function from janitor is a better choice, it is cleaner than base R’s table() and gives you percentage.
# Frequency table for gender
survey %>%
tabyl(gender) %>%
adorn_pct_formatting(digits = 1) # show percentages rounded to 1 decimal place gender n percent
Man 134 44.7%
Non-binary 7 2.3%
Prefer not to say 7 2.3%
Woman 152 50.7%
Read this as: for each category of gender, we see the count (n) and the percentage of the total (percent).
Task 2: Produce frequency tables for the following variables. Make a note of the most common category in each.
In the code below, replace the ‘_________’ with the variable requested. Remember R is case sensitive so make sure the name of the variable in the code matches the name of the variable in the dataset.
# Frequency table for ethnicity
survey %>%
tabyl(___) %>%
adorn_pct_formatting(digits = 1)
# Frequency table for education
survey %>%
tabyl(___) %>%
adorn_pct_formatting(digits = 1)
# Frequency table for victim_of_crime
survey %>%
tabyl(___) %>%
adorn_pct_formatting(digits = 1)- Most common ethnicity: ____________
- Most common education level: ____________
- Proportion who have been a victim of crime: ____________
4.2. Checking for Missing Values
Missing values in R are recorded as NA. It is essential to know where they are before you start analysing. The function is.na() is a logical function which looks at each observation and evaluates whether it is a valid case or a missing case. When used with the function sum() the result will be the number of NA (missing cases) for a particular variable.
sum(is.na(survey$gender))Another more efficient way of looking at it is using a combination of different functions that look at all variables in the dataset at once:
# Count missing values in each variable
survey %>%
summarise(across(everything(), ~ sum(is.na(.))))The across(everything(), ...) part tells R to apply the function to every column. The ~ and . are how you write “apply this to each column” in modern dplyr.
You will notice that contact_satisfaction has a lot of NA values. But look at the raw data participants who answered “No” to had_police_contact have the value NA rather than a true missing value. This is a good example of why you always need to understand why data is missing, not just that it is.
# Check the values in contact_satisfaction
survey %>%
tabyl(contact_satisfaction) %>%
adorn_pct_formatting(digits = 1) contact_satisfaction n percent valid_percent
1 41 13.7% 24.3%
2 21 7.0% 12.4%
3 34 11.3% 20.1%
4 35 11.7% 20.7%
5 38 12.7% 22.5%
NA 131 43.7% -
This is a structural missing, the question simply did not apply to those respondents. It is not a data quality problem, but it means any analysis of contact_satisfaction should be restricted to those who had contact.
Any value identified as missing do not count in the frequency table so the valid percent column gives you a true split of the valid responses in this variable.
Check again this time excluding those who have not contacted the police:
# Obtain a frequency table for contact satisfaction for those who have contacted the police
survey %>%
filter(had_police_contact =="Yes") %>% # only select those who had contacted the police
tabyl(contact_satisfaction) %>%
adorn_pct_formatting(digits = 1) contact_satisfaction n percent
1 41 24.4%
2 21 12.5%
3 33 19.6%
4 35 20.8%
5 38 22.6%
The function filter allows you to select those who had contacted the police. This function is useful when you have filter questions!
survey %>% # creates an object called 'satisfaction_freq'
filter(had_police_contact =="Yes") %>% # to select those who had contacted the police only
tabyl(contact_satisfaction) %>%
adorn_pct_formatting(digits = 1)4.3. Saving your results
If you want to save the results you can save them as an object (table) in R, by assigning the resulting table to an object of the name of your choice as in the example below. After you run the code the table is stored in the ‘Environment’ panel.
satisfaction_freq <- survey %>% # creates an object called 'satisfaction_freq'
filter(had_police_contact =="Yes") %>%
tabyl(contact_satisfaction) %>%
adorn_pct_formatting(digits = 1)
# You can also save it (export it) as csv file (to be seen in excel)
write_csv(satisfaction_freq, "satisfaction_freq.csv")Task 3: Are there missing values in trust_police, perceived_fairness, or police_effectiveness? Run the missing value check and note your finding.
4.4. Recoding categorical variables
Now let’s inspect our attitudinal variables. These are measured on a 1–5 scale where higher values indicate more positive attitudes (e.g., 5 = very high trust, 1 = very low trust).
Hint: Because trust_police is stored as a numberical variable, you can convert it to a categorical variable using the following code:
survey$trust_police_f <- factor(survey$trust_police, labels = c("Very low", "Low", "Neutral", "High", "Very high"))
First we create a new variable called survey$trust_police_f. The <- symbol tells R to take operation in the right side factor(survey$trust_police, labels = c("Very low", "Low", "Neutral", "High", "Very high")) to the right of the symbol and store it in a variable whose name is given on the left: survey$trust_police_f.
Now let’s do it with the data:
survey$trust_police_f <- factor(survey$trust_police,
labels = c("Very low", "Low", "Neutral", "High", "Very high"))# Let's start with 'trust_police'
survey %>%
tabyl(trust_police_f) %>%
adorn_pct_formatting(digits = 1) trust_police_f n percent
Very low 17 5.7%
Low 90 30.0%
Neutral 117 39.0%
High 66 22.0%
Very high 10 3.3%
Task 4: What can you say about the level of trust on the police? Look at the frequency for perceived_fairness and support_community_policing variables (convert them to factors first). Which attitude tends to be rated highest? Which lowest? Write a short interpretation (2–3 sentences) of what this tells us about public attitudes in this sample.
# Convert numeric variables to factor
## perceived_fairness
survey$perceived_fairness_f <- factor(survey$perceived_fairness,
labels = c("Very low", "Low", "Neutral", "High", "Very high"))
## support_community_policing
survey$support_community_policing_f <- factor(survey$__________,
labels = c("Very low", "Low", "Neutral", "High", "Very high"))
# Frequency tables
survey %>%
summarise(
tabyl(perceived_fairness_f) %>%
adorn_pct_formatting(digits = 1)
)
survey %>%
summarise(
tabyl(_______) %>%
adorn_pct_formatting(digits = 1)
)4.5. Continuous Variables: Summary Statistics
For numeric variables we want measures of central tendency (mean, median) and spread (standard deviation, range).
# Summary statistics for age
summary(survey$age) Min. 1st Qu. Median Mean 3rd Qu. Max.
18.00 28.00 37.00 38.51 48.00 75.00
# Standard deviation — not included in summary()
sd(survey$age)[1] 13.35481
The %>% pipe lets us chain operations together in a readable sequence. Here we use it to get a richer set of statistics in one block:
survey %>%
summarise(
mean_age = mean(age), # mean is the function for mean or average
sd_age = sd(age), # sd is for standard deviation (spread of the data)
median_age = median(age), # median is the central value if ordered from lower to higher
min_age = min(age), # min is the minimum value of the variable
max_age = max(age) # max is the maximum value of thr variable
)# A tibble: 1 × 5
mean_age sd_age median_age min_age max_age
<dbl> <dbl> <dbl> <dbl> <dbl>
1 38.5 13.4 37 18 75
Task 5: What can we say about age?
5. Visual Exploration
Numbers tell us a lot, but visualisations help us see patterns that statistics can miss. We use ggplot2 (part of tidyverse) for all our charts.
5.1. Bar Charts for Categorical Variables
plot1<- ggplot(data = survey) + # tell it which dataset to use
geom_bar(mapping = aes(x = gender)) + # type of graph and variables used
labs(title = "Distribution of gender in the sample",
x = "Gender", y = "Count") +
theme_minimal()
# Save the plot
ggsave(plot1, filename= "gender.png")You can also save the plot by clicking on Plots… Export in the bottom left side of the console.
Task 6: Produce a bar chart for ethnicity and another for area_type. Do the distributions match what you found in the frequency tables?
# Bar chart for ethnicity
ggplot(data = survey) +
geom_bar(mapping = aes(x = ___)) +
labs(title = "___", x = "___", y = "Count") +
theme_minimal()5.2. Histograms for Continuous Variables
ggplot(data = survey) +
geom_histogram(mapping = aes(x = age), binwidth = 5, fill = "steelblue", na.rm = TRUE, colour = "white") +
labs(title = "Age distribution of respondents",
x = "Age", y = "Count") +
theme_minimal()Task 7: Produce a bar graph for trust_police_f.
ggplot(data = survey) +
geom_bar(mapping = aes(x = _________), fill = "steelblue", na.rm =TRUE) +
labs(title = "Distribution of trust in police", x = "Trust", y = "Count") +
theme_minimal()6. Bivariate Analysis: Exploring Relationships
Once we understand individual variables, we look at how they relate to each other. This is where criminological insights start to emerge.
6.1. Crosstabs for Two Categorical Variables
A crosstab (contingency table) shows the joint distribution of two categorical variables. We use janitor’s tabyl() with some additional formatting functions.
In the first example we are interested in the relationship between victimisation and gender.
survey %>%
tabyl( gender, victim_of_crime, show_na = FALSE) %>% # the two variables
adorn_percentages("row") %>% # row percentages (% within each gender)
adorn_pct_formatting(digits = 1) %>% # round to 1 decimal
adorn_ns() # add raw counts in brackets gender No Yes
Man 64.9% (87) 35.1% (47)
Non-binary 85.7% (6) 14.3% (1)
Prefer not to say 71.4% (5) 28.6% (2)
Woman 77.0% (117) 23.0% (35)
Read this table column by row: of all the men in the sample, what percentage were victims of crime? Of all the women?
male victimisation: ________________________________
female victimisation:_______________________________
Task 8: Explore the relationship between victim_of_crime and ethnicity.
survey %>%
tabyl(___, ___, show_na = FALSE) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits = 1) %>%
adorn_ns()Does victimisation vary by ethnic group? Make a note of any patterns you find.
What happens if you change to “col” in this part of the code adorn_percentages("row") %>% ____________________________________________________________
Task 9: Now explore the relationship between trust_police (factor variable) and victim_of_crime. Does being a victim of crime appear to relate to levels of trust? We want to understand if vitimisation influence attitudes towards the Police.
Victimisation is an event that can shape attitudes towards the police, so victimisation can also be seen as an independent variable or a predictor
Trust in the police is something that can change depending on other factors, such as victimisation, therefore it can be seen as dependent variable or outcome.
survey %>%
tabyl( victim_of_crime, trust_police_f,show_na = FALSE) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits = 1) %>%
adorn_ns()
# A more readable table (we change the order of the variables and used column percentages)
survey %>%
tabyl(trust_police_f, victim_of_crime, show_na = FALSE) %>%
adorn_percentages("col") %>%
adorn_pct_formatting(digits = 1) %>%
adorn_ns() 6.2. Visualising Group Differences
Bar charts with a fill colour are a good way to visualise the relationship between two categorical variables.
ggplot(data = survey, mapping = aes(x = area_type, fill = victim_of_crime)) +
geom_bar(position = "fill") + # "fill" makes each bar sum to 100%
scale_y_continuous(labels = scales::percent) +
labs(title = "Victimisation by area type",
x = "Area type", y = "Proportion",
fill = "Victim of crime?") +
theme_minimal()Try comparing victim of crime by gender
ggplot(data = survey, mapping = aes(x = _____, fill = victim_of_crime)) +
geom_bar(position = "fill") + # "fill" makes each bar sum to 100%
scale_y_continuous(labels = scales::percent) +
labs(title = "Victimisation by ____",
x = "Area type", y = "Proportion",
fill = "Victim of crime?") +
theme_minimal()Task 10: Based on all the analysis you have done, write 3–5 sentences summarising the main patterns you have found. Think about: who trusts the police most and least? Does victimisation experience seem to matter? Are there gender differences trust to the police?
7. A Note on Your Own Qualtrics Data
When you export your own survey from Qualtrics and load it into R, you will encounter a few things that look different from this cleaned dataset:
Extra rows at the top: Qualtrics exports include two header rows — the variable name row and a question text row. You will need to skip or remove the second row after import.
Numeric codes for Likert items: Qualtrics often stores responses as numbers (1, 2, 3…) without the labels. You will need to add labels manually using factor() with a labels = argument, exactly as shown in Task 9 above.
Timing and metadata columns: Qualtrics adds columns like StartDate, EndDate, Duration, IPAddress, LocationLatitude. You can remove thesein the comma separated (“.csv”) file or in R using the following code: your_survey_name <- your_survey_name %>% select(-column_name).
8. Saving Your Work
In Posit Cloud your script is saved automatically. If you want to save the cleaned version of the data (with any new variables you created) you can use:
# Save as an R data file — preserves all your variable types
save(survey, file = "survey_cleaned.RData")
# Or save as CSV if you want to open it in Excel or share it
write_csv(survey, "survey_cleaned.csv")9. Going Further: Real Crime Survey Data
The dataset you have worked with today was designed to be clean, well-structured, and immediately usable, which made it ideal for learning the fundamentals. But it is also simulated. Real survey data, especially large-scale national surveys, tends to be messier, richer, and considerably more interesting.
The CSEW is archived at the UK Data Service (UKDS), which is the national repository for social science data in the UK. Registering takes a few minutes and gives you access to hundreds of datasets across criminology, health, economics, and social policy.
Why didn’t we use real survey data today? Working with real survey data such as the CSEW or the SCJS requires a few extra steps: the files are in other software (Stata or SPSS) formats rather than CSV, which requires a different import package; variables are stored as numeric codes with labels rather than plain text. None of this is difficult, but learning those steps alongside everything else in two hours would have been too much.
Further Learning: Free Online Training from the UK Data Service
The UK Data Service has developed a free, self-paced online training module specifically on analysing crime data using R. It covers the CSEW directly and builds on exactly the skills you have practised today.
Introduction to Crime Data Analysis using R
UK Data Service free and self-paced resources
This is an excellent resource to consolidate what you have learned today and extend it to working with real, nationally representative data. If you found today’s session useful, this is your obvious next step.
Summary of Functions Used Today
| Function | Package | What it does |
|---|---|---|
read_csv() |
readr (tidyverse) | Import a CSV file |
head() |
base R | Show first 6 rows |
glimpse() |
dplyr (tidyverse) | Structured variable overview |
tabyl() |
janitor | Frequency tables and crosstabs |
adorn_percentages() |
janitor | Add percentages to a tabyl |
summary() |
base R | Min, max, mean, median, quartiles |
sd() |
base R | Standard deviation |
summarise() |
dplyr | Custom summary statistics |
across() |
dplyr | Apply a function across multiple columns |
sum(is.na()) |
base R | Count missing values |
factor() |
base R | Convert to categorical with labels |
ggplot() + geom_bar() |
ggplot2 | Bar chart |
ggplot() + geom_histogram() |
ggplot2 | Histogram |
ggplot() + geom_boxplot() |
ggplot2 | Boxplot |
This practical was developed for a master’s-level introduction to data analysis in criminology. It draws on materials originally created for the UK Data Service Introduction to Crime Data workshop (University of Manchester, February 2020). The simulated dataset on public attitudes to policing was designed to reflect the structure of Qualtrics survey exports and the thematic content of the Crime Survey for England and Wales.
AI Statement
I used ELM (University of Edinburgh’s generative AI gateway) for a generic R example of simulating mixed‑type survey data. I adapted the approach, and the dataset used here was specified and generated independently to meet the requirements for this workshop.