Applied Criminological Research Methods: Quantitative data analysis

Author

Ana Morales-Gomez

Published

March 5, 2026

Welcome

In this practical you will analyse a simulated (not real) survey dataset on public attitudes to policing and the criminal justice system. The dataset contains responses from 300 participants and covers a range of topics such as: trust in police, perceptions of fairness, victimisation experience, and demographic characteristics.

The structure of the data is similar to what you will get when you export your own Qualtrics survey rows are respondents, columns are variables (questions). By the end of this session you will be able to:

Load data into R and inspect its structure
Produce frequency tables and summary statistics
Check for and handle missing values
Visualise the distribution of variables
Explore associations between variables using crosstabs and charts

1. Before We Start

1.1. Posit Cloud

We are using Posit Cloud (formerly RStudio Cloud), which means you do not need to install anything. Everything runs in your browser. Follow the instructions in the pre-session tasks document to get the data in your session.

1.2. Open a new Script

First, open a new script in R studio and save it in your session, so you will be able to access this script at a later time if you want to revise or modify a code.

In R Studio:

Go to File… New File… R script

In R we use the symbol # on a line to add comments. These are useful to add quick notes about the code or to provide more explanations of what you are doing.

1.3. Load Packages

Packages are collections of extra functions that extend what R can do. Think of base R as a basic kitchen, and packages as specialist equipment. We load them at the start of every session.

We are using two packages today:

tidyverse: a collection of tools for data manipulation and visualisation (includes ggplot2 for charts and dplyr for data manipulation/wrangling)
janitor: makes frequency tables and crosstabs very easy to produce

# Install packages, only the first time using R
install.packages("tidyverse")
install.packages("janitor")

# Load packages: run this first, every session
library(tidyverse)
library(janitor)

2. Import the Data

Import the dataset called policing_attitudes.csv. It is a comma-separated values file which is one of the most common format used for storing data, including from Qualtrics. This dataset can only be imported if you have loaded the package tidyverse.

# Import the dataset
survey <- read_csv("policing_attitudes.csv")

We have given our dataset the name survey. You will use this name every time you refer to the data in your code.

3. Getting to Know the Data

Before any analysis, we need to understand what we have. This is not optional, it is the first and most important step of any data analysis project.

3.1. First Look

Inspect the dataset, use the function View()

 View(survey)

You can also use the function head() that shows you the first 6 rows of the data set, this function is useful to have a look at the data within the console, specially when you have large datasets.

# See the first 6 rows
head(survey)

# A tibble: 6 × 12
    age gender ethnicity  education area_type victim_of_crime had_police_contact
  <dbl> <chr>  <chr>      <chr>     <chr>     <chr>           <chr>             
1    51 Man    White Bri… Undergra… Suburban  Yes             Yes               
2    35 Woman  Asian or … GCSEs or… Urban     Yes             Yes               
3    36 Woman  Black or … Undergra… Rural     No              No                
4    24 Man    White Bri… Postgrad… Suburban  No              Yes               
5    62 Man    White Bri… GCSEs or… Rural     Yes             Yes               
6    22 Man    Other      GCSEs or… Suburban  No              No                
# ℹ 5 more variables: trust_police <dbl>, perceived_fairness <dbl>,
#   police_effectiveness <dbl>, support_community_policing <dbl>,
#   contact_satisfaction <dbl>

Another function used to get a “glimpse” at your data is using the function glimpse

glimpse(survey)

Rows: 300
Columns: 12
$ age                        <dbl> 51, 35, 36, 24, 62, 22, 52, 36, 62, 41, 45,…
$ gender                     <chr> "Man", "Woman", "Woman", "Man", "Man", "Man…
$ ethnicity                  <chr> "White British", "Asian or Asian British", …
$ education                  <chr> "Undergraduate degree", "GCSEs or equivalen…
$ area_type                  <chr> "Suburban", "Urban", "Rural", "Suburban", "…
$ victim_of_crime            <chr> "Yes", "Yes", "No", "No", "Yes", "No", "No"…
$ had_police_contact         <chr> "Yes", "Yes", "No", "Yes", "Yes", "No", "Ye…
$ trust_police               <dbl> 5, 3, 3, 3, 3, 3, 4, 4, 2, 2, 3, 2, 3, 2, 1…
$ perceived_fairness         <dbl> 2, 1, 4, 4, 3, 2, 3, 4, 3, 3, 3, 3, 3, 4, 2…
$ police_effectiveness       <dbl> 4, 2, 4, 2, 3, 2, 4, 3, 3, 4, 1, 4, 1, 2, 2…
$ support_community_policing <dbl> 3, 1, 3, 4, 3, 3, 3, 3, 2, 4, 2, 4, 4, 4, 4…
$ contact_satisfaction       <dbl> 2, 1, NA, 1, 5, NA, 1, NA, NA, 1, NA, 5, 3,…

Task 1: Look at the output of glimpse() and answer the following:

How many observations (rows) does the dataset have? ____________
How many variables (columns)? ____________
What data type is age? ____________
What data type is gender? ____________

A note on data types: <dbl> means numeric (a numeric variable). <chr> means character (text) indicating categorical variables. Most of your Qualtrics variables will come in as <chr>. R does not automatically know that “1 = Strongly disagree” is an ordered category rather than just a word. We will deal with this below.

4. Univariate Analysis: One Variable at a Time

Univariate analysis means looking at each variable on its own. The goal is to understand its distribution, what values appear, how often, and whether anything looks unusual.

We can use a series of descriptive statistics and graphs to help us to understand and make sense of the data.

4.1. Categorical Variables: Frequency Tables

For categorical variables (like gender, ethnicity, area type) we use frequency tables.

The simplest way of getting a frequency table is using the function table()

table(survey$gender) # You need to include the name of the dataset plus the symbol $ before the variable


              Man        Non-binary Prefer not to say             Woman 
              134                 7                 7               152

The tabyl() function from janitor is a better choice, it is cleaner than base R’s table() and gives you percentage.

# Frequency table for gender
survey %>%
  tabyl(gender) %>%
  adorn_pct_formatting(digits = 1)  # show percentages rounded to 1 decimal place

            gender   n percent
               Man 134   44.7%
        Non-binary   7    2.3%
 Prefer not to say   7    2.3%
             Woman 152   50.7%

Read this as: for each category of gender, we see the count (n) and the percentage of the total (percent).

Task 2: Produce frequency tables for the following variables. Make a note of the most common category in each.

In the code below, replace the ‘_________’ with the variable requested. Remember R is case sensitive so make sure the name of the variable in the code matches the name of the variable in the dataset.

# Frequency table for ethnicity
survey %>%
  tabyl(___) %>%
  adorn_pct_formatting(digits = 1)

# Frequency table for education
survey %>%
  tabyl(___) %>%
  adorn_pct_formatting(digits = 1)

# Frequency table for victim_of_crime
survey %>%
  tabyl(___) %>%
  adorn_pct_formatting(digits = 1)

Most common ethnicity: ____________
Most common education level: ____________
Proportion who have been a victim of crime: ____________

4.2. Checking for Missing Values

Missing values in R are recorded as NA. It is essential to know where they are before you start analysing. The function is.na() is a logical function which looks at each observation and evaluates whether it is a valid case or a missing case. When used with the function sum() the result will be the number of NA (missing cases) for a particular variable.

sum(is.na(survey$gender))

Another more efficient way of looking at it is using a combination of different functions that look at all variables in the dataset at once:

# Count missing values in each variable
survey %>%
  summarise(across(everything(), ~ sum(is.na(.))))

The across(everything(), ...) part tells R to apply the function to every column. The ~ and . are how you write “apply this to each column” in modern dplyr.

You will notice that contact_satisfaction has a lot of NA values. But look at the raw data participants who answered “No” to had_police_contact have the value NA rather than a true missing value. This is a good example of why you always need to understand why data is missing, not just that it is.

# Check the values in contact_satisfaction
survey %>%
  tabyl(contact_satisfaction) %>%
  adorn_pct_formatting(digits = 1)

 contact_satisfaction   n percent valid_percent
                    1  41   13.7%         24.3%
                    2  21    7.0%         12.4%
                    3  34   11.3%         20.1%
                    4  35   11.7%         20.7%
                    5  38   12.7%         22.5%
                   NA 131   43.7%             -

This is a structural missing, the question simply did not apply to those respondents. It is not a data quality problem, but it means any analysis of contact_satisfaction should be restricted to those who had contact.

Any value identified as missing do not count in the frequency table so the valid percent column gives you a true split of the valid responses in this variable.

Check again this time excluding those who have not contacted the police:

# Obtain a frequency table for contact satisfaction for those who have contacted the police
survey %>%
  filter(had_police_contact =="Yes") %>% # only select those who had contacted the police 
  tabyl(contact_satisfaction) %>%
  adorn_pct_formatting(digits = 1)

 contact_satisfaction  n percent
                    1 41   24.4%
                    2 21   12.5%
                    3 33   19.6%
                    4 35   20.8%
                    5 38   22.6%

The function filter allows you to select those who had contacted the police. This function is useful when you have filter questions!

survey %>% # creates an object called 'satisfaction_freq'
  filter(had_police_contact =="Yes") %>% # to select those who had contacted the police only
  tabyl(contact_satisfaction) %>%
  adorn_pct_formatting(digits = 1)

4.3. Saving your results

If you want to save the results you can save them as an object (table) in R, by assigning the resulting table to an object of the name of your choice as in the example below. After you run the code the table is stored in the ‘Environment’ panel.

satisfaction_freq <- survey %>% # creates an object called 'satisfaction_freq'
  filter(had_police_contact =="Yes") %>% 
  tabyl(contact_satisfaction) %>%
  adorn_pct_formatting(digits = 1)


# You can also save it (export it) as csv file (to be seen in excel)

write_csv(satisfaction_freq, "satisfaction_freq.csv")

Task 3: Are there missing values in trust_police, perceived_fairness, or police_effectiveness? Run the missing value check and note your finding.

4.4. Recoding categorical variables

Now let’s inspect our attitudinal variables. These are measured on a 1–5 scale where higher values indicate more positive attitudes (e.g., 5 = very high trust, 1 = very low trust).

Hint: Because trust_police is stored as a numberical variable, you can convert it to a categorical variable using the following code:

survey$trust_police_f <- factor(survey$trust_police, labels = c("Very low", "Low", "Neutral", "High", "Very high"))

First we create a new variable called survey$trust_police_f. The <- symbol tells R to take operation in the right side factor(survey$trust_police, labels = c("Very low", "Low", "Neutral", "High", "Very high")) to the right of the symbol and store it in a variable whose name is given on the left: survey$trust_police_f.

Now let’s do it with the data:

survey$trust_police_f <- factor(survey$trust_police, 
                                labels = c("Very low", "Low", "Neutral", "High", "Very high"))

# Let's start with 'trust_police'
survey %>%
  tabyl(trust_police_f) %>%
  adorn_pct_formatting(digits = 1)

 trust_police_f   n percent
       Very low  17    5.7%
            Low  90   30.0%
        Neutral 117   39.0%
           High  66   22.0%
      Very high  10    3.3%

Task 4: What can you say about the level of trust on the police? Look at the frequency for perceived_fairness and support_community_policing variables (convert them to factors first). Which attitude tends to be rated highest? Which lowest? Write a short interpretation (2–3 sentences) of what this tells us about public attitudes in this sample.

# Convert numeric variables to factor

## perceived_fairness

survey$perceived_fairness_f <- factor(survey$perceived_fairness, 
                                labels = c("Very low", "Low", "Neutral", "High", "Very high"))

## support_community_policing

survey$support_community_policing_f <- factor(survey$__________, 
                                labels = c("Very low", "Low", "Neutral", "High", "Very high"))


# Frequency tables
survey %>%
  summarise(
    tabyl(perceived_fairness_f) %>%
  adorn_pct_formatting(digits = 1)
  )


survey %>%
  summarise(
    tabyl(_______) %>%
  adorn_pct_formatting(digits = 1)
  )

4.5. Continuous Variables: Summary Statistics

For numeric variables we want measures of central tendency (mean, median) and spread (standard deviation, range).

# Summary statistics for age
summary(survey$age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  18.00   28.00   37.00   38.51   48.00   75.00

# Standard deviation — not included in summary()
sd(survey$age)

[1] 13.35481

The %>% pipe lets us chain operations together in a readable sequence. Here we use it to get a richer set of statistics in one block:

survey %>%
  summarise(
    mean_age    = mean(age), # mean is the function for mean or average
    sd_age      = sd(age),   # sd is for standard deviation (spread of the data)
    median_age  = median(age), # median is the central value if ordered from lower to higher
    min_age     = min(age),    # min is the minimum value of the variable
    max_age     = max(age)     # max is the maximum value of thr variable
  )

# A tibble: 1 × 5
  mean_age sd_age median_age min_age max_age
     <dbl>  <dbl>      <dbl>   <dbl>   <dbl>
1     38.5   13.4         37      18      75

Task 5: What can we say about age?

5. Visual Exploration

Numbers tell us a lot, but visualisations help us see patterns that statistics can miss. We use ggplot2 (part of tidyverse) for all our charts.

5.1. Bar Charts for Categorical Variables

plot1<- ggplot(data = survey) +                  # tell it which dataset to use
  geom_bar(mapping = aes(x = gender)) +  # type of graph and variables used
  labs(title = "Distribution of gender in the sample",
       x = "Gender", y = "Count") +
  theme_minimal()

# Save the plot
ggsave(plot1, filename= "gender.png")

You can also save the plot by clicking on Plots… Export in the bottom left side of the console.

Task 6: Produce a bar chart for ethnicity and another for area_type. Do the distributions match what you found in the frequency tables?

# Bar chart for ethnicity
ggplot(data = survey) +
  geom_bar(mapping = aes(x = ___)) +
  labs(title = "___", x = "___", y = "Count") +
  theme_minimal()

5.2. Histograms for Continuous Variables

ggplot(data = survey) +
  geom_histogram(mapping = aes(x = age), binwidth = 5, fill = "steelblue", na.rm = TRUE, colour = "white") +
  labs(title = "Age distribution of respondents",
       x = "Age", y = "Count") +
  theme_minimal()

Task 7: Produce a bar graph for trust_police_f.

ggplot(data = survey) +
  geom_bar(mapping = aes(x = _________),  fill = "steelblue", na.rm =TRUE) +
  labs(title = "Distribution of trust in police", x = "Trust", y = "Count") +
  theme_minimal()

6. Bivariate Analysis: Exploring Relationships

Once we understand individual variables, we look at how they relate to each other. This is where criminological insights start to emerge.

6.1. Crosstabs for Two Categorical Variables

A crosstab (contingency table) shows the joint distribution of two categorical variables. We use janitor’s tabyl() with some additional formatting functions.

In the first example we are interested in the relationship between victimisation and gender.

survey %>%
  tabyl( gender, victim_of_crime, show_na = FALSE) %>%   # the two variables
  adorn_percentages("row") %>%         # row percentages (% within each gender)
  adorn_pct_formatting(digits = 1) %>% # round to 1 decimal
  adorn_ns()                           # add raw counts in brackets

            gender          No        Yes
               Man 64.9%  (87) 35.1% (47)
        Non-binary 85.7%   (6) 14.3%  (1)
 Prefer not to say 71.4%   (5) 28.6%  (2)
             Woman 77.0% (117) 23.0% (35)

Read this table column by row: of all the men in the sample, what percentage were victims of crime? Of all the women?

male victimisation: ________________________________

female victimisation:_______________________________

Task 8: Explore the relationship between victim_of_crime and ethnicity.

survey %>%
  tabyl(___, ___, show_na = FALSE) %>%
  adorn_percentages("row") %>%
  adorn_pct_formatting(digits = 1) %>%
  adorn_ns()

Does victimisation vary by ethnic group? Make a note of any patterns you find.

What happens if you change to “col” in this part of the code adorn_percentages("row") %>% ____________________________________________________________

Task 9: Now explore the relationship between trust_police (factor variable) and victim_of_crime. Does being a victim of crime appear to relate to levels of trust? We want to understand if vitimisation influence attitudes towards the Police.

Victimisation is an event that can shape attitudes towards the police, so victimisation can also be seen as an independent variable or a predictor

Trust in the police is something that can change depending on other factors, such as victimisation, therefore it can be seen as dependent variable or outcome.

survey %>%
  tabyl( victim_of_crime, trust_police_f,show_na = FALSE) %>%
  adorn_percentages("row") %>%
  adorn_pct_formatting(digits = 1) %>%
  adorn_ns() 

# A more readable table (we change the order of the variables and used column percentages)
survey %>%
  tabyl(trust_police_f, victim_of_crime, show_na = FALSE) %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(digits = 1) %>%
  adorn_ns()

6.2. Visualising Group Differences

Bar charts with a fill colour are a good way to visualise the relationship between two categorical variables.

ggplot(data = survey, mapping = aes(x = area_type, fill = victim_of_crime)) +
  geom_bar(position = "fill") +          # "fill" makes each bar sum to 100%
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Victimisation by area type",
       x = "Area type", y = "Proportion",
       fill = "Victim of crime?") +
  theme_minimal()

Try comparing victim of crime by gender

ggplot(data = survey, mapping = aes(x = _____, fill = victim_of_crime)) +
  geom_bar(position = "fill") +          # "fill" makes each bar sum to 100%
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Victimisation by ____",
       x = "Area type", y = "Proportion",
       fill = "Victim of crime?") +
  theme_minimal()

Task 10: Based on all the analysis you have done, write 3–5 sentences summarising the main patterns you have found. Think about: who trusts the police most and least? Does victimisation experience seem to matter? Are there gender differences trust to the police?

7. A Note on Your Own Qualtrics Data

When you export your own survey from Qualtrics and load it into R, you will encounter a few things that look different from this cleaned dataset:

Extra rows at the top: Qualtrics exports include two header rows — the variable name row and a question text row. You will need to skip or remove the second row after import.

Numeric codes for Likert items: Qualtrics often stores responses as numbers (1, 2, 3…) without the labels. You will need to add labels manually using factor() with a labels = argument, exactly as shown in Task 9 above.

Timing and metadata columns: Qualtrics adds columns like StartDate, EndDate, Duration, IPAddress, LocationLatitude. You can remove thesein the comma separated (“.csv”) file or in R using the following code: your_survey_name <- your_survey_name %>% select(-column_name).

8. Saving Your Work

In Posit Cloud your script is saved automatically. If you want to save the cleaned version of the data (with any new variables you created) you can use:

# Save as an R data file — preserves all your variable types
save(survey, file = "survey_cleaned.RData")

# Or save as CSV if you want to open it in Excel or share it
write_csv(survey, "survey_cleaned.csv")

9. Going Further: Real Crime Survey Data

The dataset you have worked with today was designed to be clean, well-structured, and immediately usable, which made it ideal for learning the fundamentals. But it is also simulated. Real survey data, especially large-scale national surveys, tends to be messier, richer, and considerably more interesting.

The CSEW is archived at the UK Data Service (UKDS), which is the national repository for social science data in the UK. Registering takes a few minutes and gives you access to hundreds of datasets across criminology, health, economics, and social policy.

Why didn’t we use real survey data today? Working with real survey data such as the CSEW or the SCJS requires a few extra steps: the files are in other software (Stata or SPSS) formats rather than CSV, which requires a different import package; variables are stored as numeric codes with labels rather than plain text. None of this is difficult, but learning those steps alongside everything else in two hours would have been too much.

Further Learning: Free Online Training from the UK Data Service

The UK Data Service has developed a free, self-paced online training module specifically on analysing crime data using R. It covers the CSEW directly and builds on exactly the skills you have practised today.

Introduction to Crime Data Analysis using R
UK Data Service free and self-paced resources

This is an excellent resource to consolidate what you have learned today and extend it to working with real, nationally representative data. If you found today’s session useful, this is your obvious next step.

Summary of Functions Used Today

Function	Package	What it does
`read_csv()`	readr (tidyverse)	Import a CSV file
`head()`	base R	Show first 6 rows
`glimpse()`	dplyr (tidyverse)	Structured variable overview
`tabyl()`	janitor	Frequency tables and crosstabs
`adorn_percentages()`	janitor	Add percentages to a tabyl
`summary()`	base R	Min, max, mean, median, quartiles
`sd()`	base R	Standard deviation
`summarise()`	dplyr	Custom summary statistics
`across()`	dplyr	Apply a function across multiple columns
`sum(is.na())`	base R	Count missing values
`factor()`	base R	Convert to categorical with labels
`ggplot()` + `geom_bar()`	ggplot2	Bar chart
`ggplot()` + `geom_histogram()`	ggplot2	Histogram
`ggplot()` + `geom_boxplot()`	ggplot2	Boxplot

This practical was developed for a master’s-level introduction to data analysis in criminology. It draws on materials originally created for the UK Data Service Introduction to Crime Data workshop (University of Manchester, February 2020). The simulated dataset on public attitudes to policing was designed to reflect the structure of Qualtrics survey exports and the thematic content of the Crime Survey for England and Wales.
AI Statement
I used ELM (University of Edinburgh’s generative AI gateway) for a generic R example of simulating mixed‑type survey data. I adapted the approach, and the dataset used here was specified and generated independently to meet the requirements for this workshop.