Data preparation in R

Please note that the objective of this document is to present different stages in the preliminary data analysis. The included visualizations need more work and refinement. Some labels were removed due to limited space.

Getting the data

The datasource is SSB.

Importing Excel file into R

library(readxl)
crime_norway <- read_excel("C:/Users/Beata/Downloads/08631_20240919-153249_.xlsx")
New names:
• `` -> `...1`
• `` -> `...2`
• `` -> `...3`
View(crime_norway)

Importing/ renaming can also be done directly from RStudio (Import Dataset option).

Data cleaning

The table view after importing the data:

The main problem with this dataset is that it is not in the “tidy” format. based on Wickham (2014:5)1:

Tidy data is a standard way of mapping the meaning of a dataset to its structure. … In tidy data: 1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table.

The key step in data cleaning is therefore pivoting data into a longer form.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)

crime_norway_longer <- crime_norway |> 
  pivot_longer(
    cols = starts_with("2"), names_to = "year", 
    values_to = "number",
    values_drop_na = TRUE
    )

This is the result:

The column 1 is redundant, for further work we need columns 2-5.

clean <- crime_norway_longer |> 
  select(-1)

In the next step I rename column “2” to “Victim”, “3” to “Type”, and capitalize the names of the other two columns.

colnames(clean)[1]<-"Victim"
colnames(clean)[2]<-"Type"
colnames(clean)[3]<-"Year"
colnames(clean)[4]<-"Number"

As a next step I divide the dataset into four subsets:

  • one subset presenting total numbers

  • 3 subsets based on victim type (person, other judical unit/enterprise, unknown)

totals <- clean |> 
  slice(1:200)

person <- clean |> 
  slice(201:400)

enterprise <- clean |> 
  slice(401:600)

unknown <- clean |> 
  slice(601:800)

Each of these subsets can be saved as a table/ csv file, to be used e.g. in Tableau.

write.csv(totals, "totals.csv")

write.csv(person, "person.csv")

write.csv(enterprise, "enterprise.csv")

write.csv(unknown, "unknown.csv")

Visualization

I usually use ggplot 2 + additional themes (ggthemes package).

In the following I filter the subset and visualize yearly totals (combined).

totals_combined <-  totals|>
  filter(Type == "All groups of offences")
totals_combined |> 
  ggplot(aes(Year, Number))+
  geom_point()+
  ggtitle("Total yearly reported crimes in Norway")+
  ylab("Cases")+
  theme_tufte()

Next, I intend to visualize the total yearly number by crime type.

To do so, I filter out the category “All groups of offences”.

totals_divided <-  totals|>
  filter(!(Type == "All groups of offences"))

Then, I visualize

totals_divided |> 
  ggplot(aes(Year, Number, colour = Type))+
  geom_point()+
  ggtitle("Yearly reported crimes in Norway, categories")+
  ylab("Cases")+
  theme_tufte()+
   theme(axis.text.x = element_blank())

Alternative theme:

Data can be further filtered to limit the number of categories, or divided into separate views with function facet_wrap. I removed the axis x label for clarity.

totals_divided |> 
  ggplot(aes(Year, Number, colour = Type))+
  geom_point()+
  facet_wrap(~ Type) +
  ggtitle("Yearly reported crimes in Norway, categories")+
  ylab("Cases")+
  theme_tufte()+
  theme(axis.text.x = element_blank())

Preliminary findings

From this comparison, we can see that the biggest decline in crime rates in the period 2014 -2023 took place in the category “Property theft”. Other categories remained relatively stable. It’s worth noting that, the lowest figures in this category (and in total crime count) were registered during the peak of COVID (2021).

A preliminary Tableau visualization

The visualization is based on the subset totals.csv and is available here: https://public.tableau.com/app/profile/beata.sirowy/viz/AnnualcrimeratesinNorway2004-2023bycategory/Dashboard1#1

Footnotes

  1. Wickham, H. . (2014). Tidy Data. Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10↩︎