library(readxl)
<- read_excel("C:/Users/Beata/Downloads/08631_20240919-153249_.xlsx") crime_norway
New names:
• `` -> `...1`
• `` -> `...2`
• `` -> `...3`
View(crime_norway)
Please note that the objective of this document is to present different stages in the preliminary data analysis. The included visualizations need more work and refinement. Some labels were removed due to limited space.
The datasource is SSB.
library(readxl)
<- read_excel("C:/Users/Beata/Downloads/08631_20240919-153249_.xlsx") crime_norway
New names:
• `` -> `...1`
• `` -> `...2`
• `` -> `...3`
View(crime_norway)
Importing/ renaming can also be done directly from RStudio (Import Dataset option).
The table view after importing the data:
The main problem with this dataset is that it is not in the “tidy” format. based on Wickham (2014:5)1:
Tidy data is a standard way of mapping the meaning of a dataset to its structure. … In tidy data: 1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table.
The key step in data cleaning is therefore pivoting data into a longer form.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
<- crime_norway |>
crime_norway_longer pivot_longer(
cols = starts_with("2"), names_to = "year",
values_to = "number",
values_drop_na = TRUE
)
This is the result:
The column 1 is redundant, for further work we need columns 2-5.
<- crime_norway_longer |>
clean select(-1)
In the next step I rename column “2” to “Victim”, “3” to “Type”, and capitalize the names of the other two columns.
colnames(clean)[1]<-"Victim"
colnames(clean)[2]<-"Type"
colnames(clean)[3]<-"Year"
colnames(clean)[4]<-"Number"
As a next step I divide the dataset into four subsets:
one subset presenting total numbers
3 subsets based on victim type (person, other judical unit/enterprise, unknown)
<- clean |>
totals slice(1:200)
<- clean |>
person slice(201:400)
<- clean |>
enterprise slice(401:600)
<- clean |>
unknown slice(601:800)
Each of these subsets can be saved as a table/ csv file, to be used e.g. in Tableau.
write.csv(totals, "totals.csv")
write.csv(person, "person.csv")
write.csv(enterprise, "enterprise.csv")
write.csv(unknown, "unknown.csv")
I usually use ggplot 2 + additional themes (ggthemes package).
In the following I filter the subset and visualize yearly totals (combined).
<- totals|>
totals_combined filter(Type == "All groups of offences")
|>
totals_combined ggplot(aes(Year, Number))+
geom_point()+
ggtitle("Total yearly reported crimes in Norway")+
ylab("Cases")+
theme_tufte()
Next, I intend to visualize the total yearly number by crime type.
To do so, I filter out the category “All groups of offences”.
<- totals|>
totals_divided filter(!(Type == "All groups of offences"))
Then, I visualize
|>
totals_divided ggplot(aes(Year, Number, colour = Type))+
geom_point()+
ggtitle("Yearly reported crimes in Norway, categories")+
ylab("Cases")+
theme_tufte()+
theme(axis.text.x = element_blank())
Alternative theme:
Data can be further filtered to limit the number of categories, or divided into separate views with function facet_wrap. I removed the axis x label for clarity.
|>
totals_divided ggplot(aes(Year, Number, colour = Type))+
geom_point()+
facet_wrap(~ Type) +
ggtitle("Yearly reported crimes in Norway, categories")+
ylab("Cases")+
theme_tufte()+
theme(axis.text.x = element_blank())
From this comparison, we can see that the biggest decline in crime rates in the period 2014 -2023 took place in the category “Property theft”. Other categories remained relatively stable. It’s worth noting that, the lowest figures in this category (and in total crime count) were registered during the peak of COVID (2021).
The visualization is based on the subset totals.csv and is available here: https://public.tableau.com/app/profile/beata.sirowy/viz/AnnualcrimeratesinNorway2004-2023bycategory/Dashboard1#1
Wickham, H. . (2014). Tidy Data. Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10↩︎