This report presents an analysis of data on motor vehicle thefts from the New Zealand Police Department’s vehicle of interest database. The dataset covers a period of six months and each record represents a single stolen vehicle.
The dataset includes information about stolen vehicles, such as the type of vehicle, make, model year, color, the date it was stolen, and the region from which it was stolen.
The primary dataset stolen_vehicles.csv contains the
following columns:
vehicle_id: A unique identifier for each vehicle.vehicle_type: The type of the vehicle.make_id: An identifier for the make of the
vehicle.model_year: The year of the vehicle model.vehicle_desc: A description of the vehicle.color: The color of the vehicle.date_stolen: The date when the vehicle was stolen.location_id: An identifier for the location where the
vehicle was stolen.Additional datasets include locations.csv and
make_details.csv, which provide further details about the
regions and vehicle makes, respectively.
The analysis aims to answer several key questions:
We will begin by loading the data into RStudio. For this analysis,
we’ll utilize data.table, a powerful package in R designed
for efficient data manipulation and analysis. Let’s start by loading the
data.table library.
# import packages
library(data.table)
library(ggplot2)
# import data, we use fread to get a data.table
locations = fread('https://raw.githubusercontent.com/artyomashigov/CEU-R-intro-2024/main/Project/Data/locations.csv')
locations
## location_id region country population density
## 1: 101 Northland New Zealand 201,500 16.11
## 2: 102 Auckland New Zealand 1,695,200 343.09
## 3: 103 Waikato New Zealand 513,800 21.50
## 4: 104 Bay of Plenty New Zealand 347,700 28.80
## 5: 105 Gisborne New Zealand 52,100 6.21
## 6: 106 Hawke's Bay New Zealand 182,700 12.92
## 7: 107 Taranaki New Zealand 127,300 17.55
## 8: 108 Manawatū-Whanganui New Zealand 258,200 11.62
## 9: 109 Wellington New Zealand 543,500 67.52
## 10: 110 Tasman New Zealand 58,700 6.10
## 11: 111 Nelson New Zealand 54,500 129.15
## 12: 112 Marlborough New Zealand 51,900 4.94
## 13: 113 West Coast New Zealand 32,700 1.41
## 14: 114 Canterbury New Zealand 655,000 14.72
## 15: 115 Otago New Zealand 246,000 7.89
## 16: 116 Southland New Zealand 102,400 3.28
str(locations)
## Classes 'data.table' and 'data.frame': 16 obs. of 5 variables:
## $ location_id: int 101 102 103 104 105 106 107 108 109 110 ...
## $ region : chr "Northland" "Auckland" "Waikato" "Bay of Plenty" ...
## $ country : chr "New Zealand" "New Zealand" "New Zealand" "New Zealand" ...
## $ population : chr "201,500" "1,695,200" "513,800" "347,700" ...
## $ density : num 16.11 343.09 21.5 28.8 6.21 ...
## - attr(*, ".internal.selfref")=<externalptr>
In our dataset, there are 16 distinct locations represented,
encompassing a variety of data across 5 columns. These include three
text-type columns, one numerical, and one integer column.
Notably,
the population field is currently classified as text. To
facilitate more accurate analysis, we will proceed to convert this field
to a numerical data type in the next section.
#import make details
make_details = fread('https://raw.githubusercontent.com/artyomashigov/CEU-R-intro-2024/main/Project/Data/make_details.csv')
make_details
## make_id make_name make_type
## 1: 501 Aakron Xpress Standard
## 2: 502 ADLY Standard
## 3: 503 Alpha Standard
## 4: 504 Anglo Standard
## 5: 505 Aprilia Standard
## ---
## 134: 634 Volvo Luxury
## 135: 635 Voyager Standard
## 136: 636 Yamaha Standard
## 137: 637 Zephyr Standard
## 138: 638 Znen Standard
str(make_details)
## Classes 'data.table' and 'data.frame': 138 obs. of 3 variables:
## $ make_id : int 501 502 503 504 505 506 507 508 509 510 ...
## $ make_name: chr "Aakron Xpress" "ADLY" "Alpha" "Anglo" ...
## $ make_type: chr "Standard" "Standard" "Standard" "Standard" ...
## - attr(*, ".internal.selfref")=<externalptr>
In the make_details dataset we’ve loaded, there’s
information on 138 unique car manufacturers. When we check the structure
of this dataset, it seems that the data types of each field are already
in the format we need. This means we won’t have to make any additional
changes to these data types for our analysis.
# import stolen vehicles
stolen_vehicles = fread('https://raw.githubusercontent.com/artyomashigov/CEU-R-intro-2024/main/Project/Data/stolen_vehicles.csv')
stolen_vehicles
## vehicle_id vehicle_type make_id model_year vehicle_desc color
## 1: 1 Trailer 623 2021 BST2021D Silver
## 2: 2 Boat Trailer 623 2021 OUTBACK BOATS FT470 Silver
## 3: 3 Boat Trailer 623 2021 ASD JETSKI Silver
## 4: 4 Trailer 623 2021 MSC 7X4 Silver
## 5: 5 Trailer 623 2018 D-MAX 8X5 Silver
## ---
## 4549: 4549 NA NA
## 4550: 4550 NA NA
## 4551: 4551 NA NA
## 4552: 4552 NA NA
## 4553: 4553 NA NA
## date_stolen location_id
## 1: 11/5/21 102
## 2: 12/13/21 105
## 3: 2/13/22 102
## 4: 11/13/21 106
## 5: 1/10/22 102
## ---
## 4549: 2/18/22 102
## 4550: 2/14/22 109
## 4551: 3/9/22 102
## 4552: 3/7/22 109
## 4553: 3/14/22 102
str(stolen_vehicles)
## Classes 'data.table' and 'data.frame': 4553 obs. of 8 variables:
## $ vehicle_id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ vehicle_type: chr "Trailer" "Boat Trailer" "Boat Trailer" "Trailer" ...
## $ make_id : int 623 623 623 623 623 636 623 623 514 514 ...
## $ model_year : int 2021 2021 2021 2021 2018 2005 2021 2001 2021 2020 ...
## $ vehicle_desc: chr "BST2021D" "OUTBACK BOATS FT470" "ASD JETSKI" "MSC 7X4" ...
## $ color : chr "Silver" "Silver" "Silver" "Silver" ...
## $ date_stolen : chr "11/5/21" "12/13/21" "2/13/22" "11/13/21" ...
## $ location_id : int 102 105 102 106 102 102 114 109 115 114 ...
## - attr(*, ".internal.selfref")=<externalptr>
We’ve successfully loaded the stolen_vehicles dataset,
which contains a total of 4553 records. A quick look at the data reveals
that everything is in order, but there’s one small adjustment we need to
make: the date_stolen field is currently a character type.
We’ll convert this to a date type in our upcoming analysis steps, to
make sure our data is just right for the analysis we plan to do.
Now moving to Step 3, it’s time for some data transformation and
analysis. Our first task is to tweak the types of two variables in our
dataset. We’ll transform the population field into an
integer type for easier numerical operations. Additionally, we’ll change
date_stolen from a character type to a date type. For ease
of use and consistency, we’ll format date_stolen into the
European date style. This way, our data will be neatly organized and
more straightforward to work with in our analysis.
# Convert 'population' from character to integer
locations$population <- as.integer(gsub(",", "", locations$population))
# Convert 'date_stolen' from character to Date in the format 'dd-mm-yyyy'
stolen_vehicles$date_stolen <- as.Date(stolen_vehicles$date_stolen, format="%m/%d/%y")
stolen_vehicles$date_stolen <- format(stolen_vehicles$date_stolen, "%d-%m-%Y")
We’re now going to check if there are any missing values in our three
datasets: locations, make_details, and
stolen_vehicles. This is an important step to ensure the
quality and completeness of our data before diving deeper into analysis.
Let’s run a check on each dataset to count how many missing values (if
any) exist in each column. This way, we can identify if there are any
gaps in our data that need addressing.
# Column-wise count of missing values
# In 'locations'
sapply(locations, function(x) sum(is.na(x)))
## location_id region country population density
## 0 0 0 0 0
# In 'make_details'
sapply(make_details, function(x) sum(is.na(x)))
## make_id make_name make_type
## 0 0 0
# In 'stolen_vehicles'
sapply(stolen_vehicles, function(x) sum(is.na(x)))
## vehicle_id vehicle_type make_id model_year vehicle_desc color
## 0 0 15 15 0 0
## date_stolen location_id
## 0 0
Now, let’s bring all our three datasets –
stolen_vehicles, make_details, and
locations – together into one comprehensive table. By using
the all.x = TRUE option in our merge function, we’ll make
sure to keep every record from the stolen_vehicles dataset, even those
that don’t have corresponding entries in the other two datasets. This
way, we won’t lose any valuable information from our primary dataset
during the merge process.
# Merge datasets
df <- merge(stolen_vehicles, make_details, by = "make_id", all.x = TRUE)
df <- merge(df, locations, by = "location_id", all.x = TRUE)
Next up, we’re going to tidy up our merged dataset a bit. First,
we’ll check for any missing values and then remove them to ensure our
data is clean and complete. We don’t want any gaps messing up our
analysis! After that, we’ll take a quick peek at our dataset to get some
basic descriptive stats, and for this, we’ll use the handy
summary method. It’s a great way to get an overview of our
data in just a glance.
# check missing values
sapply(df, function(x) sum(is.na(x)))
## location_id make_id vehicle_id vehicle_type model_year vehicle_desc
## 0 15 0 0 15 0
## color date_stolen make_name make_type region country
## 0 0 15 15 0 0
## population density
## 0 0
# drop missing values
df <- na.omit(df)
df <- df[vehicle_type != ""]
# use summary
summary(df)
## location_id make_id vehicle_id vehicle_type
## Min. :101.0 Min. :501.0 Min. : 1 Length:4527
## 1st Qu.:102.0 1st Qu.:550.0 1st Qu.:1132 Class :character
## Median :104.0 Median :587.0 Median :2264 Mode :character
## Mean :105.8 Mean :584.2 Mean :2264
## 3rd Qu.:109.0 3rd Qu.:619.0 3rd Qu.:3396
## Max. :116.0 Max. :638.0 Max. :4527
## model_year vehicle_desc color date_stolen
## Min. :1940 Length:4527 Length:4527 Length:4527
## 1st Qu.:2000 Class :character Class :character Class :character
## Median :2005 Mode :character Mode :character Mode :character
## Mean :2005
## 3rd Qu.:2011
## Max. :2022
## make_name make_type region country
## Length:4527 Length:4527 Length:4527 Length:4527
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## population density
## Min. : 52100 Min. : 3.28
## 1st Qu.: 347700 1st Qu.: 14.72
## Median : 655000 Median : 28.80
## Mean : 866847 Mean :141.19
## 3rd Qu.:1695200 3rd Qu.:343.09
## Max. :1695200 Max. :343.09
The dataset provides insight into stolen vehicles in New Zealand, with the average model year of vehicles being 2005, indicating that the average age of stolen vehicles is approximately 19 years. It covers a variety of vehicle types and colors, with a comprehensive description for each vehicle.
Population data for the regions where thefts occurred ranges from 52,100 to 1,695,200, with density varying significantly from 3.28 to 343.09. The dataset also includes diverse vehicle makes and types, suggesting a wide range of targeted vehicles.
Now, let’s dive into the visual exploration of our data. We’ll use various visualizations to answer some intriguing questions about vehicle thefts:
date_stolen field.
This will allow us to analyze the frequency of thefts across different
days and identify any patterns or trends.# get weekday names
df$week_day <- weekdays(as.Date(df$date_stolen))
# Calculate the number of thefts per day of the week
theft_counts <- df[, .(Count = .N), by = .(week_day)]
# Order the days of the week
week_days_ordered <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
theft_counts$week_day <- factor(theft_counts$week_day, levels = week_days_ordered)
# Create a bar plot with ColorBrewer palette
ggplot(theft_counts, aes(x = week_day, y = Count, fill = week_day)) +
geom_bar(stat = "identity") +
scale_fill_brewer(palette = "BuGn") + # Apply ColorBrewer palette
geom_text(aes(label = Count), vjust = -0.3, color = "black") + # Adding data labels
labs(title = "Vehicle Thefts by Day of the Week", x = "", y = "") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) # Center the title
The bar plot vividly illustrates the weekly pattern in vehicle
thefts. It’s quite noticeable that the highest number of thefts occur on
Saturdays (723 thefts) and Sundays (690 thefts), suggesting a weekend
trend in vehicle theft incidents. On the other hand, Fridays appear to
be the safest, with the lowest count of thefts at 585. This insight
could be pivotal for law enforcement and public awareness
strategies.
# Aggregate data: count of stolen vehicles by vehicle type
vehicle_type_counts <- df[, .(Count = .N), by = .(vehicle_type)]
# Order the vehicle types by count
vehicle_type_counts <- vehicle_type_counts[order(-Count)]
# Create a bar chart with data labels
ggplot(vehicle_type_counts, aes(x = vehicle_type, y = Count, fill = vehicle_type)) +
geom_bar(stat = "identity") +
geom_text(aes(label = Count), vjust = -0.3, size = 2.5) + # Adding data labels
labs(title = "Count of Thefts by Vehicle Type",
x = "",
y = "") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
legend.position = "none") +
theme(plot.title = element_text(hjust = 0.5)) # Center the title
Our second bar chart provides a clear depiction of theft frequency by
vehicle type. From the chart, it’s evident that
station wagons lead in theft frequency with a total of 945
incidents, highlighting their vulnerability. On the other end of the
spectrum, articulated trucks seem to be the least targeted,
with only a single reported theft. This visualization underscores the
varying risks associated with different vehicle types.”
# Calculate the count of luxury vs. non-luxury vehicles
luxury_counts <- df[, .(Count = .N), by = .(make_type)]
# Calculate the percentages
luxury_counts[, Percentage := Count / sum(Count) * 100]
# Create a pie chart
ggplot(luxury_counts, aes(x = 2, y = Percentage, fill = make_type)) + # Set x to a constant value
geom_bar(stat = "identity", width = 0.8) +
coord_polar(theta = "y") +
labs(title = "Percentage of Stolen Luxury Cars", x = "", y = "") +
theme_void() +
theme(legend.title = element_blank(), plot.title = element_text(hjust = 0.5)) +
geom_text(aes(label = sprintf("%.1f%%", Percentage)),
position = position_stack(vjust = 0.5)) # Add labels
The pie chart we’ve created offers a clear view of the stolen car
make-up. Interestingly, luxury cars constitute only a small fraction of
the total thefts, precisely 4.2%. This could suggest that thieves might
be targeting more common vehicle types, possibly due to their prevalence
or lesser security features compared to luxury cars.
# Calculate the age of the vehicles
df[, vehicle_age := 2024 - model_year]
# Create a box plot of vehicle age by vehicle type
ggplot(df, aes(x = vehicle_type, y = vehicle_age, fill = vehicle_type)) +
geom_boxplot() +
labs(title = "Average Age of Stolen Vehicles by Vehicle Type",
x = "",
y = "Age") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none") +
theme(plot.title = element_text(hjust = 0.5)) # Center the title
The box plot we’ve generated reveals significant variations in the
age range of stolen vehicles across different types. For instance,
caravans and trail bikes show a broad age
range, with a tendency towards older models being stolen more
frequently. Conversely, mopeds and tractors
have a narrower age range, indicating that newer models in these
categories are more commonly targeted. This diversity in age range
across vehicle types provides valuable insights into the preferences of
vehicle thieves.
# Aggregate data: count of stolen vehicles by region
theft_counts_by_region <- df[, .(Count = .N), by = .(region)]
# Order regions by count
theft_counts_by_region <- theft_counts_by_region[order(-Count)]
# Create a bar chart of theft counts by region with data labels
ggplot(theft_counts_by_region, aes(x = reorder(region, Count), y = Count, fill = region)) +
geom_bar(stat = "identity") +
geom_text(aes(label = Count), vjust = -0.3, color = "black", size = 3) + # Adding data labels
labs(title = "Number of Vehicle Thefts by Region",
x = "",
y = "") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none",
plot.title = element_text(hjust = 0.5)) # Center the title
The bar chart paints a clear picture of the regional distribution of
vehicle thefts. It’s quite striking to see that Auckland
tops the list with a substantial count of 1626 reported thefts.
In
stark contrast, Southland reports the fewest incidents,
with only 26 cases. This stark difference highlights the varying levels
of vehicle theft risk across different regions.
# Aggregate data: count of stolen vehicles by color
theft_counts_by_color <- df[, .(Count = .N), by = .(color)]
# Order colors by count
theft_counts_by_color <- theft_counts_by_color[order(-Count)]
# Create a bar chart of theft counts by color
ggplot(theft_counts_by_color, aes(x = reorder(color, Count), y = Count)) +
geom_bar(stat = "identity", fill = "darkgreen") +
geom_text(aes(label = Count), vjust = -0.3, color = "black", size = 3) +
labs(title = "Vehicle Theft Counts by Color",
x = "",
y = "") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none", #
plot.title = element_text(hjust = 0.5)) # Center the title
The color ‘Silver’ emerges as the most targeted, with as many as 1272
vehicles stolen, closely followed by ‘White’ vehicles at 932 thefts.
# Aggregate data: count of stolen vehicles by region's population density
theft_counts_by_density <- df[, .(Count = .N), by = .(density)]
# Create a scatter plot with a linear regression trend line
ggplot(theft_counts_by_density, aes(x = density, y = Count)) +
geom_point() + # Scatter plot
geom_smooth(method = "lm", se = FALSE, color = "blue") + # Linear regression trend line
labs(title = "Vehicle Theft Counts by Population Density with Trend Line",
x = "Population Density",
y = "Count of Thefts") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using formula = 'y ~ x'
The scatter plot with an overlaid linear regression trend line reveals an intriguing pattern: there seems to be a positive correlation between a region’s population density and the number of vehicle thefts. Essentially, as the population density increases, so does the frequency of stolen vehicles. This finding suggests that denser regions might be more susceptible to vehicle theft, possibly due to factors like urbanization and the sheer number of vehicles present.
In conclusion, this comprehensive analysis of the stolen vehicle dataset from New Zealand has provided valuable insights into the patterns and characteristics of vehicle thefts. Our investigation spanned various aspects, including the days of the week when thefts are most prevalent, the types of vehicles most at risk, the incidence of luxury vehicle thefts, the average age of stolen vehicles, and regional variations in theft frequencies.
Key findings include the higher frequency of thefts during weekends, with Saturdays experiencing the highest number of incidents. Interestingly, luxury vehicles constituted a minor proportion of thefts, indicating a possible preference for more common or less secure vehicle types among thieves. Additionally, the analysis revealed significant variations in vehicle age depending on the type, with certain types like caravans and trail bikes being older on average at the time of theft.
Regional analysis showed a correlation between population density and theft frequency, suggesting that denser areas are more prone to vehicle thefts. This finding has implications for urban planning and law enforcement strategies.
Overall, the insights gained from this analysis could be instrumental
in shaping effective policies and preventive measures to combat vehicle
theft. It also underscores the importance of data-driven approaches in
understanding and addressing societal challenges. The versatility of R
and its packages, particularly data.table and
ggplot2, played a crucial role in facilitating this
detailed analysis.