Introduction

This report presents an analysis of data on motor vehicle thefts from the New Zealand Police Department’s vehicle of interest database. The dataset covers a period of six months and each record represents a single stolen vehicle.

Step 1: Data Description

The dataset includes information about stolen vehicles, such as the type of vehicle, make, model year, color, the date it was stolen, and the region from which it was stolen.

The primary dataset stolen_vehicles.csv contains the following columns:

Additional datasets include locations.csv and make_details.csv, which provide further details about the regions and vehicle makes, respectively.

The analysis aims to answer several key questions:

Step 2: Data Loading

We will begin by loading the data into RStudio. For this analysis, we’ll utilize data.table, a powerful package in R designed for efficient data manipulation and analysis. Let’s start by loading the data.table library.

# import packages
library(data.table)
library(ggplot2)



# import data, we use fread to get a data.table
locations = fread('https://raw.githubusercontent.com/artyomashigov/CEU-R-intro-2024/main/Project/Data/locations.csv')
locations
##     location_id             region     country population density
##  1:         101          Northland New Zealand    201,500   16.11
##  2:         102           Auckland New Zealand  1,695,200  343.09
##  3:         103            Waikato New Zealand    513,800   21.50
##  4:         104      Bay of Plenty New Zealand    347,700   28.80
##  5:         105           Gisborne New Zealand     52,100    6.21
##  6:         106        Hawke's Bay New Zealand    182,700   12.92
##  7:         107           Taranaki New Zealand    127,300   17.55
##  8:         108 Manawatū-Whanganui New Zealand    258,200   11.62
##  9:         109         Wellington New Zealand    543,500   67.52
## 10:         110             Tasman New Zealand     58,700    6.10
## 11:         111             Nelson New Zealand     54,500  129.15
## 12:         112        Marlborough New Zealand     51,900    4.94
## 13:         113         West Coast New Zealand     32,700    1.41
## 14:         114         Canterbury New Zealand    655,000   14.72
## 15:         115              Otago New Zealand    246,000    7.89
## 16:         116          Southland New Zealand    102,400    3.28
str(locations)
## Classes 'data.table' and 'data.frame':   16 obs. of  5 variables:
##  $ location_id: int  101 102 103 104 105 106 107 108 109 110 ...
##  $ region     : chr  "Northland" "Auckland" "Waikato" "Bay of Plenty" ...
##  $ country    : chr  "New Zealand" "New Zealand" "New Zealand" "New Zealand" ...
##  $ population : chr  "201,500" "1,695,200" "513,800" "347,700" ...
##  $ density    : num  16.11 343.09 21.5 28.8 6.21 ...
##  - attr(*, ".internal.selfref")=<externalptr>

In our dataset, there are 16 distinct locations represented, encompassing a variety of data across 5 columns. These include three text-type columns, one numerical, and one integer column.
Notably, the population field is currently classified as text. To facilitate more accurate analysis, we will proceed to convert this field to a numerical data type in the next section.

#import make details
make_details = fread('https://raw.githubusercontent.com/artyomashigov/CEU-R-intro-2024/main/Project/Data/make_details.csv')
make_details
##      make_id     make_name make_type
##   1:     501 Aakron Xpress  Standard
##   2:     502          ADLY  Standard
##   3:     503         Alpha  Standard
##   4:     504         Anglo  Standard
##   5:     505       Aprilia  Standard
##  ---                                
## 134:     634         Volvo    Luxury
## 135:     635       Voyager  Standard
## 136:     636        Yamaha  Standard
## 137:     637        Zephyr  Standard
## 138:     638          Znen  Standard
str(make_details)
## Classes 'data.table' and 'data.frame':   138 obs. of  3 variables:
##  $ make_id  : int  501 502 503 504 505 506 507 508 509 510 ...
##  $ make_name: chr  "Aakron Xpress" "ADLY" "Alpha" "Anglo" ...
##  $ make_type: chr  "Standard" "Standard" "Standard" "Standard" ...
##  - attr(*, ".internal.selfref")=<externalptr>

In the make_details dataset we’ve loaded, there’s information on 138 unique car manufacturers. When we check the structure of this dataset, it seems that the data types of each field are already in the format we need. This means we won’t have to make any additional changes to these data types for our analysis.

# import stolen vehicles
stolen_vehicles = fread('https://raw.githubusercontent.com/artyomashigov/CEU-R-intro-2024/main/Project/Data/stolen_vehicles.csv')
stolen_vehicles
##       vehicle_id vehicle_type make_id model_year        vehicle_desc  color
##    1:          1      Trailer     623       2021            BST2021D Silver
##    2:          2 Boat Trailer     623       2021 OUTBACK BOATS FT470 Silver
##    3:          3 Boat Trailer     623       2021          ASD JETSKI Silver
##    4:          4      Trailer     623       2021             MSC 7X4 Silver
##    5:          5      Trailer     623       2018           D-MAX 8X5 Silver
##   ---                                                                      
## 4549:       4549                   NA         NA                           
## 4550:       4550                   NA         NA                           
## 4551:       4551                   NA         NA                           
## 4552:       4552                   NA         NA                           
## 4553:       4553                   NA         NA                           
##       date_stolen location_id
##    1:     11/5/21         102
##    2:    12/13/21         105
##    3:     2/13/22         102
##    4:    11/13/21         106
##    5:     1/10/22         102
##   ---                        
## 4549:     2/18/22         102
## 4550:     2/14/22         109
## 4551:      3/9/22         102
## 4552:      3/7/22         109
## 4553:     3/14/22         102
str(stolen_vehicles)
## Classes 'data.table' and 'data.frame':   4553 obs. of  8 variables:
##  $ vehicle_id  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ vehicle_type: chr  "Trailer" "Boat Trailer" "Boat Trailer" "Trailer" ...
##  $ make_id     : int  623 623 623 623 623 636 623 623 514 514 ...
##  $ model_year  : int  2021 2021 2021 2021 2018 2005 2021 2001 2021 2020 ...
##  $ vehicle_desc: chr  "BST2021D" "OUTBACK BOATS FT470" "ASD JETSKI" "MSC 7X4" ...
##  $ color       : chr  "Silver" "Silver" "Silver" "Silver" ...
##  $ date_stolen : chr  "11/5/21" "12/13/21" "2/13/22" "11/13/21" ...
##  $ location_id : int  102 105 102 106 102 102 114 109 115 114 ...
##  - attr(*, ".internal.selfref")=<externalptr>

We’ve successfully loaded the stolen_vehicles dataset, which contains a total of 4553 records. A quick look at the data reveals that everything is in order, but there’s one small adjustment we need to make: the date_stolen field is currently a character type. We’ll convert this to a date type in our upcoming analysis steps, to make sure our data is just right for the analysis we plan to do.

Step 3: Data transformation and analysis

Now moving to Step 3, it’s time for some data transformation and analysis. Our first task is to tweak the types of two variables in our dataset. We’ll transform the population field into an integer type for easier numerical operations. Additionally, we’ll change date_stolen from a character type to a date type. For ease of use and consistency, we’ll format date_stolen into the European date style. This way, our data will be neatly organized and more straightforward to work with in our analysis.

# Convert 'population' from character to integer
locations$population <- as.integer(gsub(",", "", locations$population))

# Convert 'date_stolen' from character to Date in the format 'dd-mm-yyyy'
stolen_vehicles$date_stolen <- as.Date(stolen_vehicles$date_stolen, format="%m/%d/%y")
stolen_vehicles$date_stolen <- format(stolen_vehicles$date_stolen, "%d-%m-%Y")

We’re now going to check if there are any missing values in our three datasets: locations, make_details, and stolen_vehicles. This is an important step to ensure the quality and completeness of our data before diving deeper into analysis. Let’s run a check on each dataset to count how many missing values (if any) exist in each column. This way, we can identify if there are any gaps in our data that need addressing.

# Column-wise count of missing values
# In 'locations'
sapply(locations, function(x) sum(is.na(x)))
## location_id      region     country  population     density 
##           0           0           0           0           0
# In 'make_details'
sapply(make_details, function(x) sum(is.na(x)))
##   make_id make_name make_type 
##         0         0         0
# In 'stolen_vehicles'
sapply(stolen_vehicles, function(x) sum(is.na(x)))
##   vehicle_id vehicle_type      make_id   model_year vehicle_desc        color 
##            0            0           15           15            0            0 
##  date_stolen  location_id 
##            0            0

Now, let’s bring all our three datasets – stolen_vehicles, make_details, and locations – together into one comprehensive table. By using the all.x = TRUE option in our merge function, we’ll make sure to keep every record from the stolen_vehicles dataset, even those that don’t have corresponding entries in the other two datasets. This way, we won’t lose any valuable information from our primary dataset during the merge process.

# Merge datasets
df <- merge(stolen_vehicles, make_details, by = "make_id", all.x = TRUE)
df <- merge(df, locations, by = "location_id", all.x = TRUE)

Next up, we’re going to tidy up our merged dataset a bit. First, we’ll check for any missing values and then remove them to ensure our data is clean and complete. We don’t want any gaps messing up our analysis! After that, we’ll take a quick peek at our dataset to get some basic descriptive stats, and for this, we’ll use the handy summary method. It’s a great way to get an overview of our data in just a glance.

# check missing values
sapply(df, function(x) sum(is.na(x)))
##  location_id      make_id   vehicle_id vehicle_type   model_year vehicle_desc 
##            0           15            0            0           15            0 
##        color  date_stolen    make_name    make_type       region      country 
##            0            0           15           15            0            0 
##   population      density 
##            0            0
# drop missing values
df <- na.omit(df)
df <- df[vehicle_type != ""]
# use summary
summary(df)
##   location_id       make_id        vehicle_id   vehicle_type      
##  Min.   :101.0   Min.   :501.0   Min.   :   1   Length:4527       
##  1st Qu.:102.0   1st Qu.:550.0   1st Qu.:1132   Class :character  
##  Median :104.0   Median :587.0   Median :2264   Mode  :character  
##  Mean   :105.8   Mean   :584.2   Mean   :2264                     
##  3rd Qu.:109.0   3rd Qu.:619.0   3rd Qu.:3396                     
##  Max.   :116.0   Max.   :638.0   Max.   :4527                     
##    model_year   vehicle_desc          color           date_stolen       
##  Min.   :1940   Length:4527        Length:4527        Length:4527       
##  1st Qu.:2000   Class :character   Class :character   Class :character  
##  Median :2005   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2005                                                           
##  3rd Qu.:2011                                                           
##  Max.   :2022                                                           
##   make_name          make_type            region            country         
##  Length:4527        Length:4527        Length:4527        Length:4527       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    population         density      
##  Min.   :  52100   Min.   :  3.28  
##  1st Qu.: 347700   1st Qu.: 14.72  
##  Median : 655000   Median : 28.80  
##  Mean   : 866847   Mean   :141.19  
##  3rd Qu.:1695200   3rd Qu.:343.09  
##  Max.   :1695200   Max.   :343.09

The dataset provides insight into stolen vehicles in New Zealand, with the average model year of vehicles being 2005, indicating that the average age of stolen vehicles is approximately 19 years. It covers a variety of vehicle types and colors, with a comprehensive description for each vehicle.

Population data for the regions where thefts occurred ranges from 52,100 to 1,695,200, with density varying significantly from 3.28 to 343.09. The dataset also includes diverse vehicle makes and types, suggesting a wide range of targeted vehicles.

Step 4: Visualizations

Now, let’s dive into the visual exploration of our data. We’ll use various visualizations to answer some intriguing questions about vehicle thefts:

  1. What days of the week see the highest and lowest rates of vehicle theft?
    To tackle this question, our first step is to extract the day of the week from the date_stolen field. This will allow us to analyze the frequency of thefts across different days and identify any patterns or trends.
# get weekday names
df$week_day <- weekdays(as.Date(df$date_stolen))

# Calculate the number of thefts per day of the week
theft_counts <- df[, .(Count = .N), by = .(week_day)]

# Order the days of the week
week_days_ordered <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
theft_counts$week_day <- factor(theft_counts$week_day, levels = week_days_ordered)

# Create a bar plot with ColorBrewer palette
ggplot(theft_counts, aes(x = week_day, y = Count, fill = week_day)) + 
    geom_bar(stat = "identity") +
    scale_fill_brewer(palette = "BuGn") +  # Apply ColorBrewer palette
    geom_text(aes(label = Count), vjust = -0.3, color = "black") +  # Adding data labels
    labs(title = "Vehicle Thefts by Day of the Week", x = "", y = "") +
    theme_minimal() + 
    theme(plot.title = element_text(hjust = 0.5))  # Center the title


The bar plot vividly illustrates the weekly pattern in vehicle thefts. It’s quite noticeable that the highest number of thefts occur on Saturdays (723 thefts) and Sundays (690 thefts), suggesting a weekend trend in vehicle theft incidents. On the other hand, Fridays appear to be the safest, with the lowest count of thefts at 585. This insight could be pivotal for law enforcement and public awareness strategies.

  1. Which vehicle types are most and least frequently stolen? Does this vary by region?
# Aggregate data: count of stolen vehicles by vehicle type
vehicle_type_counts <- df[, .(Count = .N), by = .(vehicle_type)]

# Order the vehicle types by count
vehicle_type_counts <- vehicle_type_counts[order(-Count)]

# Create a bar chart with data labels
ggplot(vehicle_type_counts, aes(x = vehicle_type, y = Count, fill = vehicle_type)) +
    geom_bar(stat = "identity") +
    geom_text(aes(label = Count), vjust = -0.3, size = 2.5) +  # Adding data labels
    labs(title = "Count of Thefts by Vehicle Type",
         x = "",
         y = "") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
          legend.position = "none") +
    theme(plot.title = element_text(hjust = 0.5))  # Center the title


Our second bar chart provides a clear depiction of theft frequency by vehicle type. From the chart, it’s evident that station wagons lead in theft frequency with a total of 945 incidents, highlighting their vulnerability. On the other end of the spectrum, articulated trucks seem to be the least targeted, with only a single reported theft. This visualization underscores the varying risks associated with different vehicle types.”

  1. What is the percentage of stolen Luxury cars?
# Calculate the count of luxury vs. non-luxury vehicles
luxury_counts <- df[, .(Count = .N), by = .(make_type)]

# Calculate the percentages
luxury_counts[, Percentage := Count / sum(Count) * 100]

# Create a pie chart
ggplot(luxury_counts, aes(x = 2, y = Percentage, fill = make_type)) +  # Set x to a constant value
    geom_bar(stat = "identity", width = 0.8) +
    coord_polar(theta = "y") +
    labs(title = "Percentage of Stolen Luxury Cars", x = "", y = "") +
    theme_void() +
    theme(legend.title = element_blank(), plot.title = element_text(hjust = 0.5)) +  
    geom_text(aes(label = sprintf("%.1f%%", Percentage)), 
              position = position_stack(vjust = 0.5))  # Add labels


The pie chart we’ve created offers a clear view of the stolen car make-up. Interestingly, luxury cars constitute only a small fraction of the total thefts, precisely 4.2%. This could suggest that thieves might be targeting more common vehicle types, possibly due to their prevalence or lesser security features compared to luxury cars.

  1. What is the average age of the stolen vehicles, and does it vary based on the vehicle type?
# Calculate the age of the vehicles
df[, vehicle_age := 2024 - model_year]

# Create a box plot of vehicle age by vehicle type
ggplot(df, aes(x = vehicle_type, y = vehicle_age, fill = vehicle_type)) +
    geom_boxplot() +
    labs(title = "Average Age of Stolen Vehicles by Vehicle Type",
         x = "",
         y = "Age") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),  
          legend.position = "none") + 
    theme(plot.title = element_text(hjust = 0.5))  # Center the title


The box plot we’ve generated reveals significant variations in the age range of stolen vehicles across different types. For instance, caravans and trail bikes show a broad age range, with a tendency towards older models being stolen more frequently. Conversely, mopeds and tractors have a narrower age range, indicating that newer models in these categories are more commonly targeted. This diversity in age range across vehicle types provides valuable insights into the preferences of vehicle thieves.

  1. Which regions experience the most and least number of stolen vehicles?
# Aggregate data: count of stolen vehicles by region
theft_counts_by_region <- df[, .(Count = .N), by = .(region)]

# Order regions by count
theft_counts_by_region <- theft_counts_by_region[order(-Count)]

# Create a bar chart of theft counts by region with data labels
ggplot(theft_counts_by_region, aes(x = reorder(region, Count), y = Count, fill = region)) +
    geom_bar(stat = "identity") +
    geom_text(aes(label = Count), vjust = -0.3, color = "black", size = 3) +  # Adding data labels
    labs(title = "Number of Vehicle Thefts by Region",
         x = "",
         y = "") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          legend.position = "none", 
          plot.title = element_text(hjust = 0.5))  # Center the title


The bar chart paints a clear picture of the regional distribution of vehicle thefts. It’s quite striking to see that Auckland tops the list with a substantial count of 1626 reported thefts.
In stark contrast, Southland reports the fewest incidents, with only 26 cases. This stark difference highlights the varying levels of vehicle theft risk across different regions.

  1. How does the color of vehicles affect theft rates?
# Aggregate data: count of stolen vehicles by color
theft_counts_by_color <- df[, .(Count = .N), by = .(color)]

# Order colors by count
theft_counts_by_color <- theft_counts_by_color[order(-Count)]

# Create a bar chart of theft counts by color
ggplot(theft_counts_by_color, aes(x = reorder(color, Count), y = Count)) +
    geom_bar(stat = "identity", fill = "darkgreen") +  
    geom_text(aes(label = Count), vjust = -0.3, color = "black", size = 3) +  
    labs(title = "Vehicle Theft Counts by Color",
         x = "",
         y = "") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),  
          legend.position = "none",  #
          plot.title = element_text(hjust = 0.5))  # Center the title


The color ‘Silver’ emerges as the most targeted, with as many as 1272 vehicles stolen, closely followed by ‘White’ vehicles at 932 thefts.

  1. How does the population density of a region correlate with the number of stolen vehicles?
# Aggregate data: count of stolen vehicles by region's population density
theft_counts_by_density <- df[, .(Count = .N), by = .(density)]

# Create a scatter plot with a linear regression trend line
ggplot(theft_counts_by_density, aes(x = density, y = Count)) +
    geom_point() +  # Scatter plot
    geom_smooth(method = "lm", se = FALSE, color = "blue") +  # Linear regression trend line
    labs(title = "Vehicle Theft Counts by Population Density with Trend Line",
         x = "Population Density",
         y = "Count of Thefts") +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5)) 
## `geom_smooth()` using formula = 'y ~ x'

The scatter plot with an overlaid linear regression trend line reveals an intriguing pattern: there seems to be a positive correlation between a region’s population density and the number of vehicle thefts. Essentially, as the population density increases, so does the frequency of stolen vehicles. This finding suggests that denser regions might be more susceptible to vehicle theft, possibly due to factors like urbanization and the sheer number of vehicles present.

Step 5: Conclusion

In conclusion, this comprehensive analysis of the stolen vehicle dataset from New Zealand has provided valuable insights into the patterns and characteristics of vehicle thefts. Our investigation spanned various aspects, including the days of the week when thefts are most prevalent, the types of vehicles most at risk, the incidence of luxury vehicle thefts, the average age of stolen vehicles, and regional variations in theft frequencies.

Key findings include the higher frequency of thefts during weekends, with Saturdays experiencing the highest number of incidents. Interestingly, luxury vehicles constituted a minor proportion of thefts, indicating a possible preference for more common or less secure vehicle types among thieves. Additionally, the analysis revealed significant variations in vehicle age depending on the type, with certain types like caravans and trail bikes being older on average at the time of theft.

Regional analysis showed a correlation between population density and theft frequency, suggesting that denser areas are more prone to vehicle thefts. This finding has implications for urban planning and law enforcement strategies.

Overall, the insights gained from this analysis could be instrumental in shaping effective policies and preventive measures to combat vehicle theft. It also underscores the importance of data-driven approaches in understanding and addressing societal challenges. The versatility of R and its packages, particularly data.table and ggplot2, played a crucial role in facilitating this detailed analysis.