# Crime: Load in packages & data
library(tidyverse)
library(lubridate)
library(dplyr)
crime <-
read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/johannk_xavier_edu/IQCG4s-0_v9aRoWLjFO6FGkwAW2rYB4KrHmceAMvM9a3I0Y?download=1",
guess_max = 100000)K Johann- BAIS Final Project
Recorded Crime Analysis
Part 1: Introduction, Data Dictionary, & Summary Statistics
Introduction
My name is Kaitlyn Johann, and I am a senior Business Analytics and Marketing double major with a finance minor. Although I plan to pursue a business-related field after graduation, I also have a strong interest in law and criminal justice. I may look into attending law school in a few years as well!
For my Business Analytics Capstone at Xavier University, we were tasked with analyzing a dataset of our choosing. I chose to look at recorded crime occurrences from 2020 up until recently in Los Angeles, California. Later, I will also be looking at a dataset depicting cost of living and associated factors in different cities throughout the United States.
For my data analysis, I want to first explore the different variables in the crime dataset. Are crime occurrences higher during certain times of the day? Among certain ages as opposed to others? What are the most common weapons used in these crimes? My main goal here is to gain a better understanding of the crime dataset itself.
For my analysis of the crime dataset paired with the cost of living dataset, I want to try to address this question:
“How does Los Angeles compare to other U.S. cities in cost of living, and how might its cost of living characteristics help explain observed crime patterns within the city?”
For your reference:
Here is the Crime dataset: https://myxavier-my.sharepoint.com/:x:/g/personal/johannk_xavier_edu/IQCG4s-0_v9aRoWLjFO6FGkwAW2rYB4KrHmceAMvM9a3I0Y?download=1
Data Dictionary
Crime Dataset
The Crime dataset has 1,005,198 distinct rows and 28 variables of crime occurrences in Los Angeles, CA.
Description of each variable:
- DR_NO: Unique identifier for the crime report.
- Date Rptd: The date the crime was reported.
- DATE OCC: The actual date the crime occurred.
- TIME OCC: The time the crime occurred, usually in 24-hour format.
- AREA: Numeric code representing the geographical area where the crime occurred.
- AREA NAME: Name of the geographical area.
- Rpt Dist No: Reporting district number for the incident.
- Part 1-2: Crime classification (e.g., Part 1 for serious crimes, Part 2 for less serious crimes).
- Crm Cd: Numeric code representing the type of crime.
- Crm Cd Desc: Description of the crime type.
- Mocodes: Modus operandi codes, describing the method used in the crime.
- Vict Age: Age of the victim.
- Vict Sex: Gender of the victim (e.g., Male, Female, Unknown).
- Vict Descent: Ethnicity or descent of the victim.
- Premis Cd: Numeric code for the type of premises where the crime occurred.
- Premis Desc: Description of the type of premises (e.g., residence, vehicle, commercial)
- Weapon Used Cd: Numeric code for the weapon used in the crime, if applicable.
- Weapon Desc: Description of the weapon used.
- Status: Status code of the crime case (e.g., Open, Solved).
- Status Desc: Description of the case status.
- Crm Cd 1, 2, 3, & 4: Additional crime codes, if the incident involved multiple offenses.
- LOCATION: Text description of the crime location
- Cross Street: Nearby cross street for the crime location.
- LAT: Latitude of the crime location.
- LON: Longitude of the crime location.
Summary Statistics
crime %>%
summary() DR_NO Date Rptd DATE OCC TIME OCC
Min. : 817 Length:1005198 Length:1005198 Min. : 1
1st Qu.:210616927 Class :character Class :character 1st Qu.: 900
Median :220916044 Mode :character Mode :character Median :1420
Mean :220227678 Mean :1340
3rd Qu.:231110516 3rd Qu.:1900
Max. :252104146 Max. :2359
AREA AREA NAME Rpt Dist No Part 1-2
Min. : 1.00 Length:1005198 Min. : 101 Min. :1.0
1st Qu.: 5.00 Class :character 1st Qu.: 587 1st Qu.:1.0
Median :11.00 Mode :character Median :1139 Median :1.0
Mean :10.69 Mean :1116 Mean :1.4
3rd Qu.:16.00 3rd Qu.:1613 3rd Qu.:2.0
Max. :21.00 Max. :2199 Max. :2.0
Crm Cd Crm Cd Desc Mocodes Vict Age
Min. :110.0 Length:1005198 Length:1005198 Min. : -4.00
1st Qu.:331.0 Class :character Class :character 1st Qu.: 0.00
Median :442.0 Mode :character Mode :character Median : 30.00
Mean :500.1 Mean : 28.91
3rd Qu.:626.0 3rd Qu.: 44.00
Max. :956.0 Max. :120.00
Vict Sex Vict Descent Premis Cd Premis Desc
Length:1005198 Length:1005198 Min. :101.0 Length:1005198
Class :character Class :character 1st Qu.:101.0 Class :character
Mode :character Mode :character Median :203.0 Mode :character
Mean :305.6
3rd Qu.:501.0
Max. :976.0
NA's :16
Weapon Used Cd Weapon Desc Status Status Desc
Min. :101.0 Length:1005198 Length:1005198 Length:1005198
1st Qu.:311.0 Class :character Class :character Class :character
Median :400.0 Mode :character Mode :character Mode :character
Mean :363.9
3rd Qu.:400.0
Max. :516.0
NA's :677918
Crm Cd 1 Crm Cd 2 Crm Cd 3 Crm Cd 4
Min. :110.0 Min. :210.0 Min. :310 Min. :821.0
1st Qu.:331.0 1st Qu.:998.0 1st Qu.:998 1st Qu.:998.0
Median :442.0 Median :998.0 Median :998 Median :998.0
Mean :499.9 Mean :958.1 Mean :984 Mean :991.2
3rd Qu.:626.0 3rd Qu.:998.0 3rd Qu.:998 3rd Qu.:998.0
Max. :956.0 Max. :999.0 Max. :999 Max. :999.0
NA's :11 NA's :936039 NA's :1002884 NA's :1005134
LOCATION Cross Street LAT LON
Length:1005198 Length:1005198 Min. : 0.00 Min. :-118.7
Class :character Class :character 1st Qu.:34.01 1st Qu.:-118.4
Mode :character Mode :character Median :34.06 Median :-118.3
Mean :34.00 Mean :-118.1
3rd Qu.:34.16 3rd Qu.:-118.3
Max. :34.33 Max. : 0.0
Part 2: Descriptive Analysis
crime_bak <- crime
crime$New_Vict_Ages <- ifelse(crime$`Vict Age` == 0, NA, crime$`Vict Age`)
Avg_Vict_Age <- mean(crime$New_Vict_Ages, na.rm = TRUE)The next five visuals are depictions of different variables in the crime dataset that I thought would be fun and interesting to look at!
Visual 1: Victim Ages within the Crime Dataset
ggplot(crime, aes(x = New_Vict_Ages)) +
geom_histogram(fill = "lightskyblue1", color = "black") +
labs(title = "Victim Ages within the Crime Dataset",
x = "Victim Age",
y = "Frequency")Much of the data is skewed towards the left, and most victims with recorded ages in the dataset fall within the 20-45 age range. This is unsurprising, as the calculated average victim age is 39.
This pattern may reflect factors such as greater exposure to public environments, higher likelihood of being outside of their homes throughout the day, or more demographic/geographic factors related to populations where crime is more frequently reported.
Visual 2: Top 5 Most Common Crime Types
crime %>%
group_by(`Crm Cd Desc`) %>%
summarise(total = n()) %>%
arrange(desc(total)) %>%
head(5) %>%
ggplot(aes(x = total,y = `Crm Cd Desc`)) +
geom_col(fill = "lightskyblue1") +
labs(title = "Top 5 Most Common Crime Types",
x = "Frequency",
y = "Crime Type")By far, the most common crime type is stolen vehicles. This is somewhat surprising, but it may reflect differences in neighborhood security, accessibility, or opportunity for vehicle crime in certain areas where they are easier to target.
The second most common crime type is battery (simple assault), which goes to show that person-on-person crime is still a major component of reported crime occurrences. Overall, the results suggest that both property-related and violent offenses are very prevalent patterns in the dataset.
Visual 3: Top 10 Areas by Crime Frequency
crime %>%
group_by(`AREA NAME`) %>%
summarise(total = n()) %>%
arrange(desc(total)) %>%
head(10) %>%
ggplot(aes(x = total,y = `AREA NAME`)) +
geom_col(fill = "lightskyblue1") +
labs(title = "Top 10 Areas by Crime Frequency",
x = "Number of Crimes",
y = "Area")Out of the top 10 Los Angeles, CA areas in the dataset, Central, 77th Street, Pacific, and Southwest have the highest crime frequencies.
The Central area includes downtown Los Angeles and nearby urban neighborhoods. As expected, it has higher crime levels due to dense population, heavy transportation use, and a high concentration of commercial activity, which by default increases opportunities for crime.
Visual 4: Top 10 Weapon Descriptions
crime %>%
filter(!is.na(`Weapon Desc`)) %>%
count(`Weapon Desc` , sort = TRUE) %>%
head(10)# A tibble: 10 × 2
`Weapon Desc` n
<chr> <int>
1 STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE) 174777
2 UNKNOWN WEAPON/OTHER WEAPON 36394
3 VERBAL THREAT 23848
4 HAND GUN 20186
5 SEMI-AUTOMATIC PISTOL 7267
6 KNIFE WITH BLADE 6INCHES OR LESS 6841
7 UNKNOWN FIREARM 6581
8 OTHER KNIFE 5880
9 MACE/PEPPER SPRAY 3730
10 VEHICLE 3260
As depicted in the table, the most common weapon category is strong-arm (hands, fist, feet, or bodily force), at 174,777, indicating that many incidents involve physical violence without weapons.
Mace/pepper spray is among the top 10 at 9 (at 3,730). It was initially interesting to see this ranked so high, but after further reflection, it does make sense given that it is a common form of personal protection in a large urban area like LA.
Visual 5: Crime Occurrences by Time of Day
ggplot(crime, aes(x = `TIME OCC`)) +
geom_histogram(fill = "lightskyblue1", color = "black") +
labs(title = "Crime Occurrences by Time of Day",
x = "Time of Crime (24 Hour Scale)",
y = "Frequency")The histogram shows that crime occurs throughout the day, but is most frequent from midday into the evening and nighttime. This pattern likely reflects increased activity during these hours, such as commuting, work, and social or commercial activity, which, by nature, creates more opportunities for crime. It may also relate to differences in visibility and routines later in the day.
Part 3: Secondary Data Source- Cost of Living in the United States
For your reference:
Here is the Cost of Living dataset: https://myxavier-my.sharepoint.com/:x:/g/personal/johannk_xavier_edu/IQCG4s-0_v9aRoWLjFO6FGkwAW2rYB4KrHmceAMvM9a3I0Y?download=1
# Cost of Living: Load in packages & data
Cost_of_Living_US <-
read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/johannk_xavier_edu/IQC057mmAhcjR6XFc9S6oBsRAZcGpQ1Wd-jQpyBcbsRzVBY?download=1")Cost of Living Dataset
This data set consists of 8 variables and 74 unique observations (in the U.S.).
Description of each variable:
- Rank: The city’s ranking based on overall cost of living. Lower rank numbers correlate to more expensive cities.
- City: The U.S. city being analyzed.
- Cost of Living Index: A measure of the overall cost of living in the city compared to New York, which is set to 100. This includes various expenses like groceries, transportation, utilities, and restaurants.
- Rent Index: Measures how expensive housing/rent prices are compared to New York. Higher values correlate to more expensive prices.
- Cost of Living Plus Rent Index: Combines both general living costs and housing costs into one single measure. This is the “total cost” measure.
- Groceries Index: Compares grocery prices to New York prices. Higher values correlate to more expensive prices.
- Restaurant Price Index: Measures the average cost of dining out and restaurant meals compared to New York.
- Local Purchasing Power Index: Measures how much residents can afford with their average salaries in that city. Higher values correlate to greater buying power after wages and prices.
Analysis of Crime and Cost of Living Datasets
To reiterate my question:
“How does Los Angeles compare to other U.S. cities in cost of living, and how might its cost of living characteristics help explain observed crime patterns within the city?”
Visual 1: Top 11- Cost of Living Index Across U.S. Cities
Cost_of_Living_US %>%
arrange(desc(`Cost.of.Living.Index`)) %>%
head(11) %>%
ggplot(aes(x = `Cost.of.Living.Index`, y = City)) +
geom_col(fill = "lightskyblue1") +
labs(title ="Top 11- Cost of Living Index Across U.S. Cities",
x = "Cost of Living Index",
y = "City")Conclusion
As depicted in the visual, Los Angeles ranks 11th among U.S. cities in cost of living, placing it in the upper tier of expensive cities and indicating relatively high financial pressure compared to most other cities contained in the dataset. Within Los Angeles, crime is not evenly distributed, with higher concentrations in areas such as Central (downtown).
Overall, the city’s high cost of living may contribute to financial strain for some residents, particularly in lower-income neighborhoods where economic opportunities are more limited. This disparity may help to explain why crime is more concentrated in certain areas, especially Downtown LA. The fact that vehicle theft is the most common crime goes to show that there is a link between economic pressure and property-related offenses in urban areas.