K Johann- BAIS Final Project

Author

Kaitlyn Johann

Recorded Crime Analysis

Part 1: Introduction, Data Dictionary, & Summary Statistics

Introduction

My name is Kaitlyn Johann, and I am a senior Business Analytics and Marketing double major with a finance minor. Although I plan to pursue a business-related field after graduation, I also have a strong interest in law and criminal justice. I may look into attending law school in a few years as well!

For my Business Analytics Capstone at Xavier University, we were tasked with analyzing a dataset of our choosing. I chose to look at recorded crime occurrences from 2020 up until recently in Los Angeles, California. Later, I will also be looking at a dataset depicting cost of living and associated factors in different cities throughout the United States.

For my data analysis, I want to first explore the different variables in the crime dataset. Are crime occurrences higher during certain times of the day? Among certain ages as opposed to others? What are the most common weapons used in these crimes? My main goal here is to gain a better understanding of the crime dataset itself.

For my analysis of the crime dataset paired with the cost of living dataset, I want to try to address this question:

“How does Los Angeles compare to other U.S. cities in cost of living, and how might its cost of living characteristics help explain observed crime patterns within the city?”

For your reference:

Here is the Crime dataset: https://myxavier-my.sharepoint.com/:x:/g/personal/johannk_xavier_edu/IQCG4s-0_v9aRoWLjFO6FGkwAW2rYB4KrHmceAMvM9a3I0Y?download=1

# Crime: Load in packages & data
library(tidyverse)
library(lubridate)
library(dplyr)

crime <-
  read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/johannk_xavier_edu/IQCG4s-0_v9aRoWLjFO6FGkwAW2rYB4KrHmceAMvM9a3I0Y?download=1",
           guess_max = 100000)

Data Dictionary

Crime Dataset

The Crime dataset has 1,005,198 distinct rows and 28 variables of crime occurrences in Los Angeles, CA.

Description of each variable:

  1. DR_NO: Unique identifier for the crime report.
  2. Date Rptd: The date the crime was reported.
  3. DATE OCC: The actual date the crime occurred.
  4. TIME OCC: The time the crime occurred, usually in 24-hour format.
  5. AREA: Numeric code representing the geographical area where the crime occurred.
  6. AREA NAME: Name of the geographical area.
  7. Rpt Dist No: Reporting district number for the incident.
  8. Part 1-2: Crime classification (e.g., Part 1 for serious crimes, Part 2 for less serious crimes).
  9. Crm Cd: Numeric code representing the type of crime.
  10. Crm Cd Desc: Description of the crime type.
  11. Mocodes: Modus operandi codes, describing the method used in the crime.
  12. Vict Age: Age of the victim.
  13. Vict Sex: Gender of the victim (e.g., Male, Female, Unknown).
  14. Vict Descent: Ethnicity or descent of the victim.
  15. Premis Cd: Numeric code for the type of premises where the crime occurred.
  16. Premis Desc: Description of the type of premises (e.g., residence, vehicle, commercial)
  17. Weapon Used Cd: Numeric code for the weapon used in the crime, if applicable.
  18. Weapon Desc: Description of the weapon used.
  19. Status: Status code of the crime case (e.g., Open, Solved).
  20. Status Desc: Description of the case status.
  21. Crm Cd 1, 2, 3, & 4: Additional crime codes, if the incident involved multiple offenses.
  22. LOCATION: Text description of the crime location
  23. Cross Street: Nearby cross street for the crime location.
  24. LAT: Latitude of the crime location.
  25. LON: Longitude of the crime location.

Summary Statistics

crime %>%
  summary()
     DR_NO            Date Rptd           DATE OCC            TIME OCC   
 Min.   :      817   Length:1005198     Length:1005198     Min.   :   1  
 1st Qu.:210616927   Class :character   Class :character   1st Qu.: 900  
 Median :220916044   Mode  :character   Mode  :character   Median :1420  
 Mean   :220227678                                         Mean   :1340  
 3rd Qu.:231110516                                         3rd Qu.:1900  
 Max.   :252104146                                         Max.   :2359  
                                                                         
      AREA        AREA NAME          Rpt Dist No      Part 1-2  
 Min.   : 1.00   Length:1005198     Min.   : 101   Min.   :1.0  
 1st Qu.: 5.00   Class :character   1st Qu.: 587   1st Qu.:1.0  
 Median :11.00   Mode  :character   Median :1139   Median :1.0  
 Mean   :10.69                      Mean   :1116   Mean   :1.4  
 3rd Qu.:16.00                      3rd Qu.:1613   3rd Qu.:2.0  
 Max.   :21.00                      Max.   :2199   Max.   :2.0  
                                                                
     Crm Cd      Crm Cd Desc          Mocodes             Vict Age     
 Min.   :110.0   Length:1005198     Length:1005198     Min.   : -4.00  
 1st Qu.:331.0   Class :character   Class :character   1st Qu.:  0.00  
 Median :442.0   Mode  :character   Mode  :character   Median : 30.00  
 Mean   :500.1                                         Mean   : 28.91  
 3rd Qu.:626.0                                         3rd Qu.: 44.00  
 Max.   :956.0                                         Max.   :120.00  
                                                                       
   Vict Sex         Vict Descent         Premis Cd     Premis Desc       
 Length:1005198     Length:1005198     Min.   :101.0   Length:1005198    
 Class :character   Class :character   1st Qu.:101.0   Class :character  
 Mode  :character   Mode  :character   Median :203.0   Mode  :character  
                                       Mean   :305.6                     
                                       3rd Qu.:501.0                     
                                       Max.   :976.0                     
                                       NA's   :16                        
 Weapon Used Cd   Weapon Desc           Status          Status Desc       
 Min.   :101.0    Length:1005198     Length:1005198     Length:1005198    
 1st Qu.:311.0    Class :character   Class :character   Class :character  
 Median :400.0    Mode  :character   Mode  :character   Mode  :character  
 Mean   :363.9                                                            
 3rd Qu.:400.0                                                            
 Max.   :516.0                                                            
 NA's   :677918                                                           
    Crm Cd 1        Crm Cd 2         Crm Cd 3          Crm Cd 4      
 Min.   :110.0   Min.   :210.0    Min.   :310       Min.   :821.0    
 1st Qu.:331.0   1st Qu.:998.0    1st Qu.:998       1st Qu.:998.0    
 Median :442.0   Median :998.0    Median :998       Median :998.0    
 Mean   :499.9   Mean   :958.1    Mean   :984       Mean   :991.2    
 3rd Qu.:626.0   3rd Qu.:998.0    3rd Qu.:998       3rd Qu.:998.0    
 Max.   :956.0   Max.   :999.0    Max.   :999       Max.   :999.0    
 NA's   :11      NA's   :936039   NA's   :1002884   NA's   :1005134  
   LOCATION         Cross Street            LAT             LON        
 Length:1005198     Length:1005198     Min.   : 0.00   Min.   :-118.7  
 Class :character   Class :character   1st Qu.:34.01   1st Qu.:-118.4  
 Mode  :character   Mode  :character   Median :34.06   Median :-118.3  
                                       Mean   :34.00   Mean   :-118.1  
                                       3rd Qu.:34.16   3rd Qu.:-118.3  
                                       Max.   :34.33   Max.   :   0.0  
                                                                       

Part 2: Descriptive Analysis

crime_bak <- crime
crime$New_Vict_Ages <- ifelse(crime$`Vict Age` == 0, NA, crime$`Vict Age`)

Avg_Vict_Age <- mean(crime$New_Vict_Ages, na.rm = TRUE)

The next five visuals are depictions of different variables in the crime dataset that I thought would be fun and interesting to look at!

Visual 1: Victim Ages within the Crime Dataset

ggplot(crime, aes(x = New_Vict_Ages)) +
  geom_histogram(fill = "lightskyblue1", color = "black") +
  labs(title = "Victim Ages within the Crime Dataset", 
       x = "Victim Age", 
       y = "Frequency")

Much of the data is skewed towards the left, and most victims with recorded ages in the dataset fall within the 20-45 age range. This is unsurprising, as the calculated average victim age is 39.

This pattern may reflect factors such as greater exposure to public environments, higher likelihood of being outside of their homes throughout the day, or more demographic/geographic factors related to populations where crime is more frequently reported.

Visual 2: Top 5 Most Common Crime Types

crime %>%
  group_by(`Crm Cd Desc`) %>%
  summarise(total = n()) %>% 
  arrange(desc(total)) %>%
  head(5) %>%
  ggplot(aes(x = total,y = `Crm Cd Desc`)) +
  geom_col(fill = "lightskyblue1") +
  labs(title = "Top 5 Most Common Crime Types",
       x = "Frequency", 
       y = "Crime Type")

By far, the most common crime type is stolen vehicles. This is somewhat surprising, but it may reflect differences in neighborhood security, accessibility, or opportunity for vehicle crime in certain areas where they are easier to target.

The second most common crime type is battery (simple assault), which goes to show that person-on-person crime is still a major component of reported crime occurrences. Overall, the results suggest that both property-related and violent offenses are very prevalent patterns in the dataset.

Visual 3: Top 10 Areas by Crime Frequency

crime %>%
  group_by(`AREA NAME`) %>%
  summarise(total = n()) %>% 
  arrange(desc(total)) %>%
  head(10) %>%
  ggplot(aes(x = total,y = `AREA NAME`)) +
  geom_col(fill = "lightskyblue1") +
  labs(title = "Top 10 Areas by Crime Frequency",
       x = "Number of Crimes", 
       y = "Area")

Out of the top 10 Los Angeles, CA areas in the dataset, Central, 77th Street, Pacific, and Southwest have the highest crime frequencies.

The Central area includes downtown Los Angeles and nearby urban neighborhoods. As expected, it has higher crime levels due to dense population, heavy transportation use, and a high concentration of commercial activity, which by default increases opportunities for crime.

Visual 4: Top 10 Weapon Descriptions

crime %>% 
  filter(!is.na(`Weapon Desc`)) %>%
  count(`Weapon Desc` , sort = TRUE) %>%
  head(10)
# A tibble: 10 × 2
   `Weapon Desc`                                       n
   <chr>                                           <int>
 1 STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE) 174777
 2 UNKNOWN WEAPON/OTHER WEAPON                     36394
 3 VERBAL THREAT                                   23848
 4 HAND GUN                                        20186
 5 SEMI-AUTOMATIC PISTOL                            7267
 6 KNIFE WITH BLADE 6INCHES OR LESS                 6841
 7 UNKNOWN FIREARM                                  6581
 8 OTHER KNIFE                                      5880
 9 MACE/PEPPER SPRAY                                3730
10 VEHICLE                                          3260

As depicted in the table, the most common weapon category is strong-arm (hands, fist, feet, or bodily force), at 174,777, indicating that many incidents involve physical violence without weapons.

Mace/pepper spray is among the top 10 at 9 (at 3,730). It was initially interesting to see this ranked so high, but after further reflection, it does make sense given that it is a common form of personal protection in a large urban area like LA.

Visual 5: Crime Occurrences by Time of Day

ggplot(crime, aes(x = `TIME OCC`)) +
  geom_histogram(fill = "lightskyblue1", color = "black") +
  labs(title = "Crime Occurrences by Time of Day",
       x = "Time of Crime (24 Hour Scale)",
       y = "Frequency")

The histogram shows that crime occurs throughout the day, but is most frequent from midday into the evening and nighttime. This pattern likely reflects increased activity during these hours, such as commuting, work, and social or commercial activity, which, by nature, creates more opportunities for crime. It may also relate to differences in visibility and routines later in the day.

Part 3: Secondary Data Source- Cost of Living in the United States

For your reference:

Here is the Cost of Living dataset: https://myxavier-my.sharepoint.com/:x:/g/personal/johannk_xavier_edu/IQCG4s-0_v9aRoWLjFO6FGkwAW2rYB4KrHmceAMvM9a3I0Y?download=1

# Cost of Living: Load in packages & data

Cost_of_Living_US <- 
  read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/johannk_xavier_edu/IQC057mmAhcjR6XFc9S6oBsRAZcGpQ1Wd-jQpyBcbsRzVBY?download=1")

Cost of Living Dataset

This data set consists of 8 variables and 74 unique observations (in the U.S.).

Description of each variable:

  1. Rank: The city’s ranking based on overall cost of living. Lower rank numbers correlate to more expensive cities.
  2. City: The U.S. city being analyzed.
  3. Cost of Living Index: A measure of the overall cost of living in the city compared to New York, which is set to 100. This includes various expenses like groceries, transportation, utilities, and restaurants.
  4. Rent Index: Measures how expensive housing/rent prices are compared to New York. Higher values correlate to more expensive prices.
  5. Cost of Living Plus Rent Index: Combines both general living costs and housing costs into one single measure. This is the “total cost” measure.
  6. Groceries Index: Compares grocery prices to New York prices. Higher values correlate to more expensive prices.
  7. Restaurant Price Index: Measures the average cost of dining out and restaurant meals compared to New York.
  8. Local Purchasing Power Index: Measures how much residents can afford with their average salaries in that city. Higher values correlate to greater buying power after wages and prices.

Analysis of Crime and Cost of Living Datasets

To reiterate my question:

“How does Los Angeles compare to other U.S. cities in cost of living, and how might its cost of living characteristics help explain observed crime patterns within the city?”

Visual 1: Top 11- Cost of Living Index Across U.S. Cities

Cost_of_Living_US %>%
  arrange(desc(`Cost.of.Living.Index`)) %>%
  head(11) %>%
  ggplot(aes(x = `Cost.of.Living.Index`, y = City)) +
  geom_col(fill = "lightskyblue1") +
  labs(title ="Top 11- Cost of Living Index Across U.S. Cities", 
       x = "Cost of Living Index", 
       y = "City")

Conclusion

As depicted in the visual, Los Angeles ranks 11th among U.S. cities in cost of living, placing it in the upper tier of expensive cities and indicating relatively high financial pressure compared to most other cities contained in the dataset. Within Los Angeles, crime is not evenly distributed, with higher concentrations in areas such as Central (downtown).

Overall, the city’s high cost of living may contribute to financial strain for some residents, particularly in lower-income neighborhoods where economic opportunities are more limited. This disparity may help to explain why crime is more concentrated in certain areas, especially Downtown LA. The fact that vehicle theft is the most common crime goes to show that there is a link between economic pressure and property-related offenses in urban areas.