For this analysis, I choose to review Philly’s crime data to determine which sections are the most dangerous, but also, answer the following questions:
Pulling Data from Azure Blob Storage
This data was originally downloaded from the City of Philadelphia’s public data. As part of this data pull, I stored the data on Azure Blob storage instead of github due to github’s 25MB limit. I am also filtering the data to focus on a single year of data crime (2018). The original file is roughly 300 MB in size, and therefore it was necessary to reduce the total number of records.
Data had 3 years of information. For this assignment, we selected crime for only 2018.
The first question we want to answer is what time during the year do most crimes occur? Based on the graph above, it appears as the summer months, crime is most prevalent. There does appear to be a peak during the holiday months, but not as high as during the summer.
The 2nd question we can answer here is what time of the day do most crimes occur? Based on the histogram and using a 0-24 scale for time of the day, we can see that most crime occurs during 11-12am and 5-6pm.
One of our other questions, what day of the week does most crime occur, can be answered by the graph above. Ironically, Monday is when most crime activities appear to occur with crime slowly tampering off towards the weekends.
Here we can visually see how the crime categories compare to other relative in size.
Philly’s crime have some interesting and unexpected surprises. I did not expect total crime to be highest during the early morning or early evening. I did expect the summer months to have the highest levels of crime. As a potential future project, I would like to join this data with housing and economic data to determine if any correlation exists between the variables.
Other Observations: Having a large data sample was challenging to parse using R as certain calculations would cause R to hang. I am not certain as to why this was happening, but it is worth noting that I was able to leverage almost all of my observations during the analysis in this report sans NA values.