The dataset I chose for this project covers crime rates in the US from the years 1960-2019. This data was taken from the Disaster Center database. The question I aim to answer with this dataset is “How do crime trends change by year?” The dataset is composed up of the following variables.
year: The year in which the row’s data describes
population: The total population of that year
total: The total number of crimes committed
violent: The total number of violent crimes committed
property: The total number of property crimes committed
murder: The total number of murders committed
forcible_rape: The total number of rape cases
robbery: The total number of robberies
aggravated_assault: The total number of aggravated assaults
burglary: The total number of burglaries
larceny_theft: The total number of larceny thefts
vehicle_theft: The total number of vehicle thefts
I will be using every variable available in the dataset for my analysis. The plots I will be creating will include barplots, filled barplots, and scatterplots to show howtrends change each year.
options(scipen=999)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(colorspace)
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
crime <- read.csv("C:/Users/ronan/OneDrive/School/Data 101/Final Project/us_crime_rates.csv")
head(crime)
## year population total violent property murder forcible_rape robbery
## 1 1960 179323175 3384200 288460 3095700 9110 17190 107840
## 2 1961 182992000 3488000 289390 3198600 8740 17220 106670
## 3 1962 185771000 3752200 301510 3450700 8530 17550 110860
## 4 1963 188483000 4109500 316970 3792500 8640 17650 116470
## 5 1964 191141000 4564600 364220 4200400 9360 21420 130390
## 6 1965 193526000 4739400 387390 4352000 9960 23410 138690
## aggravated_assault burglary larceny_theft vehicle_theft
## 1 154320 912100 1855400 328200
## 2 156760 949600 1913000 336000
## 3 164570 994300 2089600 366800
## 4 174210 1086400 2297800 408300
## 5 203050 1213200 2514400 472800
## 6 215330 1282500 2572600 496900
#Separate each year by the type of crime.
crime2 <- crime |>
pivot_longer(!c(year, population, total, violent, property), names_to = "type", values_to =
"number")
head(crime2)
## # A tibble: 6 × 7
## year population total violent property type number
## <int> <int> <int> <int> <int> <chr> <int>
## 1 1960 179323175 3384200 288460 3095700 murder 9110
## 2 1960 179323175 3384200 288460 3095700 forcible_rape 17190
## 3 1960 179323175 3384200 288460 3095700 robbery 107840
## 4 1960 179323175 3384200 288460 3095700 aggravated_assault 154320
## 5 1960 179323175 3384200 288460 3095700 burglary 912100
## 6 1960 179323175 3384200 288460 3095700 larceny_theft 1855400
Linear regression felt appropiate because every variable was numerical.
model <- lm(total ~ year + population, data = crime2)
summary(model)
##
## Call:
## lm(formula = total ~ year + population, data = crime2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5313015 -881290 150097 1222646 2644794
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3402517315.26042 121215444.96522 -28.07 <0.0000000000000002 ***
## year 1799688.68090 64040.80371 28.10 <0.0000000000000002 ***
## population -0.66284 0.02458 -26.97 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1665000 on 417 degrees of freedom
## Multiple R-squared: 0.7054, Adjusted R-squared: 0.704
## F-statistic: 499.2 on 2 and 417 DF, p-value: < 0.00000000000000022
crime2 |>
ggplot((aes(x = year, y = number, fill = type))) +
geom_col(position = "stack") +
ylab("# Of crimes") +
xlab("Year") +
labs(title = "Number of Crimes Commited each Year")
crime2 |>
ggplot((aes(x = year, y = number, color = type))) +
geom_line() +
ylab("# Of crimes") +
xlab("Year") +
labs(title = "Number of Crimes Commited each Year")
crime2 |>
ggplot((aes(x = year, y = number, fill = type))) +
geom_col(position = "fill") +
ylab("# Of crimes") +
xlab("Year") +
labs(title = "Number of Crimes Commited each Year")
crime2 |>
ggplot(aes(x = year, y = number, color = type)) +
geom_point() +
geom_smooth(method = lm, se = FALSE, fullrange = TRUE, aes(color = type)) +
ylab("# Of crimes") +
xlab("Year") +
labs(title = "Number of Crimes Commited each Year")
## `geom_smooth()` using formula = 'y ~ x'
highchart() |>
hc_yAxis_multiples(
list(lineWidth = 3, title = list(text = "Population")),
list(showLastLabel = FALSE, opposite = TRUE, title = list(text = "# of Crimes"))
) |>
hc_add_series(data = crime2,
type = "line",
hcaes(
x = year,
y = population,
),
name = "Population",
yAxis = 0,
) |>
hc_add_series(data = crime2,
type = "line",
hcaes(
x = year,
y = total,
),
name = "Number of Crimes",
yAxis = 1
)
The visualizations I have here depict the changes in overall crime trends as years pass. I was able to find a few patterns in the data from the plots I created. The total number of crimes steadily increased with the population until it peaked in 1991 with 14,872,900 total crimes. After 1991, the number of crimes occurring began to decrease steadily while the population continued to grow. As for why this is, I’m not exactly sure, but I certainly hope this trend continues. The proportion of each crime committed appears to remain at about the same amount for every type of crime across each year, with larceny theft consistently being the most commonly occurring crime and murder consistently being the least common. For further research, I feel that it would be appropriate to take into account more types of crimes, and of course take data from more recent years.