Data 101 Final

Introduction

The dataset I chose for this project covers crime rates in the US from the years 1960-2019. This data was taken from the Disaster Center database. The question I aim to answer with this dataset is “How do crime trends change by year?” The dataset is composed up of the following variables.

year: The year in which the row’s data describes

population: The total population of that year

total: The total number of crimes committed

violent: The total number of violent crimes committed

property: The total number of property crimes committed

murder: The total number of murders committed

forcible_rape: The total number of rape cases

robbery: The total number of robberies

aggravated_assault: The total number of aggravated assaults

burglary: The total number of burglaries

larceny_theft: The total number of larceny thefts

vehicle_theft: The total number of vehicle thefts

I will be using every variable available in the dataset for my analysis. The plots I will be creating will include barplots, filled barplots, and scatterplots to show howtrends change each year.

Load libraries and data

options(scipen=999)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(colorspace)
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
crime <- read.csv("C:/Users/ronan/OneDrive/School/Data 101/Final Project/us_crime_rates.csv")
head(crime)
##   year population   total violent property murder forcible_rape robbery
## 1 1960  179323175 3384200  288460  3095700   9110         17190  107840
## 2 1961  182992000 3488000  289390  3198600   8740         17220  106670
## 3 1962  185771000 3752200  301510  3450700   8530         17550  110860
## 4 1963  188483000 4109500  316970  3792500   8640         17650  116470
## 5 1964  191141000 4564600  364220  4200400   9360         21420  130390
## 6 1965  193526000 4739400  387390  4352000   9960         23410  138690
##   aggravated_assault burglary larceny_theft vehicle_theft
## 1             154320   912100       1855400        328200
## 2             156760   949600       1913000        336000
## 3             164570   994300       2089600        366800
## 4             174210  1086400       2297800        408300
## 5             203050  1213200       2514400        472800
## 6             215330  1282500       2572600        496900

Data Cleaning

#Separate each year by the type of crime.
crime2 <- crime |> 
  pivot_longer(!c(year, population, total, violent, property), names_to = "type", values_to = 
                 "number")
head(crime2)
## # A tibble: 6 × 7
##    year population   total violent property type                number
##   <int>      <int>   <int>   <int>    <int> <chr>                <int>
## 1  1960  179323175 3384200  288460  3095700 murder                9110
## 2  1960  179323175 3384200  288460  3095700 forcible_rape        17190
## 3  1960  179323175 3384200  288460  3095700 robbery             107840
## 4  1960  179323175 3384200  288460  3095700 aggravated_assault  154320
## 5  1960  179323175 3384200  288460  3095700 burglary            912100
## 6  1960  179323175 3384200  288460  3095700 larceny_theft      1855400

Linear Regression

Linear regression felt appropiate because every variable was numerical.

model <- lm(total ~ year + population, data = crime2)
summary(model)
## 
## Call:
## lm(formula = total ~ year + population, data = crime2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -5313015  -881290   150097  1222646  2644794 
## 
## Coefficients:
##                      Estimate        Std. Error t value            Pr(>|t|)    
## (Intercept) -3402517315.26042   121215444.96522  -28.07 <0.0000000000000002 ***
## year            1799688.68090       64040.80371   28.10 <0.0000000000000002 ***
## population           -0.66284           0.02458  -26.97 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1665000 on 417 degrees of freedom
## Multiple R-squared:  0.7054, Adjusted R-squared:  0.704 
## F-statistic: 499.2 on 2 and 417 DF,  p-value: < 0.00000000000000022

Visualizations

crime2 |> 
  ggplot((aes(x = year, y = number, fill = type))) +
  geom_col(position = "stack") +
  ylab("# Of crimes") +
  xlab("Year") +
  labs(title = "Number of Crimes Commited each Year")

crime2 |> 
  ggplot((aes(x = year, y = number, color = type))) +
  geom_line() +
  ylab("# Of crimes") +
  xlab("Year") +
  labs(title = "Number of Crimes Commited each Year")

crime2 |> 
  ggplot((aes(x = year, y = number, fill = type))) +
  geom_col(position = "fill") +
  ylab("# Of crimes") +
  xlab("Year") +
  labs(title = "Number of Crimes Commited each Year")

crime2 |>
  ggplot(aes(x = year, y = number, color = type)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE, fullrange = TRUE, aes(color = type)) +
  ylab("# Of crimes") +
  xlab("Year") +
  labs(title = "Number of Crimes Commited each Year")
## `geom_smooth()` using formula = 'y ~ x'

Total vs. Population

highchart() |>
  hc_yAxis_multiples(
    list(lineWidth = 3, title = list(text = "Population")),
    list(showLastLabel = FALSE, opposite = TRUE, title = list(text  = "# of Crimes"))
    ) |>
  hc_add_series(data = crime2,
                type = "line", 
                hcaes(
                  x = year,
                  y = population,
                  ),
                name = "Population", 
                yAxis = 0,
                ) |>
  hc_add_series(data = crime2,
                type = "line", 
                hcaes(
                  x = year,
                  y = total,
                  ),
                name = "Number of Crimes",
                yAxis = 1
                )

The visualizations I have here depict the changes in overall crime trends as years pass. I was able to find a few patterns in the data from the plots I created. The total number of crimes steadily increased with the population until it peaked in 1991 with 14,872,900 total crimes. After 1991, the number of crimes occurring began to decrease steadily while the population continued to grow. As for why this is, I’m not exactly sure, but I certainly hope this trend continues. The proportion of each crime committed appears to remain at about the same amount for every type of crime across each year, with larceny theft consistently being the most commonly occurring crime and murder consistently being the least common. For further research, I feel that it would be appropriate to take into account more types of crimes, and of course take data from more recent years.