project 1 for data 110

Author

Kenneth

Introduction

This data set is called “Violent Crime & Property Crime by County: 1975 to Present” and can be founded in https://opendata.maryland.gov. It was created on July 27, 2015 and was last updated on May 23, 2022. It has 1,104 rows and 38 columns. I am trying to explore how many murder crimes happened at each county. So I would be focusing on the murder variable and the jurisdiction variable. And I will use linear regression as well to see if the dataset is good or not.

Load the libraries and dataset

library(tidyverse) # Load all the functions everything else that I need to do the codes
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
library(dplyr)
library(ggplot2)
setwd("~/Data 110") # This sets the working directory
crimes <-read_csv("violent_crime.csv") # This gets the csv and puts it in a variable so I can use it

Clean dataset

Is there any cleaning I need to do on the data set?

names(crimes) <- tolower(names(crimes))
names(crimes) <- gsub(" ","_",names(crimes)) # This renames the variables so there are no spaces and replaces it with a _
names(crimes) <- gsub("b_&_e","breaking_and_entering",names(crimes)) # Removes the & because that is a special symbol that R doesn't know so we have to replace it
head(crimes) # Shows me the first 6 of the dataset
# A tibble: 6 × 38
  jurisdiction     year population murder  rape robbery agg._assault
  <chr>           <dbl>      <dbl>  <dbl> <dbl>   <dbl>        <dbl>
1 Allegany County  1975      79655      3     5      20          114
2 Allegany County  1976      83923      2     2      24           59
3 Allegany County  1977      82102      3     7      32           85
4 Allegany County  1978      79966      1     2      18           81
5 Allegany County  1979      79721      1     7      18           84
6 Allegany County  1980      80461      2    12      26           79
# ℹ 31 more variables: breaking_and_entering <dbl>, larceny_theft <dbl>,
#   `m/v_theft` <dbl>, grand_total <dbl>, percent_change <chr>,
#   violent_crime_total <dbl>, violent_crime_percent <chr>,
#   violent_crime_percent_change <chr>, property_crime_totals <dbl>,
#   property_crime_percent <chr>, property_crime_percent_change <chr>,
#   `overall_crime_rate_per_100,000_people` <dbl>,
#   `overall_percent_change_per_100,000_people` <chr>, …

Counting

How many murder cases are there?

crime_summary <- crimes |>
  count(murder) # This counted the amount of murders that took place and I will use it to create a graph

Plot the Graph

What would a graph of murders look like?

ggplot(crimes, aes(x = jurisdiction, y= murder,  fill = jurisdiction 
)) +
  geom_col() +
  labs(
    title = "Number of Murder per by County",
    x = "Counties",
    y = "Number of Incidents",
    caption = "Source: Maryland Open Data - Crime Dataset" 
  ) +
  coord_flip() + #Flip the x and y so the graph will look better
  theme_dark() #Made the theme dark to show the color more

Linear regression

What is the linear regression of this dataset?

model <- lm(murder ~ year, data = crimes) # Using this code I am able to fix the linear regression of the dataset

summary(model) # This just gives me the summary

Call:
lm(formula = murder ~ year, data = crimes)

Residuals:
   Min     1Q Median     3Q    Max 
-22.08 -18.67 -16.53 -12.22 333.66 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -182.9807   254.1746  -0.720    0.472
year           0.1015     0.1272   0.798    0.425

Residual standard error: 56.13 on 1102 degrees of freedom
Multiple R-squared:  0.0005772, Adjusted R-squared:  -0.0003297 
F-statistic: 0.6365 on 1 and 1102 DF,  p-value: 0.4252

Linear Regression Plots/model

What is the linear regression of the data set?

par(mfrow=c(2,2)) #This gives me the four linear regression plots
plot(model)

The equation

The equation is y=0.1015(year) -182.9807. The adjusted r-squared is -0.0003297 while the p-value is 0.4252. Looking at the diagnostic plots, I can see that the model is not good. When looking at the Residuals vs Fitted plot, if the dots are random, then it is good. However there is a pattern in the plot. And for the Q-Q residuals, if it is not a straight line, it is not a good plot.

Conclusion

When I first ran the code to take a look of the variables, I couldn’t run them as R couldn’t understand & so I had to replace it with and. And I replace the space with _ so my code are run as well. And then I used head function so I could see if the variables were fixed with gsubs. And I wanted to know how many murders cases there were so I used the count function to count them. When making the graph, the x-axis looks horrible due to the counties overlapping each other, so I flip them so I could clearly see everything. While looking at the graph, I founded out that Baltimore City has the highest amount of cases. With Prince George’s County being second. I am not surprised that Prince George is second because I see it in the news so much throughout the years. I wish that I was able to make another graph about the other crimes in all the counties. And then find the overall of all crimes in each counties. However that might not be possible to graph as that might be too many variables to graph(this comes from past experiences).