This data set is called “Violent Crime & Property Crime by County: 1975 to Present” and can be founded in https://opendata.maryland.gov. It was created on July 27, 2015 and was last updated on May 23, 2022. It has 1,104 rows and 38 columns. I am trying to explore how many murder crimes happened at each county. So I would be focusing on the murder variable and the jurisdiction variable. And I will use linear regression as well to see if the dataset is good or not.
Load the libraries and dataset
library(tidyverse) # Load all the functions everything else that I need to do the codes
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
library(dplyr)library(ggplot2)
setwd("~/Data 110") # This sets the working directorycrimes <-read_csv("violent_crime.csv") # This gets the csv and puts it in a variable so I can use it
Clean dataset
Is there any cleaning I need to do on the data set?
names(crimes) <-tolower(names(crimes))names(crimes) <-gsub(" ","_",names(crimes)) # This renames the variables so there are no spaces and replaces it with a _names(crimes) <-gsub("b_&_e","breaking_and_entering",names(crimes)) # Removes the & because that is a special symbol that R doesn't know so we have to replace ithead(crimes) # Shows me the first 6 of the dataset
crime_summary <- crimes |>count(murder) # This counted the amount of murders that took place and I will use it to create a graph
Plot the Graph
What would a graph of murders look like?
ggplot(crimes, aes(x = jurisdiction, y= murder, fill = jurisdiction )) +geom_col() +labs(title ="Number of Murder per by County",x ="Counties",y ="Number of Incidents",caption ="Source: Maryland Open Data - Crime Dataset" ) +coord_flip() +#Flip the x and y so the graph will look bettertheme_dark() #Made the theme dark to show the color more
Linear regression
What is the linear regression of this dataset?
model <-lm(murder ~ year, data = crimes) # Using this code I am able to fix the linear regression of the datasetsummary(model) # This just gives me the summary
Call:
lm(formula = murder ~ year, data = crimes)
Residuals:
Min 1Q Median 3Q Max
-22.08 -18.67 -16.53 -12.22 333.66
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -182.9807 254.1746 -0.720 0.472
year 0.1015 0.1272 0.798 0.425
Residual standard error: 56.13 on 1102 degrees of freedom
Multiple R-squared: 0.0005772, Adjusted R-squared: -0.0003297
F-statistic: 0.6365 on 1 and 1102 DF, p-value: 0.4252
Linear Regression Plots/model
What is the linear regression of the data set?
par(mfrow=c(2,2)) #This gives me the four linear regression plotsplot(model)
The equation
The equation is y=0.1015(year) -182.9807. The adjusted r-squared is -0.0003297 while the p-value is 0.4252. Looking at the diagnostic plots, I can see that the model is not good. When looking at the Residuals vs Fitted plot, if the dots are random, then it is good. However there is a pattern in the plot. And for the Q-Q residuals, if it is not a straight line, it is not a good plot.
Conclusion
When I first ran the code to take a look of the variables, I couldn’t run them as R couldn’t understand & so I had to replace it with and. And I replace the space with _ so my code are run as well. And then I used head function so I could see if the variables were fixed with gsubs. And I wanted to know how many murders cases there were so I used the count function to count them. When making the graph, the x-axis looks horrible due to the counties overlapping each other, so I flip them so I could clearly see everything. While looking at the graph, I founded out that Baltimore City has the highest amount of cases. With Prince George’s County being second. I am not surprised that Prince George is second because I see it in the news so much throughout the years. I wish that I was able to make another graph about the other crimes in all the counties. And then find the overall of all crimes in each counties. However that might not be possible to graph as that might be too many variables to graph(this comes from past experiences).