This project analyzes crime data across states using linear regression model. That dataset is from:https://ucrdatatool.gov/Search/Crime/State/StatebyState.cfm?NoVariables=Y. The dataset descibes US violent crime rates at the state level (20 states), from 1960 to 2014, with rates per 100,000 population for crime types such as murder, legacy rape,robbery as reported by the Uniform Crime Report. I am interested in the relationship analysis between the US violent crime and rape. Otherwise, I want to know whether the US violent crime have a weak or strong correlation with rape.
install.packages("readxl")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("caret")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("rmarkdown")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(readxl) # For reading Excel files
library(dplyr) # For data manipulation
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(caret)
## Loading required package: lattice
# Read the .CVS file
df <- read.csv("CrimeStatebyState.csv")
#print(df)
str(df)
## 'data.frame': 2800 obs. of 11 variables:
## $ State : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ Year : int 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
## $ Population : int 3266740 3302000 3358000 3347000 3407000 3462000 3517000 3540000 3566000 3531000 ...
## $ Violent_crime_total : int 6097 5564 5283 6115 7260 6916 8098 8448 8288 8842 ...
## $ Murder_and_nonnegligent_Manslaughter: int 406 427 316 340 316 395 384 415 421 485 ...
## $ Murder_per100000 : num 12.43 12.93 9.41 10.16 9.28 ...
## $ Legacy_rape_.1 : int 281 252 218 192 397 367 341 371 396 494 ...
## $ Revised_rape_.2 : int NA NA NA NA NA NA NA NA NA NA ...
## $ Robbery : int 898 630 754 828 992 992 1124 1167 1462 1448 ...
## $ Aggravated_assault : int 4512 4255 3995 4755 5555 5162 6249 6495 6009 6415 ...
## $ State_code : chr "AL" "AL" "AL" "AL" ...
# View current column names
names(df)
## [1] "State"
## [2] "Year"
## [3] "Population"
## [4] "Violent_crime_total"
## [5] "Murder_and_nonnegligent_Manslaughter"
## [6] "Murder_per100000"
## [7] "Legacy_rape_.1"
## [8] "Revised_rape_.2"
## [9] "Robbery"
## [10] "Aggravated_assault"
## [11] "State_code"
# View the first few rows of the data
head(df)
## State Year Population Violent_crime_total
## 1 Alabama 1960 3266740 6097
## 2 Alabama 1961 3302000 5564
## 3 Alabama 1962 3358000 5283
## 4 Alabama 1963 3347000 6115
## 5 Alabama 1964 3407000 7260
## 6 Alabama 1965 3462000 6916
## Murder_and_nonnegligent_Manslaughter Murder_per100000 Legacy_rape_.1
## 1 406 12.428292 281
## 2 427 12.931557 252
## 3 316 9.410363 218
## 4 340 10.158351 192
## 5 316 9.275022 397
## 6 395 11.409590 367
## Revised_rape_.2 Robbery Aggravated_assault State_code
## 1 NA 898 4512 AL
## 2 NA 630 4255 AL
## 3 NA 754 3995 AL
## 4 NA 828 4755 AL
## 5 NA 992 5555 AL
## 6 NA 992 5162 AL
# View the last few rows of the data
tail(df)
## State Year Population Violent_crime_total
## 2795 Wyoming 2009 544270 1196
## 2796 Wyoming 2010 564554 1117
## 2797 Wyoming 2011 567356 1245
## 2798 Wyoming 2012 576626 1161
## 2799 Wyoming 2013 583223 1212
## 2800 Wyoming 2014 584153 1142
## Murder_and_nonnegligent_Manslaughter Murder_per100000 Legacy_rape_.1
## 2795 11 2.021056 172
## 2796 8 1.417048 162
## 2797 18 3.172611 146
## 2798 14 2.427917 154
## 2799 17 2.914837 144
## 2800 16 2.739008 126
## Revised_rape_.2 Robbery Aggravated_assault State_code
## 2795 NA 78 935 WY
## 2796 NA 77 870 WY
## 2797 NA 71 1010 WY
## 2798 NA 61 932 WY
## 2799 204 74 917 WY
## 2800 174 53 899 WY
# Summary and overview
summary(df)
## State Year Population Violent_crime_total
## Length:2800 Min. :1960 Min. : 226167 Min. : 37
## Class :character 1st Qu.:1973 1st Qu.: 1189750 1st Qu.: 3065
## Mode :character Median :1987 Median : 3235500 Median : 10652
## Mean :1987 Mean : 4805671 Mean : 23270
## 3rd Qu.:2001 3rd Qu.: 5750048 3rd Qu.: 27192
## Max. :2014 Max. :38802500 Max. :345624
##
## Murder_and_nonnegligent_Manslaughter Murder_per100000 Legacy_rape_.1
## Min. : 1.0 Min. : 0.1565 Min. : 6.0
## 1st Qu.: 46.0 1st Qu.: 3.1041 1st Qu.: 293.8
## Median : 178.5 Median : 5.4413 Median : 822.0
## Mean : 339.2 Mean : 6.5949 Mean : 1426.3
## 3rd Qu.: 434.0 3rd Qu.: 8.6132 3rd Qu.: 1712.5
## Max. :4096.0 Max. :80.6020 Max. :13693.0
##
## Revised_rape_.2 Robbery Aggravated_assault State_code
## Min. : 110.0 Min. : 8 Min. : 14 Length:2800
## 1st Qu.: 772.8 1st Qu.: 782 1st Qu.: 1879 Class :character
## Median : 1592.0 Median : 3126 Median : 6288 Mode :character
## Mean : 2258.2 Mean : 8141 Mean : 13340
## 3rd Qu.: 2518.0 3rd Qu.: 7778 3rd Qu.: 16183
## Max. :11527.0 Max. :130897 Max. :198045
## NA's :2698
str(df)
## 'data.frame': 2800 obs. of 11 variables:
## $ State : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ Year : int 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
## $ Population : int 3266740 3302000 3358000 3347000 3407000 3462000 3517000 3540000 3566000 3531000 ...
## $ Violent_crime_total : int 6097 5564 5283 6115 7260 6916 8098 8448 8288 8842 ...
## $ Murder_and_nonnegligent_Manslaughter: int 406 427 316 340 316 395 384 415 421 485 ...
## $ Murder_per100000 : num 12.43 12.93 9.41 10.16 9.28 ...
## $ Legacy_rape_.1 : int 281 252 218 192 397 367 341 371 396 494 ...
## $ Revised_rape_.2 : int NA NA NA NA NA NA NA NA NA NA ...
## $ Robbery : int 898 630 754 828 992 992 1124 1167 1462 1448 ...
## $ Aggravated_assault : int 4512 4255 3995 4755 5555 5162 6249 6495 6009 6415 ...
## $ State_code : chr "AL" "AL" "AL" "AL" ...
head(df)
## State Year Population Violent_crime_total
## 1 Alabama 1960 3266740 6097
## 2 Alabama 1961 3302000 5564
## 3 Alabama 1962 3358000 5283
## 4 Alabama 1963 3347000 6115
## 5 Alabama 1964 3407000 7260
## 6 Alabama 1965 3462000 6916
## Murder_and_nonnegligent_Manslaughter Murder_per100000 Legacy_rape_.1
## 1 406 12.428292 281
## 2 427 12.931557 252
## 3 316 9.410363 218
## 4 340 10.158351 192
## 5 316 9.275022 397
## 6 395 11.409590 367
## Revised_rape_.2 Robbery Aggravated_assault State_code
## 1 NA 898 4512 AL
## 2 NA 630 4255 AL
## 3 NA 754 3995 AL
## 4 NA 828 4755 AL
## 5 NA 992 5555 AL
## 6 NA 992 5162 AL
tail(df)
## State Year Population Violent_crime_total
## 2795 Wyoming 2009 544270 1196
## 2796 Wyoming 2010 564554 1117
## 2797 Wyoming 2011 567356 1245
## 2798 Wyoming 2012 576626 1161
## 2799 Wyoming 2013 583223 1212
## 2800 Wyoming 2014 584153 1142
## Murder_and_nonnegligent_Manslaughter Murder_per100000 Legacy_rape_.1
## 2795 11 2.021056 172
## 2796 8 1.417048 162
## 2797 18 3.172611 146
## 2798 14 2.427917 154
## 2799 17 2.914837 144
## 2800 16 2.739008 126
## Revised_rape_.2 Robbery Aggravated_assault State_code
## 2795 NA 78 935 WY
## 2796 NA 77 870 WY
## 2797 NA 71 1010 WY
## 2798 NA 61 932 WY
## 2799 204 74 917 WY
## 2800 174 53 899 WY
# Check for missing values
colSums(is.na(df))
## State Year
## 0 0
## Population Violent_crime_total
## 0 0
## Murder_and_nonnegligent_Manslaughter Murder_per100000
## 0 0
## Legacy_rape_.1 Revised_rape_.2
## 0 2698
## Robbery Aggravated_assault
## 0 0
## State_code
## 0
# Remove columns with any missing values
df <- df[, colSums(is.na(df)) == 0]
# Check for missing values
colSums(is.na(df))
## State Year
## 0 0
## Population Violent_crime_total
## 0 0
## Murder_and_nonnegligent_Manslaughter Murder_per100000
## 0 0
## Legacy_rape_.1 Robbery
## 0 0
## Aggravated_assault State_code
## 0 0
# Set new column names
names(df) <- c("State", "Year", "Population", "Violent_crime", " Murder", "Murder_rate", "Legacy_rape", "Robbery", " Assault", " State_code")
# Summary and overview after removing missing values
summary(df)
## State Year Population Violent_crime
## Length:2800 Min. :1960 Min. : 226167 Min. : 37
## Class :character 1st Qu.:1973 1st Qu.: 1189750 1st Qu.: 3065
## Mode :character Median :1987 Median : 3235500 Median : 10652
## Mean :1987 Mean : 4805671 Mean : 23270
## 3rd Qu.:2001 3rd Qu.: 5750048 3rd Qu.: 27192
## Max. :2014 Max. :38802500 Max. :345624
## Murder Murder_rate Legacy_rape Robbery
## Min. : 1.0 Min. : 0.1565 Min. : 6.0 Min. : 8
## 1st Qu.: 46.0 1st Qu.: 3.1041 1st Qu.: 293.8 1st Qu.: 782
## Median : 178.5 Median : 5.4413 Median : 822.0 Median : 3126
## Mean : 339.2 Mean : 6.5949 Mean : 1426.3 Mean : 8141
## 3rd Qu.: 434.0 3rd Qu.: 8.6132 3rd Qu.: 1712.5 3rd Qu.: 7778
## Max. :4096.0 Max. :80.6020 Max. :13693.0 Max. :130897
## Assault State_code
## Min. : 14 Length:2800
## 1st Qu.: 1879 Class :character
## Median : 6288 Mode :character
## Mean : 13340
## 3rd Qu.: 16183
## Max. :198045
str(df)
## 'data.frame': 2800 obs. of 10 variables:
## $ State : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ Year : int 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
## $ Population : int 3266740 3302000 3358000 3347000 3407000 3462000 3517000 3540000 3566000 3531000 ...
## $ Violent_crime: int 6097 5564 5283 6115 7260 6916 8098 8448 8288 8842 ...
## $ Murder : int 406 427 316 340 316 395 384 415 421 485 ...
## $ Murder_rate : num 12.43 12.93 9.41 10.16 9.28 ...
## $ Legacy_rape : int 281 252 218 192 397 367 341 371 396 494 ...
## $ Robbery : int 898 630 754 828 992 992 1124 1167 1462 1448 ...
## $ Assault : int 4512 4255 3995 4755 5555 5162 6249 6495 6009 6415 ...
## $ State_code : chr "AL" "AL" "AL" "AL" ...
head(df)
## State Year Population Violent_crime Murder Murder_rate Legacy_rape Robbery
## 1 Alabama 1960 3266740 6097 406 12.428292 281 898
## 2 Alabama 1961 3302000 5564 427 12.931557 252 630
## 3 Alabama 1962 3358000 5283 316 9.410363 218 754
## 4 Alabama 1963 3347000 6115 340 10.158351 192 828
## 5 Alabama 1964 3407000 7260 316 9.275022 397 992
## 6 Alabama 1965 3462000 6916 395 11.409590 367 992
## Assault State_code
## 1 4512 AL
## 2 4255 AL
## 3 3995 AL
## 4 4755 AL
## 5 5555 AL
## 6 5162 AL
tail(df)
## State Year Population Violent_crime Murder Murder_rate Legacy_rape
## 2795 Wyoming 2009 544270 1196 11 2.021056 172
## 2796 Wyoming 2010 564554 1117 8 1.417048 162
## 2797 Wyoming 2011 567356 1245 18 3.172611 146
## 2798 Wyoming 2012 576626 1161 14 2.427917 154
## 2799 Wyoming 2013 583223 1212 17 2.914837 144
## 2800 Wyoming 2014 584153 1142 16 2.739008 126
## Robbery Assault State_code
## 2795 78 935 WY
## 2796 77 870 WY
## 2797 71 1010 WY
## 2798 61 932 WY
## 2799 74 917 WY
## 2800 53 899 WY
#Convert Data Types
df$State <- as.factor(df$State)
df$Year <- as.integer(df$Year)
df$Population <- as.numeric(df$Population)
df$Violent_crime <- as.integer(df$Violent_crime)
df$Murder <- as.integer(df$Murder)
df$Legacy_rape <- as.integer(df$Legacy_rape)
df$Robbery <- as.integer(df$Robbery)
# Check column data types
sapply(df, class)
## State Year Population Violent_crime Murder
## "factor" "integer" "numeric" "integer" "integer"
## Murder_rate Legacy_rape Robbery Assault State_code
## "numeric" "integer" "integer" "integer" "character"
## Murder
## "integer"
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
#Linear regression model (assuming 'Violent_crime' is the dependent variable
# and 'Legacy_rape' is an independent variable)
model <- lm(Violent_crime ~ Legacy_rape, data = df)
# Summarize the model
summary(model)
##
## Call:
## lm(formula = Violent_crime ~ Legacy_rape, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62719 -3479 795 2610 123798
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2845.5711 346.0299 -8.223 2.99e-16 ***
## Legacy_rape 18.3094 0.1457 125.679 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14640 on 2798 degrees of freedom
## Multiple R-squared: 0.8495, Adjusted R-squared: 0.8495
## F-statistic: 1.58e+04 on 1 and 2798 DF, p-value: < 2.2e-16
#Multiple regression model (assuming 'Violent_crime' is the dependent variable
# and 'Population, Murder, Legacy_rape + Robbery ' are independent variables)
model <- lm(Violent_crime ~ Population + Legacy_rape + Robbery, data = df)
# Summarize the model
summary(model)
##
## Call:
## lm(formula = Violent_crime ~ Population + Legacy_rape + Robbery,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37135 -1753 392 1505 60941
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.226e+03 1.844e+02 -6.645 3.63e-11 ***
## Population 5.059e-04 5.908e-05 8.562 < 2e-16 ***
## Legacy_rape 6.954e+00 1.658e-01 41.931 < 2e-16 ***
## Robbery 1.492e+00 1.721e-02 86.696 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6969 on 2796 degrees of freedom
## Multiple R-squared: 0.9659, Adjusted R-squared: 0.9659
## F-statistic: 2.643e+04 on 3 and 2796 DF, p-value: < 2.2e-16
The simple linear regression model equation is: Violent_crime = 18.31(legacy_rape) + -2845.57. Overall Model p-value: F-statistic: 1.58e+04 on 1 and 2798 DF, p-value: < 2.2e-1, This means the overall model is highly significant, suggesting that rape contributes meaningfully to predicting the crime rate. Adjusted R² equal to 0.8495 is mindful with 84.95% of the variation in the observations meaning that it is strong and my independent variable Legacy_rape is a good predictor of crime Rate.
Whereas, the multiple regression model Violent_crime = 5.466e-04(Population) + 8.348e+01(Murder) + 6.904e+00(Legacy_rape) + 1.479e+00(robbery)-1.752e+03. Overall Model p-value: F-statistic: 1.992e+04 on 4 and 2795 DF, p-value: < 2.2e-1, which is statistically significant meaning that here is a strong relationship between violent crime and murder, rape, and robbery. Also, the model is explained 96.61% of the variation in the observations.
The slope may be interpreted in the following: For each additional rape per 100,000, there is a predicted increased of 6.9 violent crime if murder and robbery are under control.