Introduction

This project analyzes crime data across states using linear regression model. That dataset is from:https://ucrdatatool.gov/Search/Crime/State/StatebyState.cfm?NoVariables=Y. The dataset descibes US violent crime rates at the state level (20 states), from 1960 to 2014, with rates per 100,000 population for crime types such as murder, legacy rape,robbery as reported by the Uniform Crime Report. I am interested in the relationship analysis between the US violent crime and rape. Otherwise, I want to know whether the US violent crime have a weak or strong correlation with rape.

Setup and Load Libraries

install.packages("readxl")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("caret")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("rmarkdown")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(readxl) # For reading Excel files
library(dplyr)  # For data manipulation
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(caret)
## Loading required package: lattice

Data Import and Cleaning and Explanatory Data Analysis

# Read the .CVS file
df <- read.csv("CrimeStatebyState.csv")
#print(df)
str(df)
## 'data.frame':    2800 obs. of  11 variables:
##  $ State                               : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ Year                                : int  1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
##  $ Population                          : int  3266740 3302000 3358000 3347000 3407000 3462000 3517000 3540000 3566000 3531000 ...
##  $ Violent_crime_total                 : int  6097 5564 5283 6115 7260 6916 8098 8448 8288 8842 ...
##  $ Murder_and_nonnegligent_Manslaughter: int  406 427 316 340 316 395 384 415 421 485 ...
##  $ Murder_per100000                    : num  12.43 12.93 9.41 10.16 9.28 ...
##  $ Legacy_rape_.1                      : int  281 252 218 192 397 367 341 371 396 494 ...
##  $ Revised_rape_.2                     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Robbery                             : int  898 630 754 828 992 992 1124 1167 1462 1448 ...
##  $ Aggravated_assault                  : int  4512 4255 3995 4755 5555 5162 6249 6495 6009 6415 ...
##  $ State_code                          : chr  "AL" "AL" "AL" "AL" ...
# View current column names
names(df)
##  [1] "State"                               
##  [2] "Year"                                
##  [3] "Population"                          
##  [4] "Violent_crime_total"                 
##  [5] "Murder_and_nonnegligent_Manslaughter"
##  [6] "Murder_per100000"                    
##  [7] "Legacy_rape_.1"                      
##  [8] "Revised_rape_.2"                     
##  [9] "Robbery"                             
## [10] "Aggravated_assault"                  
## [11] "State_code"
# View the first few rows of the data
head(df)
##     State Year Population Violent_crime_total
## 1 Alabama 1960    3266740                6097
## 2 Alabama 1961    3302000                5564
## 3 Alabama 1962    3358000                5283
## 4 Alabama 1963    3347000                6115
## 5 Alabama 1964    3407000                7260
## 6 Alabama 1965    3462000                6916
##   Murder_and_nonnegligent_Manslaughter Murder_per100000 Legacy_rape_.1
## 1                                  406        12.428292            281
## 2                                  427        12.931557            252
## 3                                  316         9.410363            218
## 4                                  340        10.158351            192
## 5                                  316         9.275022            397
## 6                                  395        11.409590            367
##   Revised_rape_.2 Robbery Aggravated_assault State_code
## 1              NA     898               4512         AL
## 2              NA     630               4255         AL
## 3              NA     754               3995         AL
## 4              NA     828               4755         AL
## 5              NA     992               5555         AL
## 6              NA     992               5162         AL
# View the last few rows of the data
tail(df)
##        State Year Population Violent_crime_total
## 2795 Wyoming 2009     544270                1196
## 2796 Wyoming 2010     564554                1117
## 2797 Wyoming 2011     567356                1245
## 2798 Wyoming 2012     576626                1161
## 2799 Wyoming 2013     583223                1212
## 2800 Wyoming 2014     584153                1142
##      Murder_and_nonnegligent_Manslaughter Murder_per100000 Legacy_rape_.1
## 2795                                   11         2.021056            172
## 2796                                    8         1.417048            162
## 2797                                   18         3.172611            146
## 2798                                   14         2.427917            154
## 2799                                   17         2.914837            144
## 2800                                   16         2.739008            126
##      Revised_rape_.2 Robbery Aggravated_assault State_code
## 2795              NA      78                935         WY
## 2796              NA      77                870         WY
## 2797              NA      71               1010         WY
## 2798              NA      61                932         WY
## 2799             204      74                917         WY
## 2800             174      53                899         WY
# Summary and overview
summary(df)
##     State                Year        Population       Violent_crime_total
##  Length:2800        Min.   :1960   Min.   :  226167   Min.   :    37     
##  Class :character   1st Qu.:1973   1st Qu.: 1189750   1st Qu.:  3065     
##  Mode  :character   Median :1987   Median : 3235500   Median : 10652     
##                     Mean   :1987   Mean   : 4805671   Mean   : 23270     
##                     3rd Qu.:2001   3rd Qu.: 5750048   3rd Qu.: 27192     
##                     Max.   :2014   Max.   :38802500   Max.   :345624     
##                                                                          
##  Murder_and_nonnegligent_Manslaughter Murder_per100000  Legacy_rape_.1   
##  Min.   :   1.0                       Min.   : 0.1565   Min.   :    6.0  
##  1st Qu.:  46.0                       1st Qu.: 3.1041   1st Qu.:  293.8  
##  Median : 178.5                       Median : 5.4413   Median :  822.0  
##  Mean   : 339.2                       Mean   : 6.5949   Mean   : 1426.3  
##  3rd Qu.: 434.0                       3rd Qu.: 8.6132   3rd Qu.: 1712.5  
##  Max.   :4096.0                       Max.   :80.6020   Max.   :13693.0  
##                                                                          
##  Revised_rape_.2      Robbery       Aggravated_assault  State_code       
##  Min.   :  110.0   Min.   :     8   Min.   :    14     Length:2800       
##  1st Qu.:  772.8   1st Qu.:   782   1st Qu.:  1879     Class :character  
##  Median : 1592.0   Median :  3126   Median :  6288     Mode  :character  
##  Mean   : 2258.2   Mean   :  8141   Mean   : 13340                       
##  3rd Qu.: 2518.0   3rd Qu.:  7778   3rd Qu.: 16183                       
##  Max.   :11527.0   Max.   :130897   Max.   :198045                       
##  NA's   :2698
str(df)
## 'data.frame':    2800 obs. of  11 variables:
##  $ State                               : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ Year                                : int  1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
##  $ Population                          : int  3266740 3302000 3358000 3347000 3407000 3462000 3517000 3540000 3566000 3531000 ...
##  $ Violent_crime_total                 : int  6097 5564 5283 6115 7260 6916 8098 8448 8288 8842 ...
##  $ Murder_and_nonnegligent_Manslaughter: int  406 427 316 340 316 395 384 415 421 485 ...
##  $ Murder_per100000                    : num  12.43 12.93 9.41 10.16 9.28 ...
##  $ Legacy_rape_.1                      : int  281 252 218 192 397 367 341 371 396 494 ...
##  $ Revised_rape_.2                     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Robbery                             : int  898 630 754 828 992 992 1124 1167 1462 1448 ...
##  $ Aggravated_assault                  : int  4512 4255 3995 4755 5555 5162 6249 6495 6009 6415 ...
##  $ State_code                          : chr  "AL" "AL" "AL" "AL" ...
head(df)
##     State Year Population Violent_crime_total
## 1 Alabama 1960    3266740                6097
## 2 Alabama 1961    3302000                5564
## 3 Alabama 1962    3358000                5283
## 4 Alabama 1963    3347000                6115
## 5 Alabama 1964    3407000                7260
## 6 Alabama 1965    3462000                6916
##   Murder_and_nonnegligent_Manslaughter Murder_per100000 Legacy_rape_.1
## 1                                  406        12.428292            281
## 2                                  427        12.931557            252
## 3                                  316         9.410363            218
## 4                                  340        10.158351            192
## 5                                  316         9.275022            397
## 6                                  395        11.409590            367
##   Revised_rape_.2 Robbery Aggravated_assault State_code
## 1              NA     898               4512         AL
## 2              NA     630               4255         AL
## 3              NA     754               3995         AL
## 4              NA     828               4755         AL
## 5              NA     992               5555         AL
## 6              NA     992               5162         AL
tail(df)
##        State Year Population Violent_crime_total
## 2795 Wyoming 2009     544270                1196
## 2796 Wyoming 2010     564554                1117
## 2797 Wyoming 2011     567356                1245
## 2798 Wyoming 2012     576626                1161
## 2799 Wyoming 2013     583223                1212
## 2800 Wyoming 2014     584153                1142
##      Murder_and_nonnegligent_Manslaughter Murder_per100000 Legacy_rape_.1
## 2795                                   11         2.021056            172
## 2796                                    8         1.417048            162
## 2797                                   18         3.172611            146
## 2798                                   14         2.427917            154
## 2799                                   17         2.914837            144
## 2800                                   16         2.739008            126
##      Revised_rape_.2 Robbery Aggravated_assault State_code
## 2795              NA      78                935         WY
## 2796              NA      77                870         WY
## 2797              NA      71               1010         WY
## 2798              NA      61                932         WY
## 2799             204      74                917         WY
## 2800             174      53                899         WY
# Check for missing values
colSums(is.na(df))
##                                State                                 Year 
##                                    0                                    0 
##                           Population                  Violent_crime_total 
##                                    0                                    0 
## Murder_and_nonnegligent_Manslaughter                     Murder_per100000 
##                                    0                                    0 
##                       Legacy_rape_.1                      Revised_rape_.2 
##                                    0                                 2698 
##                              Robbery                   Aggravated_assault 
##                                    0                                    0 
##                           State_code 
##                                    0
# Remove columns with any missing values
df <- df[, colSums(is.na(df)) == 0]

# Check for missing values
colSums(is.na(df))
##                                State                                 Year 
##                                    0                                    0 
##                           Population                  Violent_crime_total 
##                                    0                                    0 
## Murder_and_nonnegligent_Manslaughter                     Murder_per100000 
##                                    0                                    0 
##                       Legacy_rape_.1                              Robbery 
##                                    0                                    0 
##                   Aggravated_assault                           State_code 
##                                    0                                    0
# Set new column names
names(df) <- c("State", "Year", "Population", "Violent_crime", " Murder", "Murder_rate", "Legacy_rape", "Robbery", " Assault", " State_code")

# Summary and overview after removing missing values
summary(df)
##     State                Year        Population       Violent_crime   
##  Length:2800        Min.   :1960   Min.   :  226167   Min.   :    37  
##  Class :character   1st Qu.:1973   1st Qu.: 1189750   1st Qu.:  3065  
##  Mode  :character   Median :1987   Median : 3235500   Median : 10652  
##                     Mean   :1987   Mean   : 4805671   Mean   : 23270  
##                     3rd Qu.:2001   3rd Qu.: 5750048   3rd Qu.: 27192  
##                     Max.   :2014   Max.   :38802500   Max.   :345624  
##      Murder        Murder_rate       Legacy_rape         Robbery      
##  Min.   :   1.0   Min.   : 0.1565   Min.   :    6.0   Min.   :     8  
##  1st Qu.:  46.0   1st Qu.: 3.1041   1st Qu.:  293.8   1st Qu.:   782  
##  Median : 178.5   Median : 5.4413   Median :  822.0   Median :  3126  
##  Mean   : 339.2   Mean   : 6.5949   Mean   : 1426.3   Mean   :  8141  
##  3rd Qu.: 434.0   3rd Qu.: 8.6132   3rd Qu.: 1712.5   3rd Qu.:  7778  
##  Max.   :4096.0   Max.   :80.6020   Max.   :13693.0   Max.   :130897  
##      Assault       State_code       
##  Min.   :    14   Length:2800       
##  1st Qu.:  1879   Class :character  
##  Median :  6288   Mode  :character  
##  Mean   : 13340                     
##  3rd Qu.: 16183                     
##  Max.   :198045
str(df)
## 'data.frame':    2800 obs. of  10 variables:
##  $ State        : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ Year         : int  1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
##  $ Population   : int  3266740 3302000 3358000 3347000 3407000 3462000 3517000 3540000 3566000 3531000 ...
##  $ Violent_crime: int  6097 5564 5283 6115 7260 6916 8098 8448 8288 8842 ...
##  $  Murder      : int  406 427 316 340 316 395 384 415 421 485 ...
##  $ Murder_rate  : num  12.43 12.93 9.41 10.16 9.28 ...
##  $ Legacy_rape  : int  281 252 218 192 397 367 341 371 396 494 ...
##  $ Robbery      : int  898 630 754 828 992 992 1124 1167 1462 1448 ...
##  $  Assault     : int  4512 4255 3995 4755 5555 5162 6249 6495 6009 6415 ...
##  $  State_code  : chr  "AL" "AL" "AL" "AL" ...
head(df)
##     State Year Population Violent_crime  Murder Murder_rate Legacy_rape Robbery
## 1 Alabama 1960    3266740          6097     406   12.428292         281     898
## 2 Alabama 1961    3302000          5564     427   12.931557         252     630
## 3 Alabama 1962    3358000          5283     316    9.410363         218     754
## 4 Alabama 1963    3347000          6115     340   10.158351         192     828
## 5 Alabama 1964    3407000          7260     316    9.275022         397     992
## 6 Alabama 1965    3462000          6916     395   11.409590         367     992
##    Assault  State_code
## 1     4512          AL
## 2     4255          AL
## 3     3995          AL
## 4     4755          AL
## 5     5555          AL
## 6     5162          AL
tail(df)
##        State Year Population Violent_crime  Murder Murder_rate Legacy_rape
## 2795 Wyoming 2009     544270          1196      11    2.021056         172
## 2796 Wyoming 2010     564554          1117       8    1.417048         162
## 2797 Wyoming 2011     567356          1245      18    3.172611         146
## 2798 Wyoming 2012     576626          1161      14    2.427917         154
## 2799 Wyoming 2013     583223          1212      17    2.914837         144
## 2800 Wyoming 2014     584153          1142      16    2.739008         126
##      Robbery  Assault  State_code
## 2795      78      935          WY
## 2796      77      870          WY
## 2797      71     1010          WY
## 2798      61      932          WY
## 2799      74      917          WY
## 2800      53      899          WY
#Convert Data Types
df$State <- as.factor(df$State)
df$Year <- as.integer(df$Year)
df$Population <- as.numeric(df$Population)
df$Violent_crime <- as.integer(df$Violent_crime)
df$Murder <- as.integer(df$Murder)
df$Legacy_rape <- as.integer(df$Legacy_rape)
df$Robbery <- as.integer(df$Robbery)

# Check column data types
sapply(df, class)
##         State          Year    Population Violent_crime        Murder 
##      "factor"     "integer"     "numeric"     "integer"     "integer" 
##   Murder_rate   Legacy_rape       Robbery       Assault    State_code 
##     "numeric"     "integer"     "integer"     "integer"   "character" 
##        Murder 
##     "integer"

Visualization

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

Linear Regression Model

#Linear regression model (assuming 'Violent_crime' is the dependent variable 
# and 'Legacy_rape' is an independent variable)
model <- lm(Violent_crime ~ Legacy_rape, data = df)

# Summarize the model
summary(model)
## 
## Call:
## lm(formula = Violent_crime ~ Legacy_rape, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -62719  -3479    795   2610 123798 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2845.5711   346.0299  -8.223 2.99e-16 ***
## Legacy_rape    18.3094     0.1457 125.679  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14640 on 2798 degrees of freedom
## Multiple R-squared:  0.8495, Adjusted R-squared:  0.8495 
## F-statistic: 1.58e+04 on 1 and 2798 DF,  p-value: < 2.2e-16
#Multiple regression model (assuming 'Violent_crime' is the dependent variable 
# and 'Population, Murder, Legacy_rape + Robbery ' are independent variables)
model <- lm(Violent_crime ~ Population + Legacy_rape + Robbery, data = df)

# Summarize the model
summary(model)
## 
## Call:
## lm(formula = Violent_crime ~ Population + Legacy_rape + Robbery, 
##     data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -37135  -1753    392   1505  60941 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.226e+03  1.844e+02  -6.645 3.63e-11 ***
## Population   5.059e-04  5.908e-05   8.562  < 2e-16 ***
## Legacy_rape  6.954e+00  1.658e-01  41.931  < 2e-16 ***
## Robbery      1.492e+00  1.721e-02  86.696  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6969 on 2796 degrees of freedom
## Multiple R-squared:  0.9659, Adjusted R-squared:  0.9659 
## F-statistic: 2.643e+04 on 3 and 2796 DF,  p-value: < 2.2e-16

Conclusion

The simple linear regression model equation is: Violent_crime = 18.31(legacy_rape) + -2845.57. Overall Model p-value: F-statistic: 1.58e+04 on 1 and 2798 DF, p-value: < 2.2e-1, This means the overall model is highly significant, suggesting that rape contributes meaningfully to predicting the crime rate. Adjusted R² equal to 0.8495 is mindful with 84.95% of the variation in the observations meaning that it is strong and my independent variable Legacy_rape is a good predictor of crime Rate.

Whereas, the multiple regression model Violent_crime = 5.466e-04(Population) + 8.348e+01(Murder) + 6.904e+00(Legacy_rape) + 1.479e+00(robbery)-1.752e+03. Overall Model p-value: F-statistic: 1.992e+04 on 4 and 2795 DF, p-value: < 2.2e-1, which is statistically significant meaning that here is a strong relationship between violent crime and murder, rape, and robbery. Also, the model is explained 96.61% of the variation in the observations.

The slope may be interpreted in the following: For each additional rape per 100,000, there is a predicted increased of 6.9 violent crime if murder and robbery are under control.