1 Business Problem

Laporan ini bertujuan untuk menggunakan model regresi dalam memprediksi tingkat kejahatan. Dalam laporan ini, dilakukan analisis data tingkat kejahatan yang mencakup berbagai variabel yang berkaitan. Berdasarkan tujuan dari laporan ini, maka business problem yang mendasari kasus ini ialah untuk mengetahui faktor faktor yang berhubungan dengan tingkat kejahatan dan bagaimana variabel-variabel tersebut saling mempengaruhi

Target : Tingkat Kjeahatan crime_rate.
Prediktor : Semua variabel, kecuali crime_rate.

2 Data Wranggling & EDA

2.1 Penjelasan Data

Read data crime.csv

data = read.csv("data_input/crime.csv")
nrows = nrow(data)
ncols = ncol(data)

Dataset ini terdiri dari sebanyak 47 baris dan 17 kolom.

karena kolom X memiliki informasi yang sama dengan index, maka kita akan drop kolom X dari data frame

# take out kolom X
data <- subset(data, select = -X)

Mengubah nama kolom pada dataset agar lebih informatif.

names(data) <- c("percent_m", "is_south", "mean_education", "police_exp60", 
                  "police_exp59", "labour_participation", "m_per1000f", 
                  "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", 
                  "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")
head(data)

data <- transform(data, percent_m = as.numeric(percent_m),
                  is_south = as.numeric(is_south),
                  mean_education = as.numeric(mean_education),
                  police_exp60 = as.numeric(police_exp60),
                  police_exp59 = as.numeric(police_exp59),
                  labour_participation = as.numeric(labour_participation),
                  m_per1000f = as.numeric(m_per1000f),
                  state_pop = as.numeric(state_pop),
                  nonwhites_per1000 = as.numeric(nonwhites_per1000),
                  unemploy_m24 = as.numeric(unemploy_m24),
                  unemploy_m39 = as.numeric(unemploy_m39),
                  gdp = as.numeric(gdp),
                  inequality = as.numeric(inequality),
                  prob_prison = as.numeric(prob_prison),
                  time_prison = as.numeric(time_prison),
                  crime_rate = as.numeric(crime_rate))

Penjelasan tiap kolom :

percent_m: percentage of males aged 14-24
is_south: whether it is in a Southern state. 1 for Yes, 0 for No.
mean_education: mean years of schooling
police_exp60: police expenditure in 1960
police_exp59: police expenditure in 1959
labour_participation: labour force participation rate
m_per1000f: number of males per 1000 females
state_pop: state population
nonwhites_per1000: number of non-whites resident per 1000 people
unemploy_m24: unemployment rate of urban males aged 14-24
unemploy_m39: unemployment rate of urban males aged 35-39
gdp: gross domestic product per head
inequality: income inequality
prob_prison: probability of imprisonment
time_prison: avg time served in prisons
crime_rate: crime rate in an unspecified category

2.2 Cek Struktur Data

library(dplyr)
glimpse(data)

#> Rows: 47
#> Columns: 16
#> $ percent_m            <dbl> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140,…
#> $ is_south             <dbl> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0…
#> $ mean_education       <dbl> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 10…
#> $ police_exp60         <dbl> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121,…
#> $ police_exp59         <dbl> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, …
#> $ labour_participation <dbl> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632,…
#> $ m_per1000f           <dbl> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 102…
#> $ state_pop            <dbl> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 2…
#> $ nonwhites_per1000    <dbl> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106…
#> $ unemploy_m24         <dbl> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83…
#> $ unemploy_m39         <dbl> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 2…
#> $ gdp                  <dbl> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526,…
#> $ inequality           <dbl> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174,…
#> $ prob_prison          <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399,…
#> $ time_prison          <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9…
#> $ crime_rate           <dbl> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, …

2.3 Cek Missing Value

anyNA(data)

#> [1] FALSE

Tidak ada missing value dari data

3 Visual Basic

3.1 Korelasi

library(ggcorrplot)

# Menghitung matriks korelasi
corr_matrix <- cor(data)

#  Menampilkan plot matriks korelasi dengan ggcorrplot
ggcorrplot(corr_matrix, method = "square") +
  ggtitle("Plot Matriks Korelasi") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

3.2 Histogram

hist(data$crime_rate, breaks= 10, col = "lightblue", xlab = "Crime Rate", ylab = "Frequency", main ="Distribution of Crime Rate")

3.3 Boxplot

boxplot(data$crime_rate, col = "lightblue", ylab = "Crime Rate", main = "Boxplot of Crime Rate")

3.4 Insight

Terdapat sedikit outlier pada kolom “crime_rate” yang terlihat pada boxplot. Pada histogram, data cenderung terkumpul di sebelah kiri dengan skewness positif.
Sebaran data cukup beragam, hal ini dapat dilihat dari lebar boxplot yang cukup lebar. Pada histogram, frekuensi tertinggi terjadi di sekitar nilai 12 dengan crime rate yang berkisar antara 800-1000 (perkiraan).
Nilai median diperkirakan berada di sekitar 800.

4 Modeling

4.1 Model dengan seluruh prediktor

model_all <- lm(crime_rate ~ ., data = data)

summary(model_all)

#> 
#> Call:
#> lm(formula = crime_rate ~ ., data = data)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -395.74  -98.09   -6.69  112.99  512.67 
#> 
#> Coefficients:
#>                        Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)          -5984.2876  1628.3184  -3.675 0.000893 ***
#> percent_m                8.7830     4.1714   2.106 0.043443 *  
#> is_south                -3.8035   148.7551  -0.026 0.979765    
#> mean_education          18.8324     6.2088   3.033 0.004861 ** 
#> police_exp60            19.2804    10.6110   1.817 0.078892 .  
#> police_exp59           -10.9422    11.7478  -0.931 0.358830    
#> labour_participation    -0.6638     1.4697  -0.452 0.654654    
#> m_per1000f               1.7407     2.0354   0.855 0.398995    
#> state_pop               -0.7330     1.2896  -0.568 0.573845    
#> nonwhites_per1000        0.4204     0.6481   0.649 0.521279    
#> unemploy_m24            -5.8271     4.2103  -1.384 0.176238    
#> unemploy_m39            16.7800     8.2336   2.038 0.050161 .  
#> gdp                      0.9617     1.0367   0.928 0.360754    
#> inequality               7.0672     2.2717   3.111 0.003983 ** 
#> prob_prison          -4855.2658  2272.3746  -2.137 0.040627 *  
#> time_prison             -3.4790     7.1653  -0.486 0.630708    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 209.1 on 31 degrees of freedom
#> Multiple R-squared:  0.8031, Adjusted R-squared:  0.7078 
#> F-statistic: 8.429 on 15 and 31 DF,  p-value: 3.539e-07

Insight :

Berdasarkan summary model regresi diatas, terdapat beberapa insight yang dapat diperoleh:

Variabel mean_education memiliki koefisien positif yang signifikan (Estimate = 18.8324, p-value = 0.004861), artinya meningkatnya tingkat pendidikan rata-rata di suatu daerah berhubungan dengan peningkatan tingkat kejahatan. Hal ini menunjukkan bahwa mungkin terdapat hubungan antara kurangnya akses pendidikan yang berkualitas dengan tingkat kejahatan yang lebih tinggi.
Variabel percent_m juga memiliki koefisien positif yang signifikan (Estimate = 8.7830, p-value = 0.043443), menunjukkan bahwa peningkatan persentase penduduk laki-laki dalam suatu daerah berhubungan dengan peningkatan tingkat kejahatan. Ini dapat menunjukkan bahwa faktor demografi seperti distribusi gender juga dapat mempengaruhi tingkat kejahatan.
Variabel inequality memiliki koefisien positif yang signifikan (Estimate = 7.0672, p-value = 0.003983), menunjukkan bahwa tingkat ketimpangan ekonomi dalam suatu daerah berkaitan dengan tingkat kejahatan yang lebih tinggi. Hal ini menunjukkan bahwa kesenjangan sosial dan ekonomi dapat menjadi faktor yang berkontribusi terhadap kejahatan.
Beberapa variabel lainnya, seperti police_exp60, unemploy_m39, dan prob_prison, tidak memiliki signifikansi statistik dalam memprediksi tingkat kejahatan (p-value > 0.05). Ini berarti bahwa faktor-faktor tersebut mungkin tidak memiliki hubungan langsung atau kuat dengan tingkat kejahatan dalam konteks dataset ini.
Model secara keseluruhan memiliki R-squared sebesar 0.8031, yang berarti variabel prediktor dalam model dapat menjelaskan sekitar 80.31% variabilitas dalam variabel target “crime_rate”.

4.2 Model dengan seleksi fitur berdasarkan korelasi

Variabel prediktor yang memiliki korelasi kuat dengan variabel target crime_rate (berdasarkan matriks korelasi) ialah : police_exp60, dan police_exp59(korelasi > 0.5)

selected_features <- c("police_exp60", "police_exp59")
model_selection <-  lm(formua = crime_rate ~ police_exp60 + police_exp59, data = data)
summary(model_selection)

#> 
#> Call:
#> lm(data = data, formua = crime_rate ~ police_exp60 + police_exp59)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -14.6248  -4.0043   0.1428   5.0494  17.9402 
#> 
#> Coefficients:
#>                        Estimate Std. Error t value Pr(>|t|)  
#> (Intercept)          145.519610  74.100868   1.964   0.0586 .
#> is_south               3.018453   5.966296   0.506   0.6165  
#> mean_education        -0.360076   0.277304  -1.298   0.2037  
#> police_exp60          -0.204126   0.448018  -0.456   0.6518  
#> police_exp59           0.027098   0.479667   0.056   0.9553  
#> labour_participation  -0.036377   0.059024  -0.616   0.5422  
#> m_per1000f             0.110194   0.080536   1.368   0.1811  
#> state_pop             -0.011843   0.052161  -0.227   0.8219  
#> nonwhites_per1000      0.033203   0.025591   1.297   0.2041  
#> unemploy_m24           0.012435   0.174707   0.071   0.9437  
#> unemploy_m39          -0.541029   0.339473  -1.594   0.1211  
#> gdp                   -0.047646   0.041451  -1.149   0.2592  
#> inequality            -0.130627   0.102139  -1.279   0.2104  
#> prob_prison           30.853787  97.865765   0.315   0.7547  
#> time_prison            0.287017   0.285038   1.007   0.3218  
#> crime_rate             0.014245   0.006766   2.106   0.0434 *
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 8.42 on 31 degrees of freedom
#> Multiple R-squared:  0.6975, Adjusted R-squared:  0.5512 
#> F-statistic: 4.766 on 15 and 31 DF,  p-value: 0.0001189

Insight :

Dari hasil analisis yang dilakukan menggunakan regresi linier, terdapat beberapa insight yang dapat diperoleh:

Beberapa variabel memiliki pengaruh signifikan terhadap tingkat kejahatan (crime_rate), seperti mean_education, inequality, dan percent_m. Ini berarti tingkat pendidikan rata-rata, ketimpangan ekonomi, dan persentase penduduk laki-laki memiliki hubungan yang signifikan dengan tingkat kejahatan.
Variabel police_exp60 dan police_exp59 tidak memiliki pengaruh yang signifikan terhadap tingkat kejahatan. Hal ini dapat diinterpretasikan bahwa pengeluaran polisi pada tahun 1960 dan 1959 tidak secara signifikan mempengaruhi tingkat kejahatan.
Terdapat hubungan negatif antara unemploy_m39 (tingkat pengangguran pada kelompok usia 39 tahun ke atas) dengan tingkat kejahatan. Artinya, semakin tinggi tingkat pengangguran pada kelompok usia tersebut, semakin rendah tingkat kejahatan.
Model regresi yang diperoleh memiliki R-squared sebesar 0.8031, yang mengindikasikan bahwa sekitar 80.31% variasi dalam tingkat kejahatan dapat dijelaskan oleh variabel-variabel yang digunakan dalam model ini.

4.3 Model dengan seleksi fitur menggunakan stepwise backward

model_forward <-  step(model_all, direction = "forward")

#> Start:  AIC=514.65
#> crime_rate ~ percent_m + is_south + mean_education + police_exp60 + 
#>     police_exp59 + labour_participation + m_per1000f + state_pop + 
#>     nonwhites_per1000 + unemploy_m24 + unemploy_m39 + gdp + inequality + 
#>     prob_prison + time_prison

summary(model_forward)

#> 
#> Call:
#> lm(formula = crime_rate ~ percent_m + is_south + mean_education + 
#>     police_exp60 + police_exp59 + labour_participation + m_per1000f + 
#>     state_pop + nonwhites_per1000 + unemploy_m24 + unemploy_m39 + 
#>     gdp + inequality + prob_prison + time_prison, data = data)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -395.74  -98.09   -6.69  112.99  512.67 
#> 
#> Coefficients:
#>                        Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)          -5984.2876  1628.3184  -3.675 0.000893 ***
#> percent_m                8.7830     4.1714   2.106 0.043443 *  
#> is_south                -3.8035   148.7551  -0.026 0.979765    
#> mean_education          18.8324     6.2088   3.033 0.004861 ** 
#> police_exp60            19.2804    10.6110   1.817 0.078892 .  
#> police_exp59           -10.9422    11.7478  -0.931 0.358830    
#> labour_participation    -0.6638     1.4697  -0.452 0.654654    
#> m_per1000f               1.7407     2.0354   0.855 0.398995    
#> state_pop               -0.7330     1.2896  -0.568 0.573845    
#> nonwhites_per1000        0.4204     0.6481   0.649 0.521279    
#> unemploy_m24            -5.8271     4.2103  -1.384 0.176238    
#> unemploy_m39            16.7800     8.2336   2.038 0.050161 .  
#> gdp                      0.9617     1.0367   0.928 0.360754    
#> inequality               7.0672     2.2717   3.111 0.003983 ** 
#> prob_prison          -4855.2658  2272.3746  -2.137 0.040627 *  
#> time_prison             -3.4790     7.1653  -0.486 0.630708    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 209.1 on 31 degrees of freedom
#> Multiple R-squared:  0.8031, Adjusted R-squared:  0.7078 
#> F-statistic: 8.429 on 15 and 31 DF,  p-value: 3.539e-07

Insight :

Signifikan Statistik: Tanda bintang (*) menunjukkan tingkat signifikansi statistik dari setiap koefisien. Nilai yang signifikan menunjukkan bahwa koefisien tersebut memiliki pengaruh yang kuat terhadap tingkat kejahatan, sementara nilai yang tidak signifikan menunjukkan bahwa koefisien tersebut tidak memiliki pengaruh yang signifikan.
R-squared: Nilai Multiple R-squared (0.8031) mengindikasikan seberapa baik model ini dapat menjelaskan variasi dalam tingkat kejahatan. Nilai tersebut berkisar antara 0 hingga 1, dengan semakin tinggi nilainya menunjukkan bahwa model tersebut dapat menjelaskan lebih banyak variasi dalam data. Adjusted R-squared (0.7078) memberikan penyesuaian berdasarkan jumlah variabel independen yang digunakan dalam model.

4.4 Model dengan seleksi fitur menggunakan stepwise both

model_both <- step(model_all, direction = "both")

#> Start:  AIC=514.65
#> crime_rate ~ percent_m + is_south + mean_education + police_exp60 + 
#>     police_exp59 + labour_participation + m_per1000f + state_pop + 
#>     nonwhites_per1000 + unemploy_m24 + unemploy_m39 + gdp + inequality + 
#>     prob_prison + time_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - is_south              1        29 1354974 512.65
#> - labour_participation  1      8917 1363862 512.96
#> - time_prison           1     10304 1365250 513.00
#> - state_pop             1     14122 1369068 513.14
#> - nonwhites_per1000     1     18395 1373341 513.28
#> - m_per1000f            1     31967 1386913 513.74
#> - gdp                   1     37613 1392558 513.94
#> - police_exp59          1     37919 1392865 513.95
#> <none>                              1354946 514.65
#> - unemploy_m24          1     83722 1438668 515.47
#> - police_exp60          1    144306 1499252 517.41
#> - unemploy_m39          1    181536 1536482 518.56
#> - percent_m             1    193770 1548716 518.93
#> - prob_prison           1    199538 1554484 519.11
#> - mean_education        1    402117 1757063 524.86
#> - inequality            1    423031 1777977 525.42
#> 
#> Step:  AIC=512.65
#> crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 + 
#>     labour_participation + m_per1000f + state_pop + nonwhites_per1000 + 
#>     unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison + 
#>     time_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - time_prison           1     10341 1365315 511.01
#> - labour_participation  1     10878 1365852 511.03
#> - state_pop             1     14127 1369101 511.14
#> - nonwhites_per1000     1     21626 1376600 511.39
#> - m_per1000f            1     32449 1387423 511.76
#> - police_exp59          1     37954 1392929 511.95
#> - gdp                   1     39223 1394197 511.99
#> <none>                              1354974 512.65
#> - unemploy_m24          1     96420 1451395 513.88
#> + is_south              1        29 1354946 514.65
#> - police_exp60          1    144302 1499277 515.41
#> - unemploy_m39          1    189859 1544834 516.81
#> - percent_m             1    195084 1550059 516.97
#> - prob_prison           1    204463 1559437 517.26
#> - mean_education        1    403140 1758114 522.89
#> - inequality            1    488834 1843808 525.13
#> 
#> Step:  AIC=511.01
#> crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 + 
#>     labour_participation + m_per1000f + state_pop + nonwhites_per1000 + 
#>     unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - labour_participation  1     10533 1375848 509.37
#> - nonwhites_per1000     1     15482 1380797 509.54
#> - state_pop             1     21846 1387161 509.75
#> - police_exp59          1     28932 1394247 509.99
#> - gdp                   1     36070 1401385 510.23
#> - m_per1000f            1     41784 1407099 510.42
#> <none>                              1365315 511.01
#> - unemploy_m24          1     91420 1456735 512.05
#> + time_prison           1     10341 1354974 512.65
#> + is_south              1        65 1365250 513.00
#> - police_exp60          1    134137 1499452 513.41
#> - unemploy_m39          1    184143 1549458 514.95
#> - percent_m             1    186110 1551425 515.01
#> - prob_prison           1    237493 1602808 516.54
#> - mean_education        1    409448 1774763 521.33
#> - inequality            1    502909 1868224 523.75
#> 
#> Step:  AIC=509.37
#> crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 + 
#>     m_per1000f + state_pop + nonwhites_per1000 + unemploy_m24 + 
#>     unemploy_m39 + gdp + inequality + prob_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - nonwhites_per1000     1     11675 1387523 507.77
#> - police_exp59          1     21418 1397266 508.09
#> - state_pop             1     27803 1403651 508.31
#> - m_per1000f            1     31252 1407100 508.42
#> - gdp                   1     35035 1410883 508.55
#> <none>                              1375848 509.37
#> - unemploy_m24          1     80954 1456802 510.06
#> + labour_participation  1     10533 1365315 511.01
#> + time_prison           1      9996 1365852 511.03
#> + is_south              1      3046 1372802 511.26
#> - police_exp60          1    123896 1499744 511.42
#> - unemploy_m39          1    190746 1566594 513.47
#> - percent_m             1    217716 1593564 514.27
#> - prob_prison           1    226971 1602819 514.54
#> - mean_education        1    413254 1789103 519.71
#> - inequality            1    500944 1876792 521.96
#> 
#> Step:  AIC=507.77
#> crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 + 
#>     m_per1000f + state_pop + unemploy_m24 + unemploy_m39 + gdp + 
#>     inequality + prob_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - police_exp59          1     16706 1404229 506.33
#> - state_pop             1     25793 1413315 506.63
#> - m_per1000f            1     26785 1414308 506.66
#> - gdp                   1     31551 1419073 506.82
#> <none>                              1387523 507.77
#> - unemploy_m24          1     83881 1471404 508.52
#> + nonwhites_per1000     1     11675 1375848 509.37
#> + is_south              1      7207 1380316 509.52
#> + labour_participation  1      6726 1380797 509.54
#> + time_prison           1      4534 1382989 509.61
#> - police_exp60          1    118348 1505871 509.61
#> - unemploy_m39          1    201453 1588976 512.14
#> - prob_prison           1    216760 1604282 512.59
#> - percent_m             1    309214 1696737 515.22
#> - mean_education        1    402754 1790276 517.74
#> - inequality            1    589736 1977259 522.41
#> 
#> Step:  AIC=506.33
#> crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f + 
#>     state_pop + unemploy_m24 + unemploy_m39 + gdp + inequality + 
#>     prob_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - state_pop             1     22345 1426575 505.07
#> - gdp                   1     32142 1436371 505.39
#> - m_per1000f            1     36808 1441037 505.54
#> <none>                              1404229 506.33
#> - unemploy_m24          1     86373 1490602 507.13
#> + police_exp59          1     16706 1387523 507.77
#> + nonwhites_per1000     1      6963 1397266 508.09
#> + is_south              1      3807 1400422 508.20
#> + labour_participation  1      1986 1402243 508.26
#> + time_prison           1       575 1403654 508.31
#> - unemploy_m39          1    205814 1610043 510.76
#> - prob_prison           1    218607 1622836 511.13
#> - percent_m             1    307001 1711230 513.62
#> - mean_education        1    389502 1793731 515.83
#> - inequality            1    608627 2012856 521.25
#> - police_exp60          1   1050202 2454432 530.57
#> 
#> Step:  AIC=505.07
#> crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f + 
#>     unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - gdp                   1     26493 1453068 503.93
#> <none>                              1426575 505.07
#> - m_per1000f            1     84491 1511065 505.77
#> - unemploy_m24          1     99463 1526037 506.24
#> + state_pop             1     22345 1404229 506.33
#> + police_exp59          1     13259 1413315 506.63
#> + nonwhites_per1000     1      5927 1420648 506.87
#> + is_south              1      5724 1420851 506.88
#> + labour_participation  1      5176 1421398 506.90
#> + time_prison           1      3913 1422661 506.94
#> - prob_prison           1    198571 1625145 509.20
#> - unemploy_m39          1    208880 1635455 509.49
#> - percent_m             1    320926 1747501 512.61
#> - mean_education        1    386773 1813348 514.35
#> - inequality            1    594779 2021354 519.45
#> - police_exp60          1   1127277 2553852 530.44
#> 
#> Step:  AIC=503.93
#> crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f + 
#>     unemploy_m24 + unemploy_m39 + inequality + prob_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> <none>                              1453068 503.93
#> + gdp                   1     26493 1426575 505.07
#> - m_per1000f            1    103159 1556227 505.16
#> + state_pop             1     16697 1436371 505.39
#> + police_exp59          1     14148 1438919 505.47
#> + is_south              1      9329 1443739 505.63
#> + labour_participation  1      4374 1448694 505.79
#> + nonwhites_per1000     1      3799 1449269 505.81
#> + time_prison           1      2293 1450775 505.86
#> - unemploy_m24          1    127044 1580112 505.87
#> - prob_prison           1    247978 1701046 509.34
#> - unemploy_m39          1    255443 1708511 509.55
#> - percent_m             1    296790 1749858 510.67
#> - mean_education        1    445788 1898855 514.51
#> - inequality            1    738244 2191312 521.24
#> - police_exp60          1   1672038 3125105 537.93

summary(model_both)

#> 
#> Call:
#> lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
#>     m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
#>     data = data)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -444.70 -111.07    3.03  122.15  483.30 
#> 
#> Coefficients:
#>                 Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)    -6426.101   1194.611  -5.379 4.04e-06 ***
#> percent_m          9.332      3.350   2.786  0.00828 ** 
#> mean_education    18.012      5.275   3.414  0.00153 ** 
#> police_exp60      10.265      1.552   6.613 8.26e-08 ***
#> m_per1000f         2.234      1.360   1.642  0.10874    
#> unemploy_m24      -6.087      3.339  -1.823  0.07622 .  
#> unemploy_m39      18.735      7.248   2.585  0.01371 *  
#> inequality         6.133      1.396   4.394 8.63e-05 ***
#> prob_prison    -3796.032   1490.646  -2.547  0.01505 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 195.5 on 38 degrees of freedom
#> Multiple R-squared:  0.7888, Adjusted R-squared:  0.7444 
#> F-statistic: 17.74 on 8 and 38 DF,  p-value: 1.159e-10

5 Evaluasi Model

5.1 Simpan hasil prediksi ke kolomm baru

data$pred_all <-  predict(object = model_all, newdata = data)
data$pred_selection <-  predict(object = model_selection, newdata = data)
data$pred_forward <- predict(object = model_forward, newdata = data)
data$pred_both <-  predict(object = model_both, newwdata = data)
head(data[, (ncol(data)-3):ncol(data)])

5.2 Periksa nilai R-Squared untuk setiap model

summary(model_all)$adj.r.squared

#> [1] 0.7078062

summary(model_selection)$adj.r.squared

#> [1] 0.5511712

summary(model_forward)$adj.r.squared

#> [1] 0.7078062

summary(model_both)$adj.r.squared

#> [1] 0.7443692

5.3 Berdasarkan RMSE model regresi manakah yang terbaik?

library(MLmetrics)

#RMSE model_all
RMSE(y_pred = data$pred_all,
     y_true = data$crime_rate)

#> [1] 169.79

#RMSE model_selection
RMSE(y_pred = data$pred_selection,
     y_true = data$crime_rate)

#> [1] 857.2633

#RMSE model_forward
RMSE(y_pred = data$pred_forward,
     y_true = data$crime_rate)

#> [1] 169.79

#RMSE model_both
RMSE(y_pred = data$pred_both,
     y_true = data$crime_rate)

#> [1] 175.8304

Kesimpulan

Model yang dapat menjelaskan variabel target dengan baik adalah model_both, yakni sebesar 0.7443692 atau 74.4%
Model yang memebrikan error paling kecil dalam memprediksi nilai crime_rate adalah model_all dan model_both denagn nilai RMSE sebesar 169.79

# Mendapatkan nilai R-squared untuk setiap model

r_squared_all <- summary(model_all)$r.squared
adj_r_squared_all <- summary(model_all)$adj.r.squared

r_squared_selection <- summary(model_selection)$r.squared
adj_r_squared_selection <- summary(model_selection)$adj.r.squared


r_squared_forward <- summary(model_forward)$r.squared
adj_r_squared_forward <- summary(model_forward)$adj.r.squared

r_squared_both <- summary(model_both)$r.squared
adj_r_squared_both <- summary(model_both)$adj.r.squared

# Menampilkan nilai R-squared untuk setiap model

cat("R-squared for All Variables Model:", r_squared_all, "\n")

#> R-squared for All Variables Model: 0.8030868

cat("Adjusted R-squared for All Variables Model:", adj_r_squared_all, "\n")

#> Adjusted R-squared for All Variables Model: 0.7078062

cat("R-squared for Selection Model:", r_squared_selection, "\n")

#> R-squared for Selection Model: 0.6975284

cat("Adjusted R-squared for Selection Model:", adj_r_squared_selection, "\n")

#> Adjusted R-squared for Selection Model: 0.5511712

cat("R-squared for Forward Model:", r_squared_forward, "\n")

#> R-squared for Forward Model: 0.8030868

cat("Adjusted R-squared for Forward Model:", adj_r_squared_forward, "\n")

#> Adjusted R-squared for Forward Model: 0.7078062

cat("R-squared for Both Model:", r_squared_both, "\n")

#> R-squared for Both Model: 0.7888268

cat("Adjusted R-squared for Both Model:", adj_r_squared_both, "\n")

#> Adjusted R-squared for Both Model: 0.7443692

6 Interpretasi Model Terbaik

summary(model_all)

#> 
#> Call:
#> lm(formula = crime_rate ~ ., data = data)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -395.74  -98.09   -6.69  112.99  512.67 
#> 
#> Coefficients:
#>                        Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)          -5984.2876  1628.3184  -3.675 0.000893 ***
#> percent_m                8.7830     4.1714   2.106 0.043443 *  
#> is_south                -3.8035   148.7551  -0.026 0.979765    
#> mean_education          18.8324     6.2088   3.033 0.004861 ** 
#> police_exp60            19.2804    10.6110   1.817 0.078892 .  
#> police_exp59           -10.9422    11.7478  -0.931 0.358830    
#> labour_participation    -0.6638     1.4697  -0.452 0.654654    
#> m_per1000f               1.7407     2.0354   0.855 0.398995    
#> state_pop               -0.7330     1.2896  -0.568 0.573845    
#> nonwhites_per1000        0.4204     0.6481   0.649 0.521279    
#> unemploy_m24            -5.8271     4.2103  -1.384 0.176238    
#> unemploy_m39            16.7800     8.2336   2.038 0.050161 .  
#> gdp                      0.9617     1.0367   0.928 0.360754    
#> inequality               7.0672     2.2717   3.111 0.003983 ** 
#> prob_prison          -4855.2658  2272.3746  -2.137 0.040627 *  
#> time_prison             -3.4790     7.1653  -0.486 0.630708    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 209.1 on 31 degrees of freedom
#> Multiple R-squared:  0.8031, Adjusted R-squared:  0.7078 
#> F-statistic: 8.429 on 15 and 31 DF,  p-value: 3.539e-07

Insights :

Berikut adalah beberapa interpretasi dari model ini:

Intercept : Estimasi nilai intercept adalah -5984.2876. Ini menunjukkan nilai prediksi dari variabel target ketika semua variabel independen di dalam model memiliki nilai nol.
percent_m: Koefisien untuk variabel percent_m adalah 8.7830 dengan standard error sebesar 4.1714. Nilai t-value adalah 2.106, yang menunjukkan bahwa variabel ini secara signifikan berpengaruh terhadap variabel target (crime_rate) dengan tingkat signifikansi 0.043443.
is_south: Koefisien untuk variabel is_south adalah -3.8035 dengan standard error sebesar 148.7551. Nilai t-value adalah -0.026, yang menunjukkan bahwa variabel ini tidak signifikan dalam mempengaruhi variabel target dengan tingkat signifikansi 0.979765.
mean_education: Koefisien untuk variabel mean_education adalah 18.8324 dengan standard error sebesar 6.2088. Nilai t-value adalah 3.033, yang menunjukkan bahwa variabel ini secara signifikan berpengaruh terhadap variabel target dengan tingkat signifikansi 0.004861.
Multiple R-squared: Nilai R-squared adalah 0.8031, yang menunjukkan bahwa 80.31% variasi dalam variabel target (crime_rate) dapat dijelaskan oleh variabel-variabel independen dalam model.
Adjusted R-squared: Nilai R-squared yang telah disesuaikan adalah 0.7078. Ini adalah ukuran yang memperhitungkan jumlah variabel independen dalam model dan memberikan penalti untuk kompleksitas model.
F-statistic: Nilai F-statistic adalah 8.429, dengan p-value sebesar 3.539e-07. Ini menunjukkan bahwa setidaknya satu variabel independen secara keseluruhan secara signifikan berpengaruh pada variabel target.

Analisis Regresi untuk Prediksi Harga Rumah

Hairul Yasin

2023-06-29

1 Business Problem

2 Data Wranggling & EDA

2.1 Penjelasan Data

2.2 Cek Struktur Data

2.3 Cek Missing Value

3 Visual Basic

3.1 Korelasi

3.2 Histogram

3.3 Boxplot

3.4 Insight

4 Modeling

4.1 Model dengan seluruh prediktor

4.2 Model dengan seleksi fitur berdasarkan korelasi

4.3 Model dengan seleksi fitur menggunakan stepwise backward

4.4 Model dengan seleksi fitur menggunakan stepwise both

5 Evaluasi Model

5.1 Simpan hasil prediksi ke kolomm baru

5.2 Periksa nilai R-Squared untuk setiap model

5.3 Berdasarkan RMSE model regresi manakah yang terbaik?

6 Interpretasi Model Terbaik