Analysis on Average Year of Education

Library
1. Read the Data
2. EDA
3. Variable/Feature Selection
4. Model Evaluation
5. Assumption Test
Summary
References

What Factors Can Increase the Average Year of Education?

We will begin the analysis using crime dataset by several steps:
1. Read & Tidy Data
2. Exploratory Data Analysis (correlation matrix with GGAlly, correlation with target, multicol)
3. Variable/Feature Selection (business-wise, stepwise with backward)
4. Model Evaluation (Rsquared, Error with MSE/RMSE/MAE/MAPE)
5. Assumption Test (testing with Best Linear Unbiased Estimator)
6. Linear Model Prediction

Library

library(tidyverse)
library(GGally)
library(MLmetrics)
library(lmtest)
library(manipulate)

1. Read the Data

# Read the data and remove the index column "X"
crime <- read.csv("data_input/crime.csv") %>% 
  select(-X)

# Rename the columns
names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")

# Remove "police_exp60" since it is similar to the other variable ("police_exp59")
crime <- crime %>% 
  select(-police_exp60)

# Glance into the data
glimpse(crime)

Observations: 47
Variables: 15
$ percent_m            <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140…
$ is_south             <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, …
$ mean_education       <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 1…
$ police_exp59         <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116,…
$ labour_participation <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632…
$ m_per1000f           <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 10…
$ state_pop            <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, …
$ nonwhites_per1000    <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 10…
$ unemploy_m24         <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 8…
$ unemploy_m39         <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, …
$ gdp                  <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526…
$ inequality           <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174…
$ prob_prison          <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399…
$ time_prison          <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.…
$ crime_rate           <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856,…

percent_m: percentage of males aged 14-24
is_south: whether it is in a Southern state. 1 for Yes, 0 for No.
mean_education: mean years of schooling
police_exp59: police expenditure in 1959
labour_participation: labour force participation rate
m_per1000f: number of males per 1000 females
state_pop: state population
nonwhites_per1000: number of non-whites resident per 1000 people
unemploy_m24: unemployment rate of urban males aged 14-24
unemploy_m39: unemployment rate of urban males aged 35-39
gdp: gross domestic product per head
inequality: income inequality
prob_prison: probability of imprisonment
time_prison: avg time served in prisons
crime_rate: crime rate in an unspecified category

2. EDA

After we read and prepare the crime dataset, now we can start to look into correlation among each variables using library(GGAlly) and create a Pearson correlation:

# plot correlation matrix
ggcorr(crime,label = T, label_size = 3, hjust = 1, layout.exp = 2)

** My question based on this correlation heat map: How do we increase the average years of schooling among citizens?**

Target variable: mean_education

Based on this map, I will pick 1 variables that have strong correlation: gdp (positive strong correlation, 0.8)

Plot

# to look into correlation using scatter plot:
plot(crime$mean_education, crime$gdp)

Linear Model for GDP

model_edu <- lm(mean_education ~ gdp, crime)
summary(model_edu)


Call:
lm(formula = mean_education ~ gdp, data = crime)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.8692  -6.6220   0.0665   6.9303  13.7223 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 60.80723    6.24782   9.733 1.21e-12 ***
gdp          0.08533    0.01170   7.293 3.75e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.657 on 45 degrees of freedom
Multiple R-squared:  0.5417,    Adjusted R-squared:  0.5315 
F-statistic: 53.19 on 1 and 45 DF,  p-value: 3.755e-09

Based on the summary of this model, it confirms that the gdp variable is statistically significant (based on its p-value) to mean_education (our target variable); and from its positive slope (‘Estimate’) value, the higher the gdp in a city, the lower the higher the average education year in that city, which makes sense.

However, the model itself shows an adjusted r-squared value of 59%, which is not good, as means the model has poor fit. If so, what are the variables that actually affect the mean_education?

3. Variable/Feature Selection

Because the previous manual method to make model(s) of mean_education, we will cross-check how R can provide a new model using backward elimination (evaluating model by decreasing predicting variables until it provides the smallest AIC values with the highest adjusted R-squared).

model.all <- lm(mean_education ~ ., crime)

model_back <- step(
  object = model.all,
  direction = "backward"
)

Start:  AIC=171.34
mean_education ~ percent_m + is_south + police_exp59 + labour_participation + 
    m_per1000f + state_pop + nonwhites_per1000 + unemploy_m24 + 
    unemploy_m39 + gdp + inequality + prob_prison + time_prison + 
    crime_rate

                       Df Sum of Sq     RSS    AIC
- m_per1000f            1     0.092  951.02 169.35
- state_pop             1     0.166  951.09 169.35
- is_south              1     2.258  953.18 169.45
- gdp                   1     2.980  953.90 169.49
- time_prison           1     6.931  957.85 169.68
- police_exp59          1    13.368  964.29 170.00
- prob_prison           1    14.351  965.27 170.05
- nonwhites_per1000     1    19.886  970.81 170.31
<none>                               950.92 171.34
- percent_m             1    45.385  996.31 171.53
- labour_participation  1    77.857 1028.78 173.04
- unemploy_m24          1    87.763 1038.69 173.49
- inequality            1   165.736 1116.66 176.89
- unemploy_m39          1   210.067 1160.99 178.72
- crime_rate            1   213.513 1164.44 178.86

Step:  AIC=169.35
mean_education ~ percent_m + is_south + police_exp59 + labour_participation + 
    state_pop + nonwhites_per1000 + unemploy_m24 + unemploy_m39 + 
    gdp + inequality + prob_prison + time_prison + crime_rate

                       Df Sum of Sq     RSS    AIC
- state_pop             1     0.308  951.32 167.36
- is_south              1     2.171  953.19 167.45
- gdp                   1     2.915  953.93 167.49
- time_prison           1     6.846  957.86 167.68
- police_exp59          1    14.061  965.08 168.04
- prob_prison           1    14.300  965.32 168.05
- nonwhites_per1000     1    20.299  971.31 168.34
<none>                               951.02 169.35
- percent_m             1    49.286 1000.30 169.72
- labour_participation  1   108.351 1059.37 172.42
- unemploy_m24          1   125.523 1076.54 173.17
- inequality            1   168.566 1119.58 175.02
- crime_rate            1   220.066 1171.08 177.13
- unemploy_m39          1   220.397 1171.41 177.14

Step:  AIC=167.36
mean_education ~ percent_m + is_south + police_exp59 + labour_participation + 
    nonwhites_per1000 + unemploy_m24 + unemploy_m39 + gdp + inequality + 
    prob_prison + time_prison + crime_rate

                       Df Sum of Sq     RSS    AIC
- is_south              1     2.089  953.41 165.47
- gdp                   1     3.085  954.41 165.51
- time_prison           1     6.685  958.01 165.69
- prob_prison           1    14.227  965.55 166.06
- police_exp59          1    14.557  965.88 166.08
- nonwhites_per1000     1    20.130  971.45 166.35
<none>                               951.32 167.36
- percent_m             1    51.731 1003.05 167.85
- labour_participation  1   108.057 1059.38 170.42
- unemploy_m24          1   125.216 1076.54 171.17
- inequality            1   176.553 1127.88 173.36
- unemploy_m39          1   220.591 1171.91 175.16
- crime_rate            1   223.557 1174.88 175.28

Step:  AIC=165.47
mean_education ~ percent_m + police_exp59 + labour_participation + 
    nonwhites_per1000 + unemploy_m24 + unemploy_m39 + gdp + inequality + 
    prob_prison + time_prison + crime_rate

                       Df Sum of Sq     RSS    AIC
- gdp                   1     4.734  958.15 163.70
- time_prison           1     8.513  961.93 163.88
- prob_prison           1    16.508  969.92 164.27
- police_exp59          1    17.127  970.54 164.30
- nonwhites_per1000     1    18.185  971.60 164.35
<none>                               953.41 165.47
- percent_m             1    49.884 1003.30 165.86
- labour_participation  1   122.663 1076.08 169.15
- unemploy_m24          1   129.799 1083.21 169.46
- inequality            1   186.935 1140.35 171.88
- unemploy_m39          1   219.876 1173.29 173.22
- crime_rate            1   227.690 1181.10 173.53

Step:  AIC=163.7
mean_education ~ percent_m + police_exp59 + labour_participation + 
    nonwhites_per1000 + unemploy_m24 + unemploy_m39 + inequality + 
    prob_prison + time_prison + crime_rate

                       Df Sum of Sq     RSS    AIC
- time_prison           1      7.27  965.41 162.05
- police_exp59          1     13.47  971.61 162.35
- prob_prison           1     16.61  974.76 162.51
- nonwhites_per1000     1     22.22  980.36 162.78
<none>                               958.15 163.70
- percent_m             1     58.38 1016.52 164.48
- unemploy_m24          1    126.90 1085.05 167.54
- labour_participation  1    133.82 1091.96 167.84
- unemploy_m39          1    215.41 1173.56 171.23
- crime_rate            1    260.45 1218.60 173.00
- inequality            1    386.68 1344.83 177.63

Step:  AIC=162.05
mean_education ~ percent_m + police_exp59 + labour_participation + 
    nonwhites_per1000 + unemploy_m24 + unemploy_m39 + inequality + 
    prob_prison + crime_rate

                       Df Sum of Sq     RSS    AIC
- police_exp59          1     11.87  977.28 160.63
- nonwhites_per1000     1     33.07  998.48 161.64
<none>                               965.41 162.05
- prob_prison           1     55.61 1021.03 162.69
- percent_m             1     61.38 1026.79 162.95
- labour_participation  1    139.28 1104.70 166.39
- unemploy_m24          1    162.85 1128.27 167.38
- unemploy_m39          1    253.26 1218.67 171.00
- crime_rate            1    274.93 1240.35 171.83
- inequality            1    384.36 1349.77 175.81

Step:  AIC=160.63
mean_education ~ percent_m + labour_participation + nonwhites_per1000 + 
    unemploy_m24 + unemploy_m39 + inequality + prob_prison + 
    crime_rate

                       Df Sum of Sq     RSS    AIC
<none>                               977.28 160.63
- percent_m             1     49.55 1026.83 160.95
- prob_prison           1     56.94 1034.22 161.29
- nonwhites_per1000     1     57.02 1034.30 161.29
- labour_participation  1    166.99 1144.27 166.04
- unemploy_m24          1    191.73 1169.01 167.05
- unemploy_m39          1    263.00 1240.28 169.83
- crime_rate            1    405.95 1383.23 174.96
- inequality            1    554.24 1531.52 179.74

model_forw <- step(
  object = model.all,
  direction = "forward"
)

Start:  AIC=171.34
mean_education ~ percent_m + is_south + police_exp59 + labour_participation + 
    m_per1000f + state_pop + nonwhites_per1000 + unemploy_m24 + 
    unemploy_m39 + gdp + inequality + prob_prison + time_prison + 
    crime_rate

For now, we’ll have 2 new model from forward and backward analysis; which we are going to evaluate later on below.

4. Model Evaluation

We will use the library(MLmetrics) to evaluate the model.

RMSE (Root Mean Squared Error)

# RMSE for forward model
RMSE(
  y_pred = model_forw$fitted.values,
  y_true = crime$mean_education
)

[1] 4.498045

# RMSE for backward model
RMSE(
  y_pred = model_back$fitted.values,
  y_true = crime$mean_education
)

[1] 4.559959

In general, the smaller the RMSE, the better the model is. Again, both of the model produce similar (small, error) value.

Adjusted R-squared comparison:

summary(model_forw)$adj.r.squared

[1] 0.7625521

summary(model_back)$adj.r.squared

[1] 0.7945014

Backward stepwise model provides better model to measure mean_education.

5. Assumption Test

Normality residual

We’ll be using several methods to check whether our residual data is normally distributed (H0: Residual is normally distributed and H1: Residual is not normally distributed), using: - Residual Check Using Histogram Plot

hist(model_back$residuals)

Shapiro Test: we will check whether if our model has big/enough p-value (which means the residual value is distributed normally)

shapiro.test(model_back$residuals)


    Shapiro-Wilk normality test

data:  model_back$residuals
W = 0.98866, p-value = 0.925

Because our p-value from Shapiro test is higher than alpha (0.925 > 0.05), then our decision will be to accept H0, which means that our residual data from the model is normally distributed.

No-Multicollinearity

And for the last assumption test, we are going to test multicollinearity with vif (variance inflation factor). The goal is to see whether our variables have vif score < 10 each.

library(car)
vif(model_back)

           percent_m labour_participation    nonwhites_per1000 
            2.238560             1.592336             2.708349 
        unemploy_m24         unemploy_m39           inequality 
            2.952959             3.929435             2.436282 
         prob_prison           crime_rate 
            1.671663             1.585600

As we can see from the result above, the vif value is way under 10 each, which means there is no multicollinearity from each variables (they are not affecting each other).

Summary

Based on our backward model (which has higher Adjusted R-Squared Values of 80%), some of the variables that can significantly affect the average year of education in the city are crime_rate, inequality, unemploy_m39, unemploy_m24, and labour_participation. Peculiarly enough, the variabe gdp does not even included in the backward model provided by R; however as a user, we can always include gdp into our model, which still makes sense since the higher gdp is, the more education funds the city will have.

References

1: Linear Regression on Car Price Prediction, via https://rpubs.com/Argaadya/531140
2: Algoritma Academy: Regression Models