Analysis on Average Year of Education
What Factors Can Increase the Average Year of Education?
We will begin the analysis using crime dataset by several steps:
1. Read & Tidy Data
2. Exploratory Data Analysis (correlation matrix with GGAlly, correlation with target, multicol)
3. Variable/Feature Selection (business-wise, stepwise with backward)
4. Model Evaluation (Rsquared, Error with MSE/RMSE/MAE/MAPE)
5. Assumption Test (testing with Best Linear Unbiased Estimator)
6. Linear Model Prediction
1. Read the Data
# Read the data and remove the index column "X"
crime <- read.csv("data_input/crime.csv") %>%
select(-X)
# Rename the columns
names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")
# Remove "police_exp60" since it is similar to the other variable ("police_exp59")
crime <- crime %>%
select(-police_exp60)
# Glance into the data
glimpse(crime)Observations: 47
Variables: 15
$ percent_m <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140…
$ is_south <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, …
$ mean_education <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 1…
$ police_exp59 <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116,…
$ labour_participation <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632…
$ m_per1000f <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 10…
$ state_pop <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, …
$ nonwhites_per1000 <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 10…
$ unemploy_m24 <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 8…
$ unemploy_m39 <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, …
$ gdp <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526…
$ inequality <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174…
$ prob_prison <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399…
$ time_prison <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.…
$ crime_rate <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856,…
percent_m: percentage of males aged 14-24
is_south: whether it is in a Southern state. 1 for Yes, 0 for No.mean_education: mean years of schoolingpolice_exp59: police expenditure in 1959labour_participation: labour force participation ratem_per1000f: number of males per 1000 femalesstate_pop: state populationnonwhites_per1000: number of non-whites resident per 1000 peopleunemploy_m24: unemployment rate of urban males aged 14-24unemploy_m39: unemployment rate of urban males aged 35-39gdp: gross domestic product per headinequality: income inequalityprob_prison: probability of imprisonmenttime_prison: avg time served in prisonscrime_rate: crime rate in an unspecified category
2. EDA
After we read and prepare the crime dataset, now we can start to look into correlation among each variables using library(GGAlly) and create a Pearson correlation:
** My question based on this correlation heat map: How do we increase the average years of schooling among citizens?**
Target variable: mean_education
Based on this map, I will pick 1 variables that have strong correlation: gdp (positive strong correlation, 0.8)
Linear Model for GDP
Call:
lm(formula = mean_education ~ gdp, data = crime)
Residuals:
Min 1Q Median 3Q Max
-11.8692 -6.6220 0.0665 6.9303 13.7223
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 60.80723 6.24782 9.733 1.21e-12 ***
gdp 0.08533 0.01170 7.293 3.75e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.657 on 45 degrees of freedom
Multiple R-squared: 0.5417, Adjusted R-squared: 0.5315
F-statistic: 53.19 on 1 and 45 DF, p-value: 3.755e-09
Based on the summary of this model, it confirms that the gdp variable is statistically significant (based on its p-value) to mean_education (our target variable); and from its positive slope (‘Estimate’) value, the higher the gdp in a city, the lower the higher the average education year in that city, which makes sense.
However, the model itself shows an adjusted r-squared value of 59%, which is not good, as means the model has poor fit. If so, what are the variables that actually affect the mean_education?
3. Variable/Feature Selection
Because the previous manual method to make model(s) of mean_education, we will cross-check how R can provide a new model using backward elimination (evaluating model by decreasing predicting variables until it provides the smallest AIC values with the highest adjusted R-squared).
model.all <- lm(mean_education ~ ., crime)
model_back <- step(
object = model.all,
direction = "backward"
)Start: AIC=171.34
mean_education ~ percent_m + is_south + police_exp59 + labour_participation +
m_per1000f + state_pop + nonwhites_per1000 + unemploy_m24 +
unemploy_m39 + gdp + inequality + prob_prison + time_prison +
crime_rate
Df Sum of Sq RSS AIC
- m_per1000f 1 0.092 951.02 169.35
- state_pop 1 0.166 951.09 169.35
- is_south 1 2.258 953.18 169.45
- gdp 1 2.980 953.90 169.49
- time_prison 1 6.931 957.85 169.68
- police_exp59 1 13.368 964.29 170.00
- prob_prison 1 14.351 965.27 170.05
- nonwhites_per1000 1 19.886 970.81 170.31
<none> 950.92 171.34
- percent_m 1 45.385 996.31 171.53
- labour_participation 1 77.857 1028.78 173.04
- unemploy_m24 1 87.763 1038.69 173.49
- inequality 1 165.736 1116.66 176.89
- unemploy_m39 1 210.067 1160.99 178.72
- crime_rate 1 213.513 1164.44 178.86
Step: AIC=169.35
mean_education ~ percent_m + is_south + police_exp59 + labour_participation +
state_pop + nonwhites_per1000 + unemploy_m24 + unemploy_m39 +
gdp + inequality + prob_prison + time_prison + crime_rate
Df Sum of Sq RSS AIC
- state_pop 1 0.308 951.32 167.36
- is_south 1 2.171 953.19 167.45
- gdp 1 2.915 953.93 167.49
- time_prison 1 6.846 957.86 167.68
- police_exp59 1 14.061 965.08 168.04
- prob_prison 1 14.300 965.32 168.05
- nonwhites_per1000 1 20.299 971.31 168.34
<none> 951.02 169.35
- percent_m 1 49.286 1000.30 169.72
- labour_participation 1 108.351 1059.37 172.42
- unemploy_m24 1 125.523 1076.54 173.17
- inequality 1 168.566 1119.58 175.02
- crime_rate 1 220.066 1171.08 177.13
- unemploy_m39 1 220.397 1171.41 177.14
Step: AIC=167.36
mean_education ~ percent_m + is_south + police_exp59 + labour_participation +
nonwhites_per1000 + unemploy_m24 + unemploy_m39 + gdp + inequality +
prob_prison + time_prison + crime_rate
Df Sum of Sq RSS AIC
- is_south 1 2.089 953.41 165.47
- gdp 1 3.085 954.41 165.51
- time_prison 1 6.685 958.01 165.69
- prob_prison 1 14.227 965.55 166.06
- police_exp59 1 14.557 965.88 166.08
- nonwhites_per1000 1 20.130 971.45 166.35
<none> 951.32 167.36
- percent_m 1 51.731 1003.05 167.85
- labour_participation 1 108.057 1059.38 170.42
- unemploy_m24 1 125.216 1076.54 171.17
- inequality 1 176.553 1127.88 173.36
- unemploy_m39 1 220.591 1171.91 175.16
- crime_rate 1 223.557 1174.88 175.28
Step: AIC=165.47
mean_education ~ percent_m + police_exp59 + labour_participation +
nonwhites_per1000 + unemploy_m24 + unemploy_m39 + gdp + inequality +
prob_prison + time_prison + crime_rate
Df Sum of Sq RSS AIC
- gdp 1 4.734 958.15 163.70
- time_prison 1 8.513 961.93 163.88
- prob_prison 1 16.508 969.92 164.27
- police_exp59 1 17.127 970.54 164.30
- nonwhites_per1000 1 18.185 971.60 164.35
<none> 953.41 165.47
- percent_m 1 49.884 1003.30 165.86
- labour_participation 1 122.663 1076.08 169.15
- unemploy_m24 1 129.799 1083.21 169.46
- inequality 1 186.935 1140.35 171.88
- unemploy_m39 1 219.876 1173.29 173.22
- crime_rate 1 227.690 1181.10 173.53
Step: AIC=163.7
mean_education ~ percent_m + police_exp59 + labour_participation +
nonwhites_per1000 + unemploy_m24 + unemploy_m39 + inequality +
prob_prison + time_prison + crime_rate
Df Sum of Sq RSS AIC
- time_prison 1 7.27 965.41 162.05
- police_exp59 1 13.47 971.61 162.35
- prob_prison 1 16.61 974.76 162.51
- nonwhites_per1000 1 22.22 980.36 162.78
<none> 958.15 163.70
- percent_m 1 58.38 1016.52 164.48
- unemploy_m24 1 126.90 1085.05 167.54
- labour_participation 1 133.82 1091.96 167.84
- unemploy_m39 1 215.41 1173.56 171.23
- crime_rate 1 260.45 1218.60 173.00
- inequality 1 386.68 1344.83 177.63
Step: AIC=162.05
mean_education ~ percent_m + police_exp59 + labour_participation +
nonwhites_per1000 + unemploy_m24 + unemploy_m39 + inequality +
prob_prison + crime_rate
Df Sum of Sq RSS AIC
- police_exp59 1 11.87 977.28 160.63
- nonwhites_per1000 1 33.07 998.48 161.64
<none> 965.41 162.05
- prob_prison 1 55.61 1021.03 162.69
- percent_m 1 61.38 1026.79 162.95
- labour_participation 1 139.28 1104.70 166.39
- unemploy_m24 1 162.85 1128.27 167.38
- unemploy_m39 1 253.26 1218.67 171.00
- crime_rate 1 274.93 1240.35 171.83
- inequality 1 384.36 1349.77 175.81
Step: AIC=160.63
mean_education ~ percent_m + labour_participation + nonwhites_per1000 +
unemploy_m24 + unemploy_m39 + inequality + prob_prison +
crime_rate
Df Sum of Sq RSS AIC
<none> 977.28 160.63
- percent_m 1 49.55 1026.83 160.95
- prob_prison 1 56.94 1034.22 161.29
- nonwhites_per1000 1 57.02 1034.30 161.29
- labour_participation 1 166.99 1144.27 166.04
- unemploy_m24 1 191.73 1169.01 167.05
- unemploy_m39 1 263.00 1240.28 169.83
- crime_rate 1 405.95 1383.23 174.96
- inequality 1 554.24 1531.52 179.74
Start: AIC=171.34
mean_education ~ percent_m + is_south + police_exp59 + labour_participation +
m_per1000f + state_pop + nonwhites_per1000 + unemploy_m24 +
unemploy_m39 + gdp + inequality + prob_prison + time_prison +
crime_rate
For now, we’ll have 2 new model from forward and backward analysis; which we are going to evaluate later on below.
4. Model Evaluation
We will use the library(MLmetrics) to evaluate the model.
RMSE (Root Mean Squared Error)
[1] 4.498045
[1] 4.559959
In general, the smaller the RMSE, the better the model is. Again, both of the model produce similar (small, error) value.
Adjusted R-squared comparison:
[1] 0.7625521
[1] 0.7945014
Backward stepwise model provides better model to measure mean_education.
5. Assumption Test
Normality residual
We’ll be using several methods to check whether our residual data is normally distributed (H0: Residual is normally distributed and H1: Residual is not normally distributed), using: - Residual Check Using Histogram Plot
- Shapiro Test: we will check whether if our model has big/enough p-value (which means the residual value is distributed normally)
Shapiro-Wilk normality test
data: model_back$residuals
W = 0.98866, p-value = 0.925
Because our p-value from Shapiro test is higher than alpha (0.925 > 0.05), then our decision will be to accept H0, which means that our residual data from the model is normally distributed.
No-Multicollinearity
And for the last assumption test, we are going to test multicollinearity with vif (variance inflation factor). The goal is to see whether our variables have vif score < 10 each.
percent_m labour_participation nonwhites_per1000
2.238560 1.592336 2.708349
unemploy_m24 unemploy_m39 inequality
2.952959 3.929435 2.436282
prob_prison crime_rate
1.671663 1.585600
As we can see from the result above, the vif value is way under 10 each, which means there is no multicollinearity from each variables (they are not affecting each other).
Summary
Based on our backward model (which has higher Adjusted R-Squared Values of 80%), some of the variables that can significantly affect the average year of education in the city are crime_rate, inequality, unemploy_m39, unemploy_m24, and labour_participation. Peculiarly enough, the variabe gdp does not even included in the backward model provided by R; however as a user, we can always include gdp into our model, which still makes sense since the higher gdp is, the more education funds the city will have.
References
1: Linear Regression on Car Price Prediction, via https://rpubs.com/Argaadya/531140
2: Algoritma Academy: Regression Models