Bullying has been a common problem against child all over the world. This was mainly due to an aggressive behaviour from a child, that they have the urge to be superior than the other. According to wikipedia :
Bullying is the use of force, coercion, or threat, to abuse, aggressively dominate or intimidate. The behavior is often repeated and habitual.
This violates the human rights and need to be prevented as soon as possible. Here on this article I will try to extract which factors could determine a child is aggressive or not.
The dataset was collected from kaggle. The following study was designed with the intention of evaluating and making inferences on potential relationships between Child’s Aggression and multiple potential predictors factors for a little over 650 children in home settings. Half of the data looks at households from the U.S. and the other half examines households in England. The data includes 6 variables, which I will explain further in this article.
For this analysis I’ll use 4 libraries, i.e dplyr for data wrangling, gtools to interpret the summary of Logistic Regression model, class to extract the k-Nearest Neighbor function, and caret to evaluate our model.
We will now work with our data based on this working order :
1. Read Data
2. Data Wrangling
3. Exploratory Data Analysis
4. Cross Validation
* Data Pre-Processing (for k-NN)
5. Build Model (for Logistic Regression)
6. Predict
7. Model Evaluation
8. Model Tuning
9. Final Model
There are 6 variables :
1. Aggression : defines how aggressive a child is based on other variables
2. Video : higher means more time spent watching various platform of videos
3. Electronic.Games : higher means more time spent playing electronic games
4. sibling.agg : higher means more aggression seen from elder sibling
5. Nutrition : higher means the child had a healthy diet
6. Parental.Approach : higher means worse parenting
1. Data Structure and Transformation
Observations: 666
Variables: 6
$ Aggression <dbl> 0.37, 0.77, -0.10, 0.02, -0.28, 0.16, 0.13, 0.01,...
$ Videos <dbl> 0.17, -0.03, -0.07, 0.00, -0.68, 0.20, 0.19, 0.05...
$ Electronic.Games <dbl> 0.14, 0.71, -0.39, -0.41, -0.28, 0.32, 0.31, -0.1...
$ sibling.agg <dbl> -0.33, 0.58, -0.22, 0.05, -0.89, -0.15, 0.51, 0.1...
$ Nutrition <dbl> -0.11, -0.02, 0.28, -0.26, 0.23, -0.37, -0.66, 0....
$ Parental.Approach <dbl> -0.28, -1.25, -0.33, -1.01, 0.49, -1.74, 0.09, -0...
In order to do classification we need a binary target, deciding yes or no towards a statement. Thus we’ll transform the Aggression variable into 0s and 1s. 0 means that the child is not aggressive and 1 means the child is aggressive. We’ll set 0 as the threshold, so when the Aggression is lower than 0, the child is considered aggressive (labeled 1).
Also Our target variable must be a factor, so let’s convert them
Check the data structure once again
Observations: 666
Variables: 6
$ Aggression <fct> 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1...
$ Videos <dbl> 0.17, -0.03, -0.07, 0.00, -0.68, 0.20, 0.19, 0.05...
$ Electronic.Games <dbl> 0.14, 0.71, -0.39, -0.41, -0.28, 0.32, 0.31, -0.1...
$ sibling.agg <dbl> -0.33, 0.58, -0.22, 0.05, -0.89, -0.15, 0.51, 0.1...
$ Nutrition <dbl> -0.11, -0.02, 0.28, -0.26, 0.23, -0.37, -0.66, 0....
$ Parental.Approach <dbl> -0.28, -1.25, -0.33, -1.01, 0.49, -1.74, 0.09, -0...
2. Missing Values
Missing values would be a problem in making a machine learning model, so let’s eliminate them, if there are any
[1] FALSE
Fortunately there are no missing values
1. Summary
Let’s check the summary of our data
Aggression Videos Electronic.Games sibling.agg
0:331 Min. :-1.06000 Min. :-1.15000 Min. :-1.430000
1:335 1st Qu.:-0.18000 1st Qu.:-0.17000 1st Qu.:-0.157500
Median :-0.01000 Median : 0.00000 Median : 0.010000
Mean :-0.02471 Mean : 0.01314 Mean : 0.008664
3rd Qu.: 0.15000 3rd Qu.: 0.19000 3rd Qu.: 0.187500
Max. : 0.98000 Max. : 1.62000 Max. : 1.100000
Nutrition Parental.Approach
Min. :-1.28000 Min. :-4.460000
1st Qu.:-0.16000 1st Qu.:-0.577500
Median : 0.01000 Median : 0.030000
Mean : 0.01189 Mean : 0.009279
3rd Qu.: 0.19000 3rd Qu.: 0.520000
Max. : 1.22000 Max. : 3.990000
Things to be focused on is our target variable, whether it is balance or not
0 1
49.6997 50.3003
The target variable is balanced.
2. To avoid a perfect separation, let’s check the correlation between our predictors and our target by using boxplot
par(mfrow=c(3,2))
plot(aggression$Aggression,aggression$Videos,main = "Aggression vs Videos",xlab = "Aggression",ylab = "Videos")
plot(aggression$Aggression,aggression$Electronic.Games,main = "Aggression vs Electronic Games",xlab = "Aggression",ylab = "Electronic.Games")
plot(aggression$Aggression,aggression$sibling.agg,main = "Aggression vs Sibling Aggression",xlab = "Aggression",ylab = "sibling.agg")
plot(aggression$Aggression,aggression$Nutrition,main = "Aggression vs Nutrition",xlab = "Aggression",ylab = "Nutrition")
plot(aggression$Aggression,aggression$Parental.Approach,main = "Aggression vs Parental Approach",xlab = "Aggression",ylab = "Parental.Approach")None of the data overlaps perfectly, so the data could fit for logistic regression modelling.
Both Logistic Regression and k-Nearest Neighbor method require some process before we could fit the data into the model.
First, Cross Validation, is a method that used to divide the data into train and test. The purpose is to test how the data could maintain its performance towards an unseen data. Here I will use random sampling to split our data, with 80-20 percentage.
# Set seed to keep the same indices
set.seed(412)
# Set 80% random value from the data
index_rand <- sample(nrow(aggression), 0.8*nrow(aggression))
agg.train <- aggression[index_rand,]
agg.test <- aggression[-index_rand,]Now we check the class balance once again
0 1
0.5018797 0.4981203
A k-NN model requires 3 different objects, they are predictors from train data, predictors from test data, and labels from train data, while confusionMatrix requires the label from test data to compare the prediction and actual values against data test. Now we’ll make all these 4 objects.
Since the method used to count the k-NN is Euclidean Distance, we need to scale our data so that the predictors have the same boundaries. This will prevent the biases towards outliers.
I’ll use the z-score scaling using scale() function, because the variables have no exact scope.
Our first machine learning model is based on Logistic Regression theory. It connects the value of regression and probability by using odds. Regression returns the log of odds value, then we convert this to odds. Finally, we convert the odds to probability value, range from 0 to 1. This will then be used to differentiate data classes
We’ll use the stepwise regression method to find significant variables.
# First we need models as a scope
model.all <- glm(formula = Aggression~., data = agg.train, family = "binomial")
model.none <- glm(formula = Aggression~1, data = agg.train, family = "binomial")
# Choosing model with stepwise method
model.aggression.back <- step(object = model.all, direction = "backward", trace = 0)
model.aggression.both <- step(object = model.none, scope = list(lower = model.none, upper = model.all),direction = "both", trace = 0)
model.aggression.forward <- step(object = model.none, scope = list(lower = model.none, upper = model.all),direction = "forward", trace = 0)glm(formula = Aggression ~ Electronic.Games + Nutrition + Parental.Approach,
family = "binomial", data = agg.train)
glm(formula = Aggression ~ Parental.Approach + Nutrition + Electronic.Games,
family = "binomial", data = agg.train)
glm(formula = Aggression ~ Parental.Approach + Nutrition + Electronic.Games,
family = "binomial", data = agg.train)
All three methods derived the same predictors, so I’ll only take one model for further evaluation, that is model.aggression.both.
The most important and powerful part of Logistic Regression is Interpretation. All values in the summary are able to be interpreted.
Here is the summary of our logistic regression model
Call:
glm(formula = Aggression ~ Parental.Approach + Nutrition + Electronic.Games,
family = "binomial", data = agg.train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8150 -1.1601 -0.7299 1.1711 1.5407
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.02142 0.08836 -0.242 0.808478
Parental.Approach -0.32702 0.09731 -3.361 0.000778 ***
Nutrition 0.64011 0.29104 2.199 0.027852 *
Electronic.Games -0.58514 0.28254 -2.071 0.038357 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 737.5 on 531 degrees of freedom
Residual deviance: 718.5 on 528 degrees of freedom
AIC: 726.5
Number of Fisher Scoring iterations: 4
All variables proven significant, with Parental.Approcch being the most significant. Now let’s analyze the summary one by one.
1. AIC and Residual Deviance
The main focus in Logistic Regression result is AIC and Residual Deviance. These two variables mostly represent of how good the model works. The AIC (Akaike Information Criterion) value is 726.5, indicating how many data we lost while creating our model. Null Deviance is the error rate of model with no predictors, while Residual Deviance is our current model’s error rate. There is 19 difference between Null deviance and Residual Deviance, which is not really good.
2. Intercept
Intercept is the log of odds value when all other predictors are 0.
[1] "Log of Odds : -0.06098"
[1] "Odds : 0.940842056428971"
[1] "Probability : 0.484759722365076"
Interpretation : When Parental Approach, Nutrition, and Electronic Games are all 0, there is a 48.5% probability that the child will be aggressive.
3. Slope
Slope is the increase (or decrease) in log of odds value if one certain predictor is increased by 1, assuming all other predictors are constant. Here I will take in one slope value, the highest one, from variable Nutrition.
[1] "Log of Odds : 0.51517"
[1] "Odds : 1.67392304452828"
[1] "Probability : 0.626017658942607"
Interpretation : Child with a better diet/nutrition is 1.67x more likely to be aggressive, assuming the same Parental Approach and duration of playing Electronic Games.
A model need to be tested towards test data to observe the ability, how accurate can a model predict an unseen data. If we are not satisfied with the result, we can tune the model by fiddling with the predictors or arguments.
Predict() function in default returns log of odds value. To return probability value we use type response.
Now let’s convert it into binary class using 0.5 as threshold.
First we need to provide possible optimum k by counting the square root of total observations. Since our target is binary, we need an odd k value.
[1] 23
So our possible k are around 23 and must be odd, so the possible k are : {19,21,23,25,27}. I’ll use 23 as our first prediction, and the other possible k for model tuning.
We can evaluate our model by comparing the prediction and the actual value, either by using table or confusionMatrix function from library caret.
Two important values need to be focused more on confusion matrix:
Sensitivity/Recall : indicates the percentage of False Negative, that means when we predict Negative but the actual is Positive
Pos Pred Value/Precision : indicates the percentage of False Positive, that means when we predict Positive but the actual is Negative
How the actual and prediction value compares?
table_logistic <- data.frame(logistic_predict, agg.test.label)
names(table_logistic) <- c("Prediction", "Actual")
table_logisticA more precise comparison is by using confusionMatrix()
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 47 42
1 17 28
Accuracy : 0.5597
95% CI : (0.4714, 0.6453)
No Information Rate : 0.5224
P-Value [Acc > NIR] : 0.218416
Kappa : 0.1322
Mcnemar's Test P-Value : 0.001781
Sensitivity : 0.4000
Specificity : 0.7344
Pos Pred Value : 0.6222
Neg Pred Value : 0.5281
Prevalence : 0.5224
Detection Rate : 0.2090
Detection Prevalence : 0.3358
Balanced Accuracy : 0.5672
'Positive' Class : 1
From the result above we know that, the model can predict the unseen data with a fair accuracy (55.97%), also the Precision value (Pos Pred Value) percentage is higher than Sensitivity. This explains that the model results more False Positive than False Negative. There are more child that was predicted as Aggressive but they are actually Not Aggressive. This is a better case than the opposite, since we want to educate as much aggressive child as possible.
We still can tune our model to result a better prediction, keep in mind that Precision (Pos Pred Value) is our focus metric.
How the actual and prediction value compares?
table_knn <- data.frame(knn_predict, agg.test.label)
names(table_knn) <- c("Prediction", "Actual")
table_knnA more precise comparison is by using confusionMatrix()
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 43 27
1 21 43
Accuracy : 0.6418
95% CI : (0.5544, 0.7227)
No Information Rate : 0.5224
P-Value [Acc > NIR] : 0.003472
Kappa : 0.285
Mcnemar's Test P-Value : 0.470486
Sensitivity : 0.6719
Specificity : 0.6143
Pos Pred Value : 0.6143
Neg Pred Value : 0.6719
Prevalence : 0.4776
Detection Rate : 0.3209
Detection Prevalence : 0.5224
Balanced Accuracy : 0.6431
'Positive' Class : 0
With k-Nearest Neighbor method, we get 64.18% accuracy, a pretty good result. However, the Recall (Sensitivity) and Precision (Pos Pred Value) are quite equal, while earlier I’ve said that we need a higher Precision than Recall.
We still can tune our model to result a better prediction, keep in mind that Precision (Pos Pred Value) is our focus metric.
Model Tuning is needed to enhance the performance of our model. Thus, we’ll end up with the best model possible.
In logistic regression, we can modify the models by choosing different predictors.
Here I’ll test and compare the model with all variables and the model with stepwise method.
Let’s see which model results the better precision
# predict data test with model.all
predict_all <- predict(model.all, agg.test, type = "response")
# convert into 0 and 1
predict_all <- ifelse(predict_all > 0.5, "1", "0")# Model with Stepwise Method
confusionMatrix(data = as.factor(logistic_predict), reference = as.factor(agg.test.label),positive = "1")$byClass[3]Pos Pred Value
0.6222222
# Model with All Variables
confusionMatrix(data = as.factor(predict_all), reference = as.factor(agg.test.label),positive = "1")$byClass[3]Pos Pred Value
0.6136364
Now we can conclude that the model with stepwise method (model.aggression.both) is the best logistic regression model for this data.
For k-NN, an evaluation can be made by using different k. We’ll use 19, 21, 25, and 27, based on the calculations I’ve done before
knn_predict19 <- knn(train = agg.train.predictor, test = agg.test.predictor, cl = agg.train.label, k = 19)
knn_predict21 <- knn(train = agg.train.predictor, test = agg.test.predictor, cl = agg.train.label, k = 21)
knn_predict25 <- knn(train = agg.train.predictor, test = agg.test.predictor, cl = agg.train.label, k = 25)
knn_predict27 <- knn(train = agg.train.predictor, test = agg.test.predictor, cl = agg.train.label, k = 27)Precision Comparison
[1] "k = 19"
Pos Pred Value
0.625
# k = 21
confusionMatrix(data = as.factor(knn_predict21), reference = as.factor(agg.test.label),positive = "1")[4]$byClass[3]Pos Pred Value
0.6349206
# k = 23
confusionMatrix(data = as.factor(knn_predict), reference = as.factor(agg.test.label),positive = "1")[4]$byClass[3]Pos Pred Value
0.671875
# k = 25
confusionMatrix(data = as.factor(knn_predict25), reference = as.factor(agg.test.label),positive = "1")[4]$byClass[3]Pos Pred Value
0.6461538
# k = 27
confusionMatrix(data = as.factor(knn_predict27), reference = as.factor(agg.test.label),positive = "1")[4]$byClass[3]Pos Pred Value
0.6333333
With k = 23, we have the highest precision.
So we can conclude the best k-NN Model is the one with k = 23 (knn_predict)
We have 2 models/predictions from two methods (Logistic Regression and k-NN), here is how they compare to each other.
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 47 42
1 17 28
Accuracy : 0.5597
95% CI : (0.4714, 0.6453)
No Information Rate : 0.5224
P-Value [Acc > NIR] : 0.218416
Kappa : 0.1322
Mcnemar's Test P-Value : 0.001781
Sensitivity : 0.4000
Specificity : 0.7344
Pos Pred Value : 0.6222
Neg Pred Value : 0.5281
Prevalence : 0.5224
Detection Rate : 0.2090
Detection Prevalence : 0.3358
Balanced Accuracy : 0.5672
'Positive' Class : 1
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 43 27
1 21 43
Accuracy : 0.6418
95% CI : (0.5544, 0.7227)
No Information Rate : 0.5224
P-Value [Acc > NIR] : 0.003472
Kappa : 0.285
Mcnemar's Test P-Value : 0.470486
Sensitivity : 0.6719
Specificity : 0.6143
Pos Pred Value : 0.6143
Neg Pred Value : 0.6719
Prevalence : 0.4776
Detection Rate : 0.3209
Detection Prevalence : 0.5224
Balanced Accuracy : 0.6431
'Positive' Class : 0
k-Nearest Neighbor method has a higher accuracy, but that doesn’t mean it is a better model. In this case, we prefer a higher Precision value and less Recall. Also the Logistic Regression model is interpretable.
Logistic Regression is a better model to predict whether a child is aggressive or not.
Finally, we can decide from all the model we have tested, the Logistic Regression model from stepwise method is the best model to use. It is interpretable, and has a high Precision value. However, we can see how high the AIC and the Residual Deviance value is. I suppose that there are two possible reason for this. First, it requires more variable to define whether a child is aggressive or not, such as teacher’s lectures, social environment, close friend, etc. Second, there is probably a better machine learning model to predict child’s aggressivity.
Parental.Approach is the most significant variable, make sense since a child’s characteristic is derived mostly from their parents. Nutrition and Electronic Games also having some significant impact towards a child’s aggresivity. So, to prevent a child to be aggressive, parents’ need to give more attention and education to their children, also limiting time to play electronic video games would help.