1 Introduction

1.1 Background

Bullying has been a common problem against child all over the world. This was mainly due to an aggressive behaviour from a child, that they have the urge to be superior than the other. According to wikipedia :

Bullying is the use of force, coercion, or threat, to abuse, aggressively dominate or intimidate. The behavior is often repeated and habitual.

This violates the human rights and need to be prevented as soon as possible. Here on this article I will try to extract which factors could determine a child is aggressive or not.

1.2 Data Explanation

The dataset was collected from kaggle. The following study was designed with the intention of evaluating and making inferences on potential relationships between Child’s Aggression and multiple potential predictors factors for a little over 650 children in home settings. Half of the data looks at households from the U.S. and the other half examines households in England. The data includes 6 variables, which I will explain further in this article.

1.3 Libraries

For this analysis I’ll use 4 libraries, i.e dplyr for data wrangling, gtools to interpret the summary of Logistic Regression model, class to extract the k-Nearest Neighbor function, and caret to evaluate our model.

library(dplyr)
library(gtools)
library(class)
library(caret)

1.4 Workflow

We will now work with our data based on this working order :
1. Read Data
2. Data Wrangling
3. Exploratory Data Analysis
4. Cross Validation
* Data Pre-Processing (for k-NN)
5. Build Model (for Logistic Regression)
6. Predict
7. Model Evaluation
8. Model Tuning
9. Final Model

2 Data Preparation

2.1 Read Data

aggression <- read.csv("aggression.csv")
aggression

There are 6 variables :
1. Aggression : defines how aggressive a child is based on other variables
2. Video : higher means more time spent watching various platform of videos
3. Electronic.Games : higher means more time spent playing electronic games
4. sibling.agg : higher means more aggression seen from elder sibling
5. Nutrition : higher means the child had a healthy diet
6. Parental.Approach : higher means worse parenting

2.2 Data Wrangling

1. Data Structure and Transformation

glimpse(aggression)

Observations: 666
Variables: 6
$ Aggression        <dbl> 0.37, 0.77, -0.10, 0.02, -0.28, 0.16, 0.13, 0.01,...
$ Videos            <dbl> 0.17, -0.03, -0.07, 0.00, -0.68, 0.20, 0.19, 0.05...
$ Electronic.Games  <dbl> 0.14, 0.71, -0.39, -0.41, -0.28, 0.32, 0.31, -0.1...
$ sibling.agg       <dbl> -0.33, 0.58, -0.22, 0.05, -0.89, -0.15, 0.51, 0.1...
$ Nutrition         <dbl> -0.11, -0.02, 0.28, -0.26, 0.23, -0.37, -0.66, 0....
$ Parental.Approach <dbl> -0.28, -1.25, -0.33, -1.01, 0.49, -1.74, 0.09, -0...

In order to do classification we need a binary target, deciding yes or no towards a statement. Thus we’ll transform the Aggression variable into 0s and 1s. 0 means that the child is not aggressive and 1 means the child is aggressive. We’ll set 0 as the threshold, so when the Aggression is lower than 0, the child is considered aggressive (labeled 1).

aggression <- aggression %>% 
  mutate(Aggression = ifelse(Aggression < 0, "1", "0"))

Also Our target variable must be a factor, so let’s convert them

aggression$Aggression <- as.factor(aggression$Aggression)

Check the data structure once again

glimpse(aggression)

Observations: 666
Variables: 6
$ Aggression        <fct> 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1...
$ Videos            <dbl> 0.17, -0.03, -0.07, 0.00, -0.68, 0.20, 0.19, 0.05...
$ Electronic.Games  <dbl> 0.14, 0.71, -0.39, -0.41, -0.28, 0.32, 0.31, -0.1...
$ sibling.agg       <dbl> -0.33, 0.58, -0.22, 0.05, -0.89, -0.15, 0.51, 0.1...
$ Nutrition         <dbl> -0.11, -0.02, 0.28, -0.26, 0.23, -0.37, -0.66, 0....
$ Parental.Approach <dbl> -0.28, -1.25, -0.33, -1.01, 0.49, -1.74, 0.09, -0...

2. Missing Values

Missing values would be a problem in making a machine learning model, so let’s eliminate them, if there are any

anyNA(aggression)

[1] FALSE

Fortunately there are no missing values

2.3 Exploratory Data Analysis

1. Summary

Let’s check the summary of our data

summary(aggression)

 Aggression     Videos         Electronic.Games    sibling.agg       
 0:331      Min.   :-1.06000   Min.   :-1.15000   Min.   :-1.430000  
 1:335      1st Qu.:-0.18000   1st Qu.:-0.17000   1st Qu.:-0.157500  
            Median :-0.01000   Median : 0.00000   Median : 0.010000  
            Mean   :-0.02471   Mean   : 0.01314   Mean   : 0.008664  
            3rd Qu.: 0.15000   3rd Qu.: 0.19000   3rd Qu.: 0.187500  
            Max.   : 0.98000   Max.   : 1.62000   Max.   : 1.100000  
   Nutrition        Parental.Approach  
 Min.   :-1.28000   Min.   :-4.460000  
 1st Qu.:-0.16000   1st Qu.:-0.577500  
 Median : 0.01000   Median : 0.030000  
 Mean   : 0.01189   Mean   : 0.009279  
 3rd Qu.: 0.19000   3rd Qu.: 0.520000  
 Max.   : 1.22000   Max.   : 3.990000

Things to be focused on is our target variable, whether it is balance or not

prop.table(table(aggression$Aggression)) * 100


      0       1 
49.6997 50.3003

The target variable is balanced.

2. To avoid a perfect separation, let’s check the correlation between our predictors and our target by using boxplot

par(mfrow=c(3,2))
plot(aggression$Aggression,aggression$Videos,main = "Aggression vs Videos",xlab = "Aggression",ylab = "Videos")
plot(aggression$Aggression,aggression$Electronic.Games,main = "Aggression vs Electronic Games",xlab = "Aggression",ylab = "Electronic.Games")
plot(aggression$Aggression,aggression$sibling.agg,main = "Aggression vs Sibling Aggression",xlab = "Aggression",ylab = "sibling.agg")
plot(aggression$Aggression,aggression$Nutrition,main = "Aggression vs Nutrition",xlab = "Aggression",ylab = "Nutrition")
plot(aggression$Aggression,aggression$Parental.Approach,main = "Aggression vs Parental Approach",xlab = "Aggression",ylab = "Parental.Approach")

None of the data overlaps perfectly, so the data could fit for logistic regression modelling.

3 Model Preparation

Both Logistic Regression and k-Nearest Neighbor method require some process before we could fit the data into the model.

3.1 Cross Validation

First, Cross Validation, is a method that used to divide the data into train and test. The purpose is to test how the data could maintain its performance towards an unseen data. Here I will use random sampling to split our data, with 80-20 percentage.

# Set seed to keep the same indices
set.seed(412)

# Set 80% random value from the data
index_rand <- sample(nrow(aggression), 0.8*nrow(aggression))
agg.train <- aggression[index_rand,]
agg.test <- aggression[-index_rand,]

Now we check the class balance once again

prop.table(table(agg.train$Aggression))


        0         1 
0.5018797 0.4981203

3.2 Data Pre-processing

A k-NN model requires 3 different objects, they are predictors from train data, predictors from test data, and labels from train data, while confusionMatrix requires the label from test data to compare the prediction and actual values against data test. Now we’ll make all these 4 objects.

Since the method used to count the k-NN is Euclidean Distance, we need to scale our data so that the predictors have the same boundaries. This will prevent the biases towards outliers.

I’ll use the z-score scaling using scale() function, because the variables have no exact scope.

agg.train.predictor <- scale(agg.train[,-1])
agg.test.predictor <- scale(agg.test[,-1], 
                                 center = attr(agg.train.predictor, "scaled:center"),
                                 scale = attr(agg.train.predictor, "scaled:scale"))
agg.train.label <- agg.train[,1]
agg.test.label <- agg.test[,1]

4 Logistic Regression Model

4.1 Build Model

Our first machine learning model is based on Logistic Regression theory. It connects the value of regression and probability by using odds. Regression returns the log of odds value, then we convert this to odds. Finally, we convert the odds to probability value, range from 0 to 1. This will then be used to differentiate data classes

We’ll use the stepwise regression method to find significant variables.

# First we need models as a scope
model.all <- glm(formula = Aggression~., data = agg.train, family = "binomial")
model.none <- glm(formula = Aggression~1, data = agg.train, family = "binomial")

# Choosing model with stepwise method
model.aggression.back <- step(object = model.all, direction = "backward", trace = 0)
model.aggression.both <- step(object = model.none, scope = list(lower = model.none, upper = model.all),direction = "both", trace = 0)
model.aggression.forward <- step(object = model.none, scope = list(lower = model.none, upper = model.all),direction = "forward", trace = 0)

model.aggression.back$call

glm(formula = Aggression ~ Electronic.Games + Nutrition + Parental.Approach, 
    family = "binomial", data = agg.train)

model.aggression.both$call

glm(formula = Aggression ~ Parental.Approach + Nutrition + Electronic.Games, 
    family = "binomial", data = agg.train)

model.aggression.forward$call

glm(formula = Aggression ~ Parental.Approach + Nutrition + Electronic.Games, 
    family = "binomial", data = agg.train)

All three methods derived the same predictors, so I’ll only take one model for further evaluation, that is model.aggression.both.

4.2 Interpretation

The most important and powerful part of Logistic Regression is Interpretation. All values in the summary are able to be interpreted.
Here is the summary of our logistic regression model

summary(model.aggression.both)


Call:
glm(formula = Aggression ~ Parental.Approach + Nutrition + Electronic.Games, 
    family = "binomial", data = agg.train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8150  -1.1601  -0.7299   1.1711   1.5407  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -0.02142    0.08836  -0.242 0.808478    
Parental.Approach -0.32702    0.09731  -3.361 0.000778 ***
Nutrition          0.64011    0.29104   2.199 0.027852 *  
Electronic.Games  -0.58514    0.28254  -2.071 0.038357 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 737.5  on 531  degrees of freedom
Residual deviance: 718.5  on 528  degrees of freedom
AIC: 726.5

Number of Fisher Scoring iterations: 4

All variables proven significant, with Parental.Approcch being the most significant. Now let’s analyze the summary one by one.

1. AIC and Residual Deviance

The main focus in Logistic Regression result is AIC and Residual Deviance. These two variables mostly represent of how good the model works. The AIC (Akaike Information Criterion) value is 726.5, indicating how many data we lost while creating our model. Null Deviance is the error rate of model with no predictors, while Residual Deviance is our current model’s error rate. There is 19 difference between Null deviance and Residual Deviance, which is not really good.

2. Intercept

Intercept is the log of odds value when all other predictors are 0.

paste("Log of Odds : ", -0.06098)

[1] "Log of Odds :  -0.06098"

paste("Odds        : ", exp(-0.06098))

[1] "Odds        :  0.940842056428971"

paste("Probability : ", inv.logit(-0.06098))

[1] "Probability :  0.484759722365076"

Interpretation : When Parental Approach, Nutrition, and Electronic Games are all 0, there is a 48.5% probability that the child will be aggressive.

3. Slope

Slope is the increase (or decrease) in log of odds value if one certain predictor is increased by 1, assuming all other predictors are constant. Here I will take in one slope value, the highest one, from variable Nutrition.

paste("Log of Odds   : ", 0.51517)

[1] "Log of Odds   :  0.51517"

paste("Odds          : ", exp(0.51517))

[1] "Odds          :  1.67392304452828"

paste("Probability   : ", inv.logit(0.51517))

[1] "Probability   :  0.626017658942607"

Interpretation : Child with a better diet/nutrition is 1.67x more likely to be aggressive, assuming the same Parental Approach and duration of playing Electronic Games.

5 Predict

A model need to be tested towards test data to observe the ability, how accurate can a model predict an unseen data. If we are not satisfied with the result, we can tune the model by fiddling with the predictors or arguments.

5.1 Logistic Regression

Predict() function in default returns log of odds value. To return probability value we use type response.

logistic_predict <- predict(object = model.aggression.both, newdata = agg.test, type = "response")

Now let’s convert it into binary class using 0.5 as threshold.

logistic_predict <- ifelse(logistic_predict > 0.5, "1", "0")

5.2 k-Nearest Neighbor

First we need to provide possible optimum k by counting the square root of total observations. Since our target is binary, we need an odd k value.

round(sqrt(nrow(agg.train)))

[1] 23

So our possible k are around 23 and must be odd, so the possible k are : {19,21,23,25,27}. I’ll use 23 as our first prediction, and the other possible k for model tuning.

knn_predict <- knn(train = agg.train.predictor, test = agg.test.predictor, cl = agg.train.label, k = 23)

6 Model Evaluation

We can evaluate our model by comparing the prediction and the actual value, either by using table or confusionMatrix function from library caret.

Two important values need to be focused more on confusion matrix:

Sensitivity/Recall : indicates the percentage of False Negative, that means when we predict Negative but the actual is Positive
Pos Pred Value/Precision : indicates the percentage of False Positive, that means when we predict Positive but the actual is Negative

6.1 Logistic Regression

How the actual and prediction value compares?

table_logistic <- data.frame(logistic_predict, agg.test.label)
names(table_logistic) <-  c("Prediction", "Actual")
table_logistic

A more precise comparison is by using confusionMatrix()

confusionMatrix(data = as.factor(logistic_predict), reference = as.factor(agg.test.label),positive = "1")

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 47 42
         1 17 28
                                          
               Accuracy : 0.5597          
                 95% CI : (0.4714, 0.6453)
    No Information Rate : 0.5224          
    P-Value [Acc > NIR] : 0.218416        
                                          
                  Kappa : 0.1322          
                                          
 Mcnemar's Test P-Value : 0.001781        
                                          
            Sensitivity : 0.4000          
            Specificity : 0.7344          
         Pos Pred Value : 0.6222          
         Neg Pred Value : 0.5281          
             Prevalence : 0.5224          
         Detection Rate : 0.2090          
   Detection Prevalence : 0.3358          
      Balanced Accuracy : 0.5672          
                                          
       'Positive' Class : 1

From the result above we know that, the model can predict the unseen data with a fair accuracy (55.97%), also the Precision value (Pos Pred Value) percentage is higher than Sensitivity. This explains that the model results more False Positive than False Negative. There are more child that was predicted as Aggressive but they are actually Not Aggressive. This is a better case than the opposite, since we want to educate as much aggressive child as possible.

We still can tune our model to result a better prediction, keep in mind that Precision (Pos Pred Value) is our focus metric.

6.2 k-Nearest Neighbor

How the actual and prediction value compares?

table_knn <- data.frame(knn_predict, agg.test.label)
names(table_knn) <-  c("Prediction", "Actual")
table_knn

A more precise comparison is by using confusionMatrix()

confusionMatrix(data = as.factor(knn_predict), reference = as.factor(agg.test.label))

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 43 27
         1 21 43
                                          
               Accuracy : 0.6418          
                 95% CI : (0.5544, 0.7227)
    No Information Rate : 0.5224          
    P-Value [Acc > NIR] : 0.003472        
                                          
                  Kappa : 0.285           
                                          
 Mcnemar's Test P-Value : 0.470486        
                                          
            Sensitivity : 0.6719          
            Specificity : 0.6143          
         Pos Pred Value : 0.6143          
         Neg Pred Value : 0.6719          
             Prevalence : 0.4776          
         Detection Rate : 0.3209          
   Detection Prevalence : 0.5224          
      Balanced Accuracy : 0.6431          
                                          
       'Positive' Class : 0

With k-Nearest Neighbor method, we get 64.18% accuracy, a pretty good result. However, the Recall (Sensitivity) and Precision (Pos Pred Value) are quite equal, while earlier I’ve said that we need a higher Precision than Recall.

We still can tune our model to result a better prediction, keep in mind that Precision (Pos Pred Value) is our focus metric.

7 Model Tuning

Model Tuning is needed to enhance the performance of our model. Thus, we’ll end up with the best model possible.

7.1 Logistic Regression

In logistic regression, we can modify the models by choosing different predictors.

Here I’ll test and compare the model with all variables and the model with stepwise method.
Let’s see which model results the better precision

# predict data test with model.all
predict_all <- predict(model.all, agg.test, type = "response")

# convert into 0 and 1
predict_all <- ifelse(predict_all > 0.5, "1", "0")

# Model with Stepwise Method
confusionMatrix(data = as.factor(logistic_predict), reference = as.factor(agg.test.label),positive = "1")$byClass[3]

Pos Pred Value 
     0.6222222

# Model with All Variables
confusionMatrix(data = as.factor(predict_all), reference = as.factor(agg.test.label),positive = "1")$byClass[3]

Pos Pred Value 
     0.6136364

Now we can conclude that the model with stepwise method (model.aggression.both) is the best logistic regression model for this data.

7.2 k-Nearest Neighbor

For k-NN, an evaluation can be made by using different k. We’ll use 19, 21, 25, and 27, based on the calculations I’ve done before

knn_predict19 <- knn(train = agg.train.predictor, test = agg.test.predictor, cl = agg.train.label, k = 19)
knn_predict21 <- knn(train = agg.train.predictor, test = agg.test.predictor, cl = agg.train.label, k = 21)
knn_predict25 <- knn(train = agg.train.predictor, test = agg.test.predictor, cl = agg.train.label, k = 25)
knn_predict27 <- knn(train = agg.train.predictor, test = agg.test.predictor, cl = agg.train.label, k = 27)

Precision Comparison

paste("k = 19")

[1] "k = 19"

confusionMatrix(data = as.factor(knn_predict19), reference = as.factor(agg.test.label),positive = "1")[4]$byClass[3]

Pos Pred Value 
         0.625

# k = 21
confusionMatrix(data = as.factor(knn_predict21), reference = as.factor(agg.test.label),positive = "1")[4]$byClass[3]

Pos Pred Value 
     0.6349206

# k = 23
confusionMatrix(data = as.factor(knn_predict), reference = as.factor(agg.test.label),positive = "1")[4]$byClass[3]

Pos Pred Value 
      0.671875

# k = 25
confusionMatrix(data = as.factor(knn_predict25), reference = as.factor(agg.test.label),positive = "1")[4]$byClass[3]

Pos Pred Value 
     0.6461538

# k = 27
confusionMatrix(data = as.factor(knn_predict27), reference = as.factor(agg.test.label),positive = "1")[4]$byClass[3]

Pos Pred Value 
     0.6333333

With k = 23, we have the highest precision.

So we can conclude the best k-NN Model is the one with k = 23 (knn_predict)

8 Closing

8.1 Final Model

We have 2 models/predictions from two methods (Logistic Regression and k-NN), here is how they compare to each other.

confusionMatrix(data = as.factor(logistic_predict), reference = as.factor(agg.test.label),positive = "1")

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 47 42
         1 17 28
                                          
               Accuracy : 0.5597          
                 95% CI : (0.4714, 0.6453)
    No Information Rate : 0.5224          
    P-Value [Acc > NIR] : 0.218416        
                                          
                  Kappa : 0.1322          
                                          
 Mcnemar's Test P-Value : 0.001781        
                                          
            Sensitivity : 0.4000          
            Specificity : 0.7344          
         Pos Pred Value : 0.6222          
         Neg Pred Value : 0.5281          
             Prevalence : 0.5224          
         Detection Rate : 0.2090          
   Detection Prevalence : 0.3358          
      Balanced Accuracy : 0.5672          
                                          
       'Positive' Class : 1

confusionMatrix(data = as.factor(knn_predict), reference = as.factor(agg.test.label))

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 43 27
         1 21 43
                                          
               Accuracy : 0.6418          
                 95% CI : (0.5544, 0.7227)
    No Information Rate : 0.5224          
    P-Value [Acc > NIR] : 0.003472        
                                          
                  Kappa : 0.285           
                                          
 Mcnemar's Test P-Value : 0.470486        
                                          
            Sensitivity : 0.6719          
            Specificity : 0.6143          
         Pos Pred Value : 0.6143          
         Neg Pred Value : 0.6719          
             Prevalence : 0.4776          
         Detection Rate : 0.3209          
   Detection Prevalence : 0.5224          
      Balanced Accuracy : 0.6431          
                                          
       'Positive' Class : 0

k-Nearest Neighbor method has a higher accuracy, but that doesn’t mean it is a better model. In this case, we prefer a higher Precision value and less Recall. Also the Logistic Regression model is interpretable.

Logistic Regression is a better model to predict whether a child is aggressive or not.

8.2 Conclusion

Finally, we can decide from all the model we have tested, the Logistic Regression model from stepwise method is the best model to use. It is interpretable, and has a high Precision value. However, we can see how high the AIC and the Residual Deviance value is. I suppose that there are two possible reason for this. First, it requires more variable to define whether a child is aggressive or not, such as teacher’s lectures, social environment, close friend, etc. Second, there is probably a better machine learning model to predict child’s aggressivity.

Parental.Approach is the most significant variable, make sense since a child’s characteristic is derived mostly from their parents. Nutrition and Electronic Games also having some significant impact towards a child’s aggresivity. So, to prevent a child to be aggressive, parents’ need to give more attention and education to their children, also limiting time to play electronic video games would help.

Child Aggression

Asido Rogate

2020-02-25