Assignment 1

knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(glmnet)
library(caret)
library(tidyverse)
library(prettydoc)

Step 1: Dataset

This is a dataset which was provided to us during the Econometrics course. It shows test scores for math, science and reading tests made by children. Besides these three test scores it shows a variety of different variables which could potentially influence the test scores, such as gender, race, living conditions, sports, acitivites outside school, and many other variables. We think it will be interesting to see whether we can predict grades pupils get when we know certain characteristics of them and their enviroment. This dataset contains 40 variables, so we will have enought data to apply subset selection or shrinkage to.

library(haven)
testscores <- read_dta("testscores.dta")

Step 2: Three Plots

test <-testscores %>% mutate(avg_test = (reading_test + math_test + science_test)/3) # Adding new variable: average test score, to the dataset

  ggplot() +
  geom_histogram(test, mapping = aes(reading_test, ..density.., colour = "Reading test"), alpha = 0.1, fill = "white") +
  geom_histogram(test, mapping = aes(math_test, ..density.., colour = "Red"), alpha = 0.1, fill = "white") +
  geom_histogram(test, mapping = aes(science_test, ..density.., colour = "Orange"), alpha = 0.1,fill = "white") +
  labs(title = "Test Scores Histogram",
       x = "Test Score",
       y = "Frequency") +
  scale_colour_manual(name = 'Test Type',
                      values =c('blue', 'red', 'orange'), labels = c('Science Test','Reading Test', 'Math Test')) +
  geom_density(test, mapping = aes(avg_test)) +
  theme_minimal()

This histogram shows the distribution of the 3 test score variables: reading, math and science. We can see from the graph that their distributions are quite similar, however they are not all ‘smooth’. We therefore decided to take the average of all test scores combined. The advantage of average score over a specific score is that is gives us a more ‘complete’ picture of the effects of the variables on test score of the child. Taking the average also helps with outliers, e.g a child with dyslexia could be an outlier in the reading test while he or she could be great at mathematics. Average test score can be seen in the graph as the denisty line.

ggplot(test,aes(x = mom_educ, y = avg_test, colour = as.factor(mom_curr_married)))+
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Average Test Scores per Mother's Eduction Level",
       x = "Mother's Education Level", 
       y = "Average Test Result",
       colour = "Mother Currently Married?") +
  scale_color_discrete(labels = c("No", "Yes")) +
  theme_minimal()

## `geom_smooth()` using formula 'y ~ x'

In the previous plot we determined that average_test would make the best dependent variable. We believed that the environment of children plays a big role in their school performance, especially their family and home situation. The following plot displays how the average test results of a child are affected by the education level of their mother. We assumed that mothers who are married live in a more stable household, in which the child can also ask his dad for help with homework. This was the reason for plotting different regression lines for married and non married mothers as well. The plot shows us that the mother’s education has similar effects for both maried and non maried moms, however the magnitude and intercept differ quite a bit. Children of mothers who are not married have a lower average test result than children of married mothers, ceteris paribus. Nevertheless, the effect of mothers education on average test score of their child is bigger for non married mothers, this could confirm our assumption.

ggplot(test, aes(x = family_income, y = avg_test)) +
  geom_point(aes(colour = avg_test > mean(avg_test)), shape = "square") +
  geom_smooth(method = "loess") +
  labs(title = "Average Test Score per Annual Family Income",
       x = "Annual Family Income", 
       y = "Average Test Score", 
       colour = "Above Average Testscore") +
  theme_minimal()

## `geom_smooth()` using formula 'y ~ x'

Annual family income can potentially explain a lot about the situation in which a family lives. For example, a higher income can most of the time be explained by a higher eductation level of the money earners in the family. Higher income in most cases means a more stable household for the children. Children being able to get extra tutoring for school could be a possible benefit of families with higher income. Higher income, and thus higher education, could also mean that the parents in the family value school more than lower educated parents. In the plot there are two colours: blue and red, blue means that the average test score was above average, red means average or below. There is also a loess regrssion line in the plot, which shows the relationship between family income and average test score. It becomes clear from the regression line that family income has a positive effect on science test score, until a family income reaches 500k. A potential explanation for the negative effect could be that after a certain income (around 500k) the parents are more busy with work than with their children. Unfortunatly there are less datapoints at an family income above 500k, this increases the standard error which makes it less reliable.

Step 3: Shrinkage

From the previous section we could clearly see that there is some sort of relation between the home situation and the grades children get. With this in mind we can formulate an interesting research question: which of the family characteristics have the biggest influence on the average test score of the child? The family characteristics variables in this dataset include the following:

mom_educ: or the education level of the mother is a categorical value which takes on the values from 1:5, 5 being in possetion a college degree and 1 being only elementary school educated. The mother and father of a family have a lot of influence on the characteristics of the family, therefore it should be included.

mom_married_at_birth: is a dummy variable which takes on the value 1 if the mother is married at the birth of the child and a 0 if the mother is not married at birht. Just like the previous variable, the mother plays an important part in shaping the family characteristics, so does the married status of the mothers/parents.

family_income: is a continious variable which represents the annual family income. Family income can play a big role in creating and shaping family characteristics.

mom_work_status: a categorical variable which is 3 if the mothers does not work at al, a 1 if she works more than 35 hours per week and a 2 if she works less than 35 hours per week. The working status of the parents in this case the mothers, attributes to family characteristics.

siblings: represents the ammount of siblings the observed child has. The number of children in a family is a family characteristic.

hhnsize: The household size shows the number people living in the house. Just like siblings the number of people living in the household is a family characteristic.

tv_variables: (tv_afternoon_mf, tv_afterdinner_mf, tv_saturday, tv_sunday) these variables are represent the number of tv hours on different days/ times, mf means monday-friday. The number of hours that a child gets to see tv shows a family habbit which can be seen as a family characteristics as well.

dinner_as_family: displays the number of night the family has dinner together. This is a family habbit which can be seen as a family characteristic.

home_language_nonenglish: is a dummy variable whichh is 1 if the home language is non English and 0 if it is English The home language is a family characteristic.

mom_curr_married: is a dummy variable representing the current marriage situation of the mother, 1 means that the mother is currently married. The marriage situation is an important factor in family characteristics.

family_type: a categorical variable which displays the type of family. 1 = 2parents & siblings, 2 = 2 parents & no siblings, 3 = 1 parent & siblings, 4 = 1 parents & no sibling, 5 = other.

We decided to train or model using shrinkage, which consist of LASSO and ridge. LASSO sets unrelevant variables equal to zero which was useful in this case because the dataset contains 39 variables. Consequently we would end up with a selection of the most important variables.

# We need to make sure that we get the same results if we run the same process and we use set.seed() for this. 
set.seed(45)

 
# We don't want to include reading scores, math scores and science scores in our LASSO, because we are using the average test grade as the response variable. So, we deleted the first 3 columns.
testaverage <- test[-c(1:3)]


# We splitted our dataset, into 50% for the training set, 30% for the validation set and 20% for the test set. 
split <- c(rep("train", 4052), rep("valid", 2431), rep("test", 1622))

 
test1 <- testaverage %>% mutate(split = sample(split))


test_train <- test1 %>% filter(split == "train")
test_valid <- test1 %>% filter(split == "valid")
test_test <- test1 %>% filter(split == "test")

 
x_train <- model.matrix(avg_test ~ ., data = test_train %>% select(-split))
y_train <- test_train$avg_test

 
cv <- cv.glmnet(x_train, y_train, alpha = 1)
# Display the best lambda value
cv$lambda.min

## [1] 0.1106629

plot(cv)

# Doing the actual LASSO.
result <- glmnet(x      = x_train[, -1],             # X matrix without intercept
                 y      = test_train$avg_test,       # Average test as response variable
                 family = "gaussian",                # Normally distributed errors
                 alpha  = 1,                         # LASSO penalty
                 lambda = cv$lambda.min)             # Penalty value
 
result

## 
## Call:  glmnet(x = x_train[, -1], y = test_train$avg_test, family = "gaussian",      alpha = 1, lambda = cv$lambda.min) 
## 
##   Df   %Dev Lambda
## 1 23 0.3273 0.1107

# These are the coefficients which are influence average test scores, since we used LASSO the variables which don't influence average test score have a coefficient of zero.
coef(result)

## 38 x 1 sparse Matrix of class "dgCMatrix"
##                                     s0
## (Intercept)               9.162644e+01
## region                    .           
## gender                   -1.203147e+00
## race                      .           
## bmi                       .           
## mom_educ                  2.116690e+00
## mom_married_at_birth      1.749263e+00
## family_income             1.181683e-05
## mom_work_status           3.519798e-02
## siblings                 -3.059882e-01
## hhsize                   -2.316023e-01
## pct_minority             -6.814436e-01
## part_dance                .           
## part_athletics            3.554333e-01
## part_club                 6.543029e-01
## part_music                1.780729e+00
## part_art                  .           
## tv_afternoon_mf          -3.375498e-02
## tv_afterdinner_mf         .           
## tv_saturday               .           
## tv_sunday                 .           
## dinner_as_family         -4.664585e-02
## home_language_nonenglish -1.943373e-03
## both_parents              .           
## school_type               .           
## problem_crowding         -4.662729e-02
## problem_turnover         -2.143462e-01
## problem_parents          -2.125321e-01
## problem_drugs            -5.681349e-01
## problem_gangs            -2.726330e-01
## problem_crime             .           
## problem_weapons           .           
## problem_attacks          -7.301999e-01
## has_library_card         -1.302411e-01
## has_home_computer         2.876820e+00
## school_has_security       .           
## mom_curr_married          1.319398e+00
## family_type               .

# We can also display only the coefficients which affect the average test scores the following way. Here we can see that a total of 22 coefficients are left after doing the LASSO. This means that a little more than 50% of the variables are left. 
rownames(coef(result))[which(coef(result) != 0)]

##  [1] "(Intercept)"              "gender"                  
##  [3] "mom_educ"                 "mom_married_at_birth"    
##  [5] "family_income"            "mom_work_status"         
##  [7] "siblings"                 "hhsize"                  
##  [9] "pct_minority"             "part_athletics"          
## [11] "part_club"                "part_music"              
## [13] "tv_afternoon_mf"          "dinner_as_family"        
## [15] "home_language_nonenglish" "problem_crowding"        
## [17] "problem_turnover"         "problem_parents"         
## [19] "problem_drugs"            "problem_gangs"           
## [21] "problem_attacks"          "has_library_card"        
## [23] "has_home_computer"        "mom_curr_married"

# We made a predicted versus observed plot for the model we generated with the `test_valid` data.
x_valid <- model.matrix(avg_test ~ ., data = test_valid %>% select(-split))[, -1]
y_pred <- as.numeric(predict(result, newx = x_valid))


tibble(Predicted = y_pred, Observed = test_valid$avg_test) %>% 
  ggplot(aes(x = Predicted, y = Observed)) +
  geom_point() + 
  geom_abline(slope = 1, intercept = 0, lty = 2) +
  theme_minimal() +
  labs(title = "Predicted versus observed average test scores")

# Test the MSE.
mse <- function(y_true, y_pred) mean((y_true - y_pred)^2)
mse(test_valid$avg_test, y_pred)

## [1] 56.23496

Step 4: Conclusion

The research question was as follows: which of the family characteristics (stated in section 3) have the biggest influence on the average test score of the child. To answer this question we have used the shrinkage method LASSO to shrink the ammount of variables in our model. The used lambda is equal to 0.1107 which is quite a low penalty, this means that many of the coefficients will not be assigned a zero. To answer our research question we have to take a look at which family characteristic related coefficient is the largest, ether positive or negative. mom_educ has the largest coefficient with a value of 2.17, this means that for every increase in the level of the mother’s education the average test score increases with 2.17. We expected that income would have a bigger effect than it does in our current model. However, this coefficient is not that big due to the fact that income is measured per currency unit, so let’s take $1 dollar as an example. The coefficient of family income is multiplied by the change in dollars, this change is usually way bigger than the change in mother’s education which only takes on values from 1 to 5.

When looking at how well the model predicts the data in the predicted vs observed graph, it becomes clear that we have a medium accuracy and a low prediction. This is also partially reflected by the fact that our MSE is 56.23. This means that our prediction is on average 56.23^0.5=7.5 away from the real average testscore. The minimum and maximum value of test scores in the data set are respectively 67.84 and 117.44, an average prediction difference of 7.5 point on this scale is not that bad. So, we can conclude that the mothers education level is the most influencial family characteristic on the average test score of a child.

Assignment 1

Luc Cantor, Niek Lieon & Alec Bueger

May 30, 2020

Step 1: Dataset

Step 2: Three Plots

Step 3: Shrinkage

Step 4: Conclusion