LogisticRegressionModels

Testing and iterating Model 2 that The Vaccinators built, using the seifa_merged clean dataset.

It contains 41,016 obersvations including: postcode, state, year, age of child immunised, pc_immun in % ranks, caution, pc_immun by 8 classes 0-8, PHN code, PHN number, Year, and a range of SEIFA scores including: INDEX OF EDUCATION & OCCUPATION - max, min, rank within Australia by decile, rank within Australia by percentage etc. The main variable we want to test is IEO_SCORE, which is the straight ABS ‘score’ per postcode of this index, derived from Census data. INDEX OF ECONOMIC RESOURCES - max, min, rank within Australia by decile, rank within Australia by percentage etc. The main variable we want to test is IER_SCORE, which is the straight ABS ‘score’ per postcode of this index, derived from Census data. INDEX OF DISADVANTAGE - max, min, rank within Australia by decile, rank within Australia by percentage etc. The main variable we want to test is IRSD_SCORE, which is the straight ABS ‘score’ per postcode of this index, derived from Census data. INDEX OF ADVANTAGE/DISADVANTAGE - max, min, rank within Australia by decile, rank within Australia by percentage etc. The main variable we want to test is IRSAD_SCORE, which is the straight ABS ‘score’ per postcode of this index, derived from Census data. Let’s first import the data

raw_data <- read.csv("../cleaned_data/seifa_merged.csv")

We are going to be running a number of different models, so let’s write some functions to make our code reusable

First, let’s clean and filter the data to remove NPs and also select a chosen age (1,2 or 5) or chosen year (2011-12 to 2016-17) for each model

# Description
# -----------
# Returns a data frame filtered by a specific age and a specific year
# Adds a column to identify whether a row has immun coverage of >=95% (high_coverage == 1) or < 95% (high_coverage == 0)
# Finally reduce the number of fields to just those we are interested in

# Parameters
# ----------
  # dataSet (dataframe) - data frame of raw data to filter
  # chosen_age (integer) - age value to filter on
  # chosen_year (integer) - year value to filter on
  # remove_NP (boolean) - should we filter out all records with pc_immun_class of 0 (i.e. those with a pc_class of 'NP'), because something without a percentage class cannot be used to predict percentage class. Default is FALSE
  # as_factors (boolean) - if true, convert predictors to factors (default is TRUE)

# Returns - a data frame

get_data <- function(dataSet, chosen_age, chosen_year, remove_NP = FALSE, as_factors = TRUE) {

  if (remove_NP) {
    output = filter(dataSet, pc_immun_class != 0)
  } else {
    output = dataSet
  }
  
  output = output %>%
    filter(age == chosen_age & year == chosen_year) %>%
    mutate(high_coverage = as.factor(ifelse(pc_immun_class == 8, 1, 0))) %>%
    select(postcode,pc_immun,IRSD_SCORE,IRSAD_SCORE,IER_SCORE,IEO_SCORE,high_coverage)
  
  if(as_factors) {
    output$IRSD_SCORE = as.factor(floor(output$IRSD_SCORE))
    output$IRSAD_SCORE = as.factor(floor(output$IRSAD_SCORE))
    output$IER_SCORE = as.factor(floor(output$IER_SCORE))
    output$IEO_SCORE = as.factor(floor(output$IEO_SCORE))
  }

  return(output)
  
}

Now, let’s create a standard way of splitting our data into train and test splits, 80:20

# Description
# -----------
# Split data into random train and test sets, then add them back to a single dataframe.
# A new column called 'is_train' indicats whether a row belongs to the training set (1) or test set (0)

# Parameters
# ----------
  # dataSet (dataframe) - data frame of data with test and train split

# Returns - a data frame

train_test_set <- function(dataSet) {
  
  trainset_size = floor(0.8 * nrow(dataSet))
  
  set.seed(333)
  
  #this is via randomly picking observations using the sample function
  trainset_indices = sample(seq_len(nrow(dataSet)), size = trainset_size)
  
  #assign observations to training and testing sets
  trainset = dataSet[trainset_indices, ]
  testset = dataSet[-trainset_indices, ]
  
  #Add a column to each data frame called 'is_train'. for training set, set it to 1, for test set, set it to 0.
  trainset$is_train = 1
  testset$is_train = 0
  
  #Bind the 2 data frames back together
  output = rbind(trainset,testset)
  
  return(output)
  
}

This function will be a standard way of training a glm model, using binomial logistic regression

# Description
# -----------
# Trains a glm model and returns the model.

# Parameters
# ----------
  # dataSet (data frame) - data frame of data with test and train split
  # model_formula (chr) - the model formula to train (in the form "target ~ predictor")

# Returns - a glm model

train_model <- function(dataSet, model_formula) {
  
  #filter just those rows that are the training set
  trainSet = dataSet[dataSet$is_train == 1, ]
  
  #remove the is_train column
  trainSet = trainSet[-ncol(trainSet)]
  
  #run the model
  model = glm(formula = as.formula(model_formula),
              data = trainSet,
              family = "binomial")
  
  return(model)
  
}

We also need a standard way of calculating predictions. Note, in this instance we are not predicting 93 or 95% immunisation class so much as trying to build a model that has the ABILITY to predict the rare, high immunisation class of 95%. This is why this classification model will predict anying over 0.5% as a 1.

# Description
# -----------
# Calculates predictions from a glm model and adds probabilities and predictions to the test data frame.

# Parameters
# ----------
  # dataSet (data frame) - test set of data
  # model (glm model) - the model used to test predictions

# Returns - a data frame

get_predictions <- function(dataSet, model) {
  
  #filter just those rows that are the test set
  testSet = dataSet[dataSet$is_train == 0, ]
  
  #remove the is_train column
  testSet = testSet[-ncol(testSet)]
  
  #Add the probability scores to the testSet data frame as a new column
  testSet$probability = predict(model,
                                newdata = remove_missing_levels(model, testSet),
                                type = 'response')
  
  #only keep results without NAs
  testSet = testSet[complete.cases(testSet), ]
  
  #Create the predictions. If the probability is greater than 0.5 then the prediction should be set to 1
  testSet$prediction = 0
  testSet[testSet$probability >= 0.5, "prediction"] = 1
  
  return(testSet)

}

Now we have a way of getting predictions, we need a standard way to calculate evaluation measures

# Description
# -----------
# Calculates accuracy, precision, recall and F1 scores

# Parameters
# ----------
  # name (chr) - a descriptive name to identify the model if compared to others in a data frame (default is NA)
  # tn (integer) - count of true negatives from a confusion matrix
  # fp (integer) - count of false positives from a confusion matrix
  # fn (integer) - count of false negatives from a confusion matrix
  # tp (integer) - count of true positives from a confusion matrix

# Returns - a data frame

get_evaluation_measures <- function(name = NA, tn, fp, fn, tp) {
  
  #Just check each value has an actual value. If it doesn't, set it to 0
  tn = ifelse(is.na(tn), 0, tn)
  fp = ifelse(is.na(fp), 0, fp)
  fn = ifelse(is.na(fn), 0, fn)
  tp = ifelse(is.na(tp), 0, tp)
  
  accuracy = (tp+tn)/(tp+tn+fp+fn)
  
  precision = tp/(tp+fp)
  
  recall = tp/(tp+fn)
  
  F1 = 2 * ((precision * recall)/(precision + recall))
  
  output = data.frame(name, tn, fp, fn, tp, accuracy, precision, recall, F1)
  
  return(output)
  
}

Area under the curve gets its own function because it requires a different calculation

# Description
# -----------
# Calculates area under the curve

# Parameters
# ----------
  # probabilities - a list of probabilities
  # targets - a list of actual values

# Returns - a value (datatype of double) for area under the curve

get_auc <- function(probabilities, targets) {
  
  probs = as.vector(probabilities)
  
  pred = prediction(probs,targets)
  
  perf_AUC = performance(pred, "auc")
  
  AUC = perf_AUC@y.values[[1]]
  
  return(AUC)
  
}

This function will use both of the functions above to get all the evaluation measures and put them into a data frame

# Description
# -----------
# Calculates and returns model evaluation measures for test data

# Parameters
# ----------
  # dataSet (data frame) - data frame of data with test data, predictions and probabilities
  # name (chr) - a descriptive name to identify the model if compared to others in a data frame (default is NA)

# Returns - a data frame

get_evaluation_measures_table <- function(dataSet, name = NA) {
  conf_matrix = table(pred=dataSet$prediction,true=dataSet$high_coverage)
  conf_matrix = as.data.frame(conf_matrix)

  evaluation_measures = get_evaluation_measures(name,
                              conf_matrix$Freq[1],
                              conf_matrix$Freq[2],
                              conf_matrix$Freq[3],
                              conf_matrix$Freq[4])

  #Get the AUC and add it to our evaluation measures data frame
  evaluation_measures$AUC = get_auc(dataSet$probability, dataSet$high_coverage)

  return(evaluation_measures)
  
}

This function sets missing levels to NA where levels are only present in test data but missing in train data

remove_missing_levels <- function(fit, test_data) {
    
  # drop empty factor levels in test data
  test_data %>%
    droplevels() %>%
    as.data.frame() -> test_data
    
  # 'fit' object structure of 'lm' and 'glmmPQL' is different so we need to
  # account for it
  if (any(class(fit) == "glmmPQL")) {
    # Obtain factor predictors in the model and their levels
    factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "", names(unlist(fit$contrasts))))
    # do nothing if no factors are present
    if (length(factors) == 0) {
      return(test_data)
    }

    map(fit$contrasts, function(x) names(unmatrix(x))) %>%
      unlist() -> factor_levels

    factor_levels %>% str_split(":", simplify = TRUE) %>%
      extract(, 1) -> factor_levels

    model_factors <- as.data.frame(cbind(factors, factor_levels))
    
  } else {
    
    # Obtain factor predictors in the model and their levels
    factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "", names(unlist(fit$xlevels))))
    # do nothing if no factors are present
    if (length(factors) == 0) {
      return(test_data)
    }

    factor_levels <- unname(unlist(fit$xlevels))
    model_factors <- as.data.frame(cbind(factors, factor_levels))
    
  }

  # Select column names in test data that are factor predictors in
  # trained model
  
  predictors <- names(test_data[names(test_data) %in% factors])
  
  # For each factor predictor in your data, if the level is not in the model,
  # set the value to NA

  for (i in 1:length(predictors)) {
    found <- test_data[, predictors[i]] %in% model_factors[model_factors$factors == predictors[i], ]$factor_levels
    if (any(!found)) {
      # track which variable
      var <- predictors[i]
      # set to NA
      test_data[!found, predictors[i]] <- NA
      # drop empty factor levels in test data
      test_data %>%
        droplevels() -> test_data
      # issue warning to console
      # message(sprintf(paste0("Setting missing levels in '%s', only present",
       #                        " in test data but missing in train data,",
        #                       " to 'NA'."),var))
    }
  }
  return(test_data)
  
}

Now to run models for a variety of age categories and model formulae and output the evaluation measures as a data frame

#Get the predictors we want to run models for
predictors <- c("IRSD_SCORE", "IRSAD_SCORE", "IER_SCORE", "IEO_SCORE")
#Get the ages we want to predict for
ages <- c(1,2,5)

#Create an empty data frame to add the results to
glm_evaluation_measures <- data.frame()

#For all the ages
for(age in ages) {
  
  #Get the appriopriate filtered data and create the test/train splits
  current_data = train_test_set(get_data(raw_data, age, 2016, TRUE, TRUE))
  
  #For all our predictors, train a model based on the model formula, get predictions, and calculate evaluation criteria. Finally, add them to a data frame
  for(predictor in predictors) {
    current_model_formula = paste("high_coverage ~",predictor)
    current_model = train_model(current_data, current_model_formula)
    current_predictions = get_predictions(current_data, current_model)
    current_evaluation_measures = get_evaluation_measures_table(current_predictions, paste("Age", age, predictor))
    glm_evaluation_measures = rbind(glm_evaluation_measures, current_evaluation_measures)
  }
  
}

# Take a look at the evaluation measures
glm_evaluation_measures

##                 name  tn fp fn tp  accuracy  precision     recall
## 1   Age 1 IRSD_SCORE 130 32 93 23 0.5503597 0.41818182 0.19827586
## 2  Age 1 IRSAD_SCORE 113 52 69 48 0.5709220 0.48000000 0.41025641
## 3    Age 1 IER_SCORE 119 47 88 26 0.5178571 0.35616438 0.22807018
## 4    Age 1 IEO_SCORE  97 65 69 42 0.5091575 0.39252336 0.37837838
## 5   Age 2 IRSD_SCORE 228 16 30  3 0.8339350 0.15789474 0.09090909
## 6  Age 2 IRSAD_SCORE 227 12 31  1 0.8413284 0.07692308 0.03125000
## 7    Age 2 IER_SCORE 240  7 33  0 0.8571429 0.00000000 0.00000000
## 8    Age 2 IEO_SCORE 237  7 33  2 0.8566308 0.22222222 0.05714286
## 9   Age 5 IRSD_SCORE 119 48 89 22 0.5071942 0.31428571 0.19819820
## 10 Age 5 IRSAD_SCORE 128 48 72 39 0.5818815 0.44827586 0.35135135
## 11   Age 5 IER_SCORE 130 43 96 14 0.5088339 0.24561404 0.12727273
## 12   Age 5 IEO_SCORE 105 69 54 59 0.5714286 0.46093750 0.52212389
##            F1       AUC
## 1  0.26900585 0.4658897
## 2  0.44239631 0.5560995
## 3  0.27807487 0.5186800
## 4  0.38532110 0.5042821
## 5  0.11538462 0.6035147
## 6  0.04444444 0.4735879
## 7         NaN 0.6335419
## 8  0.09090909 0.4435597
## 9  0.24309392 0.4656902
## 10 0.39393939 0.5427416
## 11 0.16766467 0.4676038
## 12 0.48962656 0.5885464

Now let’s try LASSO

First, standardise the training of our LASSO models in a function

# Description
# -----------
# Train a LASSO model

# Parameters
# ----------
  # dataSet (dataframe) - data frame of data with test and train split

# Returns - a LASSO model

train_lasso_model <- function(dataSet) {
  
  #filter just those rows that are the training set
  trainSet = dataSet[dataSet$is_train == 1, ]
  
  #remove the is_train column, postcode and pc_immun
  trainSet = trainSet[,-c(1:2,8)]
  
  #only keep results without NAs
  trainSet = trainSet[complete.cases(trainSet), ]
  
  #convert training data to matrix format
  lasso_x = model.matrix(high_coverage~.,trainSet)
  lasso_y = trainSet$high_coverage
  
  #family= binomial => logistic regression, alpha=1 => lasso
  output = cv.glmnet(lasso_x, lasso_y, alpha=1, family="binomial", type.measure="auc")
  
  return(output)
  
}

Standardise testing our LASSO models in a function

# Description
# -----------
# Test a LASSO model

# Parameters
# ----------
  # dataSet (dataframe) - data frame of data with test and train split
  # model - LASSO model used to test prediction
  # lambda - the value of lambda we want to use when determining probabilities

# Returns - a data frame of test data with probability and prediction columns added

test_lasso_model <- function(dataSet, model, lambda) {
  
  #filter just those rows that are the training set
  testSet = dataSet[dataSet$is_train == 0, ]
  
  #remove the is_train column, postcode and pc_immun
  testSet = testSet[,-c(1:2,8)]
  
  #only keep results without NAs
  testSet = testSet[complete.cases(testSet), ]
  
  #Convert test data to a model matrix
  lasso_x_test = model.matrix(high_coverage~.,testSet)
  
  #Get prediction probabilities
  lasso_prob = predict(model, newx = lasso_x_test, s=lambda, type="response")
  
  #translate probabilities to predictions
  lasso_predict = rep(0,nrow(testSet))
  
  lasso_predict[lasso_prob>.5] <- 1
  
  output = data.frame(lasso_prob, lasso_predict)
  colnames(output) = c("probability", "prediction")
  
  output = cbind(testSet, output)
  
  return(output)
  
}

Now, we can go through various age groups, train and test a LASSO model for each age group and add the evaluation measures to the data frame of evaluation measures for our GLM models.

#Get the ages we want to predict for
ages <- c(1,2,5)

#Create an empty data frame to add the results too
lasso_evaluation_measures <- data.frame()

#For all the ages
for(age in ages) {
  
  #Get the appriopriate filtered data )note, we dont want factors for our predictors)
  current_data = train_test_set(get_data(raw_data, age, 2016, TRUE, FALSE))
  
  #Train the LASSO model
  current_model = train_lasso_model(current_data)
  
  #min value of lambda
  current_lambda_min = current_model$lambda.min
  #best value of lambda
  current_lambda_1se = current_model$lambda.1se

  #Using lambda of 1se rather than the minimum lambda, see what predictions we get
  current_prediction = test_lasso_model(current_data, current_model, current_lambda_1se)
  
  #Get all the evaluation measures
  current_evaluation_measures = get_evaluation_measures_table(current_prediction, paste("Age", age, "LASSO"))
  
  #Add the evaluation measures to our evaluation measures table
  lasso_evaluation_measures = rbind(lasso_evaluation_measures, current_evaluation_measures)
  
}

Let’s look at all the evaluation measures together

# Display the LASSO evaulation measures from all models
lasso_evaluation_measures

##          name  tn fp  fn tp  accuracy precision    recall        F1
## 1 Age 1 LASSO 163  9 113  8 0.5836177 0.4705882 0.0661157 0.1159420
## 2 Age 2 LASSO 259 35   0  0 0.8809524 0.0000000       NaN       NaN
## 3 Age 5 LASSO 160 26  82 34 0.6423841 0.5666667 0.2931034 0.3863636
##         AUC
## 1 0.6111859
## 2 0.5897408
## 3 0.6749629

# Display all the evaulation measures from all models
all_evaluation_measure <- rbind(lasso_evaluation_measures, glm_evaluation_measures)

all_evaluation_measure

##                 name  tn fp  fn tp  accuracy  precision     recall
## 1        Age 1 LASSO 163  9 113  8 0.5836177 0.47058824 0.06611570
## 2        Age 2 LASSO 259 35   0  0 0.8809524 0.00000000        NaN
## 3        Age 5 LASSO 160 26  82 34 0.6423841 0.56666667 0.29310345
## 4   Age 1 IRSD_SCORE 130 32  93 23 0.5503597 0.41818182 0.19827586
## 5  Age 1 IRSAD_SCORE 113 52  69 48 0.5709220 0.48000000 0.41025641
## 6    Age 1 IER_SCORE 119 47  88 26 0.5178571 0.35616438 0.22807018
## 7    Age 1 IEO_SCORE  97 65  69 42 0.5091575 0.39252336 0.37837838
## 8   Age 2 IRSD_SCORE 228 16  30  3 0.8339350 0.15789474 0.09090909
## 9  Age 2 IRSAD_SCORE 227 12  31  1 0.8413284 0.07692308 0.03125000
## 10   Age 2 IER_SCORE 240  7  33  0 0.8571429 0.00000000 0.00000000
## 11   Age 2 IEO_SCORE 237  7  33  2 0.8566308 0.22222222 0.05714286
## 12  Age 5 IRSD_SCORE 119 48  89 22 0.5071942 0.31428571 0.19819820
## 13 Age 5 IRSAD_SCORE 128 48  72 39 0.5818815 0.44827586 0.35135135
## 14   Age 5 IER_SCORE 130 43  96 14 0.5088339 0.24561404 0.12727273
## 15   Age 5 IEO_SCORE 105 69  54 59 0.5714286 0.46093750 0.52212389
##            F1       AUC
## 1  0.11594203 0.6111859
## 2         NaN 0.5897408
## 3  0.38636364 0.6749629
## 4  0.26900585 0.4658897
## 5  0.44239631 0.5560995
## 6  0.27807487 0.5186800
## 7  0.38532110 0.5042821
## 8  0.11538462 0.6035147
## 9  0.04444444 0.4735879
## 10        NaN 0.6335419
## 11 0.09090909 0.4435597
## 12 0.24309392 0.4656902
## 13 0.39393939 0.5427416
## 14 0.16766467 0.4676038
## 15 0.48962656 0.5885464

LogisticRegressionModels

Alex Brooks

29/09/2018

Testing and iterating Model 2 that The Vaccinators built, using the seifa_merged clean dataset.