1 Goals:

  • For this single task: to classify free-text responses as quick as possible
  • For Analysis Workstream: to feed the output of the heuristics (classified free-text responses) to supervised classification algorithms as covariates. E.g., multinomial inverse regression.

2 Expected output from this task:

  • R/Python function that can read in the free text and assign the segment/bucket based on the heuristic rules.
  • In the end, we hope to classify as many as possible responses through the heuristic rules with high accuracy.

3 Steps (for each variable):

  • Step 0: Follow the steps in the previous comment to tag hierarchical buckets to a subset (say, 40%) of free-text responses manually.

  • Step 1: Create heuristics by eyeballing the responses
    • After 40% of responses have been tagged, we find the characteristic of free-text responses that fall in a specific “bucket”. These characteristics could be keywords, a combination of keywords, etc.
    • Then, we generate a rough logic rule (heuristic) for that. For instance, in the form of “if free text string has X AND Y OR Z, then bucket = some bucket
  • Step 2: Repeat Step 1 for all buckets

  • Step 3: We run the heuristic algorithm on the remaining 60% free-text responses. Then we do a quick review of the result:
    • If any of the 60% responses cannot be assigned to a bucket, we hand-code it and edit/improve our heuristics
    • If any of the 60% responses have been misclassified, we correct it manually and edit/improve our heuristics

4 Automation script

#### helper function: sub_level_10:
#### level: info provided by professionals
#### arg1: best_treatment_selected
#### arg2: response
sub_level_10 = function(arg1, arg2) {
  token1 = (arg1 == TRUSTINFO) || (arg1 == SOMETHINGELSE) || (arg1 == NOTHING) || (arg1 == OTHER)
  token2 = grepl("doctor", arg2, fixed=TRUE)
  token3 = grepl("nurse", arg2, fixed=TRUE)
  token4 = grepl("expert", arg2, fixed=TRUE) & (grepl("explanation", arg2, fixed=TRUE) || grepl("advice", arg2, fixed=TRUE))
  token5 = grepl("dockter", arg2, fixed=TRUE)
  token6 = grepl("drs", arg2, fixed=TRUE)
  token7 = grepl("health", arg2, fixed=TRUE) & grepl("worker", arg2, fixed=TRUE)
  token8 = grepl("health", arg2, fixed=TRUE) & grepl("desk", arg2, fixed=TRUE)
  token9 = (grepl("health", arg2, fixed=TRUE) || grepl("medical", arg2, fixed=TRUE)) & grepl("expert", arg2, fixed=TRUE)
  token10 = grepl("health industry", arg2, fixed=TRUE) & grepl("work", arg2, fixed=TRUE)
  token11 = grepl("medical training", arg2, fixed=TRUE) & grepl("friend", arg2, fixed=TRUE)
  token12 = grepl("info", arg2, fixed=TRUE) & grepl("hospital", arg2, fixed=TRUE)
  token13 = grepl("scientist", arg2, fixed=TRUE)
  token14 = grepl("medical", arg2, fixed=TRUE) & grepl("practitioner", arg2, fixed=TRUE)
  token15 = grepl("health professional", arg2, fixed=TRUE)
  if (token1 & (token2 || token3 || token4 || token5 || token6 || token7 || token8 || token9 || token10 || token11 || token12 || token13 || token14 || token15)) {
    return("info provided by professionals")
  } else {
    return("")
  }
}
#### helper function: sub_level_17:
#### level: info that shows the vaccine is effective/can protect people
#### arg1: best_treatment_selected
#### arg2: response
sub_level_17 = function(arg1, arg2) {
  token1 = (arg1 == TRUSTINFO) || (arg1 == FAMILYSUPPORT) || (arg1 == NOTHING)
  token2 = grepl("100", arg2, fixed=TRUE) & (grepl("effect", arg2, fixed=TRUE) || grepl("safe", arg2, fixed=TRUE) || grepl("work", arg2, fixed=TRUE) || grepl("protect", arg2, fixed=TRUE))
  token3 = grepl("avoid", arg2, fixed=TRUE) & grepl("hospital", arg2, fixed=TRUE)
  token4 = grepl("vaccine", arg2, fixed=TRUE) & grepl("sav", arg2, fixed=TRUE) & grepl("live", arg2, fixed=TRUE)
  token5 = grepl("vacc", arg2, fixed=TRUE) & grepl("covid", arg2, fixed=TRUE) & grepl("fatal", arg2, fixed=TRUE)
  token6 = grepl("vacc", arg2, fixed=TRUE) & grepl("protected", arg2, fixed=TRUE) & grepl("by", arg2, fixed=TRUE)
  token7 = grepl("immu", arg2, fixed=TRUE) & grepl("system", arg2, fixed=TRUE) & (grepl("boost", arg2, fixed=TRUE) || grepl("protection", arg2, fixed=TRUE))
  token8 = grepl("help", arg2, fixed=TRUE) & grepl("not", arg2, fixed=TRUE) & grepl("get", arg2, fixed=TRUE) & grepl("covid", arg2, fixed=TRUE)
  token9 = grepl("n t", arg2, fixed=TRUE) & grepl("vacc", arg2, fixed=TRUE) & grepl("die", arg2, fixed=TRUE) & grepl("positive", arg2, fixed=TRUE)
  token10 = grepl("wont", arg2, fixed=TRUE) & grepl("kill", arg2, fixed=TRUE) & grepl("you", arg2, fixed=TRUE)
  token11 = grepl("bad", arg2, fixed=TRUE) & grepl("not taking", arg2, fixed=TRUE) & grepl("vaccine", arg2, fixed=TRUE)
  token12 = grepl("n t", arg2, fixed=TRUE) & grepl("get", arg2, fixed=TRUE) & grepl("sick", arg2, fixed=TRUE) & grepl("vaccin", arg2, fixed=TRUE)
  token13 = grepl("reduce", arg2, fixed=TRUE) & grepl("chance", arg2, fixed=TRUE) & grepl("get", arg2, fixed=TRUE) & (grepl("covid", arg2, fixed=TRUE) || grepl("virus", arg2, fixed=TRUE))
  token14 = grepl("wont", arg2, fixed=TRUE) & grepl("get", arg2, fixed=TRUE) & grepl("covid", arg2, fixed=TRUE)
  token15 = grepl("explain", arg2, fixed=TRUE) & grepl("people", arg2, fixed=TRUE) & grepl("die", arg2, fixed=TRUE) & (grepl("get", arg2, fixed=TRUE) & grepl("jab", arg2, fixed=TRUE))
  token16 = grepl("impact", arg2, fixed=TRUE) & grepl("get", arg2, fixed=TRUE) & grepl("covid", arg2, fixed=TRUE) & grepl("affected", arg2, fixed=TRUE)
  token17 = grepl("no longer", arg2, fixed=TRUE) & grepl("at risk", arg2, fixed=TRUE) & grepl("vaccinated", arg2, fixed=TRUE)
  token18 = grepl("prevent corona", arg2, fixed=TRUE)
  token19 = grepl("n t", arg2, fixed=TRUE) & grepl("contract", arg2, fixed=TRUE) & grepl("covid", arg2, fixed=TRUE)
  token20 = grepl("n t", arg2, fixed=TRUE) & grepl("test", arg2, fixed=TRUE) & grepl("positive", arg2, fixed=TRUE) & grepl("after", arg2, fixed=TRUE) & grepl("vax", arg2, fixed=TRUE)
  token21 = grepl("virus", arg2, fixed=TRUE) & (grepl("against", arg2, fixed=TRUE) || grepl("protect", arg2, fixed=TRUE))
  token22 = grepl("chance", arg2, fixed=TRUE) & grepl("low", arg2, fixed=TRUE) & grepl("get", arg2, fixed=TRUE) & grepl("covid", arg2, fixed=TRUE)
  token23 = grepl("reduce", arg2, fixed=TRUE) & grepl("risk", arg2, fixed=TRUE) & grepl("infect", arg2, fixed=TRUE)
  token24 = grepl("chance", arg2, fixed=TRUE) & grepl("get", arg2, fixed=TRUE) & grepl("infect", arg2, fixed=TRUE) & grepl("less", arg2, fixed=TRUE)
  token25 = grepl("covid", arg2, fixed=TRUE) & grepl("rat", arg2, fixed=TRUE) & grepl("drop", arg2, fixed=TRUE)
  token26 = grepl("virus", arg2, fixed=TRUE) & (grepl("against", arg2, fixed=TRUE) & grepl("protect", arg2, fixed=TRUE))
  token27 = grepl("reduce", arg2, fixed=TRUE) & grepl("spread", arg2, fixed=TRUE) & grepl("corona", arg2, fixed=TRUE)
  token28 = grepl("vacc", arg2, fixed=TRUE) & grepl("help", arg2, fixed=TRUE) & grepl("guarantee", arg2, fixed=TRUE)
  token29 = grepl("never", arg2, fixed=TRUE) & grepl("infected", arg2, fixed=TRUE) & grepl("vaccine", arg2, fixed=TRUE)
  token30 = grepl("dont", arg2, fixed=TRUE) & grepl("sick", arg2, fixed=TRUE) & grepl("take", arg2, fixed=TRUE) & grepl("vacc", arg2, fixed=TRUE)
  token31 = grepl("help", arg2, fixed=TRUE) & grepl("not", arg2, fixed=TRUE) & grepl("court", arg2, fixed=TRUE) & grepl("virus", arg2, fixed=TRUE)
  token32 = grepl("protect", arg2, fixed=TRUE) & grepl("against", arg2, fixed=TRUE) & grepl("covid", arg2, fixed=TRUE)

  if (token1 & (token2 || token3 || token4 || token5 || token6 || token7 || token8 || token9 || token10 || token11 || token12 || token13 || token14 || token15 || token16 || token17 || token18 || token19 || token20 || token21 || token22 || token23 || token24 || token25 || token26 || token27 || token28 || token29 || token30 || token31 || token32)) {
    return("info that shows the vaccine is effective/can protect people")
  } else {
    return("")
  }
}

5 Prediction Results

The heuristics above is trained by the data from Pilot 7. Then, we use 50 free-text responses from Pilot 5&6 to perform a quick test.

Metrics Stats
Accuracy 80%
Unassigned Rate 12%
Misassigned Rate 8%
  • Accuracy: #responses that predicted correctly / #responses
  • Unassigned Rate: #responses that miss one or more levels in prediction / #responses
  • Misassigned Rate: #responses that been assigned one or more wrong levels in prediction/ #responses

5.1 Appendix

Misclassified data:

Testing dataset: