Final Project - Subnet Classifier
Overview
One real-world problem my organization faces is that, because of its large, decentralized structure, it is often difficult to discern boundaries of responsibility, especially when it comes to tasks such as cleaning up an infected computer. Our cybersecurity operations center (CSOC) routinely sees intrusion events or anomalous network behaviour that maps clearly to a campus IP address, but it is sometimes difficult discern who is responsible for investigating the incident.
To that end, the goal of my final project is to employ a predictive model to attempt to classify IP networks by university division. The main deliverable is a document classification model that maps networks to university division codes. Since the project will be displayed publicly, network names and addresses have been anonymized or redacted. I ran a SQL query to generate a list of subnets in use on campus, 1027 of which I manually classified to use as my training set. The remain networks are as follows:
- A few large network blocks (e.g. VoIP, university housing, and campus wireless networks) have predefined ranges; as such, I’ll exclude these from the model and give them static mappings.
- Each network is manually assigned a “friendly” name and description by our network team, but they don’t necessarily use the official department name. Often, abbreviations are used, but not necessarily in a standardized format (e.g., “Mednet” sometimes is used to indicate the School of Medicine and Public Health; other times the acronym “SMPH” is listed). And rarely is the division name or code given; instead, it is almost always listed by the department within the division.
- I scraped various websites that correspond to each division, building a corpus for each that I’ll try to map to a particular division.
- A nearest-neighbor approach was employed using various distance calculation methods: Euclidian, Manhattan, cosine, and Jacard.
The following is a breakdown of the subnet counts:
total subnets: 27689 total not classified: 673 (2.43%) total classified: 27016 (97.57% of total) classified by allocation keyword: 818 (3.03% of classified) classified by ip block: 20393 (75.48% of classified) classified by regex: 5805 (21.49% of classified) count in training set: 1027 (3.8% of classified)
Ingest the data
# Set model parameters
wtManualTokens <- 25 # weight of manual tokens
wtTrainingTokens <- 1 # weight of training tokens
wtCrawledTokens <- 1 # weight of crawled tokens
minTF <- 5 # min term frequency
maxTF <- 200 # max term frequency
maxNgrams <- 2 # max ngram length
# Array of max features per divcode that will be used in model training
maxFeaturesArray = c(10, 25, 50, 100)
# Read divcodes
dfdivcodes <- read.csv("divcodes.csv", header = F)
dfdivcodes <- dfdivcodes %>%
rename(divcode = V1, divname = V2)
# Read json containing all the subnets we're trying to classify, including the following:
# 27,689 total subnets
# 818 classified by allocation keyword (blackhole, reserved, not deployed)
# 20,393 large network blocks (voip, housing, wireless, etc)
# 5,805 classified by regex but not verified for accuracy
# 1,027 count to be used in the training set, verified for accuracy
# 673 unable to be classified by any of the above
rawjson <- fromJSON("subnets-classified.json", simplifyVector = TRUE)
# Extract variable names from json:
# ipstart: starting ip address in numeric format
# ipend: ending ip address in numeric format
# alloc: allocation code:
# ALLOCATED subnet allocated but not deployed no learning required
# ASSIGNED subnet allocated and deployed learning required
# BLACKHOLE blackholed ip address no learning required
# RESERVED reserved for network equipment use no learning required
# netname: network name
# descr: network description
# match: match doe:
# 0 not matched
# 1 matched by regex (and either validated or validated, depending on value of "train" field)
# 2 matched to a large network block (voip, housing, wireless, etc)
# 3 matched to an allocation code not requiring learning (ALLOCATED, BLOCKHOLE, RESERVED)
# divcode division code (if matched)
# div division name (if matched)
# train use in training set
# 0 don't use
# 1 use
vars <- names(rawjson[[1]])
# Create data frame
dfjson <- data.frame(matrix(vector(), length(rawjson), length(vars)), stringsAsFactors=F)
colnames(dfjson) <- vars
# Walk each variable, assigning each one to a column of the empty data frame
for (i in 1:(length(vars))) {
# Extract each subnet as a separate element
tmpnet <- sapply(rawjson, "[[", i)
# Add column to data frame
dfjson[i] <- tmpnet
}
# Concatenate netname and descr fields
dfjson <- dfjson %>%
mutate(text = tolower(paste(netname, descr)))
# Convert types
dfjson$ipstart <- as.character(dfjson$ipstart)
dfjson$ipend <- as.character(dfjson$ipend)
dfjson$match <- as.numeric(dfjson$match)
dfjson$train <- as.numeric(dfjson$train)
Stopwords
# Load standard stop_words and read in custom stop_words from file
data(stop_words)
custom_stopwords <- read.delim("custom_stopwords.txt", sep = '\n', header = F) %>%
rename(word = V1) %>% mutate(lexicon = "custom")
custom_stopwords <- bind_rows(custom_stopwords, stop_words)
# Function to remove stopwords from ngrams
removeStopwords <- function(tokens, stopwords) {
map2(tokens, lapply(tokens, function(x) {
unlist(
lapply(
strsplit(x, " "), function(y) {
any(y %in% stopwords)
}
)
)
}
), ~ .x[!.y])
}
Partitioning
# reserve validation set
dftrain <- dfjson %>%
filter(train == 1)
#dftrain <- dftrain %>%
# filter(divcode != 'net' & divcode != 'ban' & divcode != 'voip' & divcode != 'uwhc' &
# divcode != 'fisma' & divcode != 'wireless' & divcode != 'pci' & divcode != 'blackhole')
set.seed(777)
validation_prop = 0.3
validation_rows <- sample(nrow(dftrain), size = nrow(dftrain) * validation_prop, replace = F)
dftrain <- dftrain %>%
mutate(validation_set = ifelse(row_number() %in% validation_rows, 1, 0))
dfvalidate <- dftrain %>%
filter(validation_set == 1) %>%
mutate(fold = 0)
dftrain <- dftrain %>%
filter(validation_set == 0)
# Create training set partitioning into [numfolds] random groups
numfolds <- 5
set.seed(777)
num_rows <- sample(nrow(dftrain), replace = F)
dftrain <- dftrain[num_rows, ]
dftrain <- dftrain %>%
mutate(fold = ((row_number() - 1) %% numfolds) + 1)
Corpus
# Load list of web pages to scrape
dfcrawl <- read.csv("crawl-pages.csv", header = F)
dfcrawl <- dfcrawl %>%
rename(divcode = V1, url = V2) %>%
mutate(text = '')
# Function to read a page and dump contents to data frame
crawlPage <- function(df) {
results <- getURL(df$url, .opts=curlOptions(followlocation = TRUE))
#results <- read_html(df$url, encoding = "UTF-8")
#results <- GET(df$url)
dfcrawl <<- dfcrawl %>%
mutate(text = ifelse(url == df$url, results, text))
}
# Crawl pages, but only if we haven't already done so
if (file.exists("crawled-pages.csv")) {
dfcrawl <- read.csv("crawled-pages.csv", header = F)
dfcrawl <- dfcrawl %>%
rename(divcode = V1, url = V2, text = V3)
} else {
crawlPage(dfcrawl)
write.csv(dfcrawl, "crawled-pages.csv", col.names=F, row.names=F)
}
# Normalize to lower case
dfcrawl <- dfcrawl %>%
mutate(text = tolower(text))
# Tokenize
dfcorpus <- dfcrawl %>%
unnest_tokens(word, text, token = "ngrams", n_min = 1, n = maxNgrams)
# Set weight
dfcorpus <- dfcorpus %>%
mutate(wt = wtCrawledTokens)
# Remove words with numbers
dfcorpus <- dfcorpus %>%
filter(!grepl('\\d', word))
# Remove all underscores
dfcorpus <- dfcorpus %>%
filter(!grepl('^_+$', word))
# Remove url field
dfcorpus <- dfcorpus %>%
select(-url)
# Load manual tokens; add flag indicating this is a manual token
dfmanual_tokens <- read.csv("manual-tokens.csv", header = F)
dfmanual_tokens <- dfmanual_tokens %>%
rename(word = V1, divcode = V2) %>%
mutate(wt = wtManualTokens)
#dfmanual_tokens <- dfmanual_tokens %>%
# filter(divcode != 'net' & divcode != 'ban' & divcode != 'voip' & divcode != 'uwhc' &
# divcode != 'fisma' & divcode != 'wireless' & divcode != 'pci' & divcode != 'blackhole')
# Remove single stopwords
dfcorpus <- dfcorpus %>%
anti_join(custom_stopwords, by = c("word"))
# Remove stopwords from ngrams - this takes a long time
dfcorpus$word <- removeStopwords(dfcorpus$word, custom_stopwords$word)
dfcorpus <- dfcorpus %>%
filter(word != 'character(0)')
dfcorpus$word <- as.character(dfcorpus$word)
# Merge manual tokens with corpus
dfcorpus <- dfcorpus %>%
bind_rows(dfmanual_tokens)
# Counts
dfcorpus <- dfcorpus %>%
group_by(divcode, word) %>%
summarize(n = sum(wt)) %>%
ungroup()
# Filter infrequent and too-frequent terms
dfcorpus <- dfcorpus %>%
filter(n >= minTF & n <= maxTF) %>%
arrange(desc(n))
# IDF
dfidf_corpus <- dfcorpus %>%
group_by(word) %>%
summarize(idf = 1 + log(nrow(dfcrawl) / sum(n), base = 10)) %>%
ungroup()
# TFIDF
dfcorpus <- dfcorpus %>%
inner_join(dfidf_corpus, by = c("word")) %>%
mutate(tfidf = idf * n)
# Filter in words with highest tfidf scores (not doing this)
#dfcorpus <- dfcorpus %>%
# filter(tfidf >= 0)
# Filter on only divcodes with more than a certain amount of terms (not doing this)
#dfcorpus_filter <- dfcorpus %>% group_by(divcode) %>%
# summarize(n = n()) %>%
# arrange(n) %>%
# filter(n > 5) %>%
# select(divcode)
Data Evaluation
# Divisions
kbl(tail(dfdivcodes, 20) %>% select(divname), caption =
"<i><font color=#000000><b>Table 1.</b> University divisions.</font></i>") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), font_size = 12)
Table 1. University divisions.
|
|
divname
|
|
20
|
Division of Extension
|
|
21
|
College of Letters & Science (L&S)
|
|
22
|
General Library System
|
|
23
|
Wisconsin State Lab of Hygiene (WSLH)
|
|
24
|
School of Medicine and Public Health (SMPH)
|
|
25
|
School of Nursing
|
|
26
|
School of Pharmacy
|
|
27
|
University Health Services (UHS)
|
|
28
|
Officer Education
|
|
29
|
Facilities Planning & Management (FP&M)
|
|
30
|
UW Police Department
|
|
31
|
Recreational Sports
|
|
32
|
University Housing
|
|
33
|
School of Veterinary Medicine
|
|
34
|
Wisconsin Veterinary Diagnostic Lab (WVDL)
|
|
35
|
Division of Continuing Studies
|
|
36
|
Wisconsin Institute for Discovery
|
|
37
|
Wisconsin Union
|
|
38
|
Campus networks
|
|
39
|
University of Wisconsin Hospitals and Clinics
|
# Most frequent terms
dfcorpus %>%
arrange(desc(n)) %>%
head(10) %>%
ggplot() +
geom_bar(aes(x = reorder(word, n), y = n), stat = "identity") +
ggtitle("Most frequent terms") +
xlab("Term") + ylab("Count") +
coord_flip()

# Word cloud
set.seed(777)
wordcloud(words = dfcorpus$word, freq = dfcorpus$n, min.freq = 1,
max.words = 75, random.order = F, rot.per = 0.35,
colors=brewer.pal(8, "Accent"), scale=c(1.5, 1))

# Load divcode table
dfdivcodes <- read.csv("divcodes.csv", header = F) %>%
rename(divcode = V1, divname = V2)
# Terms by division code
dfcorpus %>%
inner_join(dfdivcodes, by = c("divcode")) %>%
group_by(divcode, divname) %>%
summarize(n = sum(n)) %>%
ggplot() +
geom_bar(aes(x = reorder(divname, n), y = n), stat = "identity") +
ggtitle("Terms by division") +
xlab("Division") + ylab("Count") +
coord_flip()

#############################
# Build corpus function
#############################
buildCorpus <- function(holdoutfold, maxFeatures) {
# Tokenize, remove words with numbers, all underscores, and stopwords
dftrain_words <- dftrain %>%
filter(fold != holdoutfold) %>%
unnest_tokens(word, text, token = "ngrams", n_min = 1, n = maxNgrams) %>%
filter(!grepl('\\d', word)) %>%
filter(!grepl('^_+$', word)) %>%
anti_join(custom_stopwords, by = c("word"))
# Remove stopwords from ngrams - this takes a long time
dftrain_words$word <- removeStopwords(dftrain_words$word, custom_stopwords$word)
dftrain_words <- dftrain_words %>%
filter(word != 'character(0)')
dftrain_words$word <- as.character(dftrain_words$word)
# Calculate word counts per divcode
dftrain_words <- dftrain_words %>%
group_by(fold, ipstart, ipend, divcode, word) %>%
summarize(n = n() * wtTrainingTokens) %>%
ungroup()
# IDF
dfidf <- dftrain_words %>%
group_by(word) %>%
summarize(idf = 1 + log(nrow(dftrain) / n(), base = 10)) %>%
ungroup()
# TFIDF
dftrain_words <- dftrain_words %>%
inner_join(dfidf, by = c("word")) %>%
mutate(tfidf = idf * n)
# Create training corpus
dftrain_corpus <<- dfcorpus %>%
group_by(divcode, word) %>%
summarize(tfidf = sum(tfidf)) %>%
ungroup() %>%
select(divcode, word, tfidf)
# Append the training set to the corpus
dftrain_corpus <<- dftrain_corpus %>%
bind_rows(
dftrain_words %>%
distinct(divcode, word, tfidf)
) %>%
group_by(divcode, word) %>%
summarize(tfidf = sum(tfidf)) %>%
ungroup() %>%
select(divcode, word, tfidf)
# Select only the top [maxFeatures] features per divcode
dftrain_corpus <<- dftrain_corpus %>%
group_by(divcode) %>%
slice_max(tfidf, n = maxFeatures) %>%
ungroup()
# Filter out problematic entries
dftrain_corpus <<- dftrain_corpus %>%
filter(word != 'fp' | (word == 'fp' & divcode == 'net'))
# Create document term matrix, replacing NAs and sorting by column name
print(" Creating document term matrix")
dfcorpus_dtm <<- dftrain_corpus %>%
spread(key = word, value = tfidf) %>%
mutate_all(~replace(., is.na(.), 0)) %>%
select("divcode", sort(colnames(.)))
}
#############################
# Compare documents function
#############################
compareDocs <- function(dfset, holdoutfold, maxFeatures, distMethod, tstart) {
# Build corpus
buildCorpus(holdoutfold, maxFeatures)
# Tokenize, remove words with numbers, all underscores, and stopwords
dftrain_words <- dfset %>%
filter(fold == holdoutfold) %>%
unnest_tokens(word, text, token = "ngrams", n_min = 1, n = maxNgrams) %>%
filter(!grepl('\\d', word)) %>%
filter(!grepl('^_+$', word)) %>%
anti_join(custom_stopwords, by = c("word"))
# Remove stopwords from ngrams - this takes a long time
dftrain_words$word <- removeStopwords(dftrain_words$word, custom_stopwords$word)
dftrain_words <- dftrain_words %>%
filter(word != 'character(0)')
dftrain_words$word <- as.character(dftrain_words$word)
# Calculate word counts per divcode
dftrain_words <- dftrain_words %>%
group_by(fold, ipstart, ipend, divcode, word) %>%
summarize(n = n()) %>%
ungroup() %>%
mutate(tfidf = n) # added
# Iterate over each subnet in the set
print(" Iterating over subnets")
#for (row in 51:100) {
for (row in 1:nrow(dfset)) {
if (row %% 10 == 0) {
print(paste0(" holdoutfold=", holdoutfold, ", test set row=", row))
}
# Filter on just this network
dfnet_unpadded <- dftrain_words %>%
filter(ipstart == dfset[row, 'ipstart'] & ipend == dfset[row, 'ipend'])
#print(paste0(dfset[row, 'ipstart'],",",dfset[row, 'ipend'],",",dfset[row, 'divcode']))
# Fill in missing words in training set from corpus in order to calculate distances
dfnet <- dfnet_unpadded %>%
full_join(dftrain_corpus %>% distinct(word), by = c("word"))
# Set NAs to 0
dfnet$tfidf[is.na(dfnet$tfidf)] <- 0
# Fill in any missing subnet and divcode values
dfnet$ipstart <- dfset[row, 'ipstart']
dfnet$ipend <- dfset[row, 'ipend']
dfnet$divcode <- dfset[row, 'divcode']
# Remove duplicate rows resulting from the full_join
dfnet <- dfnet %>%
select(ipstart, ipend, divcode, word, tfidf) %>%
unique()
# Create document term matrix, sorting by column name
dfnet_dtm <- dfnet %>%
spread(key = word, value = tfidf) %>%
select("ipstart", "ipend", "divcode", sort(colnames(.)))
# Fill in missing columns in the corpus DTM
missing <- c()
for (i in colnames(dfnet_dtm)) {
found = 0
for (j in colnames(dfcorpus_dtm)) {
if (i == j) {
found = 1
}
}
if (found == 0 & i != 'ipstart' & i != 'ipend') {
missing <- c(missing, i)
}
}
# Create matrices to do matrix arithmetic;
# dtmnet_matrix is just one row repeated a bunch of times,
# equal to the number of rows in the dtmcorpus_matrix data frame.
dtmcorpus_matrix <- dfcorpus_dtm[,2:ncol(dfcorpus_dtm)]
dtmnet_matrix <- dfnet_dtm[,4:ncol(dfnet_dtm)]
dtmnet_matrix <- dtmnet_matrix[rep(seq_len(nrow(dtmnet_matrix)), each = nrow(dtmcorpus_matrix)), ]
# Fill in missing values in the corpus DTM
for (i in missing) {
dtmcorpus_matrix[,i] <- 0
}
# Loop through each distance method specified in the function parameters
for (j in distMethod) {
if (j == 'euc') {
# Calculate Euclidian distance
#print(" calculating Euclidian distance")
dfresult_tmp <<- (dtmnet_matrix - dtmcorpus_matrix) ^ 2 %>%
mutate(divcode = dfcorpus_dtm$divcode) %>%
select(divcode, sort(colnames(.)))
dfresult_tmp <<- dfresult_tmp %>%
mutate(method = j, dist = sqrt(rowSums(.[2:ncol(dfresult_tmp)]))) %>%
select(divcode, method, dist)
} else if (j == 'man') {
# Calculate Manhattan distance
#print(" calculating Manhattan distance")
dfresult_tmp <<- abs(dtmnet_matrix - dtmcorpus_matrix) %>%
mutate(divcode = dfcorpus_dtm$divcode) %>%
select(divcode, sort(colnames(.)))
dfresult_tmp <<- dfresult_tmp %>%
mutate(method = j, dist = rowSums(.[2:ncol(dfresult_tmp)])) %>%
select(divcode, method, dist)
} else if (j == 'cos') {
# Calculate cosine distance
#print(" calculating cosine distance")
dfresult_tmp <<- (dtmnet_matrix * dtmcorpus_matrix) %>%
mutate(divcode = dfcorpus_dtm$divcode) %>%
select(divcode, sort(colnames(.)))
dfresult_tmp <<- dfresult_tmp %>%
mutate(dist = rowSums(.[2:ncol(dfresult_tmp)]))
dfresult_tmp <<- dfresult_tmp %>%
mutate(method = j, dist = 1 - (dist / (sqrt(sum(dtmnet_matrix ^ 2)) * sqrt(sum(dtmcorpus_matrix)))))
dfresult_tmp <<- dfresult_tmp %>%
select(divcode, method, dist)
} else if (j == 'jac') {
# Calculate Jaccard distance
#print(" calculating Jaccard distance")
tmp_intersect <<- dftrain_corpus %>%
group_by(divcode, word) %>%
summarize(n = n()) %>%
ungroup() %>%
inner_join(select(dfnet_unpadded, word), by = c("word")) %>%
group_by(divcode) %>%
summarize(n = n()) %>%
ungroup()
tmp_union <- nrow(dftrain_corpus)
dfresult_tmp <<- tmp_intersect %>%
mutate(method = j, dist = 1 - (n / tmp_union)) %>%
select(divcode, method, dist)
}
# Prepare results
dfresult_tmp <<- dfresult_tmp %>%
select(divcode, method, dist) %>%
rename(divcode_predicted = divcode) %>%
mutate(
holdout_fold = holdoutfold,
ipstart = dfset[row, 'ipstart'],
ipend = dfset[row, 'ipend'],
divcode_actual = dfset[row, 'divcode'],
max_features = maxFeatures,
cum_proc_time = (as.numeric(as.POSIXct(Sys.time())) - tstart)
)
# Append to result df
if (exists('dfresult') && is.data.frame(get('dfresult'))) {
dfresult <<- dfresult %>%
bind_rows(dfresult_tmp)
} else {
dfresult <<- dfresult_tmp
}
# Diag
dfresult_tmp <<- dfresult_tmp %>%
mutate(text = dfset[row, 'text'])
} # loop through distance methods
} # Iterate over subnets
}
#############################
# End of document comparison function
#############################
#############################
# Training
#############################
# Skip if the results file already exists
if (file.exists("results.csv") & 1 == 1) {
# Load results
dfresult <- read.csv("results.csv")
dfresult$ipstart <- as.character(dfresult$ipstart)
dfresult$ipend <- as.character(dfresult$ipend)
} else {
if (1 == 1) {
# Create data frame to hold results and seed it to establish data types
dfresult <- data.frame(matrix(vector(), 0, 9), stringsAsFactors=F)
colnames(dfresult) <-
c("holdout_fold", "ipstart", "ipend", "divcode_predicted", "divcode_actual",
"method", "dist", "max_features", "cum_proc_time")
dfresult <- dfresult %>% mutate(
holdout_fold = 0,
ipstart = '',
ipend = '',
divcode_predicted = '',
divcode_actual = '',
method = '',
dist = 0,
max_features = 0,
cum_proc_time = 0
)
}
# Starting unix timestamp for calculating processing time
tstart <- as.numeric(as.POSIXct(Sys.time()))
# Iterate over holdout folds
print("Iterating over holdout folds")
distMethod <- c('euc', 'man', 'cos', 'jac')
#for (holdoutfold in 1) {
for (holdoutfold in 1:numfolds) {
for (maxfeatures in maxFeaturesArray) {
# Train
print(paste0("holdout fold=", holdoutfold, ", max features=", maxfeatures))
dftrain_set <- dftrain %>%
filter(fold == holdoutfold)
compareDocs(dftrain_set, holdoutfold, maxfeatures, distMethod, tstart)
}
} # Iterate over training folds
} # If results file exists
# Save results
resultsFilename = paste0("results_", format(Sys.time(), "%Y%m%d_%H%M%S"), ".csv")
write.csv(dfresult, resultsFilename, row.names = F)
write.csv(dfresult, "results.csv", row.names = F)
dfresult_training <- dfresult
#dfresult <- dfresult %>%
# mutate(dist = ifelse(divcode_predicted == 'blackhole', 9999, dist))
Model evaluation
# Calculate column counts
dfcounts <- dfresult_training %>%
group_by(holdout_fold, ipstart, ipend, divcode_actual, max_features, method) %>%
summarize(n = 1) %>%
ungroup()
dfcounts_grouped <- dfcounts %>%
group_by(holdout_fold, max_features, method) %>%
summarize(n = n())
# Generate evaluation table
dfeval_mindist <- dfresult_training %>%
group_by(holdout_fold, ipstart, ipend, divcode_actual, max_features, method) %>%
slice_min(dist, with_ties = T) %>%
ungroup()
dfeval <- dfeval_mindist %>%
inner_join(dfcounts, by = c("holdout_fold", "ipstart", "ipend", "divcode_actual", "max_features", "method")) %>%
mutate(correct = ifelse(divcode_actual == divcode_predicted, 1, 0)) %>%
group_by(holdout_fold, max_features, method) %>%
summarize(ncorrect = sum(correct)) %>%
ungroup()
dfeval <- dfeval %>%
inner_join(dfcounts_grouped, by = c("holdout_fold", "max_features", "method")) %>%
mutate(pct = ncorrect / n) %>%
arrange(desc(pct))
# Plot ROC
dfeval %>%
ggplot() +
geom_point(aes(x = 100 * (1 - pct), y = 100 * pct, color = method, shape = as.character(holdout_fold), stroke = 1)) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
facet_wrap(~max_features, labeller = label_both) +
ggtitle("ROC space for training folds") +
xlab("False positive %") + ylab("True positive %") +
xlim(0, 100) + ylim(0, 100) +
labs(shape = "Holdout Fold") + labs(color = "Distance Method")

# Accuracy vs max features
dfeval %>%
ggplot(aes(x = max_features, y = 100 * pct, color = as.character(holdout_fold), stroke = 1)) +
geom_point() +
geom_smooth(method = lm, se = F) +
facet_wrap(~method, labeller = label_both) +
ggtitle("Accuracy vs max features") +
xlab("Max features") + ylab("Accuracy %") +
labs(color = "Holdout fold")

# Best results by distance method
dftop_dist <- dfeval %>%
group_by(method) %>%
slice_max(pct, with_ties = F) %>%
arrange(desc(pct))
kbl(dftop_dist, caption =
"<i><font color=#000000><b>Table 1.</b> Best results by distance method.</font></i>") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), font_size = 14)
Table 1. Best results by distance method.
|
holdout_fold
|
max_features
|
method
|
ncorrect
|
n
|
pct
|
|
3
|
50
|
jac
|
140
|
142
|
0.9859155
|
|
4
|
100
|
cos
|
91
|
144
|
0.6319444
|
|
1
|
10
|
euc
|
0
|
144
|
0.0000000
|
|
1
|
10
|
man
|
0
|
144
|
0.0000000
|
# Best results by distance method
dftop_dist_jac <- dfeval %>%
filter(method == 'jac') %>%
group_by(max_features) %>%
slice_max(pct) %>%
arrange(max_features)
kbl(dftop_dist_jac, caption =
"<i><font color=#000000><b>Table 2.</b> Jaccard distance results.</font></i>") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), font_size = 14)
Table 2. Jaccard distance results.
|
holdout_fold
|
max_features
|
method
|
ncorrect
|
n
|
pct
|
|
2
|
10
|
jac
|
28
|
54
|
0.5185185
|
|
3
|
25
|
jac
|
135
|
138
|
0.9782609
|
|
3
|
50
|
jac
|
140
|
142
|
0.9859155
|
|
3
|
100
|
jac
|
138
|
143
|
0.9650350
|
# Select best model
bestFold <- 3
bestMaxFeatures <- 50
bestDistMethod <- 'jac'
# Most predicted divcodes
dfeval_most_predicted <- dfeval_mindist %>%
filter(holdout_fold == bestFold & max_features == bestMaxFeatures & method == bestDistMethod) %>%
group_by(divcode_predicted) %>%
summarize(n = n()) %>%
arrange(desc(n)) %>%
inner_join(dfdivcodes, by = c("divcode_predicted" = "divcode")) %>%
select(n, divname)
kbl(head(dfeval_most_predicted, 10), caption =
"<i><font color=#000000><b>Table 3.</b> Most commonly predicted divisions of best model.</font></i>") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), font_size = 14)
Table 3. Most commonly predicted divisions of best model.
|
n
|
divname
|
|
107
|
Campus networks
|
|
11
|
Division of Information Technology (DoIT)
|
|
9
|
Office of the Vice Chancellor for Research and Graduate Education (OVCRGE)
|
|
7
|
School of Medicine and Public Health (SMPH)
|
|
5
|
General Education Administration
|
|
5
|
College of Agricultural & Life Sciences (CALS)
|
|
5
|
College of Letters & Science (L&S)
|
|
4
|
T&L Collaboration
|
|
4
|
Division of Extension
|
|
3
|
College of Engineering
|
# Most actual divcodes
dfeval_most_actual <- dfeval_mindist %>%
filter(holdout_fold == bestFold & max_features == bestMaxFeatures & method == bestDistMethod) %>%
group_by(divcode_actual) %>%
summarize(n = n()) %>%
arrange(desc(n)) %>%
inner_join(dfdivcodes, by = c("divcode_actual" = "divcode")) %>%
select(n, divname)
kbl(head(dfeval_most_actual, 10), caption =
"<i><font color=#000000><b>Table 4.</b> Most common actual divisions of best model.</font></i>") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), font_size = 14)
Table 4. Most common actual divisions of best model.
|
n
|
divname
|
|
105
|
Campus networks
|
|
19
|
Office of the Vice Chancellor for Research and Graduate Education (OVCRGE)
|
|
12
|
College of Letters & Science (L&S)
|
|
8
|
School of Medicine and Public Health (SMPH)
|
|
7
|
Division of Information Technology (DoIT)
|
|
7
|
College of Agricultural & Life Sciences (CALS)
|
|
7
|
College of Engineering
|
|
5
|
University Housing
|
|
4
|
Wisconsin State Lab of Hygiene (WSLH)
|
|
3
|
Division of Extension
|
# Top features of the best model
buildCorpus(bestFold, bestMaxFeatures)
## [1] " Creating document term matrix"
dftrain_corpus <- dftrain_corpus %>%
arrange(desc(tfidf))
kbl(head(dftrain_corpus, 10), caption =
"<i><font color=#000000><b>Table 5.</b> Top features of the best model.</font></i>") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), font_size = 14)
Table 5. Top features of the best model.
|
divcode
|
word
|
tfidf
|
|
A53
|
smph
|
134.3875
|
|
A42
|
uwbadgers.com
|
129.9824
|
|
A87
|
www.vetmed.wisc.edu
|
126.1346
|
|
A71
|
facilities.fpm.wisc.edu
|
125.3006
|
|
A85
|
halls
|
124.4442
|
|
A19
|
pull
|
123.8447
|
|
A06
|
newsbox
|
123.5651
|
|
A06
|
newscol
|
123.5651
|
|
A06
|
newsheads
|
123.5651
|
|
A48
|
ls.wisc.edu
|
122.9663
|
# Model speed calculations
dfperf <- dfresult_training %>%
group_by(holdout_fold, max_features) %>%
summarize(min_t = min(cum_proc_time), max_t = max(cum_proc_time)) %>%
ungroup() %>%
mutate(elapsed = max_t - min_t)
# Plot speed
dfperf %>%
ggplot(aes(x = max_features, y = elapsed,
color = as.character(max_features), shape = as.character(holdout_fold))) +
geom_point() +
ggtitle("Model speed evaluation") +
xlab("Maximum features per division") + ylab("Model runtime (seconds)")

Validation set
# Skip if the results file already exists
if (file.exists("results_validation.csv") & 1 == 1) {
# Load results
dfresult <- read.csv("results_validation.csv")
dfresult$ipstart <- as.character(dfresult$ipstart)
dfresult$ipend <- as.character(dfresult$ipend)
} else {
# Create data frame to hold results and seed it to establish data types
dfresult <- data.frame(matrix(vector(), 0, 9), stringsAsFactors=F)
colnames(dfresult) <-
c("holdout_fold", "ipstart", "ipend", "divcode_predicted", "divcode_actual",
"method", "dist", "max_features", "cum_proc_time")
dfresult <- dfresult %>% mutate(
holdout_fold = 0,
ipstart = '',
ipend = '',
divcode_predicted = '',
divcode_actual = '',
method = '',
dist = 0,
max_features = 0,
cum_proc_time = 0
)
# Starting unix timestamp for calculating processing time
tstart <- as.numeric(as.POSIXct(Sys.time()))
# Validation set
print("Processing validation set")
dfvalidate$fold <- bestFold
compareDocs(dfvalidate, bestFold, bestMaxFeatures, bestDistMethod, tstart)
# Save results
resultsFilename = paste0("results_validation_", format(Sys.time(), "%Y%m%d_%H%M%S"), ".csv")
write.csv(dfresult, resultsFilename, row.names = F)
write.csv(dfresult, "results.csv", row.names = F)
}
dfresult_validation <- dfresult
Evaluate validation set
# Calculate column counts
dfcounts <- dfresult_validation %>%
group_by(holdout_fold, ipstart, ipend, divcode_actual, max_features, method) %>%
summarize(n = 1) %>%
ungroup()
dfcounts_grouped <- dfcounts %>%
group_by(holdout_fold, max_features, method) %>%
summarize(n = sum(n))
# Generate evaluation table
dfeval_mindist <- dfresult_validation %>%
group_by(holdout_fold, ipstart, ipend, divcode_actual, max_features, method) %>%
slice_min(dist, with_ties = T) %>%
ungroup()
dfeval <- dfeval_mindist %>%
inner_join(dfcounts, by = c("holdout_fold", "ipstart", "ipend", "divcode_actual", "max_features", "method")) %>%
mutate(correct = ifelse(divcode_actual == divcode_predicted, 1, 0)) %>%
group_by(holdout_fold, max_features, method) %>%
summarize(ncorrect = sum(correct)) %>%
ungroup()
dfeval <- dfeval %>%
inner_join(dfcounts_grouped, by = c("holdout_fold", "max_features", "method")) %>%
mutate(pct = ncorrect / n) %>%
arrange(desc(pct))
# Plot ROC
dfeval %>%
ggplot(aes(x = 100 * (1 - pct), y = 100 * pct, color = method, shape = as.character(holdout_fold), stroke = 1.5)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
facet_wrap(~max_features, labeller = label_both) +
ggtitle("ROC space for validtion set") +
xlab("False positive %") + ylab("True positive %") +
xlim(0, 100) + ylim(0, 100) +
labs(shape = "Holdout Fold") + labs(color = "Distance Method") +
geom_text(aes(label = paste0(round(100 * pct, 1), "%")), hjust=-0.25, vjust=0)

ANOVA
# Generate table of predicted values from selected model
dfpaired_predicted <- dfresult_validation %>%
filter(holdout_fold == bestFold & method == bestDistMethod & max_features == bestMaxFeatures) %>%
group_by(divcode_predicted) %>%
summarize(n_predicted = n()) %>%
ungroup() %>%
rename(divcode = divcode_predicted) %>%
arrange(divcode)
dfpaired_predicted <- dfpaired_predicted %>%
mutate(pct_predicted = n_predicted / sum(n_predicted))
# Rebuild the corpus using all folds
buildCorpus(0, bestMaxFeatures)
## [1] " Creating document term matrix"
dfpaired_actual <- dftrain_corpus %>%
group_by(divcode) %>%
summarize(n_actual = n()) %>%
ungroup() %>%
arrange(divcode)
dfpaired_actual <- dfpaired_actual %>%
mutate(pct_actual = n_actual / sum(n_actual))
# Join tables, filling in NAs
dfpaired <- dfpaired_actual %>%
full_join(dfpaired_predicted, by = c("divcode")) %>%
mutate_all(~replace(., is.na(.), 0)) %>%
mutate(diff = abs(pct_actual - pct_predicted))
head(dfpaired, 10)
# Inference for paired data
n_diff <- nrow(dfpaired)
x_diff <- mean(dfpaired$diff)
x_diff_null <- 0
s_diff <- sd(dfpaired$diff)
# Hypothesis test
# H_0: There is no statistically significant difference between the difference in means
# H_A: There is a statistically significant difference between the difference in means
SE <- s_diff / sqrt(n_diff)
T <- (x_diff - x_diff_null) / SE
df <- n_diff - 1
p_val <- pt(T, df = df, lower.tail = F)
paste0("p-value = ", p_val)
## [1] "p-value = 0.000806161549978326"
# Results
# Plot t-distributio/n
ggplot(data = data.frame(x = c(-4, 4)), mapping = aes(x = x)) +
stat_function(fun = dt, n = 100, args = list(df = 38)) +
stat_function(fun = dt, n = 100, args = list(df = 38),
xlim = c(T, 4), geom = "area", fill = "#880000", alpha = .8)
