DATA607 - Final Project

Final Project - Subnet Classifier

Overview

One real-world problem my organization faces is that, because of its large, decentralized structure, it is often difficult to discern boundaries of responsibility, especially when it comes to tasks such as cleaning up an infected computer. Our cybersecurity operations center (CSOC) routinely sees intrusion events or anomalous network behaviour that maps clearly to a campus IP address, but it is sometimes difficult discern who is responsible for investigating the incident.

To that end, the goal of my final project is to employ a predictive model to attempt to classify IP networks by university division. The main deliverable is a document classification model that maps networks to university division codes. Since the project will be displayed publicly, network names and addresses have been anonymized or redacted. I ran a SQL query to generate a list of subnets in use on campus, 1027 of which I manually classified to use as my training set. The remain networks are as follows:

A few large network blocks (e.g. VoIP, university housing, and campus wireless networks) have predefined ranges; as such, I’ll exclude these from the model and give them static mappings.
Each network is manually assigned a “friendly” name and description by our network team, but they don’t necessarily use the official department name. Often, abbreviations are used, but not necessarily in a standardized format (e.g., “Mednet” sometimes is used to indicate the School of Medicine and Public Health; other times the acronym “SMPH” is listed). And rarely is the division name or code given; instead, it is almost always listed by the department within the division.
I scraped various websites that correspond to each division, building a corpus for each that I’ll try to map to a particular division.
A nearest-neighbor approach was employed using various distance calculation methods: Euclidian, Manhattan, cosine, and Jacard.

The following is a breakdown of the subnet counts:

total subnets: 27689 total not classified: 673 (2.43%) total classified: 27016 (97.57% of total) classified by allocation keyword: 818 (3.03% of classified) classified by ip block: 20393 (75.48% of classified) classified by regex: 5805 (21.49% of classified) count in training set: 1027 (3.8% of classified)

Ingest the data

# Set model parameters
wtManualTokens <- 25     # weight of manual tokens
wtTrainingTokens <- 1    # weight of training tokens
wtCrawledTokens <- 1     # weight of crawled tokens
minTF <- 5               # min term frequency
maxTF <- 200             # max term frequency
maxNgrams <- 2           # max ngram length

# Array of max features per divcode that will be used in model training
maxFeaturesArray = c(10, 25, 50, 100)

# Read divcodes
dfdivcodes <- read.csv("divcodes.csv", header = F)
dfdivcodes <- dfdivcodes %>%
    rename(divcode = V1, divname = V2)

# Read json containing all the subnets we're trying to classify, including the following:
# 27,689 total subnets
#    818 classified by allocation keyword (blackhole, reserved, not deployed)
# 20,393 large network blocks (voip, housing, wireless, etc)
#  5,805 classified by regex but not verified for accuracy
#  1,027 count to be used in the training set, verified for accuracy
#    673 unable to be classified by any of the above
rawjson <- fromJSON("subnets-classified.json", simplifyVector = TRUE)

# Extract variable names from json:
# ipstart:      starting ip address in numeric format
# ipend:        ending ip address in numeric format
# alloc:        allocation code:
#                   ALLOCATED       subnet allocated but not deployed       no learning required
#                   ASSIGNED        subnet allocated and deployed           learning required
#                   BLACKHOLE       blackholed ip address                   no learning required
#                   RESERVED        reserved for network equipment use      no learning required
# netname:      network name
# descr:        network description
# match:        match doe:
#                   0               not matched
#                   1               matched by regex (and either validated or validated, depending on value of "train" field)
#                   2               matched to a large network block (voip, housing, wireless, etc)
#                   3               matched to an allocation code not requiring learning (ALLOCATED, BLOCKHOLE, RESERVED)
# divcode       division code (if matched)
# div           division name (if matched)
# train         use in training set
#                   0               don't use
#                   1               use
vars <- names(rawjson[[1]])

# Create data frame
dfjson <- data.frame(matrix(vector(), length(rawjson), length(vars)), stringsAsFactors=F)
colnames(dfjson) <- vars

# Walk each variable, assigning each one to a column of the empty data frame
for (i in 1:(length(vars))) {

    # Extract each subnet as a separate element
    tmpnet <- sapply(rawjson, "[[", i)

    # Add column to data frame
    dfjson[i] <- tmpnet
}

# Concatenate netname and descr fields
dfjson <- dfjson %>%
    mutate(text = tolower(paste(netname, descr)))

# Convert types
dfjson$ipstart <- as.character(dfjson$ipstart)
dfjson$ipend <- as.character(dfjson$ipend)
dfjson$match <- as.numeric(dfjson$match)
dfjson$train <- as.numeric(dfjson$train)

Stopwords

# Load standard stop_words and read in custom stop_words from file
data(stop_words)
custom_stopwords <- read.delim("custom_stopwords.txt", sep = '\n', header = F) %>%
    rename(word = V1) %>% mutate(lexicon = "custom")
custom_stopwords <- bind_rows(custom_stopwords, stop_words)

# Function to remove stopwords from ngrams
removeStopwords <- function(tokens, stopwords) {
    map2(tokens, lapply(tokens, function(x) {
        unlist(
            lapply(
                strsplit(x, " "), function(y) {
                    any(y %in% stopwords) 
                }
            )
        ) 
       }
    ), ~ .x[!.y])
}

Partitioning

# reserve validation set
dftrain <- dfjson %>%
    filter(train == 1)
#dftrain <- dftrain %>%
#    filter(divcode != 'net' & divcode != 'ban' & divcode != 'voip' & divcode != 'uwhc' & 
#            divcode != 'fisma'  & divcode != 'wireless' & divcode != 'pci' & divcode != 'blackhole')
set.seed(777)
validation_prop = 0.3
validation_rows <- sample(nrow(dftrain), size = nrow(dftrain) * validation_prop, replace = F)
dftrain <- dftrain %>%
    mutate(validation_set = ifelse(row_number() %in% validation_rows, 1, 0))
dfvalidate <- dftrain %>%
    filter(validation_set == 1) %>%
    mutate(fold = 0)
dftrain <- dftrain %>%
    filter(validation_set == 0)

# Create training set partitioning into [numfolds] random groups
numfolds <- 5
set.seed(777)
num_rows <- sample(nrow(dftrain), replace = F)
dftrain <- dftrain[num_rows, ]
dftrain <- dftrain %>%
    mutate(fold = ((row_number() - 1) %% numfolds) + 1)

Corpus

# Load list of web pages to scrape
dfcrawl <- read.csv("crawl-pages.csv", header = F)
dfcrawl <- dfcrawl %>%
    rename(divcode = V1, url = V2) %>%
    mutate(text = '')

# Function to read a page and dump contents to data frame
crawlPage <- function(df) {
    results <- getURL(df$url, .opts=curlOptions(followlocation = TRUE))
    #results <- read_html(df$url, encoding = "UTF-8")
    #results <- GET(df$url)
    dfcrawl <<- dfcrawl %>%
        mutate(text = ifelse(url == df$url, results, text))
}

# Crawl pages, but only if we haven't already done so
if (file.exists("crawled-pages.csv")) {
    dfcrawl <- read.csv("crawled-pages.csv", header = F)
    dfcrawl <- dfcrawl %>%
        rename(divcode = V1, url = V2, text = V3)
} else {
    crawlPage(dfcrawl)
    write.csv(dfcrawl, "crawled-pages.csv", col.names=F, row.names=F)
}

# Normalize to lower case
dfcrawl <- dfcrawl %>%
    mutate(text = tolower(text))

# Tokenize
dfcorpus <- dfcrawl %>%
    unnest_tokens(word, text, token = "ngrams", n_min = 1, n = maxNgrams)

# Set weight
dfcorpus <- dfcorpus %>%
    mutate(wt = wtCrawledTokens)

# Remove words with numbers
dfcorpus <- dfcorpus %>% 
    filter(!grepl('\\d', word))

# Remove all underscores
dfcorpus <- dfcorpus %>%
    filter(!grepl('^_+$', word))

# Remove url field
dfcorpus <- dfcorpus %>%
    select(-url)

# Load manual tokens; add flag indicating this is a manual token
dfmanual_tokens <- read.csv("manual-tokens.csv", header = F)
dfmanual_tokens <- dfmanual_tokens %>%
    rename(word = V1, divcode = V2) %>%
    mutate(wt = wtManualTokens)
#dfmanual_tokens <- dfmanual_tokens %>%
#    filter(divcode != 'net' & divcode != 'ban' & divcode != 'voip' & divcode != 'uwhc' & 
#            divcode != 'fisma'  & divcode != 'wireless' & divcode != 'pci' & divcode != 'blackhole')

# Remove single stopwords
dfcorpus <- dfcorpus %>%
    anti_join(custom_stopwords, by = c("word"))

# Remove stopwords from ngrams - this takes a long time
dfcorpus$word <- removeStopwords(dfcorpus$word, custom_stopwords$word)
dfcorpus <- dfcorpus %>%
    filter(word != 'character(0)')
dfcorpus$word <- as.character(dfcorpus$word)

# Merge manual tokens with corpus
dfcorpus <- dfcorpus %>%
    bind_rows(dfmanual_tokens)

# Counts
dfcorpus <- dfcorpus %>%
    group_by(divcode, word) %>%
    summarize(n = sum(wt)) %>%
    ungroup()

# Filter infrequent and too-frequent terms
dfcorpus <- dfcorpus %>%
    filter(n >= minTF & n <= maxTF) %>%
    arrange(desc(n))

# IDF
dfidf_corpus <- dfcorpus %>%
    group_by(word) %>%
    summarize(idf = 1 + log(nrow(dfcrawl) / sum(n), base = 10)) %>%
    ungroup()

# TFIDF
dfcorpus <- dfcorpus %>%
    inner_join(dfidf_corpus, by = c("word")) %>%
    mutate(tfidf = idf * n)

# Filter in words with highest tfidf scores (not doing this)
#dfcorpus <- dfcorpus %>%
#    filter(tfidf >= 0)

# Filter on only divcodes with more than a certain amount of terms (not doing this)
#dfcorpus_filter <- dfcorpus %>% group_by(divcode) %>%
#    summarize(n = n()) %>%
#    arrange(n) %>% 
#    filter(n > 5) %>%
#    select(divcode)

Data Evaluation

# Divisions
kbl(tail(dfdivcodes, 20) %>% select(divname), caption = 
    "<i><font color=#000000><b>Table 1.</b> University divisions.</font></i>") %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"), font_size = 12)

**Table 1.** University divisions.
	divname
20	Division of Extension
21	College of Letters & Science (L&S)
22	General Library System
23	Wisconsin State Lab of Hygiene (WSLH)
24	School of Medicine and Public Health (SMPH)
25	School of Nursing
26	School of Pharmacy
27	University Health Services (UHS)
28	Officer Education
29	Facilities Planning & Management (FP&M)
30	UW Police Department
31	Recreational Sports
32	University Housing
33	School of Veterinary Medicine
34	Wisconsin Veterinary Diagnostic Lab (WVDL)
35	Division of Continuing Studies
36	Wisconsin Institute for Discovery
37	Wisconsin Union
38	Campus networks
39	University of Wisconsin Hospitals and Clinics

# Most frequent terms
dfcorpus %>%
    arrange(desc(n)) %>%
    head(10) %>%
    ggplot() +
    geom_bar(aes(x = reorder(word, n), y = n), stat = "identity") +
    ggtitle("Most frequent terms") +
    xlab("Term") + ylab("Count") +
    coord_flip()

# Word cloud
set.seed(777)
wordcloud(words = dfcorpus$word, freq = dfcorpus$n, min.freq = 1,
    max.words = 75, random.order = F, rot.per = 0.35,
    colors=brewer.pal(8, "Accent"), scale=c(1.5, 1))

# Load divcode table
dfdivcodes <- read.csv("divcodes.csv", header = F) %>%
    rename(divcode = V1, divname = V2)

# Terms by division code
dfcorpus %>%
    inner_join(dfdivcodes, by = c("divcode")) %>%
    group_by(divcode, divname) %>%
    summarize(n = sum(n)) %>%
    ggplot() +
    geom_bar(aes(x = reorder(divname, n), y = n), stat = "identity") +
    ggtitle("Terms by division") +
    xlab("Division") + ylab("Count") +
    coord_flip()

#############################
# Build corpus function
#############################

buildCorpus <- function(holdoutfold, maxFeatures) {

    # Tokenize, remove words with numbers, all underscores, and stopwords
    dftrain_words <- dftrain %>%
        filter(fold != holdoutfold) %>%
        unnest_tokens(word, text, token = "ngrams", n_min = 1, n = maxNgrams) %>%
        filter(!grepl('\\d', word)) %>%
        filter(!grepl('^_+$', word)) %>%
        anti_join(custom_stopwords, by = c("word"))

    # Remove stopwords from ngrams - this takes a long time
    dftrain_words$word <- removeStopwords(dftrain_words$word, custom_stopwords$word)
    dftrain_words <- dftrain_words %>%
        filter(word != 'character(0)')
    dftrain_words$word <- as.character(dftrain_words$word)

    # Calculate word counts per divcode
    dftrain_words <- dftrain_words %>%
        group_by(fold, ipstart, ipend, divcode, word) %>%
        summarize(n = n() * wtTrainingTokens) %>%
        ungroup()

    # IDF
    dfidf <- dftrain_words %>%
        group_by(word) %>%
        summarize(idf = 1 + log(nrow(dftrain) / n(), base = 10)) %>%
        ungroup()
    
    # TFIDF
    dftrain_words <- dftrain_words %>%
        inner_join(dfidf, by = c("word")) %>%
        mutate(tfidf = idf * n)

    # Create training corpus
    dftrain_corpus <<- dfcorpus %>%
        group_by(divcode, word) %>%
        summarize(tfidf = sum(tfidf)) %>%
        ungroup() %>%
        select(divcode, word, tfidf)

    # Append the training set to the corpus
    dftrain_corpus <<- dftrain_corpus %>%
        bind_rows(
            dftrain_words %>% 
                distinct(divcode, word, tfidf)
        ) %>%
        group_by(divcode, word) %>%
        summarize(tfidf = sum(tfidf)) %>%
        ungroup() %>%
        select(divcode, word, tfidf)

    # Select only the top [maxFeatures] features per divcode
    dftrain_corpus <<- dftrain_corpus %>%
        group_by(divcode) %>%
        slice_max(tfidf, n = maxFeatures) %>%
        ungroup()
    
    # Filter out problematic entries
    dftrain_corpus <<- dftrain_corpus %>%
        filter(word != 'fp' | (word == 'fp' & divcode == 'net'))

    # Create document term matrix, replacing NAs and sorting by column name
    print("    Creating document term matrix")
    dfcorpus_dtm <<- dftrain_corpus %>% 
        spread(key = word, value = tfidf) %>%
        mutate_all(~replace(., is.na(.), 0)) %>% 
        select("divcode", sort(colnames(.)))

}

#############################
# Compare documents function
#############################

compareDocs <- function(dfset, holdoutfold, maxFeatures, distMethod, tstart) {
    
    # Build corpus
    buildCorpus(holdoutfold, maxFeatures)

    # Tokenize, remove words with numbers, all underscores, and stopwords
    dftrain_words <- dfset %>%
        filter(fold == holdoutfold) %>%
        unnest_tokens(word, text, token = "ngrams", n_min = 1, n = maxNgrams) %>%
        filter(!grepl('\\d', word)) %>%
        filter(!grepl('^_+$', word)) %>%
        anti_join(custom_stopwords, by = c("word"))

    # Remove stopwords from ngrams - this takes a long time
    dftrain_words$word <- removeStopwords(dftrain_words$word, custom_stopwords$word)
    dftrain_words <- dftrain_words %>%
        filter(word != 'character(0)')
    dftrain_words$word <- as.character(dftrain_words$word)
       
    # Calculate word counts per divcode
    dftrain_words <- dftrain_words %>%
        group_by(fold, ipstart, ipend, divcode, word) %>%
        summarize(n = n()) %>%
        ungroup() %>%
        mutate(tfidf = n)   # added

    # Iterate over each subnet in the set
    print("    Iterating over subnets")
    #for (row in 51:100) {
    for (row in 1:nrow(dfset)) {

        if (row %% 10 == 0) {
            print(paste0("        holdoutfold=", holdoutfold, ", test set row=", row))
        }

        # Filter on just this network
        dfnet_unpadded <- dftrain_words %>% 
            filter(ipstart == dfset[row, 'ipstart'] & ipend == dfset[row, 'ipend'])
        #print(paste0(dfset[row, 'ipstart'],",",dfset[row, 'ipend'],",",dfset[row, 'divcode']))

        # Fill in missing words in training set from corpus in order to calculate distances
        dfnet <- dfnet_unpadded %>%
            full_join(dftrain_corpus %>% distinct(word), by = c("word"))

        # Set NAs to 0
        dfnet$tfidf[is.na(dfnet$tfidf)] <- 0

        # Fill in any missing subnet and divcode values
        dfnet$ipstart <- dfset[row, 'ipstart']
        dfnet$ipend <- dfset[row, 'ipend']
        dfnet$divcode <- dfset[row, 'divcode']

        # Remove duplicate rows resulting from the full_join
        dfnet <- dfnet %>% 
            select(ipstart, ipend, divcode, word, tfidf) %>%
            unique()

        # Create document term matrix, sorting by column name
        dfnet_dtm <- dfnet %>% 
            spread(key = word, value = tfidf) %>%
            select("ipstart", "ipend", "divcode", sort(colnames(.)))
        
        # Fill in missing columns in the corpus DTM
        missing <- c()
        for (i in colnames(dfnet_dtm)) {
            found = 0
            for (j in colnames(dfcorpus_dtm)) {
                if (i == j) {
                    found = 1
                }
            }
            if (found == 0 & i != 'ipstart' & i != 'ipend') {
                missing <- c(missing, i)
            }
        }

        # Create matrices to do matrix arithmetic;
        # dtmnet_matrix is just one row repeated a bunch of times,
        # equal to the number of rows in the dtmcorpus_matrix data frame.
        dtmcorpus_matrix <- dfcorpus_dtm[,2:ncol(dfcorpus_dtm)]
        dtmnet_matrix <- dfnet_dtm[,4:ncol(dfnet_dtm)]
        dtmnet_matrix <- dtmnet_matrix[rep(seq_len(nrow(dtmnet_matrix)), each = nrow(dtmcorpus_matrix)), ]
        
        # Fill in missing values in the corpus DTM
        for (i in missing) {
            dtmcorpus_matrix[,i] <- 0
        }

        # Loop through each distance method specified in the function parameters
        for (j in distMethod) {

            if (j == 'euc') {
            
                # Calculate Euclidian distance
                #print("               calculating Euclidian distance")
                dfresult_tmp <<- (dtmnet_matrix - dtmcorpus_matrix) ^ 2 %>%
                    mutate(divcode = dfcorpus_dtm$divcode) %>%
                    select(divcode, sort(colnames(.)))
                dfresult_tmp <<- dfresult_tmp %>%
                    mutate(method = j, dist = sqrt(rowSums(.[2:ncol(dfresult_tmp)]))) %>%
                    select(divcode, method, dist)
    
            } else if (j == 'man') {
    
                # Calculate Manhattan distance
                #print("               calculating Manhattan distance")
                dfresult_tmp <<- abs(dtmnet_matrix - dtmcorpus_matrix) %>%
                    mutate(divcode = dfcorpus_dtm$divcode) %>%
                    select(divcode, sort(colnames(.)))
                dfresult_tmp <<- dfresult_tmp %>%
                    mutate(method = j, dist = rowSums(.[2:ncol(dfresult_tmp)])) %>%
                    select(divcode, method, dist)
    
            } else if (j == 'cos') {
                
                # Calculate cosine distance
                #print("               calculating cosine distance")
                dfresult_tmp <<- (dtmnet_matrix * dtmcorpus_matrix) %>%
                    mutate(divcode = dfcorpus_dtm$divcode) %>%
                    select(divcode, sort(colnames(.)))
                dfresult_tmp <<- dfresult_tmp %>%
                    mutate(dist = rowSums(.[2:ncol(dfresult_tmp)]))
                dfresult_tmp <<- dfresult_tmp %>%
                    mutate(method = j, dist = 1 - (dist / (sqrt(sum(dtmnet_matrix ^ 2)) * sqrt(sum(dtmcorpus_matrix)))))
                dfresult_tmp <<- dfresult_tmp %>%
                    select(divcode, method, dist)
    
            } else if (j == 'jac') {
                        
                # Calculate Jaccard distance
                #print("               calculating Jaccard distance")
                tmp_intersect <<- dftrain_corpus %>%
                    group_by(divcode, word) %>%
                    summarize(n = n()) %>%
                    ungroup() %>%
                    inner_join(select(dfnet_unpadded, word), by = c("word")) %>%
                    group_by(divcode) %>%
                    summarize(n = n()) %>%
                    ungroup()
                tmp_union <- nrow(dftrain_corpus)
                dfresult_tmp <<- tmp_intersect %>%
                    mutate(method = j, dist = 1 - (n / tmp_union)) %>%
                    select(divcode, method, dist)
            }

            # Prepare results
            dfresult_tmp <<- dfresult_tmp %>%
                select(divcode, method, dist) %>%
                rename(divcode_predicted = divcode) %>%
                mutate(
                    holdout_fold = holdoutfold, 
                    ipstart = dfset[row, 'ipstart'],
                    ipend = dfset[row, 'ipend'],
                    divcode_actual = dfset[row, 'divcode'],
                    max_features = maxFeatures,
                    cum_proc_time = (as.numeric(as.POSIXct(Sys.time())) - tstart)
                )

            # Append to result df
            if (exists('dfresult') && is.data.frame(get('dfresult'))) {
                dfresult <<- dfresult %>%
                    bind_rows(dfresult_tmp)
            } else {
                dfresult <<- dfresult_tmp
            }
        
            # Diag
            dfresult_tmp <<- dfresult_tmp %>%
                mutate(text = dfset[row, 'text'])

        }   # loop through distance methods

    }   # Iterate over subnets
    
}

#############################
# End of document comparison function
#############################


#############################
# Training
#############################

# Skip if the results file already exists
if (file.exists("results.csv") & 1 == 1) {

    # Load results
    dfresult <- read.csv("results.csv")
    dfresult$ipstart <- as.character(dfresult$ipstart)
    dfresult$ipend <- as.character(dfresult$ipend)

} else {

    if (1 == 1) {
        # Create data frame to hold results and seed it to establish data types
        dfresult <- data.frame(matrix(vector(), 0, 9), stringsAsFactors=F)
        colnames(dfresult) <- 
            c("holdout_fold", "ipstart", "ipend", "divcode_predicted", "divcode_actual", 
                "method", "dist", "max_features", "cum_proc_time")
        dfresult <- dfresult %>% mutate(
            holdout_fold = 0, 
            ipstart = '',
            ipend = '',
            divcode_predicted = '',
            divcode_actual = '',
            method = '',
            dist = 0,
            max_features = 0,
            cum_proc_time = 0
        )
    }
    
    # Starting unix timestamp for calculating processing time
    tstart <- as.numeric(as.POSIXct(Sys.time()))
    
    # Iterate over holdout folds
    print("Iterating over holdout folds")
    distMethod <- c('euc', 'man', 'cos', 'jac')
    #for (holdoutfold in 1) {
    for (holdoutfold in 1:numfolds) {
    
        for (maxfeatures in maxFeaturesArray) {
            
            # Train
            print(paste0("holdout fold=", holdoutfold, ", max features=", maxfeatures))
            dftrain_set <- dftrain %>%
                filter(fold == holdoutfold)
            compareDocs(dftrain_set, holdoutfold, maxfeatures, distMethod, tstart)
        }

    }   # Iterate over training folds

}   # If results file exists

# Save results
resultsFilename = paste0("results_", format(Sys.time(), "%Y%m%d_%H%M%S"), ".csv")
write.csv(dfresult, resultsFilename, row.names = F)
write.csv(dfresult, "results.csv", row.names = F)
dfresult_training <- dfresult

#dfresult <- dfresult %>%
#    mutate(dist = ifelse(divcode_predicted == 'blackhole', 9999, dist))

Model evaluation

# Calculate column counts
dfcounts <- dfresult_training %>%
    group_by(holdout_fold, ipstart, ipend, divcode_actual, max_features, method) %>%
    summarize(n = 1) %>%
    ungroup()
dfcounts_grouped <- dfcounts %>%
    group_by(holdout_fold, max_features, method) %>%
    summarize(n = n())

# Generate evaluation table
dfeval_mindist <- dfresult_training %>%
    group_by(holdout_fold, ipstart, ipend, divcode_actual, max_features, method) %>%
    slice_min(dist, with_ties = T) %>%
    ungroup()
dfeval <- dfeval_mindist %>%
    inner_join(dfcounts, by = c("holdout_fold", "ipstart", "ipend", "divcode_actual", "max_features", "method")) %>% 
    mutate(correct = ifelse(divcode_actual == divcode_predicted, 1, 0)) %>%
    group_by(holdout_fold, max_features, method) %>%
    summarize(ncorrect = sum(correct)) %>%
    ungroup()
dfeval <- dfeval %>%
    inner_join(dfcounts_grouped, by = c("holdout_fold", "max_features", "method")) %>%
    mutate(pct = ncorrect / n) %>%
    arrange(desc(pct))

# Plot ROC
dfeval %>%
    ggplot() +
    geom_point(aes(x = 100 * (1 - pct), y = 100 * pct, color = method, shape = as.character(holdout_fold), stroke = 1)) +
    geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
    facet_wrap(~max_features, labeller = label_both) +
    ggtitle("ROC space for training folds") +
    xlab("False positive %") + ylab("True positive %") +
    xlim(0, 100) + ylim(0, 100) + 
    labs(shape = "Holdout Fold") + labs(color = "Distance Method")

# Accuracy vs max features
dfeval %>%
    ggplot(aes(x = max_features, y = 100 * pct, color = as.character(holdout_fold), stroke = 1)) +
    geom_point() +
    geom_smooth(method = lm, se = F) +
    facet_wrap(~method, labeller = label_both) +
    ggtitle("Accuracy vs max features") +
    xlab("Max features") + ylab("Accuracy %") +
    labs(color = "Holdout fold")

# Best results by distance method
dftop_dist <- dfeval %>%
    group_by(method) %>%
    slice_max(pct, with_ties = F) %>%
    arrange(desc(pct))
kbl(dftop_dist, caption = 
  "<i><font color=#000000><b>Table 1.</b> Best results by distance method.</font></i>") %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"), font_size = 14)

**Table 1.** Best results by distance method.
holdout_fold	max_features	method	ncorrect	n	pct
3	50	jac	140	142	0.9859155
4	100	cos	91	144	0.6319444
1	10	euc	0	144	0.0000000
1	10	man	0	144	0.0000000

# Best results by distance method
dftop_dist_jac <- dfeval %>%
    filter(method == 'jac') %>%
    group_by(max_features) %>%
    slice_max(pct) %>%
    arrange(max_features)
kbl(dftop_dist_jac, caption = 
  "<i><font color=#000000><b>Table 2.</b> Jaccard distance results.</font></i>") %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"), font_size = 14)

**Table 2.** Jaccard distance results.
holdout_fold	max_features	method	ncorrect	n	pct
2	10	jac	28	54	0.5185185
3	25	jac	135	138	0.9782609
3	50	jac	140	142	0.9859155
3	100	jac	138	143	0.9650350

# Select best model
bestFold <- 3
bestMaxFeatures <- 50
bestDistMethod <- 'jac'

# Most predicted divcodes
dfeval_most_predicted <- dfeval_mindist %>%
    filter(holdout_fold == bestFold & max_features == bestMaxFeatures & method == bestDistMethod) %>%
    group_by(divcode_predicted) %>%
    summarize(n = n()) %>%
    arrange(desc(n)) %>%
    inner_join(dfdivcodes, by = c("divcode_predicted" = "divcode")) %>%
    select(n, divname)
kbl(head(dfeval_most_predicted, 10), caption = 
  "<i><font color=#000000><b>Table 3.</b> Most commonly predicted divisions of best model.</font></i>") %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"), font_size = 14)

**Table 3.** Most commonly predicted divisions of best model.
n	divname
107	Campus networks
11	Division of Information Technology (DoIT)
9	Office of the Vice Chancellor for Research and Graduate Education (OVCRGE)
7	School of Medicine and Public Health (SMPH)
5	General Education Administration
5	College of Agricultural & Life Sciences (CALS)
5	College of Letters & Science (L&S)
4	T&L Collaboration
4	Division of Extension
3	College of Engineering

# Most actual divcodes
dfeval_most_actual <- dfeval_mindist %>%
    filter(holdout_fold == bestFold & max_features == bestMaxFeatures & method == bestDistMethod) %>%
    group_by(divcode_actual) %>%
    summarize(n = n()) %>%
    arrange(desc(n)) %>%
    inner_join(dfdivcodes, by = c("divcode_actual" = "divcode")) %>%
    select(n, divname)
kbl(head(dfeval_most_actual, 10), caption = 
  "<i><font color=#000000><b>Table 4.</b> Most common actual divisions of best model.</font></i>") %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"), font_size = 14)

**Table 4.** Most common actual divisions of best model.
n	divname
105	Campus networks
19	Office of the Vice Chancellor for Research and Graduate Education (OVCRGE)
12	College of Letters & Science (L&S)
8	School of Medicine and Public Health (SMPH)
7	Division of Information Technology (DoIT)
7	College of Agricultural & Life Sciences (CALS)
7	College of Engineering
5	University Housing
4	Wisconsin State Lab of Hygiene (WSLH)
3	Division of Extension

# Top features of the best model
buildCorpus(bestFold, bestMaxFeatures)

## [1] "    Creating document term matrix"

dftrain_corpus <- dftrain_corpus %>% 
    arrange(desc(tfidf))
kbl(head(dftrain_corpus, 10), caption = 
  "<i><font color=#000000><b>Table 5.</b> Top features of the best model.</font></i>") %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"), font_size = 14)

**Table 5.** Top features of the best model.
divcode	word	tfidf
A53	smph	134.3875
A42	uwbadgers.com	129.9824
A87	www.vetmed.wisc.edu	126.1346
A71	facilities.fpm.wisc.edu	125.3006
A85	halls	124.4442
A19	pull	123.8447
A06	newsbox	123.5651
A06	newscol	123.5651
A06	newsheads	123.5651
A48	ls.wisc.edu	122.9663

# Model speed calculations
dfperf <- dfresult_training %>%
    group_by(holdout_fold, max_features) %>%
    summarize(min_t = min(cum_proc_time), max_t = max(cum_proc_time)) %>%
    ungroup() %>%
    mutate(elapsed = max_t - min_t)

# Plot speed
dfperf %>%
    ggplot(aes(x = max_features, y = elapsed, 
        color = as.character(max_features), shape = as.character(holdout_fold))) +
    geom_point() +
    ggtitle("Model speed evaluation") + 
    xlab("Maximum features per division") + ylab("Model runtime (seconds)")

Validation set

# Skip if the results file already exists
if (file.exists("results_validation.csv") & 1 == 1) {

    # Load results
    dfresult <- read.csv("results_validation.csv")
    dfresult$ipstart <- as.character(dfresult$ipstart)
    dfresult$ipend <- as.character(dfresult$ipend)

} else {

    # Create data frame to hold results and seed it to establish data types
    dfresult <- data.frame(matrix(vector(), 0, 9), stringsAsFactors=F)
    colnames(dfresult) <- 
        c("holdout_fold", "ipstart", "ipend", "divcode_predicted", "divcode_actual", 
            "method", "dist", "max_features", "cum_proc_time")
    dfresult <- dfresult %>% mutate(
        holdout_fold = 0, 
        ipstart = '',
        ipend = '',
        divcode_predicted = '',
        divcode_actual = '',
        method = '',
        dist = 0,
        max_features = 0,
        cum_proc_time = 0
    )
    
    # Starting unix timestamp for calculating processing time
    tstart <- as.numeric(as.POSIXct(Sys.time()))
    
    # Validation set
    print("Processing validation set")
    dfvalidate$fold <- bestFold
    compareDocs(dfvalidate, bestFold, bestMaxFeatures, bestDistMethod, tstart)
    
    # Save results
    resultsFilename = paste0("results_validation_", format(Sys.time(), "%Y%m%d_%H%M%S"), ".csv")
    write.csv(dfresult, resultsFilename, row.names = F)
    write.csv(dfresult, "results.csv", row.names = F)
    
}
dfresult_validation <- dfresult

Evaluate validation set

# Calculate column counts
dfcounts <- dfresult_validation %>%
    group_by(holdout_fold, ipstart, ipend, divcode_actual, max_features, method) %>%
    summarize(n = 1) %>%
    ungroup() 
dfcounts_grouped <- dfcounts %>%
    group_by(holdout_fold, max_features, method) %>%
    summarize(n = sum(n))

# Generate evaluation table
dfeval_mindist <- dfresult_validation %>%
    group_by(holdout_fold, ipstart, ipend, divcode_actual, max_features, method) %>%
    slice_min(dist, with_ties = T) %>%
    ungroup()
dfeval <- dfeval_mindist %>%
    inner_join(dfcounts, by = c("holdout_fold", "ipstart", "ipend", "divcode_actual", "max_features", "method")) %>% 
    mutate(correct = ifelse(divcode_actual == divcode_predicted, 1, 0)) %>%
    group_by(holdout_fold, max_features, method) %>%
    summarize(ncorrect = sum(correct)) %>%
    ungroup()
dfeval <- dfeval %>%
    inner_join(dfcounts_grouped, by = c("holdout_fold", "max_features", "method")) %>%
    mutate(pct = ncorrect / n) %>%
    arrange(desc(pct))

# Plot ROC
dfeval %>%
    ggplot(aes(x = 100 * (1 - pct), y = 100 * pct, color = method, shape = as.character(holdout_fold), stroke = 1.5)) +
    geom_point() +
    geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
    facet_wrap(~max_features, labeller = label_both) +
    ggtitle("ROC space for validtion set") +
    xlab("False positive %") + ylab("True positive %") +
    xlim(0, 100) + ylim(0, 100) + 
    labs(shape = "Holdout Fold") + labs(color = "Distance Method") +
    geom_text(aes(label = paste0(round(100 * pct, 1), "%")), hjust=-0.25, vjust=0)

ANOVA

# Generate table of predicted values from selected model
dfpaired_predicted <- dfresult_validation %>%
    filter(holdout_fold == bestFold & method == bestDistMethod & max_features == bestMaxFeatures) %>%
    group_by(divcode_predicted) %>%
    summarize(n_predicted = n()) %>%
    ungroup() %>%
    rename(divcode = divcode_predicted) %>%
    arrange(divcode)
dfpaired_predicted <- dfpaired_predicted %>%
    mutate(pct_predicted = n_predicted / sum(n_predicted))

# Rebuild the corpus using all folds
buildCorpus(0, bestMaxFeatures)

## [1] "    Creating document term matrix"

dfpaired_actual <- dftrain_corpus %>%
    group_by(divcode) %>%
    summarize(n_actual = n()) %>%
    ungroup() %>%
    arrange(divcode)
dfpaired_actual <- dfpaired_actual %>%
    mutate(pct_actual = n_actual / sum(n_actual))

# Join tables, filling in NAs
dfpaired <- dfpaired_actual %>%
    full_join(dfpaired_predicted, by = c("divcode")) %>%
    mutate_all(~replace(., is.na(.), 0)) %>%
    mutate(diff = abs(pct_actual - pct_predicted))
head(dfpaired, 10)

# Inference for paired data
n_diff <- nrow(dfpaired)
x_diff <- mean(dfpaired$diff)
x_diff_null <- 0
s_diff <- sd(dfpaired$diff)

# Hypothesis test
# H_0: There is no statistically significant difference between the difference in means
# H_A: There is a statistically significant difference between the difference in means
SE <- s_diff / sqrt(n_diff)
T <- (x_diff - x_diff_null) / SE
df <- n_diff - 1
p_val <- pt(T, df = df, lower.tail = F)
paste0("p-value = ", p_val)

## [1] "p-value = 0.000806161549978326"

# Results
# Plot t-distributio/n
ggplot(data = data.frame(x = c(-4, 4)), mapping = aes(x = x)) +
    stat_function(fun = dt, n = 100, args = list(df = 38)) +
    stat_function(fun = dt, n = 100, args = list(df = 38),
    xlim = c(T, 4), geom = "area", fill = "#880000", alpha = .8)