Project 4

By Brian Weinfeld

April 12th, 2018

For this project, I used a data set from UCI’s Machine Learning Repository which can be found here. I selected a data set that contained several websites that have been archived and identified as being either about ‘Bands’, ‘BioMedical’, ‘Goats’, or ‘Sheep’.

My goal is to perform an analysis on this data and see if I can use machine learning to correctly identify websites based on their words.

I began by writing a function Website.Parse that is designed to read each webpage, give it with a unique document id and it’s associated classification.

Website.Parse <- function(folder, type){
  path <- paste0('C:\\Users\\Brian\\Desktop\\GradClasses\\Spring18\\607\\607project4\\', folder, '\\', type, '\\')
  list.files(path=path, pattern="\\d+") %>%
  map_dfr(~getURL(paste0('file:///', path, .x)) %>% 
            htmlParse() %>%
            xpathSApply('/html//body', xmlValue) %>%
            strsplit(split='\\n') %>%
            unlist() %>%
            as.tibble() %>%
            add_column(type=type, .before=1) %>%
            add_column(id=paste(type, .x, sep='_'), .before=1)
        )
}

I read in the all the websites and then cleaned up the data using tidytext. In addition, I removed all numbers and non-ascii symbols. Finally, I removed all stop words and a number of other reoccuring terms that I did not want in the final analysis. My guess is that a few of the websites had malformed tags as all my custom stop words are html tags.

custom.stopwords <- data_frame(word=c('div', 'p', 'h', 'img'))

data <- c('Bands', 'BioMedical', 'Goats', 'Sheep') %>%
  map_df(~Website.Parse('Train', .)) %>%
  unnest_tokens(word, value) %>%
  anti_join(stop_words) %>%
  anti_join(custom.stopwords) %>%
  filter(!str_detect(word, '(\\d|\\.|[^A-Za-z])+')) %>%
  filter(stringi::stri_enc_mark(word) == 'ASCII') %>%
  mutate(word = wordStem(word)) 
kable(data[1:10, ])
id type word
Bands_1 Bands el
Bands_1 Bands sob
Bands_1 Bands el
Bands_1 Bands sob
Bands_1 Bands skin
Bands_1 Bands cat
Bands_1 Bands excerpt
Bands_1 Bands song
Bands_1 Bands min
Bands_1 Bands colleg

A quick examination finds 61 ‘Band’ websites, 136 ‘BioMedical’ websites, 70 ‘Goats’ websites, and 65 ‘Sheep’ websites.

data %>%
  group_by(type) %>%
  summarise(count = n_distinct(id)) %>%
  kable()
type count
Bands 61
BioMedical 136
Goats 70
Sheep 65

I created a tf_idf table and found the top 10 words for each group.

data.tfidf <- data %>%
  count(type, word) %>%
  bind_tf_idf(word, type, n) %>%
  arrange(type, tf_idf) %>%
  mutate(order = row_number()) %>%
  group_by(type) %>%
  top_n(10, tf_idf)
kable(data.tfidf[1:10, ])
type word n tf idf tf_idf order
Bands jazz 26 0.0017710 1.3862944 0.0024551 4245
Bands songwrit 28 0.0019072 1.3862944 0.0026440 4246
Bands band 138 0.0093999 0.2876821 0.0027042 4247
Bands cassett 33 0.0022478 1.3862944 0.0031161 4248
Bands drum 39 0.0026565 1.3862944 0.0036827 4249
Bands pop 47 0.0032014 1.3862944 0.0044381 4250
Bands excerpt 63 0.0042913 1.3862944 0.0059490 4251
Bands min 67 0.0045637 1.3862944 0.0063267 4252
Bands vocal 82 0.0055855 1.3862944 0.0077431 4253
Bands song 249 0.0169607 0.6931472 0.0117563 4254

The below graph shows the plotted data. The words associated with each category all seem reasonable and mostly distinct. This should aid in the identification of websites. The two most similar (both intuitively and statistically) are ‘Goats’ and ‘Sheep’. In fact, Sheep is one of the top identifiying words in the ‘Goats’ category. It seems most probably that misclassification could occurr between these two sets.

ggplot(data.tfidf, aes(order, tf_idf, fill=type)) +
  geom_bar(show.legend=FALSE, stat='identity') +
  facet_wrap(~type, scales='free') +
  coord_flip() +
  theme(axis.text.x=element_text(angle=-30, vjust=1, hjust=0)) +
  scale_x_continuous(
    breaks = data.tfidf$order,
    labels = data.tfidf$word
  ) +
  labs(x=NULL,
       y=NULL,
       title='Top 10 Best Identifying Words by Topic')

Next, I wanted to perform a sentiment analysis on each of the four types of websites. I joined the data with the ‘afinn’ sentiment dataframe. This dataframe contains a list of words with an associated score from -5 to 5 based on how negative or positive the word is. I then plotted the data along with each group’s average.

It appears that ‘Bands’ websites are most likely to use positive language while ‘Sheep’ websites are more prone to using neutral language.

afinn.data <- data %>%
  inner_join(get_sentiments('afinn')) %>%
  count(type, word, score) %>%
  mutate(total = n/sum(n))

afinn.vline.data <- afinn.data %>%
  group_by(type) %>%
  summarise(avg = mean(n*score))

ggplot(afinn.data, aes(score, total, fill=factor(score))) +
  geom_bar(stat='identity') +
  geom_vline(data=afinn.vline.data, aes(xintercept=avg)) +
  facet_wrap(~type, ncol=1) +
  scale_x_continuous(limits=c(-5,5), breaks=seq(-5,5,1)) +
  scale_y_continuous(limits=c(0,.15), breaks=seq(0,.15,.05), expand=c(0,0), labels=percent) +
  scale_fill_brewer(palette='RdYlGn') + 
  labs(x='Sentiment Score',
       y='Frequency',
       title='Use of Strong Words by Topic') +
  theme_bw() + 
  theme(legend.position='none',
        panel.grid.minor=element_blank(),
        panel.grid.major.x=element_blank(),
        strip.background=element_rect(fill='grey70'))

Sentiment analysis is prone to misrepresenting the positivity or negativity of words. Since words are only examined one at a time the possibility exists that words are preceeded by a negation that changes their meaning (ie: not bad). While these instances are assumed to be a small overall percent of the uses of a word, the greater risk is that a particular domain may have certain words that are viewed as positive or negative only in their specific context. For example, a nature website by use ‘bear’ neutrally while financial documents may be talking about a bear market.

I plotted the top 5 most frequently used sentimental words for each group.

afinn.frequent.data <- afinn.data %>%
  group_by(type) %>%
  top_n(5, n) %>%
  arrange(type, desc(n))
kable(afinn.frequent.data)
type word score n total
Bands love 3 40 0.0180343
Bands hard -1 35 0.0157800
Bands humor 2 22 0.0099188
Bands hope 2 19 0.0085663
Bands god 1 18 0.0081154
BioMedical cancer -1 79 0.0356177
BioMedical care 2 70 0.0315600
BioMedical support 2 56 0.0252480
BioMedical prevent -1 46 0.0207394
BioMedical grant 1 26 0.0117223
Goats fair 2 15 0.0067628
Goats care 2 14 0.0063120
Goats growth 2 14 0.0063120
Goats prevent -1 14 0.0063120
Goats vitamin 1 13 0.0058611
Sheep ha 2 57 0.0256988
Sheep die -3 9 0.0040577
Sheep help 2 9 0.0040577
Sheep kill -3 9 0.0040577
Sheep love 3 8 0.0036069
Sheep suffer -2 8 0.0036069

BioMedical stands out as having the highest potential for misrepresentation. In this context cancer may be a neutral term and injury could possibly be used often as a positive term. For example in “prevent injury”. I decided to further examine the BioMedical data by forming bigrams.

bigram.data <- c('BioMedical') %>%
  map_df(~Website.Parse('Train', .)) %>%
  unnest_tokens(word, value, token='ngrams', n=2) %>%
  separate(word, c('word1', 'word2'), sep=' ') %>%
  anti_join(stop_words, by=c('word1'='word')) %>%
  anti_join(stop_words, by=c('word2'='word')) %>%
  anti_join(custom.stopwords, by=c('word1'='word')) %>%
  anti_join(custom.stopwords, by=c('word2'='word')) %>%
  filter(!str_detect(word1, '(\\d|\\.|[^A-Za-z])+')) %>%
  filter(!str_detect(word2, '(\\d|\\.|[^A-Za-z])+')) %>%
  filter(stringi::stri_enc_mark(word1) == 'ASCII') %>%
  filter(stringi::stri_enc_mark(word2) == 'ASCII')
kable(bigram.data[1:10, ])
id type word1 word2
BioMedical_0-143302914 BioMedical journals conferences
BioMedical_0-143302914 BioMedical current awareness
BioMedical_0-143302914 BioMedical awareness services
BioMedical_0-143302914 BioMedical services biosciences
BioMedical_0-143302914 BioMedical biosciences information
BioMedical_0-143302914 BioMedical information categorized
BioMedical_0-143302914 BioMedical complete list
BioMedical_0-143302914 BioMedical biomedical www
BioMedical_0-143302914 BioMedical www sites
BioMedical_0-143302914 BioMedical sites biology

I filtered the bigrams to display only those that feature one of the most frequent BioMedical terms and then displayed the graph of frequent word pairings.

afinn.biomedical.frequent.data <- afinn.frequent.data %>%
  filter(type == 'BioMedical')

bigram.data %>% 
  count(word1, word2, sort=TRUE) %>%
  filter(word1 %in% afinn.biomedical.frequent.data$word | word2 %in% afinn.biomedical.frequent.data$word,
         n >= 2) %>%
  mutate(word1 = wordStem(word1)) %>%
  mutate(word2 = wordStem(word2)) %>%
  graph_from_data_frame() %>%
  ggraph(layout='fr') +
  geom_edge_link(aes(edge_alpha=n), show.legend=FALSE, arrow=arrow(type='closed', length=unit(.15, 'inches'))) + 
  geom_node_point(color='lightblue', size=5) +
  geom_node_text(aes(label=name), vjust=1, hjust=1) +
  theme_void() +
  labs(title='Frequently Used BioMedical Word Pairings')

As anticipated, ‘cancer’ has several contexts where it appears to be a neutral term, like ‘cancer center’ for example. ‘Grant’ appears to be discussing a medical grant as opposed to granting someone a something.

If I were performing a deeper analysis of BioMedical websites it seems prudent to create my own stop words dictionary to look for commonly used key words.

Finally, I wanted to see if the websites were unique enough to perform a model classificaiton on them. I prepared the data for training and testing by removing the sparse terms and adding an identifying column named type. I randomly selected 10% of the websites in the data and set them aside. The remaining 90% became the training data.

ml.data <- data %>%
  count(type, id, word) %>%
  cast_dtm(id, word, n) %>%
  removeSparseTerms(sparse=0.99) %>%
  as.matrix() %>%
  as.data.frame() %>%
  mutate(type = str_extract(row.names(.), '[^_]+'))

test.numbers <- sample(1:323, 32, replace=FALSE)

train.data <- ml.data %>%
  filter(!(row_number() %in% test.numbers))

test.data <- ml.data %>%
  filter(row_number() %in% test.numbers)

The training data was run through a random forest model and then the test.data was passed through for prediction. The confusion matrix for the training set shows highly accurate classification.

train.results <- train(type~., data=train.data,
           method='rf',
           trControl=trainControl(method='cv',
                                  number = 2
                                  ),
           verbose=FALSE)

test.results <- predict(train.results, newdata=test.data)

train.results$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, verbose = FALSE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2071
## 
##         OOB estimate of  error rate: 8%
## Confusion matrix:
##            Bands BioMedical Goats Sheep class.error
## Bands         51          0     0     0  0.00000000
## BioMedical     0        117     0     7  0.05645161
## Goats          0          2    56     7  0.13846154
## Sheep          0          7     1    52  0.13333333

The model had a very high success rate. Only two websites in the testing set were misidentified. Both of them were, as predicted, due to the similarity between Goats and Sheep websites.

present.results <- data_frame(original=test.data$type, prediction=test.results)

table(present.results$original, present.results$prediction)
##             
##              Bands BioMedical Goats Sheep
##   Bands         10          0     0     0
##   BioMedical     0         12     0     0
##   Goats          0          0     4     1
##   Sheep          0          0     1     4

The analysis has been highly successful. The random forest model was approximately 94% successful in identifying websites.