Project 4

By Brian Weinfeld

April 12th, 2018

Frequency Analysis

For this project, I used a data set from UCI’s Machine Learning Repository which can be found here. I selected a data set that contained several websites that have been archived and identified as being either about ‘Bands’, ‘BioMedical’, ‘Goats’, or ‘Sheep’.

My goal is to perform an analysis on this data and see if I can use machine learning to correctly identify websites based on their words.

I began by writing a function Website.Parse that is designed to read each webpage, give it with a unique document id and it’s associated classification.

Website.Parse <- function(folder, type){
  path <- paste0('C:\\Users\\Brian\\Desktop\\GradClasses\\Spring18\\607\\607project4\\', folder, '\\', type, '\\')
  list.files(path=path, pattern="\\d+") %>%
  map_dfr(~getURL(paste0('file:///', path, .x)) %>% 
            htmlParse() %>%
            xpathSApply('/html//body', xmlValue) %>%
            strsplit(split='\\n') %>%
            unlist() %>%
            as.tibble() %>%
            add_column(type=type, .before=1) %>%
            add_column(id=paste(type, .x, sep='_'), .before=1)
        )
}

I read in the all the websites and then cleaned up the data using tidytext. In addition, I removed all numbers and non-ascii symbols. Finally, I removed all stop words and a number of other reoccuring terms that I did not want in the final analysis. My guess is that a few of the websites had malformed tags as all my custom stop words are html tags.

custom.stopwords <- data_frame(word=c('div', 'p', 'h', 'img'))

data <- c('Bands', 'BioMedical', 'Goats', 'Sheep') %>%
  map_df(~Website.Parse('Train', .)) %>%
  unnest_tokens(word, value) %>%
  anti_join(stop_words) %>%
  anti_join(custom.stopwords) %>%
  filter(!str_detect(word, '(\\d|\\.|[^A-Za-z])+')) %>%
  filter(stringi::stri_enc_mark(word) == 'ASCII') %>%
  mutate(word = wordStem(word)) 
kable(data[1:10, ])

id	type	word
Bands_1	Bands	el
Bands_1	Bands	sob
Bands_1	Bands	el
Bands_1	Bands	sob
Bands_1	Bands	skin
Bands_1	Bands	cat
Bands_1	Bands	excerpt
Bands_1	Bands	song
Bands_1	Bands	min
Bands_1	Bands	colleg

A quick examination finds 61 ‘Band’ websites, 136 ‘BioMedical’ websites, 70 ‘Goats’ websites, and 65 ‘Sheep’ websites.

data %>%
  group_by(type) %>%
  summarise(count = n_distinct(id)) %>%
  kable()

type	count
Bands	61
BioMedical	136
Goats	70
Sheep	65

I created a tf_idf table and found the top 10 words for each group.

data.tfidf <- data %>%
  count(type, word) %>%
  bind_tf_idf(word, type, n) %>%
  arrange(type, tf_idf) %>%
  mutate(order = row_number()) %>%
  group_by(type) %>%
  top_n(10, tf_idf)
kable(data.tfidf[1:10, ])

type	word	n	tf	idf	tf_idf	order
Bands	jazz	26	0.0017710	1.3862944	0.0024551	4245
Bands	songwrit	28	0.0019072	1.3862944	0.0026440	4246
Bands	band	138	0.0093999	0.2876821	0.0027042	4247
Bands	cassett	33	0.0022478	1.3862944	0.0031161	4248
Bands	drum	39	0.0026565	1.3862944	0.0036827	4249
Bands	pop	47	0.0032014	1.3862944	0.0044381	4250
Bands	excerpt	63	0.0042913	1.3862944	0.0059490	4251
Bands	min	67	0.0045637	1.3862944	0.0063267	4252
Bands	vocal	82	0.0055855	1.3862944	0.0077431	4253
Bands	song	249	0.0169607	0.6931472	0.0117563	4254

The below graph shows the plotted data. The words associated with each category all seem reasonable and mostly distinct. This should aid in the identification of websites. The two most similar (both intuitively and statistically) are ‘Goats’ and ‘Sheep’. In fact, Sheep is one of the top identifiying words in the ‘Goats’ category. It seems most probably that misclassification could occurr between these two sets.

ggplot(data.tfidf, aes(order, tf_idf, fill=type)) +
  geom_bar(show.legend=FALSE, stat='identity') +
  facet_wrap(~type, scales='free') +
  coord_flip() +
  theme(axis.text.x=element_text(angle=-30, vjust=1, hjust=0)) +
  scale_x_continuous(
    breaks = data.tfidf$order,
    labels = data.tfidf$word
  ) +
  labs(x=NULL,
       y=NULL,
       title='Top 10 Best Identifying Words by Topic')

Sentiment Analysis

Next, I wanted to perform a sentiment analysis on each of the four types of websites. I joined the data with the ‘afinn’ sentiment dataframe. This dataframe contains a list of words with an associated score from -5 to 5 based on how negative or positive the word is. I then plotted the data along with each group’s average.

It appears that ‘Bands’ websites are most likely to use positive language while ‘Sheep’ websites are more prone to using neutral language.

afinn.data <- data %>%
  inner_join(get_sentiments('afinn')) %>%
  count(type, word, score) %>%
  mutate(total = n/sum(n))

afinn.vline.data <- afinn.data %>%
  group_by(type) %>%
  summarise(avg = mean(n*score))

ggplot(afinn.data, aes(score, total, fill=factor(score))) +
  geom_bar(stat='identity') +
  geom_vline(data=afinn.vline.data, aes(xintercept=avg)) +
  facet_wrap(~type, ncol=1) +
  scale_x_continuous(limits=c(-5,5), breaks=seq(-5,5,1)) +
  scale_y_continuous(limits=c(0,.15), breaks=seq(0,.15,.05), expand=c(0,0), labels=percent) +
  scale_fill_brewer(palette='RdYlGn') + 
  labs(x='Sentiment Score',
       y='Frequency',
       title='Use of Strong Words by Topic') +
  theme_bw() + 
  theme(legend.position='none',
        panel.grid.minor=element_blank(),
        panel.grid.major.x=element_blank(),
        strip.background=element_rect(fill='grey70'))

Sentiment analysis is prone to misrepresenting the positivity or negativity of words. Since words are only examined one at a time the possibility exists that words are preceeded by a negation that changes their meaning (ie: not bad). While these instances are assumed to be a small overall percent of the uses of a word, the greater risk is that a particular domain may have certain words that are viewed as positive or negative only in their specific context. For example, a nature website by use ‘bear’ neutrally while financial documents may be talking about a bear market.

I plotted the top 5 most frequently used sentimental words for each group.

afinn.frequent.data <- afinn.data %>%
  group_by(type) %>%
  top_n(5, n) %>%
  arrange(type, desc(n))
kable(afinn.frequent.data)

type	word	score	n	total
Bands	love	3	40	0.0180343
Bands	hard	-1	35	0.0157800
Bands	humor	2	22	0.0099188
Bands	hope	2	19	0.0085663
Bands	god	1	18	0.0081154
BioMedical	cancer	-1	79	0.0356177
BioMedical	care	2	70	0.0315600
BioMedical	support	2	56	0.0252480
BioMedical	prevent	-1	46	0.0207394
BioMedical	grant	1	26	0.0117223
Goats	fair	2	15	0.0067628
Goats	care	2	14	0.0063120
Goats	growth	2	14	0.0063120
Goats	prevent	-1	14	0.0063120
Goats	vitamin	1	13	0.0058611
Sheep	ha	2	57	0.0256988
Sheep	die	-3	9	0.0040577
Sheep	help	2	9	0.0040577
Sheep	kill	-3	9	0.0040577
Sheep	love	3	8	0.0036069
Sheep	suffer	-2	8	0.0036069

BioMedical stands out as having the highest potential for misrepresentation. In this context cancer may be a neutral term and injury could possibly be used often as a positive term. For example in “prevent injury”. I decided to further examine the BioMedical data by forming bigrams.

bigram.data <- c('BioMedical') %>%
  map_df(~Website.Parse('Train', .)) %>%
  unnest_tokens(word, value, token='ngrams', n=2) %>%
  separate(word, c('word1', 'word2'), sep=' ') %>%
  anti_join(stop_words, by=c('word1'='word')) %>%
  anti_join(stop_words, by=c('word2'='word')) %>%
  anti_join(custom.stopwords, by=c('word1'='word')) %>%
  anti_join(custom.stopwords, by=c('word2'='word')) %>%
  filter(!str_detect(word1, '(\\d|\\.|[^A-Za-z])+')) %>%
  filter(!str_detect(word2, '(\\d|\\.|[^A-Za-z])+')) %>%
  filter(stringi::stri_enc_mark(word1) == 'ASCII') %>%
  filter(stringi::stri_enc_mark(word2) == 'ASCII')
kable(bigram.data[1:10, ])

id	type	word1	word2
BioMedical_0-143302914	BioMedical	journals	conferences
BioMedical_0-143302914	BioMedical	current	awareness
BioMedical_0-143302914	BioMedical	awareness	services
BioMedical_0-143302914	BioMedical	services	biosciences
BioMedical_0-143302914	BioMedical	biosciences	information
BioMedical_0-143302914	BioMedical	information	categorized
BioMedical_0-143302914	BioMedical	complete	list
BioMedical_0-143302914	BioMedical	biomedical	www
BioMedical_0-143302914	BioMedical	www	sites
BioMedical_0-143302914	BioMedical	sites	biology

I filtered the bigrams to display only those that feature one of the most frequent BioMedical terms and then displayed the graph of frequent word pairings.

afinn.biomedical.frequent.data <- afinn.frequent.data %>%
  filter(type == 'BioMedical')

bigram.data %>% 
  count(word1, word2, sort=TRUE) %>%
  filter(word1 %in% afinn.biomedical.frequent.data$word | word2 %in% afinn.biomedical.frequent.data$word,
         n >= 2) %>%
  mutate(word1 = wordStem(word1)) %>%
  mutate(word2 = wordStem(word2)) %>%
  graph_from_data_frame() %>%
  ggraph(layout='fr') +
  geom_edge_link(aes(edge_alpha=n), show.legend=FALSE, arrow=arrow(type='closed', length=unit(.15, 'inches'))) + 
  geom_node_point(color='lightblue', size=5) +
  geom_node_text(aes(label=name), vjust=1, hjust=1) +
  theme_void() +
  labs(title='Frequently Used BioMedical Word Pairings')

As anticipated, ‘cancer’ has several contexts where it appears to be a neutral term, like ‘cancer center’ for example. ‘Grant’ appears to be discussing a medical grant as opposed to granting someone a something.

If I were performing a deeper analysis of BioMedical websites it seems prudent to create my own stop words dictionary to look for commonly used key words.

Document Classification

Finally, I wanted to see if the websites were unique enough to perform a model classificaiton on them. I prepared the data for training and testing by removing the sparse terms and adding an identifying column named type. I randomly selected 10% of the websites in the data and set them aside. The remaining 90% became the training data.

ml.data <- data %>%
  count(type, id, word) %>%
  cast_dtm(id, word, n) %>%
  removeSparseTerms(sparse=0.99) %>%
  as.matrix() %>%
  as.data.frame() %>%
  mutate(type = str_extract(row.names(.), '[^_]+'))

test.numbers <- sample(1:323, 32, replace=FALSE)

train.data <- ml.data %>%
  filter(!(row_number() %in% test.numbers))

test.data <- ml.data %>%
  filter(row_number() %in% test.numbers)

The training data was run through a random forest model and then the test.data was passed through for prediction. The confusion matrix for the training set shows highly accurate classification.

train.results <- train(type~., data=train.data,
           method='rf',
           trControl=trainControl(method='cv',
                                  number = 2
                                  ),
           verbose=FALSE)

test.results <- predict(train.results, newdata=test.data)

train.results$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, verbose = FALSE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2071
## 
##         OOB estimate of  error rate: 8%
## Confusion matrix:
##            Bands BioMedical Goats Sheep class.error
## Bands         51          0     0     0  0.00000000
## BioMedical     0        117     0     7  0.05645161
## Goats          0          2    56     7  0.13846154
## Sheep          0          7     1    52  0.13333333

The model had a very high success rate. Only two websites in the testing set were misidentified. Both of them were, as predicted, due to the similarity between Goats and Sheep websites.

present.results <- data_frame(original=test.data$type, prediction=test.results)

table(present.results$original, present.results$prediction)

##             
##              Bands BioMedical Goats Sheep
##   Bands         10          0     0     0
##   BioMedical     0         12     0     0
##   Goats          0          0     4     1
##   Sheep          0          0     1     4

The analysis has been highly successful. The random forest model was approximately 94% successful in identifying websites.