For this project, I used a data set from UCI’s Machine Learning Repository which can be found here. I selected a data set that contained several websites that have been archived and identified as being either about ‘Bands’, ‘BioMedical’, ‘Goats’, or ‘Sheep’.
My goal is to perform an analysis on this data and see if I can use machine learning to correctly identify websites based on their words.
I began by writing a function Website.Parse that is designed to read each webpage, give it with a unique document id and it’s associated classification.
Website.Parse <- function(folder, type){
path <- paste0('C:\\Users\\Brian\\Desktop\\GradClasses\\Spring18\\607\\607project4\\', folder, '\\', type, '\\')
list.files(path=path, pattern="\\d+") %>%
map_dfr(~getURL(paste0('file:///', path, .x)) %>%
htmlParse() %>%
xpathSApply('/html//body', xmlValue) %>%
strsplit(split='\\n') %>%
unlist() %>%
as.tibble() %>%
add_column(type=type, .before=1) %>%
add_column(id=paste(type, .x, sep='_'), .before=1)
)
}
I read in the all the websites and then cleaned up the data using tidytext. In addition, I removed all numbers and non-ascii symbols. Finally, I removed all stop words and a number of other reoccuring terms that I did not want in the final analysis. My guess is that a few of the websites had malformed tags as all my custom stop words are html tags.
custom.stopwords <- data_frame(word=c('div', 'p', 'h', 'img'))
data <- c('Bands', 'BioMedical', 'Goats', 'Sheep') %>%
map_df(~Website.Parse('Train', .)) %>%
unnest_tokens(word, value) %>%
anti_join(stop_words) %>%
anti_join(custom.stopwords) %>%
filter(!str_detect(word, '(\\d|\\.|[^A-Za-z])+')) %>%
filter(stringi::stri_enc_mark(word) == 'ASCII') %>%
mutate(word = wordStem(word))
kable(data[1:10, ])
| id | type | word |
|---|---|---|
| Bands_1 | Bands | el |
| Bands_1 | Bands | sob |
| Bands_1 | Bands | el |
| Bands_1 | Bands | sob |
| Bands_1 | Bands | skin |
| Bands_1 | Bands | cat |
| Bands_1 | Bands | excerpt |
| Bands_1 | Bands | song |
| Bands_1 | Bands | min |
| Bands_1 | Bands | colleg |
A quick examination finds 61 ‘Band’ websites, 136 ‘BioMedical’ websites, 70 ‘Goats’ websites, and 65 ‘Sheep’ websites.
data %>%
group_by(type) %>%
summarise(count = n_distinct(id)) %>%
kable()
| type | count |
|---|---|
| Bands | 61 |
| BioMedical | 136 |
| Goats | 70 |
| Sheep | 65 |
I created a tf_idf table and found the top 10 words for each group.
data.tfidf <- data %>%
count(type, word) %>%
bind_tf_idf(word, type, n) %>%
arrange(type, tf_idf) %>%
mutate(order = row_number()) %>%
group_by(type) %>%
top_n(10, tf_idf)
kable(data.tfidf[1:10, ])
| type | word | n | tf | idf | tf_idf | order |
|---|---|---|---|---|---|---|
| Bands | jazz | 26 | 0.0017710 | 1.3862944 | 0.0024551 | 4245 |
| Bands | songwrit | 28 | 0.0019072 | 1.3862944 | 0.0026440 | 4246 |
| Bands | band | 138 | 0.0093999 | 0.2876821 | 0.0027042 | 4247 |
| Bands | cassett | 33 | 0.0022478 | 1.3862944 | 0.0031161 | 4248 |
| Bands | drum | 39 | 0.0026565 | 1.3862944 | 0.0036827 | 4249 |
| Bands | pop | 47 | 0.0032014 | 1.3862944 | 0.0044381 | 4250 |
| Bands | excerpt | 63 | 0.0042913 | 1.3862944 | 0.0059490 | 4251 |
| Bands | min | 67 | 0.0045637 | 1.3862944 | 0.0063267 | 4252 |
| Bands | vocal | 82 | 0.0055855 | 1.3862944 | 0.0077431 | 4253 |
| Bands | song | 249 | 0.0169607 | 0.6931472 | 0.0117563 | 4254 |
The below graph shows the plotted data. The words associated with each category all seem reasonable and mostly distinct. This should aid in the identification of websites. The two most similar (both intuitively and statistically) are ‘Goats’ and ‘Sheep’. In fact, Sheep is one of the top identifiying words in the ‘Goats’ category. It seems most probably that misclassification could occurr between these two sets.
ggplot(data.tfidf, aes(order, tf_idf, fill=type)) +
geom_bar(show.legend=FALSE, stat='identity') +
facet_wrap(~type, scales='free') +
coord_flip() +
theme(axis.text.x=element_text(angle=-30, vjust=1, hjust=0)) +
scale_x_continuous(
breaks = data.tfidf$order,
labels = data.tfidf$word
) +
labs(x=NULL,
y=NULL,
title='Top 10 Best Identifying Words by Topic')
Next, I wanted to perform a sentiment analysis on each of the four types of websites. I joined the data with the ‘afinn’ sentiment dataframe. This dataframe contains a list of words with an associated score from -5 to 5 based on how negative or positive the word is. I then plotted the data along with each group’s average.
It appears that ‘Bands’ websites are most likely to use positive language while ‘Sheep’ websites are more prone to using neutral language.
afinn.data <- data %>%
inner_join(get_sentiments('afinn')) %>%
count(type, word, score) %>%
mutate(total = n/sum(n))
afinn.vline.data <- afinn.data %>%
group_by(type) %>%
summarise(avg = mean(n*score))
ggplot(afinn.data, aes(score, total, fill=factor(score))) +
geom_bar(stat='identity') +
geom_vline(data=afinn.vline.data, aes(xintercept=avg)) +
facet_wrap(~type, ncol=1) +
scale_x_continuous(limits=c(-5,5), breaks=seq(-5,5,1)) +
scale_y_continuous(limits=c(0,.15), breaks=seq(0,.15,.05), expand=c(0,0), labels=percent) +
scale_fill_brewer(palette='RdYlGn') +
labs(x='Sentiment Score',
y='Frequency',
title='Use of Strong Words by Topic') +
theme_bw() +
theme(legend.position='none',
panel.grid.minor=element_blank(),
panel.grid.major.x=element_blank(),
strip.background=element_rect(fill='grey70'))
Sentiment analysis is prone to misrepresenting the positivity or negativity of words. Since words are only examined one at a time the possibility exists that words are preceeded by a negation that changes their meaning (ie: not bad). While these instances are assumed to be a small overall percent of the uses of a word, the greater risk is that a particular domain may have certain words that are viewed as positive or negative only in their specific context. For example, a nature website by use ‘bear’ neutrally while financial documents may be talking about a bear market.
I plotted the top 5 most frequently used sentimental words for each group.
afinn.frequent.data <- afinn.data %>%
group_by(type) %>%
top_n(5, n) %>%
arrange(type, desc(n))
kable(afinn.frequent.data)
| type | word | score | n | total |
|---|---|---|---|---|
| Bands | love | 3 | 40 | 0.0180343 |
| Bands | hard | -1 | 35 | 0.0157800 |
| Bands | humor | 2 | 22 | 0.0099188 |
| Bands | hope | 2 | 19 | 0.0085663 |
| Bands | god | 1 | 18 | 0.0081154 |
| BioMedical | cancer | -1 | 79 | 0.0356177 |
| BioMedical | care | 2 | 70 | 0.0315600 |
| BioMedical | support | 2 | 56 | 0.0252480 |
| BioMedical | prevent | -1 | 46 | 0.0207394 |
| BioMedical | grant | 1 | 26 | 0.0117223 |
| Goats | fair | 2 | 15 | 0.0067628 |
| Goats | care | 2 | 14 | 0.0063120 |
| Goats | growth | 2 | 14 | 0.0063120 |
| Goats | prevent | -1 | 14 | 0.0063120 |
| Goats | vitamin | 1 | 13 | 0.0058611 |
| Sheep | ha | 2 | 57 | 0.0256988 |
| Sheep | die | -3 | 9 | 0.0040577 |
| Sheep | help | 2 | 9 | 0.0040577 |
| Sheep | kill | -3 | 9 | 0.0040577 |
| Sheep | love | 3 | 8 | 0.0036069 |
| Sheep | suffer | -2 | 8 | 0.0036069 |
BioMedical stands out as having the highest potential for misrepresentation. In this context cancer may be a neutral term and injury could possibly be used often as a positive term. For example in “prevent injury”. I decided to further examine the BioMedical data by forming bigrams.
bigram.data <- c('BioMedical') %>%
map_df(~Website.Parse('Train', .)) %>%
unnest_tokens(word, value, token='ngrams', n=2) %>%
separate(word, c('word1', 'word2'), sep=' ') %>%
anti_join(stop_words, by=c('word1'='word')) %>%
anti_join(stop_words, by=c('word2'='word')) %>%
anti_join(custom.stopwords, by=c('word1'='word')) %>%
anti_join(custom.stopwords, by=c('word2'='word')) %>%
filter(!str_detect(word1, '(\\d|\\.|[^A-Za-z])+')) %>%
filter(!str_detect(word2, '(\\d|\\.|[^A-Za-z])+')) %>%
filter(stringi::stri_enc_mark(word1) == 'ASCII') %>%
filter(stringi::stri_enc_mark(word2) == 'ASCII')
kable(bigram.data[1:10, ])
| id | type | word1 | word2 |
|---|---|---|---|
| BioMedical_0-143302914 | BioMedical | journals | conferences |
| BioMedical_0-143302914 | BioMedical | current | awareness |
| BioMedical_0-143302914 | BioMedical | awareness | services |
| BioMedical_0-143302914 | BioMedical | services | biosciences |
| BioMedical_0-143302914 | BioMedical | biosciences | information |
| BioMedical_0-143302914 | BioMedical | information | categorized |
| BioMedical_0-143302914 | BioMedical | complete | list |
| BioMedical_0-143302914 | BioMedical | biomedical | www |
| BioMedical_0-143302914 | BioMedical | www | sites |
| BioMedical_0-143302914 | BioMedical | sites | biology |
I filtered the bigrams to display only those that feature one of the most frequent BioMedical terms and then displayed the graph of frequent word pairings.
afinn.biomedical.frequent.data <- afinn.frequent.data %>%
filter(type == 'BioMedical')
bigram.data %>%
count(word1, word2, sort=TRUE) %>%
filter(word1 %in% afinn.biomedical.frequent.data$word | word2 %in% afinn.biomedical.frequent.data$word,
n >= 2) %>%
mutate(word1 = wordStem(word1)) %>%
mutate(word2 = wordStem(word2)) %>%
graph_from_data_frame() %>%
ggraph(layout='fr') +
geom_edge_link(aes(edge_alpha=n), show.legend=FALSE, arrow=arrow(type='closed', length=unit(.15, 'inches'))) +
geom_node_point(color='lightblue', size=5) +
geom_node_text(aes(label=name), vjust=1, hjust=1) +
theme_void() +
labs(title='Frequently Used BioMedical Word Pairings')
As anticipated, ‘cancer’ has several contexts where it appears to be a neutral term, like ‘cancer center’ for example. ‘Grant’ appears to be discussing a medical grant as opposed to granting someone a something.
If I were performing a deeper analysis of BioMedical websites it seems prudent to create my own stop words dictionary to look for commonly used key words.
Finally, I wanted to see if the websites were unique enough to perform a model classificaiton on them. I prepared the data for training and testing by removing the sparse terms and adding an identifying column named type. I randomly selected 10% of the websites in the data and set them aside. The remaining 90% became the training data.
ml.data <- data %>%
count(type, id, word) %>%
cast_dtm(id, word, n) %>%
removeSparseTerms(sparse=0.99) %>%
as.matrix() %>%
as.data.frame() %>%
mutate(type = str_extract(row.names(.), '[^_]+'))
test.numbers <- sample(1:323, 32, replace=FALSE)
train.data <- ml.data %>%
filter(!(row_number() %in% test.numbers))
test.data <- ml.data %>%
filter(row_number() %in% test.numbers)
The training data was run through a random forest model and then the test.data was passed through for prediction. The confusion matrix for the training set shows highly accurate classification.
train.results <- train(type~., data=train.data,
method='rf',
trControl=trainControl(method='cv',
number = 2
),
verbose=FALSE)
test.results <- predict(train.results, newdata=test.data)
train.results$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, verbose = FALSE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2071
##
## OOB estimate of error rate: 8%
## Confusion matrix:
## Bands BioMedical Goats Sheep class.error
## Bands 51 0 0 0 0.00000000
## BioMedical 0 117 0 7 0.05645161
## Goats 0 2 56 7 0.13846154
## Sheep 0 7 1 52 0.13333333
The model had a very high success rate. Only two websites in the testing set were misidentified. Both of them were, as predicted, due to the similarity between Goats and Sheep websites.
present.results <- data_frame(original=test.data$type, prediction=test.results)
table(present.results$original, present.results$prediction)
##
## Bands BioMedical Goats Sheep
## Bands 10 0 0 0
## BioMedical 0 12 0 0
## Goats 0 0 4 1
## Sheep 0 0 1 4
The analysis has been highly successful. The random forest model was approximately 94% successful in identifying websites.