The University of Southern California’s LGBT Resource Center (LGBTRC) website was rudimentarily text mined to identify groups based on webpage similarity. Finding these similarities were meant to serve as a starting point for reorganizing the website menus, so that information would be more intuitive to access. After removing high-frequency words and word pairs (bigrams) that did not help differentiate among webpages, five major groups were found. The subjects were: queer resources for a universal audience, queer resources specific to USC, USC resources, navigating gender, and navigating graduation.
The two major forms of analyses used were term frequency tables, and hierarchical clustering.
Term frequency tables recorded the counts of each word, or word-pair, appearing within individual webpages. The more similar the counts, the more similar the webpages. In addition, the types of terms that appeared within tables, suggested general topics of the webpage, or webpage group.
Hierarchical clustering organized webpages by branching levels of similarity, separating pages into groups based on how different they were. The branching structure provided a simple outline for possible menus.
The following aspects of each page were recorded:
Aspect | Definition |
---|---|
ID | Each page was assigned an ID number based on alphabetical order. |
Page | Absolute link (full URL) to webpage |
Title | Title of webpage |
Text | Entire text of page, without any paragraph indents. |
Section | Location in dropdown menu on unreformatted LGBTRC website |
Links | Total number of links. Includes links to LGBTRC website, USC sites, non-USC sites, email, videos, and image links. |
LGBTRC | Total number of links to any page hosted under lgbtrc.usc.edu, including files. Excludes emails. |
USC | Total number of links to pages hosted under usc.edu OR survey forms, excluding lgbtrc.edu pages and email. |
Outside | Total number of links to any page hosted outside of lgbtrc.usc.edu and usc.edu, including files. Excludes emails. |
Total number of links to email addresses. | |
Videos | Total number of links to videos. |
Linked Images | Total number of images containing links to either media files or websites. |
Unlinked Images | Total number of images without links. |
Images | Total number of images. |
Parent | Name of parent page (a parent page is any page with additional pages, i.e. children, nested under it on a website). |
Issues | Any immediately observable problems the page had. Recorded with the aim of returning to fix page-specific issues after website is reorganized. |
First, convert data on Google Spreadsheets into CSV format. Import dataset.
q <- read.csv("C:\\Users\\greya\\Downloads\\v1.csv")
#Google Drive corrupted the file halfway through this project. This was the code
#used to fix the issue; may not be needed for other computers.
q <- q[-c(110:133),]
q$Page <- factor(q$Page)
q$Title <- factor(q$Title)
q$Text <- factor(q$Text)
q$Parent <- factor(q$Parent)
q$Issues <- factor(q$Issues)
This is what the initial data looks like:
q[1, ]
## ID Page Title
## 1 1 https://lgbtrc.usc.edu/events-2/ (no title)
## Text
## 1 First uRAP Meeting | August 31st | 7:30 <U+0096> 8:30pm Resource Center Open House | September 3rd | 5:00 <U+0096> 7:00pm First FAB Meeting | September 3rd | 7:00pm First Rainbow International Lunch | September 25th | 12:00 <U+0096> 1:30pm
## Section Links LGBTRC USC Outside Email Videos Linked.Images
## 1 None 4 4 0 0 0 0 0
## Unlinked.Images Total.Images Parent
## 1 1 1 None
## Issues
## 1 invisible, 4 redundant links to same location of event page, which has no information. missing calendar, accessibility image s
Exclude any aspects containing information that will not help with grouping. These aspects include:
Aspect | Reason for Exclusion |
---|---|
Page | ID serves as a shorter, easier-to-read identifier |
Section | Old category—because the goal is to recreate new methods of page categorization, any old categories should not be used to evaluate similarity among pages. |
Parent | Old category—see aspect “Section” for explanation. |
Issues | Grouping pages by how difficult they are to access, does not help with accessibility. |
This leaves us with:
qe <- q[,-c(2, 5, 15:16)] #dataset with excluded variables
Given the remaining aspects, there are several analysis levels that can be used to compare pages. Here they are listed from the fewest to the most variables:
Analysis Number | Aspects Involved |
---|---|
1 | Text |
2 | Bigram |
3 | Text, Links |
4 | Bigram, Links |
5 | Text, LGBTRC, USC, Outside |
6 | Bigram, LGBTRC, USC, Outside |
7 | Text, LGBTRC, USC, Outside, Email, Videos, Images |
8 | Bigram, LGBTRC, USC, Outside, Email, Videos, Images |
9 | Text, LGBTRC, USC, Outside, Email, Videos, Linked Images, Unlinked Images |
10 | Bigram, LGBTRC, USC, Outside, Email, Videos, Images |
If body text alone is insufficient for analysis, bigrams may become the unit of focus. Likewise, if bigrams are insufficient, body text will become the unit of focus.
Analysis 1 examined the similarity of pages by body text alone. The groups formed by this analysis were numerous and topically incoherent, suggesting that bigrams were a more appropriate unit of focus.
Analysis 2a examined the similarity of pages by bigram. The groups formed here were fewer and more topically coherent, but possibly confounded by high-frequency bigrams that had exponentially higher counts than the other bigrams.
Analysis 2b was the same as 2a, but with the most frequent bigram removed. This analysis provided the most topically coherent groups.
Analysis 2c was the same as 2a, but with all high-frequency bigrams removed. This analysis provided the fewest groups. However, these groups were topically incoherent and not useful for organizing menus.
Using the dataset with excluded variables (qe), we will begin preparing a cleaned version of the set for analysis 1, on whether pages sharing similar body text should be grouped together. Cleaning will be done with the package, tm; for more information on how tm works, see the package’s vignette on CRAN by Ingo Feinerer:
#replace any non A~Z character with a space
qe$Text <- gsub("[^[:alnum:][:space:]]"," ", qe$Text)
#convert to tm corpus
tc <- Corpus(VectorSource(qe$Text))
#The converted column is stored in a new variable, tc, to expedite typing for cleaning:
tc <- tm_map(tc, content_transformer(tolower)) #makes all text lowercase
tc <- tm_map(tc, removeNumbers) #removes numbers
tc <- tm_map(tc, removeWords, stopwords("english")) #removes stopwords, e.g. and
tc <- tm_map(tc, removePunctuation) #removes punctuation
tc <- tm_map(tc, stripWhitespace) #removes extra spaces and paragraph indents
tc <- tm_map(tc, stemDocument) #reduces words to root form, e.g. cars to car
inspect(tc[1]) #check results to see if additional cleaning needed
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 1
##
## [1] first urap meet august st pm resourc center open hous septemb rd pm first fab meet septemb rd pm first rainbow intern lunch septemb th pm
#(abbreviated example of only one result shown here)
#remove additional:
tc <- tm_map(tc, removeWords, c("lgbt", "also", "can", "may", "los"))
Now that the text has been cleaned, we can create a document term matrix (dtm) to track the frequency of words on each webpage:
dtmTC <- DocumentTermMatrix(tc)
inspect(dtmTC)
## <<DocumentTermMatrix (documents: 109, terms: 2700)>>
## Non-/sparse entries: 9484/284816
## Sparsity : 97%
## Maximal term length: 18
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs alli center communiti gay gender program resourc student transgend
## 107 7 26 7 44 5 14 13 84 5
## 108 0 16 8 1 4 11 7 39 1
## 11 6 11 3 2 1 11 7 29 1
## 13 0 4 3 0 2 1 2 22 0
## 28 10 4 7 2 0 5 4 12 1
## 42 2 4 2 3 0 0 3 10 3
## 44 4 1 9 9 87 0 0 0 11
## 55 0 5 0 1 9 0 4 17 1
## 61 0 5 0 3 3 2 9 7 2
## 92 0 6 0 0 5 3 1 9 10
## Terms
## Docs usc
## 107 32
## 108 23
## 11 18
## 13 8
## 28 9
## 42 12
## 44 0
## 55 7
## 61 4
## 92 8
As shown in the subset above, each webpage is listed by its ID number as a “doc.” Terms are listed as columns, and the counts of every term are listed for all the docs/webpages.
From the dtm, you can tell that most of the terms are sparse (i.e. are unique to individual documents). For the sake of determining page similarity, all terms that do not appear at least once, among at least two different pages, shall be removed:
#To accomplish this, the function "trim" from quanteda, which requires a conversion to the dfm datatype, will be used:
dfmTC <- as.dfm(dtmTC) #convert to dfm
dfmTC <- dfm_trim(dfmTC, min_count = 1, min_docfreq = 2) #trim terms
dtmTC <- as.DocumentTermMatrix(dfmTC) #revert to dtm
inspect(dtmTC) #show changed dtm
## <<DocumentTermMatrix (documents: 109, terms: 1262)>>
## Non-/sparse entries: 8046/129512
## Sparsity : 94%
## Maximal term length: 12
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs alli center communiti gay gender program resourc student transgend
## 107 7 26 7 44 5 14 13 84 5
## 108 0 16 8 1 4 11 7 39 1
## 11 6 11 3 2 1 11 7 29 1
## 13 0 4 3 0 2 1 2 22 0
## 28 10 4 7 2 0 5 4 12 1
## 42 2 4 2 3 0 0 3 10 3
## 44 4 1 9 9 87 0 0 0 11
## 55 0 5 0 1 9 0 4 17 1
## 61 0 5 0 3 3 2 9 7 2
## 92 0 6 0 0 5 3 1 9 10
## Terms
## Docs usc
## 107 32
## 108 23
## 11 18
## 13 8
## 28 9
## 42 12
## 44 0
## 55 7
## 61 4
## 92 8
Below are remaining terms, organized by frequency across the entire website:
termsTC <- colSums(as.matrix(dtmTC)) #get and store term counts
dftermsTC <- data.frame(word = names(termsTC),freq=termsTC) #put counts into table
dftermsTC <- dplyr::arrange(dftermsTC, desc(freq)) #sort counts in descending order
dftermsTC[1:30,] #display table of top 30 words
## word freq
## 1 student 508
## 2 usc 292
## 3 gender 204
## 4 center 197
## 5 communiti 178
## 6 resourc 167
## 7 program 150
## 8 alli 144
## 9 gay 134
## 10 transgend 125
## 11 graduat 116
## 12 ident 116
## 13 support 109
## 14 queer 108
## 15 lesbian 107
## 16 campus 103
## 17 person 95
## 18 sexual 89
## 19 group 85
## 20 univers 80
## 21 peopl 80
## 22 use 77
## 23 one 76
## 24 educ 73
## 25 servic 72
## 26 organ 71
## 27 angel 70
## 28 health 70
## 29 inform 68
## 30 event 68
# Visuals:
dftermsTC[1:30,] %>%
mutate(word = fct_reorder(word, freq)) %>%
ggplot(aes(x=word, y=freq)) +
geom_bar(stat="identity", fill = "magenta4") +
coord_flip() +
ggtitle("Frequency Distribution of Top 30 Words") +
labs(y="Frequency", x = "Word")
The most frequent word is “student,” suggesting that the resources of the site are primarily geared towards students rather than faculty and staff. Because the term “student” is so frequent, it may be unhelpful to include it as a term for determining similarity, if almost the pages contain it. Studying the n-grams (clusters of words that appear together) may give some more insight.
Two n-grams will be examined: bigrams (likelihood of 2 words appearing side by side) and trigrams (likelihood of 3 words appearing side by side).
#BIGRAMS
tcq <- corpus(tc$content, docvars = tc$dmeta) #convert to quanteda corpus
bicolloTC <- textstat_collocations(tcq, size = 2) #calculate bigrams
bicolloTC <- dplyr::arrange(bicolloTC, desc(count)) #sort by descending count
#display frequent and significant bigrams containing "student"
#frequent is defined as having a double-digit count
#significant is defined by the Wald test z-statistic >= 1
bicolloTC[grep("student", bicolloTC$collocation), ] %>%
filter(count >= 10) %>%
filter(z >= 1)
## collocation count count_nested length lambda z
## 1 graduat student 40 0 2 3.0022993 15.020162
## 2 alli student 35 0 2 2.4994263 12.580821
## 3 student affair 33 0 2 7.8345679 5.496063
## 4 student organ 27 0 2 3.1335443 12.690186
## 5 student assembl 25 0 2 4.4088724 12.308725
## 6 divis student 21 0 2 7.3639908 5.144968
## 7 usc student 16 0 2 0.7655627 2.972386
## 8 student health 13 0 2 2.1363037 6.984828
## 9 undergradu student 11 0 2 3.1060552 8.211545
## 10 colleg student 11 0 2 2.5418022 7.343925
## 11 student servic 11 0 2 1.9044482 5.869243
## 12 intern student 10 0 2 3.0687012 7.808554
## 13 student staff 10 0 2 2.0292016 5.919633
## 14 transgend student 10 0 2 1.1761338 3.613176
These bigrams show that “student” is used as a qualifier in diverse contexts, and should be left in the data. What about the trigrams?
#TRIGRAMS
tricolloTC <- textstat_collocations(tcq, size = 3) #calculate trigrams
tricolloTC <- dplyr::arrange(tricolloTC, desc(count)) #sort by descending count
#display frequent and significant trigrams containing "student"
#frequent is defined as having a double-digit count
#significant is defined by the Wald test z-statistic >= 1
tricolloTC[grep("student", tricolloTC$collocation), ] %>%
filter(count >= 10) %>%
filter(z >= 1)
## collocation count count_nested length lambda z
## 1 queer alli student 17 0 3 4.411056 2.898238
## 2 alli student assembl 17 0 3 1.997375 1.309327
At the trigram level, including the term “student” is redundant; both listed trigrams reference QuASA, the Queer and Ally Student Assembly. This is less evidence to suggest eliminating “student” as a stopword, and more evidence that bigrams are the appropriate size of word grouping to analyze.
Conclusion: “student” should be kept in the analysis, despite its frequency.
Dendrograms are perfect for showing the similarities among objects, especially webpages, which are nested hierarchically under each other. Because dendrograms are hierarchical, there are several types of methods you can choose from to organize the graph. The method used here will be complete clustering (separating pages by maximum difference).
To build a dendrogram, we need to make a distance matrix, which gives a quantitative calculation of how similar pages are to each other. In this case, the metric used will be cosine similarity; Kan Nishida gives a good explanation of what cosine similarity is, and why it’s ideal for text similarity analysis.
#Compute and store cosine similarity distance matrix on webpages
disTC <- textstat_simil(dfmTC, method = "cosine", margin = "documents")
#Create Dendrogram
denTC <- hclust(disTC, method = "complete")
#Label Dendrogram
labels(denTC) <- as.character(qe$Title[ #convert to character; order Titles by
order.dendrogram(as.dendrogram(denTC))]) #order of appearance in dendrogram
#Plot Dendrogram
plot(as.dendrogram(denTC), main = "Complete Linkage Clustering: Analysis 1,
Text", horiz = TRUE)
This dendogram has several places where it could be cut into menu headers. Stephanie Lin of UX Booth suggests that, ideally there should be three to four levels of hierarchy in a website’s navigation. Using that guideline, the dendrogram might be cut here, creating roughly 11 groups:
plot(as.dendrogram(denTC), main = "Complete Linkage Clustering: Analysis 1,
Text", horiz = TRUE)
abline(v=0.45)
#color dendrogram by groups based on line of division
#cut tree at line and divide into groups
group = cutree(as.dendrogram(denTC), h = 0.45)
labelColors = c("fire brick", "dark orange 1", "gold", "dark green", "blue", "cornflower blue", "magenta 4", "black", "brown", "cyan 3", "deep pink", "medium turquoise")
#use custom function to get color labels
colLab <- function(n) {
if (is.leaf(n)) {
a <- attributes(n)
labCol <- labelColors[group[which(names(group) == a$label)]]
attr(n, "nodePar") <- c(a$nodePar, lab.col = labCol)
}
n
}
#apply custom function to color dendrogram
colDenTC <- dendrapply(as.dendrogram(denTC), colLab)
plot(colDenTC, main = "Complete Linkage Clustering: Analysis 1,
Text", horiz = TRUE)
abline(v = 0.45)
While many pages sharing similar topics, such as offering resources, were placed together, most pages were topically divergent based solely on title. Here are the top ten term frequency tables for each group.
ga1 <- data.frame(ID = 1:109, Title=names(group), Group=group, row.names = NULL)
#GROUP 1
xtc <- tc[(dplyr::filter(ga1, Group == 1))$ID] #pull all docs from tc from group
dtmXTC <- DocumentTermMatrix(xtc) #convert to dtm
termsXTC <- colSums(as.matrix(dtmXTC)) #get and store term counts
dftermsXTC <- data.frame(word = names(termsXTC),freq=termsXTC) #put counts into table
dftermsXTC <- dplyr::arrange(dftermsXTC, desc(freq)) #sort counts in descending order
dfta1g1 <- dftermsXTC #store results
head(dfta1g1, 10) #display table
## word freq
## 1 student 53
## 2 chang 36
## 3 usc 30
## 4 program 28
## 5 name 28
## 6 center 24
## 7 resourc 20
## 8 communiti 20
## 9 gender 20
## 10 assess 18
#GROUP 2
xtc <- tc[(dplyr::filter(ga1, Group == 2))$ID] #pull all docs from tc from group
dtmXTC <- DocumentTermMatrix(xtc) #convert to dtm
termsXTC <- colSums(as.matrix(dtmXTC)) #get and store term counts
dftermsXTC <- data.frame(word = names(termsXTC),freq=termsXTC) #put counts into table
dftermsXTC <- dplyr::arrange(dftermsXTC, desc(freq)) #sort counts in descending order
dfta1g2 <- dftermsXTC #store results
head(dfta1g2, 10) #display table
## word freq
## 1 student 26
## 2 gender 14
## 3 alli 13
## 4 teammat 13
## 5 gay 12
## 6 communiti 10
## 7 support 10
## 8 program 9
## 9 athlet 9
## 10 coach 9
#GROUP 3
xtc <- tc[(dplyr::filter(ga1, Group == 3))$ID] #pull all docs from tc from group
dtmXTC <- DocumentTermMatrix(xtc) #convert to dtm
termsXTC <- colSums(as.matrix(dtmXTC)) #get and store term counts
dftermsXTC <- data.frame(word = names(termsXTC),freq=termsXTC) #put counts into table
dftermsXTC <- dplyr::arrange(dftermsXTC, desc(freq)) #sort counts in descending order
dfta1g3 <- dftermsXTC #store results
head(dfta1g3, 10) #display table
## word freq
## 1 student 69
## 2 usc 40
## 3 center 29
## 4 resourc 27
## 5 graduat 27
## 6 famili 26
## 7 communiti 24
## 8 program 23
## 9 alli 22
## 10 lcc 22
#GROUP 4
xtc <- tc[(dplyr::filter(ga1, Group == 4))$ID] #pull all docs from tc from group
dtmXTC <- DocumentTermMatrix(xtc) #convert to dtm
termsXTC <- colSums(as.matrix(dtmXTC)) #get and store term counts
dftermsXTC <- data.frame(word = names(termsXTC),freq=termsXTC) #put counts into table
dftermsXTC <- dplyr::arrange(dftermsXTC, desc(freq)) #sort counts in descending order
dfta1g4 <- dftermsXTC #store results
head(dfta1g4, 10) #display table
## word freq
## 1 student 34
## 2 usc 24
## 3 award 23
## 4 receiv 13
## 5 program 12
## 6 center 11
## 7 honor 11
## 8 queer 10
## 9 year 10
## 10 affair 9
#GROUP 5
xtc <- tc[(dplyr::filter(ga1, Group == 5))$ID] #pull all docs from tc from group
dtmXTC <- DocumentTermMatrix(xtc) #convert to dtm
termsXTC <- colSums(as.matrix(dtmXTC)) #get and store term counts
dftermsXTC <- data.frame(word = names(termsXTC),freq=termsXTC) #put counts into table
dftermsXTC <- dplyr::arrange(dftermsXTC, desc(freq)) #sort counts in descending order
dfta1g5 <- dftermsXTC #store results
head(dfta1g5, 10) #display table
## word freq
## 1 student 41
## 2 usc 32
## 3 communiti 23
## 4 center 22
## 5 resourc 21
## 6 gender 18
## 7 alli 18
## 8 transgend 17
## 9 support 16
## 10 come 16
#GROUP 6
xtc <- tc[(dplyr::filter(ga1, Group == 6))$ID] #pull all docs from tc from group
dtmXTC <- DocumentTermMatrix(xtc) #convert to dtm
termsXTC <- colSums(as.matrix(dtmXTC)) #get and store term counts
dftermsXTC <- data.frame(word = names(termsXTC),freq=termsXTC) #put counts into table
dftermsXTC <- dplyr::arrange(dftermsXTC, desc(freq)) #sort counts in descending order
dfta1g6 <- dftermsXTC #store results
head(dfta1g6, 10) #display table
## word freq
## 1 student 72
## 2 usc 37
## 3 resourc 31
## 4 center 30
## 5 communiti 26
## 6 graduat 24
## 7 transgend 24
## 8 program 23
## 9 queer 22
## 10 support 21
#GROUP 7
xtc <- tc[(dplyr::filter(ga1, Group == 7))$ID] #pull all docs from tc from group
dtmXTC <- DocumentTermMatrix(xtc) #convert to dtm
termsXTC <- colSums(as.matrix(dtmXTC)) #get and store term counts
dftermsXTC <- data.frame(word = names(termsXTC),freq=termsXTC) #put counts into table
dftermsXTC <- dplyr::arrange(dftermsXTC, desc(freq)) #sort counts in descending order
dfta1g7 <- dftermsXTC #store results
head(dfta1g7, 10) #display table
## word freq
## 1 gender 98
## 2 student 76
## 3 person 52
## 4 usc 52
## 5 use 51
## 6 male 50
## 7 one 47
## 8 ident 47
## 9 see 47
## 10 femal 42
#GROUP 8
xtc <- tc[(dplyr::filter(ga1, Group == 8))$ID] #pull all docs from tc from group
dtmXTC <- DocumentTermMatrix(xtc) #convert to dtm
termsXTC <- colSums(as.matrix(dtmXTC)) #get and store term counts
dftermsXTC <- data.frame(word = names(termsXTC),freq=termsXTC) #put counts into table
dftermsXTC <- dplyr::arrange(dftermsXTC, desc(freq)) #sort counts in descending order
dfta1g8 <- dftermsXTC #store results
head(dfta1g8, 10) #display table
## word freq
## 1 student 25
## 2 usc 18
## 3 communiti 14
## 4 gender 14
## 5 resourc 14
## 6 floor 13
## 7 transgend 11
## 8 center 10
## 9 ident 10
## 10 sexual 9
#GROUP 9
xtc <- tc[(dplyr::filter(ga1, Group == 9))$ID] #pull all docs from tc from group
dtmXTC <- DocumentTermMatrix(xtc) #convert to dtm
termsXTC <- colSums(as.matrix(dtmXTC)) #get and store term counts
dftermsXTC <- data.frame(word = names(termsXTC),freq=termsXTC) #put counts into table
dftermsXTC <- dplyr::arrange(dftermsXTC, desc(freq)) #sort counts in descending order
dfta1g9 <- dftermsXTC #store results
head(dfta1g9, 10) #display table
## word freq
## 1 student 11
## 2 usc 11
## 3 queer 8
## 4 alli 7
## 5 communiti 6
## 6 first 4
## 7 graduat 4
## 8 lawrenc 4
## 9 lindsey 4
## 10 organ 4
#GROUP 10
xtc <- tc[(dplyr::filter(ga1, Group == 10))$ID] #pull all docs from tc from group
dtmXTC <- DocumentTermMatrix(xtc) #convert to dtm
termsXTC <- colSums(as.matrix(dtmXTC)) #get and store term counts
dftermsXTC <- data.frame(word = names(termsXTC),freq=termsXTC) #put counts into table
dftermsXTC <- dplyr::arrange(dftermsXTC, desc(freq)) #sort counts in descending order
dfta1g10 <- dftermsXTC #store results
head(dfta1g10, 10) #display table
## word freq
## 1 student 17
## 2 alli 9
## 3 admiss 8
## 4 usc 8
## 5 experi 7
## 6 guid 6
## 7 center 5
## 8 colleg 5
## 9 campus 4
## 10 floor 4
#GROUP 11
xtc <- tc[(dplyr::filter(ga1, Group == 11))$ID] #pull all docs from tc from group
dtmXTC <- DocumentTermMatrix(xtc) #convert to dtm
termsXTC <- colSums(as.matrix(dtmXTC)) #get and store term counts
dftermsXTC <- data.frame(word = names(termsXTC),freq=termsXTC) #put counts into table
dftermsXTC <- dplyr::arrange(dftermsXTC, desc(freq)) #sort counts in descending order
dfta1g11 <- dftermsXTC #store results
head(dfta1g11, 10) #display table
## word freq
## 1 student 84
## 2 gay 44
## 3 usc 32
## 4 center 26
## 5 lesbian 24
## 6 first 19
## 7 graduat 18
## 8 alumni 14
## 9 program 14
## 10 board 13
Examining the frequency tables, we find that many of the same words dominate the top of the list across groups. This suggests that the websites were sorted poorly, because the unit of analysis was too specific. Rather than determining word-by-word, we should have been looking by word associations, e,g, bigrams, to focus on differentiation by term context, rather than just term frequency.
Repeating the process of creating a distance matrix and dendrogram for bigrams, requires a new document term matrix:
#create tm corpus from bigrams
bt <- tokens(tcq, what = "word", ngram = 2)
bc <- VCorpus(VectorSource(bt)) #Though we used SimpleCorpus before, SimpleCorpus
#finds incorrect docs. VCorpus fixes the issue.
#create bigrams dtm
dtmBC <- DocumentTermMatrix(bc)
inspect(dtmBC)
## <<DocumentTermMatrix (documents: 109, terms: 12701)>>
## Non-/sparse entries: 15948/1368461
## Sparsity : 99%
## Maximal term length: 29
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs alli_student bisexu_transgend faculti_staff gay_lesbian gender_ident
## 107 3 4 3 22 0
## 108 0 1 3 0 2
## 11 4 0 1 0 0
## 13 0 0 0 0 0
## 28 3 0 4 1 0
## 42 1 1 0 2 0
## 44 0 3 0 2 27
## 55 0 0 0 1 2
## 61 0 1 0 0 1
## 92 0 0 0 0 3
## Terms
## Docs graduat_student lesbian_gay rainbow_floor resourc_center usc_edu
## 107 7 1 3 13 0
## 108 1 1 0 5 17
## 11 3 0 1 7 0
## 13 6 0 0 2 2
## 28 2 0 0 2 2
## 42 0 1 0 3 0
## 44 0 2 0 0 0
## 55 1 0 0 4 0
## 61 0 3 2 5 1
## 92 0 0 0 0 1
#trim sparse bigrams:
#spare = appears at least once, in at least two docs/webpages
dfmBC <- as.dfm(dtmBC) #convert to dfm
dfmBC <- dfm_trim(dfmBC, min_count = 1, min_docfreq = 2) #trim bigrams
dtmBC <- as.DocumentTermMatrix(dfmBC) #revert to dtm
inspect(dtmBC) #show changed dtm
## <<DocumentTermMatrix (documents: 109, terms: 1755)>>
## Non-/sparse entries: 5002/186293
## Sparsity : 97%
## Maximal term length: 22
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs alli_student bisexu_transgend faculti_staff gay_lesbian gender_ident
## 107 3 4 3 22 0
## 108 0 1 3 0 2
## 11 4 0 1 0 0
## 18 1 1 0 0 2
## 28 3 0 4 1 0
## 34 1 1 1 0 0
## 35 0 4 2 2 0
## 44 0 3 0 2 27
## 75 0 1 0 0 2
## 76 0 1 0 0 2
## Terms
## Docs graduat_student lesbian_gay rainbow_floor resourc_center usc_edu
## 107 7 1 3 13 0
## 108 1 1 0 5 17
## 11 3 0 1 7 0
## 18 0 1 4 2 0
## 28 2 0 0 2 2
## 34 2 1 1 0 2
## 35 4 3 0 0 0
## 44 0 2 0 0 0
## 75 1 1 6 4 3
## 76 2 1 8 2 2
#bigram frequency tables:
cBC <- colSums(as.matrix(dtmBC)) #get and store bigram counts
dfcBC <- data.frame(bigram = names(cBC),freq=cBC) #put counts into table
dfcBC <- dplyr::arrange(dfcBC, desc(freq)) #sort counts in descending order
dfcBC[1:30,] #display table of top 30 bigrams
## bigram freq
## 1 resourc_center 102
## 2 gender_ident 73
## 3 usc_edu 60
## 4 gay_lesbian 51
## 5 graduat_student 40
## 6 rainbow_floor 40
## 7 bisexu_transgend 38
## 8 lesbian_gay 36
## 9 alli_student 35
## 10 faculti_staff 35
## 11 student_affair 33
## 12 gay_bisexu 30
## 13 sexual_orient 30
## 14 pm_pm 27
## 15 sexual_gender 27
## 16 student_organ 27
## 17 angel_ca 25
## 18 student_assembl 25
## 19 one_s 23
## 20 queer_alli 22
## 21 divis_student 21
## 22 alumni_associ 20
## 23 usc_s 20
## 24 lambda_alumni 19
## 25 mentor_program 19
## 26 peer_mentor 19
## 27 program_servic 19
## 28 male_femal 18
## 29 first_year 16
## 30 usc_student 16
# Visuals:
dfcBC[1:30,] %>%
mutate(bigram = fct_reorder(bigram, freq)) %>%
ggplot(aes(x=bigram, y=freq)) +
geom_bar(stat="identity", fill = "magenta3") +
coord_flip() +
ggtitle("Frequency Distribution of Top 30 Bigrams") +
labs(y="Frequency", x = "Bigram")
Based on these top 30 bigrams, we can already see clearer topic associations appearing, affirming the suspicion from earlier that bigrams would be more appropriate for determining page similarity.
The extremely high frequency of the bigram “resource center” suggests that it may not be helpful in determining page similarity. Therefore, two dendrograms will be created, one including “resource center,” and one excluding it.
#Compute and store cosine similarity distance matrix on webpages
disBC <- textstat_simil(dfmBC, method = "cosine", margin = "documents")
#Create Dendrogram
denBC <- hclust(disBC, method = "complete")
#Label Dendrogram
labels(denBC) <- as.character(qe$Title[ #convert to character; order Titles by
order.dendrogram(as.dendrogram(denBC))]) #order of appearance in dendrogram
#Plot Dendrogram
plot(as.dendrogram(denBC), main = "Complete Linkage Clustering: Analysis 2a,
Bigram, with Resource Center", horiz = TRUE)
From a cursory visual examination, we can already tell that this dendrogram is much neater, and has fewer splits than the previous one based only on body text. Splitting groups up based on the goal of three-to-four levels of hierarchy yields 6 groups:
#color dendrogram by groups based on line of division
group = cutree(as.dendrogram(denBC), h = 0.35)
colDenBC <- dendrapply(as.dendrogram(denBC), colLab)
plot(colDenBC, main = "Complete Linkage Clustering: Analysis 2a,
Bigram, with Resource Center", horiz = TRUE)
abline(v = 0.35)
Although these groups are much more logical than before, they still aren’t perfect. Is it because of “resource center’s” frequency interfering?
ga2rc <- data.frame(ID = 1:109, Title=names(group), Group=group, row.names = NULL)
#GROUP 1
xbc <- bc[(dplyr::filter(ga2rc, Group == 1))$ID] #pull all docs from bc from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2rcg1 <- dftermsXBC #store results
head(dfta2rcg1, 10) #display table
## word freq
## 1 resourc_center 19
## 2 usc_edu 12
## 3 lesbian_gay 11
## 4 gender_ident 10
## 5 sexual_orient 10
## 6 bisexu_transgend 9
## 7 gay_bisexu 9
## 8 rainbow_floor 8
## 9 sexual_gender 8
## 10 usc_lambda 8
#GROUP 2
xbc <- bc[(dplyr::filter(ga2rc, Group == 2))$ID] #pull all docs from bc from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2rcg2 <- dftermsXBC #store results
head(dfta2rcg2, 10) #display table
## word freq
## 1 resourc_center 16
## 2 name_chang 9
## 3 gender_ident 8
## 4 faculti_staff 7
## 5 intern_student 7
## 6 make_gift 6
## 7 program_servic 6
## 8 registrar_s 6
## 9 titl_ix 6
## 10 usc_edu 6
#GROUP 3
xbc <- bc[(dplyr::filter(ga2rc, Group == 3))$ID] #pull all docs from bc from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2rcg3 <- dftermsXBC #store results
head(dfta2rcg3, 10) #display table
## word freq
## 1 gender_ident 37
## 2 resourc_center 31
## 3 gay_lesbian 26
## 4 pm_pm 21
## 5 one_s 17
## 6 male_femal 16
## 7 rainbow_floor 16
## 8 graduat_student 13
## 9 student_affair 13
## 10 bisexu_transgend 12
#GROUP 4
xbc <- bc[(dplyr::filter(ga2rc, Group == 4))$ID] #pull all docs from bc from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2rcg4 <- dftermsXBC #store results
head(dfta2rcg4, 10) #display table
## word freq
## 1 resourc_center 16
## 2 receiv_usc 11
## 3 divis_student 10
## 4 student_affair 10
## 5 affair_honor 9
## 6 graduat_student 9
## 7 alli_student 8
## 8 campus_activ 8
## 9 student_assembl 8
## 10 usc_campus 8
#GROUP 5
xbc <- bc[(dplyr::filter(ga2rc, Group == 5))$ID] #pull all docs from bc from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2rcg5 <- dftermsXBC #store results
head(dfta2rcg5, 10) #display table
## word freq
## 1 usc_edu 27
## 2 angel_ca 24
## 3 resourc_center 19
## 4 contact_trousdal 12
## 5 gender_ident 11
## 6 faculti_staff 10
## 7 bisexu_transgend 9
## 8 gay_lesbian 9
## 9 parkway_stu 9
## 10 rainbow_floor 9
#GROUP 6
xbc <- bc[(dplyr::filter(ga2rc, Group == 6))$ID] #pull all docs from bc from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2rcg6 <- dftermsXBC #store results
head(dfta2rcg6, 10) #display table
## word freq
## 1 safe_zone 3
## 2 transgend_individu 3
## 3 zone_program 3
## 4 alli_tran 2
## 5 come_process 2
## 6 faculti_staff 2
## 7 gender_ident 2
## 8 individu_gender 2
## 9 need_alli 2
## 10 term_transgend 2
“Resource center” shows up within the top 3, suggesting that it should be removed to allow for more differentiation.
dfmBCx <- dfm_remove(dfmBC, "resourc_center") #remove resource center bigram
dtmBCx <- as.DocumentTermMatrix(dfmBCx) #revert to dtm
inspect(dtmBCx) #show changed dtm
## <<DocumentTermMatrix (documents: 109, terms: 1754)>>
## Non-/sparse entries: 4958/186228
## Sparsity : 97%
## Maximal term length: 22
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs alli_student bisexu_transgend faculti_staff gay_lesbian gender_ident
## 107 3 4 3 22 0
## 108 0 1 3 0 2
## 11 4 0 1 0 0
## 18 1 1 0 0 2
## 28 3 0 4 1 0
## 34 1 1 1 0 0
## 35 0 4 2 2 0
## 44 0 3 0 2 27
## 75 0 1 0 0 2
## 76 0 1 0 0 2
## Terms
## Docs graduat_student lesbian_gay rainbow_floor student_affair usc_edu
## 107 7 1 3 12 0
## 108 1 1 0 3 17
## 11 3 0 1 9 0
## 18 0 1 4 2 0
## 28 2 0 0 1 2
## 34 2 1 1 0 2
## 35 4 3 0 0 0
## 44 0 2 0 0 0
## 75 1 1 6 0 3
## 76 2 1 8 0 2
#bigram frequency tables:
cBCx <- colSums(as.matrix(dtmBCx)) #get and store bigram counts
dfcBCx <- data.frame(bigram = names(cBCx),freq=cBCx) #put counts into table
dfcBCx <- dplyr::arrange(dfcBCx, desc(freq)) #sort counts in descending order
dfcBCx[1:30,] #display table of top 30 bigrams
## bigram freq
## 1 gender_ident 73
## 2 usc_edu 60
## 3 gay_lesbian 51
## 4 graduat_student 40
## 5 rainbow_floor 40
## 6 bisexu_transgend 38
## 7 lesbian_gay 36
## 8 alli_student 35
## 9 faculti_staff 35
## 10 student_affair 33
## 11 gay_bisexu 30
## 12 sexual_orient 30
## 13 pm_pm 27
## 14 sexual_gender 27
## 15 student_organ 27
## 16 angel_ca 25
## 17 student_assembl 25
## 18 one_s 23
## 19 queer_alli 22
## 20 divis_student 21
## 21 alumni_associ 20
## 22 usc_s 20
## 23 lambda_alumni 19
## 24 mentor_program 19
## 25 peer_mentor 19
## 26 program_servic 19
## 27 male_femal 18
## 28 first_year 16
## 29 usc_student 16
## 30 gender_neutral 15
# Visuals:
dfcBCx[1:30,] %>%
mutate(bigram = fct_reorder(bigram, freq)) %>%
ggplot(aes(x=bigram, y=freq)) +
geom_bar(stat="identity", fill = "magenta3") +
coord_flip() +
ggtitle("Frequency Distribution of Top 30 Bigrams") +
labs(y="Frequency", x = "Bigram")
The exponentially high frequencies of “gender identity,” “usc edu,” and “gay lesbian” may mean that dendrograms should be redrawn with these bigrams removed as well.
#Compute and store cosine similarity distance matrix on webpages
disBCx <- textstat_simil(dfmBCx, method = "cosine", margin = "documents")
#Create Dendrogram
denBCx <- hclust(disBCx, method = "complete")
#Label Dendrogram
labels(denBCx) <- as.character(qe$Title[ #convert to character; order Titles by
order.dendrogram(as.dendrogram(denBCx))]) #order of appearance in dendrogram
#Plot Dendrogram
plot(as.dendrogram(denBCx), main = "Complete Linkage Clustering: Analysis 2b,
Bigram, Excluding 'Resource Center'", horiz = TRUE)
The webpages, again, have been clustered into topically more related groups. Splitting groups up based on the goal of three-to-four levels of hierarchy yields 5 groups:
#color dendrogram by groups based on line of division
group = cutree(as.dendrogram(denBCx), h = 0.385)
colDenBCx <- dendrapply(as.dendrogram(denBCx), colLab)
plot(colDenBCx, main = "Complete Linkage Clustering: Analysis 2a,
Bigram, Excluding 'Resource Center'", horiz = TRUE)
abline(v = 0.385)
What bigrams define these 5 groups?
ga2x <- data.frame(ID = 1:109, Title=names(group), Group=group, row.names = NULL)
#remove "resource center" from corpus
bcx <- tm_map(bc, removeWords, "resourc_center")
#GROUP 1
xbc <- bcx[(dplyr::filter(ga2x, Group == 1))$ID] #pull all bcx docs from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2xg1 <- dftermsXBC #store results
head(dfta2xg1, 10) #display table
## word freq
## 1 bisexu_transgend 9
## 2 gender_ident 9
## 3 lesbian_gay 9
## 4 usc_edu 9
## 5 gay_bisexu 8
## 6 gay_lesbian 8
## 7 rainbow_floor 8
## 8 alli_communiti 7
## 9 program_servic 7
## 10 sexual_gender 7
#GROUP 2
xbc <- bcx[(dplyr::filter(ga2x, Group == 2))$ID] #pull all bcx docs from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2xg2 <- dftermsXBC #store results
head(dfta2xg2, 10) #display table
## word freq
## 1 gender_ident 41
## 2 gay_lesbian 30
## 3 male_femal 16
## 4 one_s 16
## 5 rainbow_floor 15
## 6 bisexu_transgend 14
## 7 sexual_orient 14
## 8 graduat_student 13
## 9 student_affair 13
## 10 faculti_staff 12
#GROUP 3
xbc <- bcx[(dplyr::filter(ga2x, Group == 3))$ID] #pull all bcx docs from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2xg3 <- dftermsXBC #store results
head(dfta2xg3, 10) #display table
## word freq
## 1 usc_edu 26
## 2 angel_ca 23
## 3 pm_pm 20
## 4 contact_trousdal 12
## 5 student_affair 12
## 6 alli_student 11
## 7 receiv_usc 11
## 8 student_assembl 11
## 9 campus_activ 10
## 10 divis_student 10
#GROUP 4
xbc <- bcx[(dplyr::filter(ga2x, Group == 4))$ID] #pull all bcx docs from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2xg4 <- dftermsXBC #store results
head(dfta2xg4, 10) #display table
## word freq
## 1 usc_edu 9
## 2 gender_ident 8
## 3 graduat_student 7
## 4 parent_famili 7
## 5 gay_lesbian 6
## 6 gender_neutral 6
## 7 lesbian_gay 6
## 8 neutral_hous 6
## 9 parent_circl 6
## 10 rainbow_floor 6
#GROUP 5
xbc <- bcx[(dplyr::filter(ga2x, Group == 5))$ID] #pull all bcx docs from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2xg5 <- dftermsXBC #store results
head(dfta2xg5, 10) #display table
## word freq
## 1 undergradu_graduat 6
## 2 commenc_celebr 5
## 3 faculti_staff 5
## 4 gender_ident 5
## 5 lavend_commenc 5
## 6 lcc_lcc 5
## 7 usc_edu 5
## 8 alli_student 4
## 9 alumni_associ 4
## 10 bias_incid 4
Group 1 seems to contain general information about being queer, while Group 2 also does, but with more of a focus on students, faculty, and staff. Group 3, on the contrary, is heavily focused on USC without relation to queer identity. Group 4 seems to focus on parents and gender neutral housing, while Group 5 is centered on graduation.
Because none of the remaining high-frequency terms (“gender identity,” “usc edu,” and “gay lesbian”) consistently occur within the top 3 across groups, this may be a good stopping point for bigram analysis. But just in case, what happens when we remove all the high-frequency bigrams?
Again, the high-frequency bigrams are: “resource center,” “gender identity,” “usc edu,” and “gay lesbian.”
dfmBCxz <- dfm_remove(dfmBCx, c("gender_ident", "usc_edu", "gay_lesbian")) #remove resource center bigram
dtmBCxz <- as.DocumentTermMatrix(dfmBCxz) #revert to dtm
inspect(dtmBCxz) #show changed dtm
## <<DocumentTermMatrix (documents: 109, terms: 1751)>>
## Non-/sparse entries: 4878/185981
## Sparsity : 97%
## Maximal term length: 22
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs alli_student bisexu_transgend faculti_staff gay_bisexu
## 107 3 4 3 1
## 108 0 1 3 1
## 11 4 0 1 0
## 18 1 1 0 1
## 28 3 0 4 0
## 34 1 1 1 1
## 35 0 4 2 3
## 44 0 3 0 1
## 75 0 1 0 1
## 76 0 1 0 1
## Terms
## Docs graduat_student lesbian_gay pm_pm rainbow_floor sexual_orient
## 107 7 1 0 3 2
## 108 1 1 0 0 2
## 11 3 0 0 1 0
## 18 0 1 0 4 0
## 28 2 0 0 0 0
## 34 2 1 0 1 0
## 35 4 3 0 0 1
## 44 0 2 0 0 6
## 75 1 1 0 6 0
## 76 2 1 0 8 0
## Terms
## Docs student_affair
## 107 12
## 108 3
## 11 9
## 18 2
## 28 1
## 34 0
## 35 0
## 44 0
## 75 0
## 76 0
#bigram frequency tables:
cBCxz <- colSums(as.matrix(dtmBCxz)) #get and store bigram counts
dfcBCxz <- data.frame(bigram = names(cBCxz),freq=cBCxz) #put counts into table
dfcBCxz <- dplyr::arrange(dfcBCxz, desc(freq)) #sort counts in descending order
dfcBCxz[1:30,] #display table of top 30 bigrams
## bigram freq
## 1 graduat_student 40
## 2 rainbow_floor 40
## 3 bisexu_transgend 38
## 4 lesbian_gay 36
## 5 alli_student 35
## 6 faculti_staff 35
## 7 student_affair 33
## 8 gay_bisexu 30
## 9 sexual_orient 30
## 10 pm_pm 27
## 11 sexual_gender 27
## 12 student_organ 27
## 13 angel_ca 25
## 14 student_assembl 25
## 15 one_s 23
## 16 queer_alli 22
## 17 divis_student 21
## 18 alumni_associ 20
## 19 usc_s 20
## 20 lambda_alumni 19
## 21 mentor_program 19
## 22 peer_mentor 19
## 23 program_servic 19
## 24 male_femal 18
## 25 first_year 16
## 26 usc_student 16
## 27 gender_neutral 15
## 28 ident_express 15
## 29 lesbian_bisexu 15
## 30 undergradu_graduat 15
# Visuals:
dfcBCxz[1:30,] %>%
mutate(bigram = fct_reorder(bigram, freq)) %>%
ggplot(aes(x=bigram, y=freq)) +
geom_bar(stat="identity", fill = "mediumpurple") +
coord_flip() +
ggtitle("Frequency Distribution of Top 30 Bigrams") +
labs(y="Frequency", x = "Bigram")
These top 30 bigrams are more evenly distributed; hypothetically the differentiation in how they are distributed among pages, should provide the clearest topical grouping.
#Compute and store cosine similarity distance matrix on webpages
disBCxz <- textstat_simil(dfmBCxz, method = "cosine", margin = "documents")
#Create Dendrogram
denBCxz <- hclust(disBCxz, method = "complete")
#Label Dendrogram
labels(denBCxz) <- as.character(qe$Title[ #convert to character; order Titles by
order.dendrogram(as.dendrogram(denBCxz))]) #order of appearance in dendrogram
#Plot Dendrogram
plot(as.dendrogram(denBCxz), main = "Complete Linkage Clustering: Analysis 2c,
Bigram, Excluding All High-Frequency Bigrams", horiz = TRUE)
Against expectations, the webpages have been clustered into less coherent categories. Splitting groups up based on the goal of three-to-four levels of hierarchy yields 4 groups, which suggests that too many terms have been removed, making categories too broad:
#color dendrogram by groups based on line of division
group = cutree(as.dendrogram(denBCxz), h = 0.4)
colDenBCxz <- dendrapply(as.dendrogram(denBCxz), colLab)
plot(colDenBCxz, main = "Complete Linkage Clustering: Analysis 2a,
Bigram, Excluding Resource Center", horiz = TRUE)
abline(v = 0.4)
What bigrams define these 4 groups, if any?
ga2xz <- data.frame(ID = 1:109, Title=names(group), Group=group, row.names = NULL)
#remove "resource center" from corpus
bcxz <- tm_map(bcx, removeWords, c("gender_ident", "usc_edu", "gay_lesbian"))
#GROUP 1
xbc <- bcxz[(dplyr::filter(ga2xz, Group == 1))$ID] #pull all bcxz docs from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2xzg1 <- dftermsXBC #store results
head(dfta2xzg1, 10) #display table
## word freq
## 1 pm_pm 21
## 2 alli_student 10
## 3 make_gift 8
## 4 gender_neutral 7
## 5 graduat_student 7
## 6 program_servic 7
## 7 faculti_staff 6
## 8 neutral_hous 6
## 9 peer_mentor 6
## 10 rainbow_floor 6
#GROUP 2
xbc <- bcxz[(dplyr::filter(ga2xz, Group == 2))$ID] #pull all bcxz docs from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2xzg2 <- dftermsXBC #store results
head(dfta2xzg2, 10) #display table
## word freq
## 1 angel_ca 23
## 2 one_s 17
## 3 male_femal 16
## 4 rainbow_floor 16
## 5 sexual_orient 16
## 6 lesbian_gay 13
## 7 sexual_gender 13
## 8 bisexu_transgend 12
## 9 contact_trousdal 12
## 10 chest_surgeri 10
#GROUP 3
xbc <- bcxz[(dplyr::filter(ga2xz, Group == 3))$ID] #pull all bcxz docs from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2xzg3 <- dftermsXBC #store results
head(dfta2xzg3, 10) #display table
## word freq
## 1 graduat_student 20
## 2 student_affair 15
## 3 faculti_staff 14
## 4 bisexu_transgend 13
## 5 student_organ 12
## 6 alumni_associ 11
## 7 rainbow_floor 11
## 8 lambda_alumni 10
## 9 lesbian_gay 10
## 10 alli_student 9
#GROUP 4
xbc <- bcxz[(dplyr::filter(ga2xz, Group == 4))$ID] #pull all bcxz docs from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2xzg4 <- dftermsXBC #store results
head(dfta2xzg4, 10) #display table
## word freq
## 1 receiv_usc 11
## 2 student_assembl 10
## 3 affair_honor 9
## 4 alli_student 9
## 5 divis_student 9
## 6 lesbian_gay 9
## 7 name_chang 9
## 8 student_affair 9
## 9 bisexu_transgend 8
## 10 campus_activ 8
Based on these frequency tables, there is no coherent subject defining each quality.
The most coherent groups thus far, are the ones suggested by Analysis 2b:
These groups can serve as potential menu headers on the website.
In addition, we have determined that bigrams are more useful than body text in analyzing page similarity. After striking text analyses, the following analysis levels remain (renumbered to reflect changes):
Analysis Number | Aspects Involved |
---|---|
3 | Bigram, Links |
4 | Bigram, LGBTRC, USC, Outside |
5 | Bigram, LGBTRC, USC, Outside, Email, Videos, Images |
6 | Bigram, LGBTRC, USC, Outside, Email, Videos, Images |
After consulting with Advocacy Directors, it was determined that Analysis 2b would be sufficient for a starting point in reorganizing the website’s menus, and that no further analyses would be done.