Abstract

The University of Southern California’s LGBT Resource Center (LGBTRC) website was rudimentarily text mined to identify groups based on webpage similarity. Finding these similarities were meant to serve as a starting point for reorganizing the website menus, so that information would be more intuitive to access. After removing high-frequency words and word pairs (bigrams) that did not help differentiate among webpages, five major groups were found. The subjects were: queer resources for a universal audience, queer resources specific to USC, USC resources, navigating gender, and navigating graduation.

Methods Used

The two major forms of analyses used were term frequency tables, and hierarchical clustering.

Term frequency tables recorded the counts of each word, or word-pair, appearing within individual webpages. The more similar the counts, the more similar the webpages. In addition, the types of terms that appeared within tables, suggested general topics of the webpage, or webpage group.

Hierarchical clustering organized webpages by branching levels of similarity, separating pages into groups based on how different they were. The branching structure provided a simple outline for possible menus.

Data Collection

The following aspects of each page were recorded:

Aspect Definition
ID Each page was assigned an ID number based on alphabetical order.
Page Absolute link (full URL) to webpage
Title Title of webpage
Text Entire text of page, without any paragraph indents.
Section Location in dropdown menu on unreformatted LGBTRC website
Links Total number of links. Includes links to LGBTRC website, USC sites, non-USC sites, email, videos, and image links.
LGBTRC Total number of links to any page hosted under lgbtrc.usc.edu, including files. Excludes emails.
USC Total number of links to pages hosted under usc.edu OR survey forms, excluding lgbtrc.edu pages and email.
Outside Total number of links to any page hosted outside of lgbtrc.usc.edu and usc.edu, including files. Excludes emails.
Email Total number of links to email addresses.
Videos Total number of links to videos.
Linked Images Total number of images containing links to either media files or websites.
Unlinked Images Total number of images without links.
Images Total number of images.
Parent Name of parent page (a parent page is any page with additional pages, i.e. children, nested under it on a website).
Issues Any immediately observable problems the page had. Recorded with the aim of returning to fix page-specific issues after website is reorganized.

Preparing for Analysis

First, convert data on Google Spreadsheets into CSV format. Import dataset.

q <- read.csv("C:\\Users\\greya\\Downloads\\v1.csv")

#Google Drive corrupted the file halfway through this project. This was the code
#used to fix the issue; may not be needed for other computers.

q <- q[-c(110:133),]
q$Page <- factor(q$Page)
q$Title <- factor(q$Title)
q$Text <- factor(q$Text)
q$Parent <- factor(q$Parent)
q$Issues <- factor(q$Issues)

This is what the initial data looks like:

q[1, ]
##   ID                             Page      Title
## 1  1 https://lgbtrc.usc.edu/events-2/ (no title)
##                                                                                                                                                                                                                         Text
## 1 First uRAP Meeting | August 31st | 7:30 <U+0096> 8:30pm Resource Center Open House | September 3rd | 5:00 <U+0096> 7:00pm First FAB Meeting | September 3rd | 7:00pm First Rainbow International Lunch | September 25th | 12:00 <U+0096> 1:30pm
##   Section Links LGBTRC USC Outside Email Videos Linked.Images
## 1    None     4      4   0       0     0      0             0
##   Unlinked.Images Total.Images Parent
## 1               1            1   None
##                                                                                                                           Issues
## 1 invisible, 4 redundant links to same location of event page, which has no information. missing calendar, accessibility image s

Exclude any aspects containing information that will not help with grouping. These aspects include:

Aspect Reason for Exclusion
Page ID serves as a shorter, easier-to-read identifier
Section Old category—because the goal is to recreate new methods of page categorization, any old categories should not be used to evaluate similarity among pages.
Parent Old category—see aspect “Section” for explanation.
Issues Grouping pages by how difficult they are to access, does not help with accessibility.

This leaves us with:

qe <- q[,-c(2, 5, 15:16)] #dataset with excluded variables

Figuring out Levels of Analysis

Given the remaining aspects, there are several analysis levels that can be used to compare pages. Here they are listed from the fewest to the most variables:

Analysis Number Aspects Involved
1 Text
2 Bigram
3 Text, Links
4 Bigram, Links
5 Text, LGBTRC, USC, Outside
6 Bigram, LGBTRC, USC, Outside
7 Text, LGBTRC, USC, Outside, Email, Videos, Images
8 Bigram, LGBTRC, USC, Outside, Email, Videos, Images
9 Text, LGBTRC, USC, Outside, Email, Videos, Linked Images, Unlinked Images
10 Bigram, LGBTRC, USC, Outside, Email, Videos, Images

If body text alone is insufficient for analysis, bigrams may become the unit of focus. Likewise, if bigrams are insufficient, body text will become the unit of focus.

Summary of Analyses Performed

Analysis 1 examined the similarity of pages by body text alone. The groups formed by this analysis were numerous and topically incoherent, suggesting that bigrams were a more appropriate unit of focus.

Analysis 2a examined the similarity of pages by bigram. The groups formed here were fewer and more topically coherent, but possibly confounded by high-frequency bigrams that had exponentially higher counts than the other bigrams.

Analysis 2b was the same as 2a, but with the most frequent bigram removed. This analysis provided the most topically coherent groups.

Analysis 2c was the same as 2a, but with all high-frequency bigrams removed. This analysis provided the fewest groups. However, these groups were topically incoherent and not useful for organizing menus.

Analysis 1: Similarity by Text

Using the dataset with excluded variables (qe), we will begin preparing a cleaned version of the set for analysis 1, on whether pages sharing similar body text should be grouped together. Cleaning will be done with the package, tm; for more information on how tm works, see the package’s vignette on CRAN by Ingo Feinerer:

#replace any non A~Z character with a space
qe$Text <- gsub("[^[:alnum:][:space:]]"," ", qe$Text) 

#convert to tm corpus
tc <- Corpus(VectorSource(qe$Text)) 

#The converted column is stored in a new variable, tc, to expedite typing for cleaning:

  tc <- tm_map(tc, content_transformer(tolower)) #makes all text lowercase
  tc <- tm_map(tc, removeNumbers) #removes numbers
  tc <- tm_map(tc, removeWords, stopwords("english")) #removes stopwords, e.g. and
  tc <- tm_map(tc, removePunctuation) #removes punctuation
  tc <- tm_map(tc, stripWhitespace) #removes extra spaces and paragraph indents
  tc <- tm_map(tc, stemDocument) #reduces words to root form, e.g. cars to car

inspect(tc[1]) #check results to see if additional cleaning needed
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 1
## 
## [1] first urap meet august st pm resourc center open hous septemb rd pm first fab meet septemb rd pm first rainbow intern lunch septemb th pm
               #(abbreviated example of only one result shown here)

#remove additional:  
tc <- tm_map(tc, removeWords, c("lgbt", "also", "can", "may", "los"))

Common Terms on the Website

Now that the text has been cleaned, we can create a document term matrix (dtm) to track the frequency of words on each webpage:

dtmTC <- DocumentTermMatrix(tc)
inspect(dtmTC) 
## <<DocumentTermMatrix (documents: 109, terms: 2700)>>
## Non-/sparse entries: 9484/284816
## Sparsity           : 97%
## Maximal term length: 18
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  alli center communiti gay gender program resourc student transgend
##   107    7     26         7  44      5      14      13      84         5
##   108    0     16         8   1      4      11       7      39         1
##   11     6     11         3   2      1      11       7      29         1
##   13     0      4         3   0      2       1       2      22         0
##   28    10      4         7   2      0       5       4      12         1
##   42     2      4         2   3      0       0       3      10         3
##   44     4      1         9   9     87       0       0       0        11
##   55     0      5         0   1      9       0       4      17         1
##   61     0      5         0   3      3       2       9       7         2
##   92     0      6         0   0      5       3       1       9        10
##      Terms
## Docs  usc
##   107  32
##   108  23
##   11   18
##   13    8
##   28    9
##   42   12
##   44    0
##   55    7
##   61    4
##   92    8

As shown in the subset above, each webpage is listed by its ID number as a “doc.” Terms are listed as columns, and the counts of every term are listed for all the docs/webpages.

From the dtm, you can tell that most of the terms are sparse (i.e. are unique to individual documents). For the sake of determining page similarity, all terms that do not appear at least once, among at least two different pages, shall be removed:

#To accomplish this, the function "trim" from quanteda, which requires a conversion to the dfm datatype, will be used:

dfmTC <- as.dfm(dtmTC) #convert to dfm
dfmTC <- dfm_trim(dfmTC, min_count = 1, min_docfreq = 2) #trim terms
dtmTC <- as.DocumentTermMatrix(dfmTC) #revert to dtm
inspect(dtmTC) #show changed dtm
## <<DocumentTermMatrix (documents: 109, terms: 1262)>>
## Non-/sparse entries: 8046/129512
## Sparsity           : 94%
## Maximal term length: 12
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  alli center communiti gay gender program resourc student transgend
##   107    7     26         7  44      5      14      13      84         5
##   108    0     16         8   1      4      11       7      39         1
##   11     6     11         3   2      1      11       7      29         1
##   13     0      4         3   0      2       1       2      22         0
##   28    10      4         7   2      0       5       4      12         1
##   42     2      4         2   3      0       0       3      10         3
##   44     4      1         9   9     87       0       0       0        11
##   55     0      5         0   1      9       0       4      17         1
##   61     0      5         0   3      3       2       9       7         2
##   92     0      6         0   0      5       3       1       9        10
##      Terms
## Docs  usc
##   107  32
##   108  23
##   11   18
##   13    8
##   28    9
##   42   12
##   44    0
##   55    7
##   61    4
##   92    8

Below are remaining terms, organized by frequency across the entire website:

termsTC <- colSums(as.matrix(dtmTC)) #get and store term counts
dftermsTC <- data.frame(word = names(termsTC),freq=termsTC) #put counts into table
dftermsTC <- dplyr::arrange(dftermsTC, desc(freq)) #sort counts in descending order
dftermsTC[1:30,] #display table of top 30 words
##         word freq
## 1    student  508
## 2        usc  292
## 3     gender  204
## 4     center  197
## 5  communiti  178
## 6    resourc  167
## 7    program  150
## 8       alli  144
## 9        gay  134
## 10 transgend  125
## 11   graduat  116
## 12     ident  116
## 13   support  109
## 14     queer  108
## 15   lesbian  107
## 16    campus  103
## 17    person   95
## 18    sexual   89
## 19     group   85
## 20   univers   80
## 21     peopl   80
## 22       use   77
## 23       one   76
## 24      educ   73
## 25    servic   72
## 26     organ   71
## 27     angel   70
## 28    health   70
## 29    inform   68
## 30     event   68
# Visuals:
dftermsTC[1:30,] %>%
  mutate(word = fct_reorder(word, freq)) %>%
  ggplot(aes(x=word, y=freq)) +
    geom_bar(stat="identity", fill = "magenta4") +
    coord_flip() +
    ggtitle("Frequency Distribution of Top 30 Words") +
    labs(y="Frequency", x = "Word")

The most frequent word is “student,” suggesting that the resources of the site are primarily geared towards students rather than faculty and staff. Because the term “student” is so frequent, it may be unhelpful to include it as a term for determining similarity, if almost the pages contain it. Studying the n-grams (clusters of words that appear together) may give some more insight.

Analyzing N-Grams: Should “Student” be Left in the Data?

Two n-grams will be examined: bigrams (likelihood of 2 words appearing side by side) and trigrams (likelihood of 3 words appearing side by side).

#BIGRAMS
tcq <- corpus(tc$content, docvars = tc$dmeta) #convert to quanteda corpus
bicolloTC <- textstat_collocations(tcq, size = 2) #calculate bigrams
bicolloTC <- dplyr::arrange(bicolloTC, desc(count)) #sort by descending count

#display frequent and significant bigrams containing "student"
  #frequent is defined as having a double-digit count
  #significant is defined by the Wald test z-statistic >= 1
  bicolloTC[grep("student", bicolloTC$collocation), ] %>%
    filter(count >= 10) %>%
    filter(z >= 1)
##           collocation count count_nested length    lambda         z
## 1     graduat student    40            0      2 3.0022993 15.020162
## 2        alli student    35            0      2 2.4994263 12.580821
## 3      student affair    33            0      2 7.8345679  5.496063
## 4       student organ    27            0      2 3.1335443 12.690186
## 5     student assembl    25            0      2 4.4088724 12.308725
## 6       divis student    21            0      2 7.3639908  5.144968
## 7         usc student    16            0      2 0.7655627  2.972386
## 8      student health    13            0      2 2.1363037  6.984828
## 9  undergradu student    11            0      2 3.1060552  8.211545
## 10     colleg student    11            0      2 2.5418022  7.343925
## 11     student servic    11            0      2 1.9044482  5.869243
## 12     intern student    10            0      2 3.0687012  7.808554
## 13      student staff    10            0      2 2.0292016  5.919633
## 14  transgend student    10            0      2 1.1761338  3.613176

These bigrams show that “student” is used as a qualifier in diverse contexts, and should be left in the data. What about the trigrams?

#TRIGRAMS
tricolloTC <- textstat_collocations(tcq, size = 3) #calculate trigrams
tricolloTC <- dplyr::arrange(tricolloTC, desc(count)) #sort by descending count

#display frequent and significant trigrams containing "student"
  #frequent is defined as having a double-digit count
  #significant is defined by the Wald test z-statistic >= 1
  tricolloTC[grep("student", tricolloTC$collocation), ] %>%
    filter(count >= 10) %>%
    filter(z >= 1)
##            collocation count count_nested length   lambda        z
## 1   queer alli student    17            0      3 4.411056 2.898238
## 2 alli student assembl    17            0      3 1.997375 1.309327

At the trigram level, including the term “student” is redundant; both listed trigrams reference QuASA, the Queer and Ally Student Assembly. This is less evidence to suggest eliminating “student” as a stopword, and more evidence that bigrams are the appropriate size of word grouping to analyze.

Conclusion: “student” should be kept in the analysis, despite its frequency.

Forming Groups

Dendrograms are perfect for showing the similarities among objects, especially webpages, which are nested hierarchically under each other. Because dendrograms are hierarchical, there are several types of methods you can choose from to organize the graph. The method used here will be complete clustering (separating pages by maximum difference).

To build a dendrogram, we need to make a distance matrix, which gives a quantitative calculation of how similar pages are to each other. In this case, the metric used will be cosine similarity; Kan Nishida gives a good explanation of what cosine similarity is, and why it’s ideal for text similarity analysis.

#Compute and store cosine similarity distance matrix on webpages
disTC <- textstat_simil(dfmTC, method = "cosine", margin = "documents")

#Create Dendrogram
denTC <- hclust(disTC, method = "complete")
  #Label Dendrogram
  labels(denTC) <- as.character(qe$Title[ #convert to character; order Titles by 
      order.dendrogram(as.dendrogram(denTC))]) #order of appearance in dendrogram 
  #Plot Dendrogram
  plot(as.dendrogram(denTC), main = "Complete Linkage Clustering: Analysis 1, 
       Text", horiz = TRUE)

This dendogram has several places where it could be cut into menu headers. Stephanie Lin of UX Booth suggests that, ideally there should be three to four levels of hierarchy in a website’s navigation. Using that guideline, the dendrogram might be cut here, creating roughly 11 groups:

plot(as.dendrogram(denTC), main = "Complete Linkage Clustering: Analysis 1, 
       Text", horiz = TRUE)
abline(v=0.45)

#color dendrogram by groups based on line of division
#cut tree at line and divide into groups
group = cutree(as.dendrogram(denTC), h = 0.45)
labelColors = c("fire brick", "dark orange 1", "gold", "dark green", "blue", "cornflower blue", "magenta 4", "black", "brown", "cyan 3", "deep pink", "medium turquoise")
#use custom function to get color labels
colLab <- function(n) {
    if (is.leaf(n)) {
        a <- attributes(n)
        labCol <- labelColors[group[which(names(group) == a$label)]]
        attr(n, "nodePar") <- c(a$nodePar, lab.col = labCol)
    }
    n
}
#apply custom function to color dendrogram
colDenTC <- dendrapply(as.dendrogram(denTC), colLab) 
plot(colDenTC, main = "Complete Linkage Clustering: Analysis 1, 
       Text", horiz = TRUE)
abline(v = 0.45)

While many pages sharing similar topics, such as offering resources, were placed together, most pages were topically divergent based solely on title. Here are the top ten term frequency tables for each group.

Top Ten Term Frequency Tables

ga1 <- data.frame(ID = 1:109, Title=names(group), Group=group, row.names = NULL)

#GROUP 1
xtc <- tc[(dplyr::filter(ga1, Group == 1))$ID] #pull all docs from tc from group
dtmXTC <- DocumentTermMatrix(xtc) #convert to dtm
termsXTC <- colSums(as.matrix(dtmXTC)) #get and store term counts
dftermsXTC <- data.frame(word = names(termsXTC),freq=termsXTC) #put counts into table
dftermsXTC <- dplyr::arrange(dftermsXTC, desc(freq)) #sort counts in descending order
dfta1g1 <- dftermsXTC #store results
head(dfta1g1, 10) #display table
##         word freq
## 1    student   53
## 2      chang   36
## 3        usc   30
## 4    program   28
## 5       name   28
## 6     center   24
## 7    resourc   20
## 8  communiti   20
## 9     gender   20
## 10    assess   18
#GROUP 2
xtc <- tc[(dplyr::filter(ga1, Group == 2))$ID] #pull all docs from tc from group
dtmXTC <- DocumentTermMatrix(xtc) #convert to dtm
termsXTC <- colSums(as.matrix(dtmXTC)) #get and store term counts
dftermsXTC <- data.frame(word = names(termsXTC),freq=termsXTC) #put counts into table
dftermsXTC <- dplyr::arrange(dftermsXTC, desc(freq)) #sort counts in descending order
dfta1g2 <- dftermsXTC #store results
head(dfta1g2, 10) #display table
##         word freq
## 1    student   26
## 2     gender   14
## 3       alli   13
## 4    teammat   13
## 5        gay   12
## 6  communiti   10
## 7    support   10
## 8    program    9
## 9     athlet    9
## 10     coach    9
#GROUP 3
xtc <- tc[(dplyr::filter(ga1, Group == 3))$ID] #pull all docs from tc from group
dtmXTC <- DocumentTermMatrix(xtc) #convert to dtm
termsXTC <- colSums(as.matrix(dtmXTC)) #get and store term counts
dftermsXTC <- data.frame(word = names(termsXTC),freq=termsXTC) #put counts into table
dftermsXTC <- dplyr::arrange(dftermsXTC, desc(freq)) #sort counts in descending order
dfta1g3 <- dftermsXTC #store results
head(dfta1g3, 10) #display table
##         word freq
## 1    student   69
## 2        usc   40
## 3     center   29
## 4    resourc   27
## 5    graduat   27
## 6     famili   26
## 7  communiti   24
## 8    program   23
## 9       alli   22
## 10       lcc   22
#GROUP 4
xtc <- tc[(dplyr::filter(ga1, Group == 4))$ID] #pull all docs from tc from group
dtmXTC <- DocumentTermMatrix(xtc) #convert to dtm
termsXTC <- colSums(as.matrix(dtmXTC)) #get and store term counts
dftermsXTC <- data.frame(word = names(termsXTC),freq=termsXTC) #put counts into table
dftermsXTC <- dplyr::arrange(dftermsXTC, desc(freq)) #sort counts in descending order
dfta1g4 <- dftermsXTC #store results
head(dfta1g4, 10) #display table
##       word freq
## 1  student   34
## 2      usc   24
## 3    award   23
## 4   receiv   13
## 5  program   12
## 6   center   11
## 7    honor   11
## 8    queer   10
## 9     year   10
## 10  affair    9
#GROUP 5
xtc <- tc[(dplyr::filter(ga1, Group == 5))$ID] #pull all docs from tc from group
dtmXTC <- DocumentTermMatrix(xtc) #convert to dtm
termsXTC <- colSums(as.matrix(dtmXTC)) #get and store term counts
dftermsXTC <- data.frame(word = names(termsXTC),freq=termsXTC) #put counts into table
dftermsXTC <- dplyr::arrange(dftermsXTC, desc(freq)) #sort counts in descending order
dfta1g5 <- dftermsXTC #store results
head(dfta1g5, 10) #display table
##         word freq
## 1    student   41
## 2        usc   32
## 3  communiti   23
## 4     center   22
## 5    resourc   21
## 6     gender   18
## 7       alli   18
## 8  transgend   17
## 9    support   16
## 10      come   16
#GROUP 6
xtc <- tc[(dplyr::filter(ga1, Group == 6))$ID] #pull all docs from tc from group
dtmXTC <- DocumentTermMatrix(xtc) #convert to dtm
termsXTC <- colSums(as.matrix(dtmXTC)) #get and store term counts
dftermsXTC <- data.frame(word = names(termsXTC),freq=termsXTC) #put counts into table
dftermsXTC <- dplyr::arrange(dftermsXTC, desc(freq)) #sort counts in descending order
dfta1g6 <- dftermsXTC #store results
head(dfta1g6, 10) #display table
##         word freq
## 1    student   72
## 2        usc   37
## 3    resourc   31
## 4     center   30
## 5  communiti   26
## 6    graduat   24
## 7  transgend   24
## 8    program   23
## 9      queer   22
## 10   support   21
#GROUP 7
xtc <- tc[(dplyr::filter(ga1, Group == 7))$ID] #pull all docs from tc from group
dtmXTC <- DocumentTermMatrix(xtc) #convert to dtm
termsXTC <- colSums(as.matrix(dtmXTC)) #get and store term counts
dftermsXTC <- data.frame(word = names(termsXTC),freq=termsXTC) #put counts into table
dftermsXTC <- dplyr::arrange(dftermsXTC, desc(freq)) #sort counts in descending order
dfta1g7 <- dftermsXTC #store results
head(dfta1g7, 10) #display table
##       word freq
## 1   gender   98
## 2  student   76
## 3   person   52
## 4      usc   52
## 5      use   51
## 6     male   50
## 7      one   47
## 8    ident   47
## 9      see   47
## 10   femal   42
#GROUP 8
xtc <- tc[(dplyr::filter(ga1, Group == 8))$ID] #pull all docs from tc from group
dtmXTC <- DocumentTermMatrix(xtc) #convert to dtm
termsXTC <- colSums(as.matrix(dtmXTC)) #get and store term counts
dftermsXTC <- data.frame(word = names(termsXTC),freq=termsXTC) #put counts into table
dftermsXTC <- dplyr::arrange(dftermsXTC, desc(freq)) #sort counts in descending order
dfta1g8 <- dftermsXTC #store results
head(dfta1g8, 10) #display table
##         word freq
## 1    student   25
## 2        usc   18
## 3  communiti   14
## 4     gender   14
## 5    resourc   14
## 6      floor   13
## 7  transgend   11
## 8     center   10
## 9      ident   10
## 10    sexual    9
#GROUP 9
xtc <- tc[(dplyr::filter(ga1, Group == 9))$ID] #pull all docs from tc from group
dtmXTC <- DocumentTermMatrix(xtc) #convert to dtm
termsXTC <- colSums(as.matrix(dtmXTC)) #get and store term counts
dftermsXTC <- data.frame(word = names(termsXTC),freq=termsXTC) #put counts into table
dftermsXTC <- dplyr::arrange(dftermsXTC, desc(freq)) #sort counts in descending order
dfta1g9 <- dftermsXTC #store results
head(dfta1g9, 10) #display table
##         word freq
## 1    student   11
## 2        usc   11
## 3      queer    8
## 4       alli    7
## 5  communiti    6
## 6      first    4
## 7    graduat    4
## 8    lawrenc    4
## 9    lindsey    4
## 10     organ    4
#GROUP 10
xtc <- tc[(dplyr::filter(ga1, Group == 10))$ID] #pull all docs from tc from group
dtmXTC <- DocumentTermMatrix(xtc) #convert to dtm
termsXTC <- colSums(as.matrix(dtmXTC)) #get and store term counts
dftermsXTC <- data.frame(word = names(termsXTC),freq=termsXTC) #put counts into table
dftermsXTC <- dplyr::arrange(dftermsXTC, desc(freq)) #sort counts in descending order
dfta1g10 <- dftermsXTC #store results
head(dfta1g10, 10) #display table
##       word freq
## 1  student   17
## 2     alli    9
## 3   admiss    8
## 4      usc    8
## 5   experi    7
## 6     guid    6
## 7   center    5
## 8   colleg    5
## 9   campus    4
## 10   floor    4
#GROUP 11
xtc <- tc[(dplyr::filter(ga1, Group == 11))$ID] #pull all docs from tc from group
dtmXTC <- DocumentTermMatrix(xtc) #convert to dtm
termsXTC <- colSums(as.matrix(dtmXTC)) #get and store term counts
dftermsXTC <- data.frame(word = names(termsXTC),freq=termsXTC) #put counts into table
dftermsXTC <- dplyr::arrange(dftermsXTC, desc(freq)) #sort counts in descending order
dfta1g11 <- dftermsXTC #store results
head(dfta1g11, 10) #display table
##       word freq
## 1  student   84
## 2      gay   44
## 3      usc   32
## 4   center   26
## 5  lesbian   24
## 6    first   19
## 7  graduat   18
## 8   alumni   14
## 9  program   14
## 10   board   13

Examining the frequency tables, we find that many of the same words dominate the top of the list across groups. This suggests that the websites were sorted poorly, because the unit of analysis was too specific. Rather than determining word-by-word, we should have been looking by word associations, e,g, bigrams, to focus on differentiation by term context, rather than just term frequency.

Analysis 2: Similarity by Bigram

Repeating the process of creating a distance matrix and dendrogram for bigrams, requires a new document term matrix:

#create tm corpus from bigrams
bt <- tokens(tcq, what = "word", ngram = 2)
bc <- VCorpus(VectorSource(bt)) #Though we used SimpleCorpus before, SimpleCorpus
                                #finds incorrect docs. VCorpus fixes the issue.

#create bigrams dtm
dtmBC <- DocumentTermMatrix(bc)
inspect(dtmBC) 
## <<DocumentTermMatrix (documents: 109, terms: 12701)>>
## Non-/sparse entries: 15948/1368461
## Sparsity           : 99%
## Maximal term length: 29
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  alli_student bisexu_transgend faculti_staff gay_lesbian gender_ident
##   107            3                4             3          22            0
##   108            0                1             3           0            2
##   11             4                0             1           0            0
##   13             0                0             0           0            0
##   28             3                0             4           1            0
##   42             1                1             0           2            0
##   44             0                3             0           2           27
##   55             0                0             0           1            2
##   61             0                1             0           0            1
##   92             0                0             0           0            3
##      Terms
## Docs  graduat_student lesbian_gay rainbow_floor resourc_center usc_edu
##   107               7           1             3             13       0
##   108               1           1             0              5      17
##   11                3           0             1              7       0
##   13                6           0             0              2       2
##   28                2           0             0              2       2
##   42                0           1             0              3       0
##   44                0           2             0              0       0
##   55                1           0             0              4       0
##   61                0           3             2              5       1
##   92                0           0             0              0       1
#trim sparse bigrams:
#spare = appears at least once, in at least two docs/webpages
dfmBC <- as.dfm(dtmBC) #convert to dfm
dfmBC <- dfm_trim(dfmBC, min_count = 1, min_docfreq = 2) #trim bigrams
dtmBC <- as.DocumentTermMatrix(dfmBC) #revert to dtm
inspect(dtmBC) #show changed dtm
## <<DocumentTermMatrix (documents: 109, terms: 1755)>>
## Non-/sparse entries: 5002/186293
## Sparsity           : 97%
## Maximal term length: 22
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  alli_student bisexu_transgend faculti_staff gay_lesbian gender_ident
##   107            3                4             3          22            0
##   108            0                1             3           0            2
##   11             4                0             1           0            0
##   18             1                1             0           0            2
##   28             3                0             4           1            0
##   34             1                1             1           0            0
##   35             0                4             2           2            0
##   44             0                3             0           2           27
##   75             0                1             0           0            2
##   76             0                1             0           0            2
##      Terms
## Docs  graduat_student lesbian_gay rainbow_floor resourc_center usc_edu
##   107               7           1             3             13       0
##   108               1           1             0              5      17
##   11                3           0             1              7       0
##   18                0           1             4              2       0
##   28                2           0             0              2       2
##   34                2           1             1              0       2
##   35                4           3             0              0       0
##   44                0           2             0              0       0
##   75                1           1             6              4       3
##   76                2           1             8              2       2
#bigram frequency tables:
cBC <- colSums(as.matrix(dtmBC)) #get and store bigram counts
dfcBC <- data.frame(bigram = names(cBC),freq=cBC) #put counts into table
dfcBC <- dplyr::arrange(dfcBC, desc(freq)) #sort counts in descending order
dfcBC[1:30,] #display table of top 30 bigrams
##              bigram freq
## 1    resourc_center  102
## 2      gender_ident   73
## 3           usc_edu   60
## 4       gay_lesbian   51
## 5   graduat_student   40
## 6     rainbow_floor   40
## 7  bisexu_transgend   38
## 8       lesbian_gay   36
## 9      alli_student   35
## 10    faculti_staff   35
## 11   student_affair   33
## 12       gay_bisexu   30
## 13    sexual_orient   30
## 14            pm_pm   27
## 15    sexual_gender   27
## 16    student_organ   27
## 17         angel_ca   25
## 18  student_assembl   25
## 19            one_s   23
## 20       queer_alli   22
## 21    divis_student   21
## 22    alumni_associ   20
## 23            usc_s   20
## 24    lambda_alumni   19
## 25   mentor_program   19
## 26      peer_mentor   19
## 27   program_servic   19
## 28       male_femal   18
## 29       first_year   16
## 30      usc_student   16
# Visuals:
dfcBC[1:30,] %>%
  mutate(bigram = fct_reorder(bigram, freq)) %>%
  ggplot(aes(x=bigram, y=freq)) +
    geom_bar(stat="identity", fill = "magenta3") +
    coord_flip() +
    ggtitle("Frequency Distribution of Top 30 Bigrams") +
    labs(y="Frequency", x = "Bigram")

Based on these top 30 bigrams, we can already see clearer topic associations appearing, affirming the suspicion from earlier that bigrams would be more appropriate for determining page similarity.

The extremely high frequency of the bigram “resource center” suggests that it may not be helpful in determining page similarity. Therefore, two dendrograms will be created, one including “resource center,” and one excluding it.

Analysis 2a: Including “Resource Center”

#Compute and store cosine similarity distance matrix on webpages
disBC <- textstat_simil(dfmBC, method = "cosine", margin = "documents")

#Create Dendrogram
denBC <- hclust(disBC, method = "complete")
  #Label Dendrogram
  labels(denBC) <- as.character(qe$Title[ #convert to character; order Titles by 
      order.dendrogram(as.dendrogram(denBC))]) #order of appearance in dendrogram 
  #Plot Dendrogram
  plot(as.dendrogram(denBC), main = "Complete Linkage Clustering: Analysis 2a, 
       Bigram, with Resource Center", horiz = TRUE)

From a cursory visual examination, we can already tell that this dendrogram is much neater, and has fewer splits than the previous one based only on body text. Splitting groups up based on the goal of three-to-four levels of hierarchy yields 6 groups:

#color dendrogram by groups based on line of division
group = cutree(as.dendrogram(denBC), h = 0.35)
colDenBC <- dendrapply(as.dendrogram(denBC), colLab) 
plot(colDenBC, main = "Complete Linkage Clustering: Analysis 2a, 
       Bigram, with Resource Center", horiz = TRUE)
abline(v = 0.35)

Although these groups are much more logical than before, they still aren’t perfect. Is it because of “resource center’s” frequency interfering?

Top Ten Bigram Frequency Tables

ga2rc <- data.frame(ID = 1:109, Title=names(group), Group=group, row.names = NULL)

#GROUP 1
xbc <- bc[(dplyr::filter(ga2rc, Group == 1))$ID] #pull all docs from bc from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2rcg1 <- dftermsXBC #store results
head(dfta2rcg1, 10) #display table
##                word freq
## 1    resourc_center   19
## 2           usc_edu   12
## 3       lesbian_gay   11
## 4      gender_ident   10
## 5     sexual_orient   10
## 6  bisexu_transgend    9
## 7        gay_bisexu    9
## 8     rainbow_floor    8
## 9     sexual_gender    8
## 10       usc_lambda    8
#GROUP 2
xbc <- bc[(dplyr::filter(ga2rc, Group == 2))$ID] #pull all docs from bc from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2rcg2 <- dftermsXBC #store results
head(dfta2rcg2, 10) #display table
##              word freq
## 1  resourc_center   16
## 2      name_chang    9
## 3    gender_ident    8
## 4   faculti_staff    7
## 5  intern_student    7
## 6       make_gift    6
## 7  program_servic    6
## 8     registrar_s    6
## 9         titl_ix    6
## 10        usc_edu    6
#GROUP 3
xbc <- bc[(dplyr::filter(ga2rc, Group == 3))$ID] #pull all docs from bc from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2rcg3 <- dftermsXBC #store results
head(dfta2rcg3, 10) #display table
##                word freq
## 1      gender_ident   37
## 2    resourc_center   31
## 3       gay_lesbian   26
## 4             pm_pm   21
## 5             one_s   17
## 6        male_femal   16
## 7     rainbow_floor   16
## 8   graduat_student   13
## 9    student_affair   13
## 10 bisexu_transgend   12
#GROUP 4
xbc <- bc[(dplyr::filter(ga2rc, Group == 4))$ID] #pull all docs from bc from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2rcg4 <- dftermsXBC #store results
head(dfta2rcg4, 10) #display table
##               word freq
## 1   resourc_center   16
## 2       receiv_usc   11
## 3    divis_student   10
## 4   student_affair   10
## 5     affair_honor    9
## 6  graduat_student    9
## 7     alli_student    8
## 8     campus_activ    8
## 9  student_assembl    8
## 10      usc_campus    8
#GROUP 5
xbc <- bc[(dplyr::filter(ga2rc, Group == 5))$ID] #pull all docs from bc from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2rcg5 <- dftermsXBC #store results
head(dfta2rcg5, 10) #display table
##                word freq
## 1           usc_edu   27
## 2          angel_ca   24
## 3    resourc_center   19
## 4  contact_trousdal   12
## 5      gender_ident   11
## 6     faculti_staff   10
## 7  bisexu_transgend    9
## 8       gay_lesbian    9
## 9       parkway_stu    9
## 10    rainbow_floor    9
#GROUP 6
xbc <- bc[(dplyr::filter(ga2rc, Group == 6))$ID] #pull all docs from bc from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2rcg6 <- dftermsXBC #store results
head(dfta2rcg6, 10) #display table
##                  word freq
## 1           safe_zone    3
## 2  transgend_individu    3
## 3        zone_program    3
## 4           alli_tran    2
## 5        come_process    2
## 6       faculti_staff    2
## 7        gender_ident    2
## 8     individu_gender    2
## 9           need_alli    2
## 10     term_transgend    2

“Resource center” shows up within the top 3, suggesting that it should be removed to allow for more differentiation.

Analysis 2b: Excluding “Resource Center”

dfmBCx <- dfm_remove(dfmBC, "resourc_center") #remove resource center bigram
dtmBCx <- as.DocumentTermMatrix(dfmBCx) #revert to dtm
inspect(dtmBCx) #show changed dtm
## <<DocumentTermMatrix (documents: 109, terms: 1754)>>
## Non-/sparse entries: 4958/186228
## Sparsity           : 97%
## Maximal term length: 22
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  alli_student bisexu_transgend faculti_staff gay_lesbian gender_ident
##   107            3                4             3          22            0
##   108            0                1             3           0            2
##   11             4                0             1           0            0
##   18             1                1             0           0            2
##   28             3                0             4           1            0
##   34             1                1             1           0            0
##   35             0                4             2           2            0
##   44             0                3             0           2           27
##   75             0                1             0           0            2
##   76             0                1             0           0            2
##      Terms
## Docs  graduat_student lesbian_gay rainbow_floor student_affair usc_edu
##   107               7           1             3             12       0
##   108               1           1             0              3      17
##   11                3           0             1              9       0
##   18                0           1             4              2       0
##   28                2           0             0              1       2
##   34                2           1             1              0       2
##   35                4           3             0              0       0
##   44                0           2             0              0       0
##   75                1           1             6              0       3
##   76                2           1             8              0       2
#bigram frequency tables:
cBCx <- colSums(as.matrix(dtmBCx)) #get and store bigram counts
dfcBCx <- data.frame(bigram = names(cBCx),freq=cBCx) #put counts into table
dfcBCx <- dplyr::arrange(dfcBCx, desc(freq)) #sort counts in descending order
dfcBCx[1:30,] #display table of top 30 bigrams
##              bigram freq
## 1      gender_ident   73
## 2           usc_edu   60
## 3       gay_lesbian   51
## 4   graduat_student   40
## 5     rainbow_floor   40
## 6  bisexu_transgend   38
## 7       lesbian_gay   36
## 8      alli_student   35
## 9     faculti_staff   35
## 10   student_affair   33
## 11       gay_bisexu   30
## 12    sexual_orient   30
## 13            pm_pm   27
## 14    sexual_gender   27
## 15    student_organ   27
## 16         angel_ca   25
## 17  student_assembl   25
## 18            one_s   23
## 19       queer_alli   22
## 20    divis_student   21
## 21    alumni_associ   20
## 22            usc_s   20
## 23    lambda_alumni   19
## 24   mentor_program   19
## 25      peer_mentor   19
## 26   program_servic   19
## 27       male_femal   18
## 28       first_year   16
## 29      usc_student   16
## 30   gender_neutral   15
# Visuals:
dfcBCx[1:30,] %>%
  mutate(bigram = fct_reorder(bigram, freq)) %>%
  ggplot(aes(x=bigram, y=freq)) +
    geom_bar(stat="identity", fill = "magenta3") +
    coord_flip() +
    ggtitle("Frequency Distribution of Top 30 Bigrams") +
    labs(y="Frequency", x = "Bigram")

The exponentially high frequencies of “gender identity,” “usc edu,” and “gay lesbian” may mean that dendrograms should be redrawn with these bigrams removed as well.

#Compute and store cosine similarity distance matrix on webpages
disBCx <- textstat_simil(dfmBCx, method = "cosine", margin = "documents")

#Create Dendrogram
denBCx <- hclust(disBCx, method = "complete")
  #Label Dendrogram
  labels(denBCx) <- as.character(qe$Title[ #convert to character; order Titles by 
      order.dendrogram(as.dendrogram(denBCx))]) #order of appearance in dendrogram 
  #Plot Dendrogram
  plot(as.dendrogram(denBCx), main = "Complete Linkage Clustering: Analysis 2b, 
       Bigram, Excluding 'Resource Center'", horiz = TRUE)

The webpages, again, have been clustered into topically more related groups. Splitting groups up based on the goal of three-to-four levels of hierarchy yields 5 groups:

#color dendrogram by groups based on line of division
group = cutree(as.dendrogram(denBCx), h = 0.385)
colDenBCx <- dendrapply(as.dendrogram(denBCx), colLab) 
plot(colDenBCx, main = "Complete Linkage Clustering: Analysis 2a, 
       Bigram, Excluding 'Resource Center'", horiz = TRUE)
abline(v = 0.385)

What bigrams define these 5 groups?

Top Ten Bigram Frequency Tables

ga2x <- data.frame(ID = 1:109, Title=names(group), Group=group, row.names = NULL)

#remove "resource center" from corpus
bcx <- tm_map(bc, removeWords, "resourc_center")
#GROUP 1
xbc <- bcx[(dplyr::filter(ga2x, Group == 1))$ID] #pull all bcx docs from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2xg1 <- dftermsXBC #store results
head(dfta2xg1, 10) #display table
##                word freq
## 1  bisexu_transgend    9
## 2      gender_ident    9
## 3       lesbian_gay    9
## 4           usc_edu    9
## 5        gay_bisexu    8
## 6       gay_lesbian    8
## 7     rainbow_floor    8
## 8    alli_communiti    7
## 9    program_servic    7
## 10    sexual_gender    7
#GROUP 2
xbc <- bcx[(dplyr::filter(ga2x, Group == 2))$ID] #pull all bcx docs from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2xg2 <- dftermsXBC #store results
head(dfta2xg2, 10) #display table
##                word freq
## 1      gender_ident   41
## 2       gay_lesbian   30
## 3        male_femal   16
## 4             one_s   16
## 5     rainbow_floor   15
## 6  bisexu_transgend   14
## 7     sexual_orient   14
## 8   graduat_student   13
## 9    student_affair   13
## 10    faculti_staff   12
#GROUP 3
xbc <- bcx[(dplyr::filter(ga2x, Group == 3))$ID] #pull all bcx docs from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2xg3 <- dftermsXBC #store results
head(dfta2xg3, 10) #display table
##                word freq
## 1           usc_edu   26
## 2          angel_ca   23
## 3             pm_pm   20
## 4  contact_trousdal   12
## 5    student_affair   12
## 6      alli_student   11
## 7        receiv_usc   11
## 8   student_assembl   11
## 9      campus_activ   10
## 10    divis_student   10
#GROUP 4
xbc <- bcx[(dplyr::filter(ga2x, Group == 4))$ID] #pull all bcx docs from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2xg4 <- dftermsXBC #store results
head(dfta2xg4, 10) #display table
##               word freq
## 1          usc_edu    9
## 2     gender_ident    8
## 3  graduat_student    7
## 4    parent_famili    7
## 5      gay_lesbian    6
## 6   gender_neutral    6
## 7      lesbian_gay    6
## 8     neutral_hous    6
## 9     parent_circl    6
## 10   rainbow_floor    6
#GROUP 5
xbc <- bcx[(dplyr::filter(ga2x, Group == 5))$ID] #pull all bcx docs from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2xg5 <- dftermsXBC #store results
head(dfta2xg5, 10) #display table
##                  word freq
## 1  undergradu_graduat    6
## 2      commenc_celebr    5
## 3       faculti_staff    5
## 4        gender_ident    5
## 5      lavend_commenc    5
## 6             lcc_lcc    5
## 7             usc_edu    5
## 8        alli_student    4
## 9       alumni_associ    4
## 10         bias_incid    4

Group 1 seems to contain general information about being queer, while Group 2 also does, but with more of a focus on students, faculty, and staff. Group 3, on the contrary, is heavily focused on USC without relation to queer identity. Group 4 seems to focus on parents and gender neutral housing, while Group 5 is centered on graduation.

Because none of the remaining high-frequency terms (“gender identity,” “usc edu,” and “gay lesbian”) consistently occur within the top 3 across groups, this may be a good stopping point for bigram analysis. But just in case, what happens when we remove all the high-frequency bigrams?

Analysis 2c: Excluding All High-Frequency Bigrams

Again, the high-frequency bigrams are: “resource center,” “gender identity,” “usc edu,” and “gay lesbian.”

dfmBCxz <- dfm_remove(dfmBCx, c("gender_ident", "usc_edu", "gay_lesbian")) #remove resource center bigram
dtmBCxz <- as.DocumentTermMatrix(dfmBCxz) #revert to dtm
inspect(dtmBCxz) #show changed dtm
## <<DocumentTermMatrix (documents: 109, terms: 1751)>>
## Non-/sparse entries: 4878/185981
## Sparsity           : 97%
## Maximal term length: 22
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  alli_student bisexu_transgend faculti_staff gay_bisexu
##   107            3                4             3          1
##   108            0                1             3          1
##   11             4                0             1          0
##   18             1                1             0          1
##   28             3                0             4          0
##   34             1                1             1          1
##   35             0                4             2          3
##   44             0                3             0          1
##   75             0                1             0          1
##   76             0                1             0          1
##      Terms
## Docs  graduat_student lesbian_gay pm_pm rainbow_floor sexual_orient
##   107               7           1     0             3             2
##   108               1           1     0             0             2
##   11                3           0     0             1             0
##   18                0           1     0             4             0
##   28                2           0     0             0             0
##   34                2           1     0             1             0
##   35                4           3     0             0             1
##   44                0           2     0             0             6
##   75                1           1     0             6             0
##   76                2           1     0             8             0
##      Terms
## Docs  student_affair
##   107             12
##   108              3
##   11               9
##   18               2
##   28               1
##   34               0
##   35               0
##   44               0
##   75               0
##   76               0
#bigram frequency tables:
cBCxz <- colSums(as.matrix(dtmBCxz)) #get and store bigram counts
dfcBCxz <- data.frame(bigram = names(cBCxz),freq=cBCxz) #put counts into table
dfcBCxz <- dplyr::arrange(dfcBCxz, desc(freq)) #sort counts in descending order
dfcBCxz[1:30,] #display table of top 30 bigrams
##                bigram freq
## 1     graduat_student   40
## 2       rainbow_floor   40
## 3    bisexu_transgend   38
## 4         lesbian_gay   36
## 5        alli_student   35
## 6       faculti_staff   35
## 7      student_affair   33
## 8          gay_bisexu   30
## 9       sexual_orient   30
## 10              pm_pm   27
## 11      sexual_gender   27
## 12      student_organ   27
## 13           angel_ca   25
## 14    student_assembl   25
## 15              one_s   23
## 16         queer_alli   22
## 17      divis_student   21
## 18      alumni_associ   20
## 19              usc_s   20
## 20      lambda_alumni   19
## 21     mentor_program   19
## 22        peer_mentor   19
## 23     program_servic   19
## 24         male_femal   18
## 25         first_year   16
## 26        usc_student   16
## 27     gender_neutral   15
## 28      ident_express   15
## 29     lesbian_bisexu   15
## 30 undergradu_graduat   15
# Visuals:
dfcBCxz[1:30,] %>%
  mutate(bigram = fct_reorder(bigram, freq)) %>%
  ggplot(aes(x=bigram, y=freq)) +
    geom_bar(stat="identity", fill = "mediumpurple") +
    coord_flip() +
    ggtitle("Frequency Distribution of Top 30 Bigrams") +
    labs(y="Frequency", x = "Bigram")

These top 30 bigrams are more evenly distributed; hypothetically the differentiation in how they are distributed among pages, should provide the clearest topical grouping.

#Compute and store cosine similarity distance matrix on webpages
disBCxz <- textstat_simil(dfmBCxz, method = "cosine", margin = "documents")

#Create Dendrogram
denBCxz <- hclust(disBCxz, method = "complete")
  #Label Dendrogram
  labels(denBCxz) <- as.character(qe$Title[ #convert to character; order Titles by 
      order.dendrogram(as.dendrogram(denBCxz))]) #order of appearance in dendrogram 
  #Plot Dendrogram
  plot(as.dendrogram(denBCxz), main = "Complete Linkage Clustering: Analysis 2c, 
       Bigram, Excluding All High-Frequency Bigrams", horiz = TRUE)

Against expectations, the webpages have been clustered into less coherent categories. Splitting groups up based on the goal of three-to-four levels of hierarchy yields 4 groups, which suggests that too many terms have been removed, making categories too broad:

#color dendrogram by groups based on line of division
group = cutree(as.dendrogram(denBCxz), h = 0.4)
colDenBCxz <- dendrapply(as.dendrogram(denBCxz), colLab) 
plot(colDenBCxz, main = "Complete Linkage Clustering: Analysis 2a, 
       Bigram, Excluding Resource Center", horiz = TRUE)
abline(v = 0.4)

What bigrams define these 4 groups, if any?

Top Ten Bigram Frequency Tables

ga2xz <- data.frame(ID = 1:109, Title=names(group), Group=group, row.names = NULL)

#remove "resource center" from corpus
bcxz <- tm_map(bcx, removeWords, c("gender_ident", "usc_edu", "gay_lesbian"))
#GROUP 1
xbc <- bcxz[(dplyr::filter(ga2xz, Group == 1))$ID] #pull all bcxz docs from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2xzg1 <- dftermsXBC #store results
head(dfta2xzg1, 10) #display table
##               word freq
## 1            pm_pm   21
## 2     alli_student   10
## 3        make_gift    8
## 4   gender_neutral    7
## 5  graduat_student    7
## 6   program_servic    7
## 7    faculti_staff    6
## 8     neutral_hous    6
## 9      peer_mentor    6
## 10   rainbow_floor    6
#GROUP 2
xbc <- bcxz[(dplyr::filter(ga2xz, Group == 2))$ID] #pull all bcxz docs from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2xzg2 <- dftermsXBC #store results
head(dfta2xzg2, 10) #display table
##                word freq
## 1          angel_ca   23
## 2             one_s   17
## 3        male_femal   16
## 4     rainbow_floor   16
## 5     sexual_orient   16
## 6       lesbian_gay   13
## 7     sexual_gender   13
## 8  bisexu_transgend   12
## 9  contact_trousdal   12
## 10    chest_surgeri   10
#GROUP 3
xbc <- bcxz[(dplyr::filter(ga2xz, Group == 3))$ID] #pull all bcxz docs from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2xzg3 <- dftermsXBC #store results
head(dfta2xzg3, 10) #display table
##                word freq
## 1   graduat_student   20
## 2    student_affair   15
## 3     faculti_staff   14
## 4  bisexu_transgend   13
## 5     student_organ   12
## 6     alumni_associ   11
## 7     rainbow_floor   11
## 8     lambda_alumni   10
## 9       lesbian_gay   10
## 10     alli_student    9
#GROUP 4
xbc <- bcxz[(dplyr::filter(ga2xz, Group == 4))$ID] #pull all bcxz docs from group
dtmXBC <- DocumentTermMatrix(xbc) #convert to dtm
termsXBC <- colSums(as.matrix(dtmXBC)) #get and store term counts
dftermsXBC <- data.frame(word = names(termsXBC),freq=termsXBC) #put counts into table
dftermsXBC <- dplyr::arrange(dftermsXBC, desc(freq)) #sort counts in descending order
dfta2xzg4 <- dftermsXBC #store results
head(dfta2xzg4, 10) #display table
##                word freq
## 1        receiv_usc   11
## 2   student_assembl   10
## 3      affair_honor    9
## 4      alli_student    9
## 5     divis_student    9
## 6       lesbian_gay    9
## 7        name_chang    9
## 8    student_affair    9
## 9  bisexu_transgend    8
## 10     campus_activ    8

Based on these frequency tables, there is no coherent subject defining each quality.

Conclusion

The most coherent groups thus far, are the ones suggested by Analysis 2b:

  1. Being Queer
  2. Queer at USC
  3. Being at USC
  4. Navigating Gender
  5. Navigating Graduation

These groups can serve as potential menu headers on the website.

In addition, we have determined that bigrams are more useful than body text in analyzing page similarity. After striking text analyses, the following analysis levels remain (renumbered to reflect changes):

Analysis Number Aspects Involved
3 Bigram, Links
4 Bigram, LGBTRC, USC, Outside
5 Bigram, LGBTRC, USC, Outside, Email, Videos, Images
6 Bigram, LGBTRC, USC, Outside, Email, Videos, Images

After consulting with Advocacy Directors, it was determined that Analysis 2b would be sufficient for a starting point in reorganizing the website’s menus, and that no further analyses would be done.