A very common theme is being asked to do some analysis on a WebCenter Sites installation but not having direct access to the system.

In this blog we will show you how you can take simple Sites Explorer exports and interrogate them in R.

First we load up the XML library to process the html files

library(XML)

Now lets open up the AssetPublication table from the zip file and read in the data to a “DataFrame”

# create an object pointing to the file in the zip we want to process
f <- unz("AssetPublication.zip", "AssetPublication.html")

# read all of text and pass it to the readHTMLTable function
t <- readHTMLTable(readLines(f))

# The output of readHTMLTable will be a list with each table in the file as an element.  We want the AssetPublication table
assetPublication <- t$AssetPublication

close(f)


head(assetPublication)
##              id         pubid         assettype       assetid
## 1 1343556967936 1112198287026          SitePlan     198287026
## 2 1380563708125 1112198287026 HIG60MessageAsset 1380563708121
## 3 1380563707998 1112198287026     VIP01Spndacct 1380563707995
## 4 1380563707936 1112198287026     VIP01Spndacct 1380563707933
## 5 1380563707922 1112198287026       VIP02Agency 1380563707919
## 6 1380563707917 1112198287026       VIP02Agency 1380563707914

So now we have the AssetPublication table. Lets have a look at the asset counts.

table(assetPublication$assettype)
## 
##             AdvCols     ArticleCategory           AttrTypes 
##                  15                   6                  13 
##          AVIArticle            AVIImage           Content_A 
##                 139                 249                  13 
##           Content_C          Content_CD           Content_F 
##                  15                   2                   2 
##           Content_P          Content_PD    ContentAttribute 
##                   1                   1                  24 
##          ContentDef    ContentParentDef           CSElement 
##                   3                   1                  47 
##              Device         DeviceGroup           Dimension 
##                  25                  11                   4 
##        DimensionSet          Document_A          Document_C 
##                   1                  10                  35 
##         Document_CD          Document_F          Document_P 
##                   1                   3                   4 
##         Document_PD         FSIIVisitor     FSIIVisitorAttr 
##                   1                   2                  11 
##      FSIIVisitorDef      FW_Application             FW_View 
##                   1                   5                   5 
##     HIG02SiteAttrib        HIG03Content           HIG11Link 
##                 178                   1                   3 
##      HIG14Component        HIG23GBDCase        HIG24GBDStat 
##                   5                   1                   1 
## HIG33TrainingModule       HIG35UniqueID       HIG48HubVideo 
##                   2                   2                   1 
##   HIG60MessageAsset        HLI03Article       ImageCategory 
##                   1                   2                  11 
##             Media_A             Media_C            Media_CD 
##                  11                  27                   1 
##             Media_F             Media_P            Media_PD 
##                   3                   3                   1 
##                Page       PageAttribute      PageDefinition 
##                  30                  18                   9 
##           Product_A           Product_C          Product_CD 
##                  15                  17                   1 
##           Product_F           Product_P          Product_PD 
##                   4                  20                   4 
##          Promotions          ScalarVals            Segments 
##                   1                  12                   3 
##           SiteEntry            SitePlan               Slots 
##                   9                   4                  17 
##          StyleSheet            Template           TestAsset 
##                  10                 139                   2 
##       VIP01Spndacct         VIP02Agency             WebRoot 
##                   2                   2                   3 
##             YouTube 
##                   5

Well that isn’t very prety. We can use the dplyr package to get some easy to use functions to process the data.

Lets find the top 10 asset types in this table:

library(dplyr)
assetPublication %>% count(assettype, sort=TRUE) %>% top_n(10)
## Source: local data frame [10 x 2]
## 
##           assettype     n
##              (fctr) (int)
## 1          AVIImage   249
## 2   HIG02SiteAttrib   178
## 3        AVIArticle   139
## 4          Template   139
## 5         CSElement    47
## 6        Document_C    35
## 7              Page    30
## 8           Media_C    27
## 9            Device    25
## 10 ContentAttribute    24

What if we want to break that down by site? We probably want to include the names out of the publication table.

publication <-  readHTMLTable(readLines(unz("Publication.zip", "Publication.html")))$Publication
assetPublicationWithNames <- assetPublication %>% inner_join(publication, c("pubid"="id"))
assetPublicationWithNames %>% group_by(name, assettype) %>% tally(sort = TRUE) %>% top_n(5)
## Source: local data frame [13 x 3]
## Groups: name [3]
## 
##           name        assettype     n
##         (fctr)           (fctr) (int)
## 1    AdminSite  HIG02SiteAttrib   178
## 2    AdminSite   FW_Application     5
## 3    AdminSite          FW_View     5
## 4    avisports         AVIImage   249
## 5    avisports       AVIArticle   139
## 6    avisports         Template    91
## 7    avisports ContentAttribute    24
## 8    avisports        CSElement    21
## 9  FirstSiteII         Template    48
## 10 FirstSiteII       Document_C    35
## 11 FirstSiteII          Media_C    27
## 12 FirstSiteII        CSElement    26
## 13 FirstSiteII        Product_P    20

As one final example lets look at the number of assets per site.

aps <- assetPublicationWithNames %>% group_by(name) %>% tally() %>% arrange(n)
pie(aps$n, labels = aps$name, col = rainbow(length(aps$n)))

Or as a bar chart

library(ggplot2)
ggplot(aps, aes(x=name, y=n, fill=name)) + 
  geom_bar(stat="identity") +
  theme_bw(base_size = 20) +
  ylab("Number of Assets") +
  theme(axis.title.x = element_blank()) +
  guides(fill=guide_legend(title=NULL))