A very common theme is being asked to do some analysis on a WebCenter Sites installation but not having direct access to the system.
In this blog we will show you how you can take simple Sites Explorer exports and interrogate them in R.
First we load up the XML library to process the html files
library(XML)
Now lets open up the AssetPublication table from the zip file and read in the data to a “DataFrame”
# create an object pointing to the file in the zip we want to process
f <- unz("AssetPublication.zip", "AssetPublication.html")
# read all of text and pass it to the readHTMLTable function
t <- readHTMLTable(readLines(f))
# The output of readHTMLTable will be a list with each table in the file as an element. We want the AssetPublication table
assetPublication <- t$AssetPublication
close(f)
head(assetPublication)
## id pubid assettype assetid
## 1 1343556967936 1112198287026 SitePlan 198287026
## 2 1380563708125 1112198287026 HIG60MessageAsset 1380563708121
## 3 1380563707998 1112198287026 VIP01Spndacct 1380563707995
## 4 1380563707936 1112198287026 VIP01Spndacct 1380563707933
## 5 1380563707922 1112198287026 VIP02Agency 1380563707919
## 6 1380563707917 1112198287026 VIP02Agency 1380563707914
So now we have the AssetPublication table. Lets have a look at the asset counts.
table(assetPublication$assettype)
##
## AdvCols ArticleCategory AttrTypes
## 15 6 13
## AVIArticle AVIImage Content_A
## 139 249 13
## Content_C Content_CD Content_F
## 15 2 2
## Content_P Content_PD ContentAttribute
## 1 1 24
## ContentDef ContentParentDef CSElement
## 3 1 47
## Device DeviceGroup Dimension
## 25 11 4
## DimensionSet Document_A Document_C
## 1 10 35
## Document_CD Document_F Document_P
## 1 3 4
## Document_PD FSIIVisitor FSIIVisitorAttr
## 1 2 11
## FSIIVisitorDef FW_Application FW_View
## 1 5 5
## HIG02SiteAttrib HIG03Content HIG11Link
## 178 1 3
## HIG14Component HIG23GBDCase HIG24GBDStat
## 5 1 1
## HIG33TrainingModule HIG35UniqueID HIG48HubVideo
## 2 2 1
## HIG60MessageAsset HLI03Article ImageCategory
## 1 2 11
## Media_A Media_C Media_CD
## 11 27 1
## Media_F Media_P Media_PD
## 3 3 1
## Page PageAttribute PageDefinition
## 30 18 9
## Product_A Product_C Product_CD
## 15 17 1
## Product_F Product_P Product_PD
## 4 20 4
## Promotions ScalarVals Segments
## 1 12 3
## SiteEntry SitePlan Slots
## 9 4 17
## StyleSheet Template TestAsset
## 10 139 2
## VIP01Spndacct VIP02Agency WebRoot
## 2 2 3
## YouTube
## 5
Well that isn’t very prety. We can use the dplyr package to get some easy to use functions to process the data.
Lets find the top 10 asset types in this table:
library(dplyr)
assetPublication %>% count(assettype, sort=TRUE) %>% top_n(10)
## Source: local data frame [10 x 2]
##
## assettype n
## (fctr) (int)
## 1 AVIImage 249
## 2 HIG02SiteAttrib 178
## 3 AVIArticle 139
## 4 Template 139
## 5 CSElement 47
## 6 Document_C 35
## 7 Page 30
## 8 Media_C 27
## 9 Device 25
## 10 ContentAttribute 24
What if we want to break that down by site? We probably want to include the names out of the publication table.
publication <- readHTMLTable(readLines(unz("Publication.zip", "Publication.html")))$Publication
assetPublicationWithNames <- assetPublication %>% inner_join(publication, c("pubid"="id"))
assetPublicationWithNames %>% group_by(name, assettype) %>% tally(sort = TRUE) %>% top_n(5)
## Source: local data frame [13 x 3]
## Groups: name [3]
##
## name assettype n
## (fctr) (fctr) (int)
## 1 AdminSite HIG02SiteAttrib 178
## 2 AdminSite FW_Application 5
## 3 AdminSite FW_View 5
## 4 avisports AVIImage 249
## 5 avisports AVIArticle 139
## 6 avisports Template 91
## 7 avisports ContentAttribute 24
## 8 avisports CSElement 21
## 9 FirstSiteII Template 48
## 10 FirstSiteII Document_C 35
## 11 FirstSiteII Media_C 27
## 12 FirstSiteII CSElement 26
## 13 FirstSiteII Product_P 20
As one final example lets look at the number of assets per site.
aps <- assetPublicationWithNames %>% group_by(name) %>% tally() %>% arrange(n)
pie(aps$n, labels = aps$name, col = rainbow(length(aps$n)))
Or as a bar chart
library(ggplot2)
ggplot(aps, aes(x=name, y=n, fill=name)) +
geom_bar(stat="identity") +
theme_bw(base_size = 20) +
ylab("Number of Assets") +
theme(axis.title.x = element_blank()) +
guides(fill=guide_legend(title=NULL))