Example of Analysis in Rmarkdown + knitr in RStudio

Below is the workflow for an analysis on the Bornholm grave ornament data or in the archdata package. The source info for this data is at the bottom of the page. The scope of the analysis is conducting a Correspondence Analysis on the count of different 39 different types of grave ornaments across 7 time periods from 77 female Iron Age graves in Bornholm, Denmark. This analysis, while I believe valid, it just an example to show a reproducible research workflow. The context for use of this example include a research documenting their own analysis, sharing analysis with a co-worker or adviser, preparing for publication, or just exploring some data that you find interesting.

The end result is a HTML, PDF, and/or Word document that contains the entire throughput of your analysis, all the packages, analytic steps, and results. From this, anyone else can recreate your analysis for confirmation or extension. Perhaps the best part is that the person repeating it is future you that recalls when you did that great CA analysis on the Bornholm data. This process solves the problem of, “now where did I put those files…”"

library("archdata")
library("dplyr")
library("MASS")
library("factoextra")
library("FactoMineR")
library("gplots")
library("ggmap")

Load data from archdata package and look at structure

data(Bornholm)
bh <- Bornholm
str(bh)
## 'data.frame':    77 obs. of  42 variables:
##  $ Number: num  70 61 62 69 28 32 45 18 2 42 ...
##  $ Site  : chr  "Tranekar" "Lillevang-Melsted 2" "Lillevang-Melsted 4" "Ellegård" ...
##  $ Period: Factor w/ 7 levels "1a","1b","2a",..: 7 7 6 6 7 6 6 6 6 6 ...
##  $ N2c   : num  1 2 0 0 0 0 0 0 0 0 ...
##  $ R3d   : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ N2a   : num  1 0 0 0 1 0 0 0 0 0 ...
##  $ Q3b   : num  0 2 1 2 0 1 1 0 0 0 ...
##  $ R3c   : num  0 0 1 1 0 0 0 0 0 0 ...
##  $ N1    : num  0 0 0 0 1 0 1 0 0 0 ...
##  $ Q3c   : num  0 0 1 1 2 0 2 1 1 0 ...
##  $ O1    : num  0 0 1 0 0 0 1 1 1 1 ...
##  $ O2    : num  0 0 0 2 0 0 0 1 0 1 ...
##  $ N2e   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ I3    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ R3b   : num  0 0 0 0 1 1 1 1 1 1 ...
##  $ K1a   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Q3a   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ I2    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ K1c   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ K1b   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ H     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Q3d   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ J1d   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ S1    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ D     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Q2    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ S3    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ P2    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ P4    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ G3    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ E2a   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ P3    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ R3a   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ R1    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ E2b   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ G2    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ I1b   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ G1    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ F     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ P1    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ I1a   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ A2e   : num  0 0 0 0 0 0 0 0 0 0 ...

Location Map

base_map <- get_map(location = "Bornholm", zoom = 10)
ggmap(base_map, extent = "normal") +
coord_cartesian(xlim = c(14.6, 15.22),
                ylim = c(54.95, 55.33),
                expand = FALSE)

Background of current research

Look at Carlson version of Baxter(1994) Correspondence Analysis

Bornholm.ca <- corresp(bh[, 4:42], nf=2)
# Symetric Biplot of Ornamentation by Site labeled by Time period
plot(Bornholm.ca$rscore, pch=substring(bh$Period, 1, 1), cex=.75)

# Boxplot of 1st CA dimension raw score by Time period
boxplot(Bornholm.ca$rscore[, 1]~bh$Period, main="First CA Axis by Period")

Current Analysis

Instead of looking at each site, this analysis is more interested in change over time aggregated over sites. This is an attempt to remove the variance between sites and look for a larger pattern from which Baxter’s (1994) analysis is drawn.

The approach here is to aggregate the counts of each ornamentation style per time period and then redo the Correspondence Analysis on these date. If the CA results are verifiable via a Chi Square (\(\chi^2\)) test and row/column contributions, then we may be able to speak to the larger pattern of ornamentation throughout these periods.

The first step is to use dplyr and tibble packages to aggregate and prepare Bornholm data

bh2 <- group_by(bh, Period) %>%
  dplyr::select(-Number, -Site) %>%
  summarise_each(funs(sum)) %>%
  mutate(Period = as.character(Period)) %>%
  tibble::column_to_rownames(., "Period") %>%
  data.frame()
## Warning: Setting row names on a tibble is deprecated.

Take a look at a sample of the results, make sure it looks right

print(bh2[,1:15]) # example of first 10 columns
   N2c R3d N2a Q3b R3c N1 Q3c O1 O2 N2e I3 R3b K1a Q3a I2
1a   0   0   0   0   0  0   0  0  0   0  0   0   0   0  0
1b   0   0   0   0   0  0   0  0  0   0  0   0   0   0  0
2a   0   0   0   0   0  0   0  0  0   0  0   0   0   0  0
2b   0   0   0   0   0  0   0  0  0   0  0  13   1   1  3
2c   0   0   0   0   0  0   0  0  0   0  0   4   2   1  0
3a   0   0   0   5   2  2   8  6 11   0  5  13   0   0  1
3b   3   1   2   3   0  1   2  0  0   3  3   3   0   0  0

First take a look at the distribution of ornamentation over time. This relies on the gplots package balloonplot() function.

# convert to matrix for balloonplot() function
bh2_table <- as.table(as.matrix(bh2))
# balloon plotto show frequency of styles per time period
balloonplot(x = t(bh2_table), main = "Ornamentation by Style and Time Period", xlab ="", ylab="",
            label = FALSE, show.margins = FALSE)

# mosaic plot visualizes the freqncy as the length of the bar and deviation for the expected-by-random frequency as color (red for less than expected, blue for more than expected)
mosaicplot(bh2_table, shade = TRUE, las=2, main = "Ornamentation by Style and Time Period")

Conduct the CA analysis using the same FactoMineR::CA() to conduct the same type of analysis, but use a package that gives more information back for further study. These results will be plotted with the factoextra packages. This package returns a friendly ggplot2 object to extend. From there, the CA will be redone with the FactoMineR package to give us more options to review the findings

ca3 <- CA(bh2, ncp = 2, graph = FALSE) # ncp = 2 indicates 2 CA dimensions 
summary(ca3, nb.dec = 2, ncp = 2) 
## 
## Call:
## CA(X = bh2, ncp = 2, graph = FALSE) 
## 
## The chi square of independence between the two variables is equal to 806.2874 (p-value =  1.341565e-65 ).
## 
## Eigenvalues
##                       Dim.1  Dim.2  Dim.3  Dim.4  Dim.5  Dim.6
## Variance               0.82   0.60   0.39   0.34   0.26   0.07
## % of var.             33.06  24.13  15.85  13.79  10.38   2.78
## Cumulative % of var.  33.06  57.20  73.05  86.84  97.22 100.00
## 
## Rows
##       Iner*1000    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## 1a  |    419.72 |  -1.06  19.76   0.38 |   0.89  19.18   0.27 |
## 1b  |    340.24 |  -0.91  19.23   0.46 |   0.44   6.11   0.11 |
## 2a  |    145.89 |  -0.39   2.21   0.12 |  -0.58   6.66   0.27 |
## 2b  |    264.18 |   0.04   0.04   0.00 |  -0.91  33.83   0.76 |
## 2c  |    327.39 |   0.35   0.89   0.02 |  -1.20  14.03   0.26 |
## 3a  |    472.79 |   1.24  34.74   0.60 |   0.36   4.06   0.05 |
## 3b  |    495.50 |   1.67  23.14   0.38 |   1.19  16.14   0.19 |
## 
## Columns (the 10 first)
##       Iner*1000    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## N2c |    127.19 |   1.85   3.87   0.25 |   1.55   3.70   0.17 |
## R3d |     42.40 |   1.85   1.29   0.25 |   1.55   1.23   0.17 |
## N2a |     84.79 |   1.85   2.58   0.25 |   1.55   2.47   0.17 |
## Q3b |     78.75 |   1.56   7.26   0.75 |   0.87   3.15   0.24 |
## R3c |     27.22 |   1.38   1.42   0.43 |   0.47   0.23   0.05 |
## N1  |     28.20 |   1.54   2.65   0.77 |   0.83   1.06   0.22 |
## Q3c |     94.27 |   1.47   8.12   0.70 |   0.69   2.42   0.15 |
## O1  |     81.65 |   1.38   4.26   0.43 |   0.47   0.68   0.05 |
## O2  |    149.69 |   1.38   7.81   0.43 |   0.47   1.25   0.05 |
## N2e |    127.19 |   1.85   3.87   0.25 |   1.55   3.70   0.17 |
# use factoextra::fviz_ca_biplot() to create the same style plot as the Symetric Biplot above
fviz_ca_biplot(ca3) + # the basic plot from the fviz_ca_biplot() function
  theme_minimal()     # add further ggplot2 calls

The results above show a pretty convincing pattern between the relationship of the time periods (blue) and Ornamentation (red). Three relatively distinct clusters or ornamentation appear in each of the upper-left, upper-right quadrants, and lower half. The central portions of these clusters coincide with the time periods of 1, 3, and 2 respectively.

Looking at the model a bit more in depth: Calculate the correlation coefficient between the rows and columns

eig <- get_eigenvalue(ca3)
trace <- sum(eig$eigenvalue) 
cor.coef <- sqrt(trace)
cor.coef # > 0.2 = significant (general rule)
## [1] 1.570258

The coefficient of 1.5702582 is considered an important correlation result (Bendixen 1995, 576; Healey 2013, 289-290). The summary results of the ca3 object above give the \(\chi^2\) results of the table values. The same value is calculated independently below.

chi2 <- trace*sum(as.matrix(bh2))
# Degree of freedom
df <- (nrow(bh2) - 1) * (ncol(bh2) - 1)
# P-value
pval <- pchisq(chi2, df = df, lower.tail = FALSE)
paste0("chisq = ", round(chi2,3), ", p-value = ", round(pval,3)) # same as summary(ca3)
## [1] "chisq = 806.287, p-value = 0"

The results are the same as the summary above.

Next, is to take a look at the justification for number of dimensions being evaluated. The point of this portion of the analysis to find a threshold to use for deciding on the number of dimensions. The threshold used here is to chose only those dimensions that explain more variance in the data that should be explained by random chance alone.

row_random <- (1/(nrow(bh2)-1))*100 # expected explained var by random for row
col_random <- (1/(ncol(bh2)-1))*100 # expected explained var by random for col

fviz_screeplot(ca3) +
  geom_hline(yintercept = max(row_random, col_random), linetype=2, color="red") +
  theme_bw()

# dimension 1 and 2 are included because dim > 3 could be by radom
# e.g. below max(row_random, col_random)

The threshold (red dotted line) is drawn as the maximum value of the variance explained by random for either rows or columns. In this case the row value is 16.6666667%, the column value is 2.6315789%, and there for the max is 16.6666667%. As depicted in the graphic, a choice of evaluating the first two dimensions is justified under these assumptions.

To understand which of rows/columns and observations contribute to the overall significance of the results, the rows and columns coordinates are colored by their contribution values

# contirbution of rows to global solution
fviz_ca_row(ca3, col.row="contrib")+
  scale_color_gradient2(low="white", mid="blue", 
                        high="red", midpoint=10)+
  theme_bw()

# contirbution of columns to global solution
fviz_ca_col(ca3, col.col="contrib")+
  scale_color_gradient2(low="white", mid="blue", 
                        high="red", midpoint=5)+
  theme_bw()

It appears to be the observations at the negative poles of dimension 1 that are most contributing to the results; these include time periods 1 and 3.

To compliment the symmetric biplots above, an asymmetric biplot is presented to understand the relationship if rows and columns within a common measurement space. As such, the distance and location relative to the angle of the row coordinates (blue) can be interpreted directly as they are projected into the same measurement space.

# asymetric biplot in the measuremnet space of rows
fviz_ca_biplot(ca3, map ="rowprincipal", arrow = c(TRUE, FALSE)) +
  theme_bw()

Concluding observations

Generally, a series of styles are well clustered to time period 1a, with a related set at 1b. The periods of 2a, 2b, and 2c are distance from both 1 and 3 and contain a continuum of styles. There are a few overlapping styles (Q3d, R3b) between periods 2 and 3. Finally, in the later time periods, 3a does not appear to have any stylistic cluster, but 3b contains the most distinct and distant cluster of any time period

The images that show this best are the basic symmetric plot and the asymmetric plot in row measurement space. The code below will take the basic plots above and dress them up a bit with calls to ggplot2. These can be used in presentation or publication

# dressed up symetric biplot
fviz_ca_biplot(ca3, arrow = c(TRUE, FALSE), label = "row", labelsize = 5,
               jitter = list(what = "label", width = 0.1, height = 0.1)) +
  theme_bw() +
  labs(title="CA Analysis: Symetric Biplot for Ornament Style and Time Period ",
       subtitle="39 ornamentations from 77 female graves at Iron age sites in Bornholm, Denmark",
       caption="Data: Ørsnes, M. (1966)") +
  theme(
    panel.border = element_rect(colour = "gray90"),
    axis.text.x = element_text(angle = 0, size = 10, family = "Trebuchet MS"),
    axis.text.y = element_text(size = 9, family = "Trebuchet MS"),
    axis.title.y = element_text(size = 11, family = "Trebuchet MS"),
    axis.title.x = element_text(size = 11, family = "Trebuchet MS"),
    plot.caption = element_text(size = 10, hjust = 0, margin=margin(t=10), 
                                family = "Trebuchet MS"),
    plot.title=element_text(family="TrebuchetMS-Bold"),
    plot.subtitle=element_text(family="TrebuchetMS-Italic")
  )

#contribution biplot
fviz_ca_biplot(ca3, map ="rowprincipal", arrow = c(TRUE, FALSE), repel = TRUE) +
  theme_bw() +
  labs(title="CA Analysis: Conribution Biplot for Ornament Style and Time Period ",
       subtitle="39 ornamentations from 77 female graves at Iron age sites in Bornholm, Denmark",
       caption="Data: Ørsnes, M. (1966)") +
  theme(
    panel.border = element_rect(colour = "gray90"),
    axis.text.x = element_text(angle = 0, size = 10, family = "Trebuchet MS"),
    axis.text.y = element_text(size = 9, family = "Trebuchet MS"),
    axis.title.y = element_text(size = 11, family = "Trebuchet MS"),
    axis.title.x = element_text(size = 11, family = "Trebuchet MS"),
    plot.caption = element_text(size = 10, hjust = 0, margin=margin(t=10), 
                                family = "Trebuchet MS"),
    plot.title=element_text(family="TrebuchetMS-Bold"),
    plot.subtitle=element_text(family="TrebuchetMS-Italic")
  )

Data Source

These data are from the archdata package in R and can be retrieved by typing once the package is installed and loaded. The information below is form the package documentation at: https://cran.r-project.org/web/packages/archdata/archdata.pdf

Details

Nielsen used data on 39 different types of ornaments from Ørsnes (1966) to seriate a series of 77 Germanic Iron Age graves from Bornholm, Denmark (1988, Table 4 and Figure 7). Baxter re-analyzed the data to illustrate correspondence analysis (1994: 104-107, Table A6). These data were taken from Nielsen’s Table 4 showing her seriation. Baxter’s version is scrambled in order to evaluate different seriation methods and does not include the ornament types (illustrated in Nielson’s Figure 7). The data include Ørsnes’s period designation (1966).

Source

Baxter, M. J. 1994. Exploratory Multivariate Analysis in Archaeology. Edinburgh University Press. Edinburgh.

Nielsen, D. H. 1988. Correspondence analysis applied to hords and graves of the Germanic Iron Age. In Multivariate Archaeology: Numerical Approaches in Scandinavian Archaeology, edited by Torsten Madsen, pp 37-54. Jutland Archaeological Society Publications XXI. Arahus University Press.

Ørsnes, M. 1966. Form og stil i Sydskandinaviens yngre germanske jernalder. Nationalmuseets skrifter. Arkæologisk-historisk række 11. Copenhagen.

RStudio Environment Information

sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggmap_2.6.1        gplots_3.0.1       FactoMineR_1.33   
## [4] factoextra_1.0.3   ggplot2_2.1.0.9000 MASS_7.3-45       
## [7] dplyr_0.5.0        archdata_1.1      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.7          formatR_1.4          plyr_1.8.4          
##  [4] bitops_1.0-6         tools_3.3.1          digest_0.6.10       
##  [7] evaluate_0.9         tibble_1.2           gtable_0.2.0        
## [10] lattice_0.20-33      png_0.1-7            DBI_0.5             
## [13] mapproj_1.2-4        ggrepel_0.5          yaml_2.1.13         
## [16] proto_0.3-10         stringr_1.1.0        knitr_1.14          
## [19] cluster_2.0.4        maps_3.1.1           RgoogleMaps_1.4.1   
## [22] gtools_3.5.0         caTools_1.17.1       flashClust_1.01-2   
## [25] grid_3.3.1           scatterplot3d_0.3-37 data.table_1.9.6    
## [28] R6_2.1.3             jpeg_0.1-8           rmarkdown_1.0       
## [31] sp_1.2-3             gdata_2.17.0         reshape2_1.4.1      
## [34] magrittr_1.5         scales_0.4.0         htmltools_0.3.5     
## [37] leaps_2.9            assertthat_0.1       geosphere_1.5-5     
## [40] colorspace_1.2-6     labeling_0.3         KernSmooth_2.23-15  
## [43] stringi_1.1.1        lazyeval_0.2.0       munsell_0.4.3       
## [46] rjson_0.2.15         chron_2.3-47