This is a quick analysis and visualisation of package revision data. The data is in csv form, saved in a folder called “data”, one level up from the working directory of this R work. The relative path is from the R-script in a ‘code’ folder, one level up into the ‘data’ folder.

Working directory

The working directory for this code to run properly needs to be set relative to where the data is stored. If the data is stored in the same folder as this document is saved and subsequentyl executed in R or Rstudio, ensure that the “path_to_file” variable is set accordingly. To see your working directory, type “getwd()” in the R console. To set your working directory, type “setwd(”path_to_working_directory_required“)” and complete the biut between braces accordingly.

Packages required

We need to install packages to plot the data in a grid format. For this work, we’ll use the “lattice” package.

library(stringr)
library(lattice)

Getting the data

The headers for the columns are given explicitly, to ensure the correct headers are used in the analysis. In this case, when reading the data in, the argument “headers” is set to FALSE.

columns <- c("SMF", "SUBSYS", "COUNT", "PLAN", "YEAR", "MONTH", "DAY", "HOUR", "TRAN", "COLLECTION", "PACKAGE", "SQL", "DB2ELAP", "DB2TCB")
path_to_file = "../data/Packrev.csv"
data <- read.csv(path_to_file, header = FALSE, skip = 1, col.names = columns, stringsAsFactors = FALSE)

Verifying the data has read in correctly, accessing only the first 10 rows of data (note: if you want to see the output of the command, uncomment line 2 of the code below):

data_test = data[1:5,]
#data_test

Quick glance at the structure of the data

The data has mixed types, some columns are numbers, others are date fields and the remaining are text. The data type in these columns needs to be properly set in order to get appropriate plots. Running the code below shows that some of the text columns (character, or “chr” type) have leading and trailing white spaces. The date columns are read in as integers and that needs to be converted to date-time objects for R. The Packages, collection and SQL columns need to be converted to factors

str(data)

## 'data.frame':    11196 obs. of  14 variables:
##  $ SMF       : chr  " XE21  " " XE21  " " XE21  " " XE21  " ...
##  $ SUBSYS    : chr  "  S10A   " "  S10A   " "  S10A   " "  S10A   " ...
##  $ COUNT     : int  341 341 1 1 1 1 3 1 1 12 ...
##  $ PLAN      : chr  "  FL2A1700 " "  FL2A1700 " "  GLBD1700 " "  GLBD1700 " ...
##  $ YEAR      : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
##  $ MONTH     : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ DAY       : int  28 28 28 28 28 28 28 28 28 28 ...
##  $ HOUR      : int  12 12 12 12 12 12 12 12 12 12 ...
##  $ TRAN      : int  3406 3406 1 1 1 1 3 1 1 12 ...
##  $ COLLECTION: chr  "  AUTH1700_COM " "  FL2A1700_CHO " "  AUTH1700_COM " "  AUTH1700_COM " ...
##  $ PACKAGE   : chr  "  PTADRVAC     " "  FLCMMAGT     " "  PTADRVAC     " "  PTADRVU@     " ...
##  $ SQL       : int  1702 8084 2 2 2 3 4 2 3 48 ...
##  $ DB2ELAP   : num  0.6231 6.4295 0.0102 0.3388 0.0171 ...
##  $ DB2TCB    : num  0.475548 0.981472 0.000722 0.001568 0.00063 ...

data$Date <- paste(data$YEAR, data$MONTH, data$DAY, sep ="/")
#first combine the columns into one
data$Date <- as.Date(data$Date, format="%Y/%m/%d")
#adding the hour into the data time column
data$DateHour <- paste(paste(data$YEAR, data$MONTH, data$DAY, sep = "/"),data$HOUR, sep = " ")
#convert the column into a Date/Time column
data$DateHour <- as.POSIXct(data$DateHour, format="%Y/%m/%d %H", tz = "MST") + (0:3) * 3600
str(data$DateHour[1:10])

##  POSIXct[1:10], format: "2014-07-28 12:00:00" "2014-07-28 13:00:00" ...

Plotting

some variables we need to convert to factors to use in the plots:

data$CollectionFactors <- as.factor(data$COLLECTION)
data$SQLFactors <- as.factor(data$SQL)
data$PackageFactors <- as.factor(str_trim(data$PACKAGE, side = c("both")))

Selecting specific data

Subsetting the data for only the rows containing the package of interest We first find out which packages appear the most in the data

PackageFrequency <- table(data$PackageFactors)
barplot(PackageFrequency)

OrderedPackageFrequency <- PackageFrequency[order(PackageFrequency, decreasing = TRUE)]

#Packages of interest: selecting the top 5
packagesOfInterest <- names(OrderedPackageFrequency[1:5])

#packagesOfInterest names have been changed to accomodate the use of # and @ in the package name.
#All five top:
PTADRVACdata <- data[which(data$PackageFactors == packagesOfInterest[1]),]
LCA__HCAdata <- data[which(data$PackageFactors == packagesOfInterest[2]),]
SA_CDBdata <- data[which(data$PackageFactors == packagesOfInterest[3]),]
PRA__HCAdata <- data[which(data$PackageFactors == packagesOfInterest[4]),]
PTASSPIdata <- data[which(data$PackageFactors == packagesOfInterest[5]),]
#putting them together:
SubsetData <- rbind(PTADRVACdata, LCA__HCAdata, SA_CDBdata, PRA__HCAdata, PTASSPIdata)

Using the Lattice package: Plotting data in SQL column as a function of Date

#SQL data for the PTADRVAC package as a function of date:
plot(PTADRVACdata$DateHour, PTADRVACdata$SQL, main="Scatterplot Example for PTADRVAC data", xlab="Date", ylab="SQL", pch=19)

#continue for the other packages

Package_revision_visualisations

Christiane Heiligers

01 July 2016