This is a quick analysis and visualisation of package revision data. The data is in csv form, saved in a folder called “data”, one level up from the working directory of this R work. The relative path is from the R-script in a ‘code’ folder, one level up into the ‘data’ folder.
The working directory for this code to run properly needs to be set relative to where the data is stored. If the data is stored in the same folder as this document is saved and subsequentyl executed in R or Rstudio, ensure that the “path_to_file” variable is set accordingly. To see your working directory, type “getwd()” in the R console. To set your working directory, type “setwd(”path_to_working_directory_required“)” and complete the biut between braces accordingly.
We need to install packages to plot the data in a grid format. For this work, we’ll use the “lattice” package.
library(stringr)
library(lattice)
The headers for the columns are given explicitly, to ensure the correct headers are used in the analysis. In this case, when reading the data in, the argument “headers” is set to FALSE.
columns <- c("SMF", "SUBSYS", "COUNT", "PLAN", "YEAR", "MONTH", "DAY", "HOUR", "TRAN", "COLLECTION", "PACKAGE", "SQL", "DB2ELAP", "DB2TCB")
path_to_file = "../data/Packrev.csv"
data <- read.csv(path_to_file, header = FALSE, skip = 1, col.names = columns, stringsAsFactors = FALSE)
Verifying the data has read in correctly, accessing only the first 10 rows of data (note: if you want to see the output of the command, uncomment line 2 of the code below):
data_test = data[1:5,]
#data_test
The data has mixed types, some columns are numbers, others are date fields and the remaining are text. The data type in these columns needs to be properly set in order to get appropriate plots. Running the code below shows that some of the text columns (character, or “chr” type) have leading and trailing white spaces. The date columns are read in as integers and that needs to be converted to date-time objects for R. The Packages, collection and SQL columns need to be converted to factors
str(data)
## 'data.frame': 11196 obs. of 14 variables:
## $ SMF : chr " XE21 " " XE21 " " XE21 " " XE21 " ...
## $ SUBSYS : chr " S10A " " S10A " " S10A " " S10A " ...
## $ COUNT : int 341 341 1 1 1 1 3 1 1 12 ...
## $ PLAN : chr " FL2A1700 " " FL2A1700 " " GLBD1700 " " GLBD1700 " ...
## $ YEAR : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
## $ MONTH : int 7 7 7 7 7 7 7 7 7 7 ...
## $ DAY : int 28 28 28 28 28 28 28 28 28 28 ...
## $ HOUR : int 12 12 12 12 12 12 12 12 12 12 ...
## $ TRAN : int 3406 3406 1 1 1 1 3 1 1 12 ...
## $ COLLECTION: chr " AUTH1700_COM " " FL2A1700_CHO " " AUTH1700_COM " " AUTH1700_COM " ...
## $ PACKAGE : chr " PTADRVAC " " FLCMMAGT " " PTADRVAC " " PTADRVU@ " ...
## $ SQL : int 1702 8084 2 2 2 3 4 2 3 48 ...
## $ DB2ELAP : num 0.6231 6.4295 0.0102 0.3388 0.0171 ...
## $ DB2TCB : num 0.475548 0.981472 0.000722 0.001568 0.00063 ...
data$Date <- paste(data$YEAR, data$MONTH, data$DAY, sep ="/")
#first combine the columns into one
data$Date <- as.Date(data$Date, format="%Y/%m/%d")
#adding the hour into the data time column
data$DateHour <- paste(paste(data$YEAR, data$MONTH, data$DAY, sep = "/"),data$HOUR, sep = " ")
#convert the column into a Date/Time column
data$DateHour <- as.POSIXct(data$DateHour, format="%Y/%m/%d %H", tz = "MST") + (0:3) * 3600
str(data$DateHour[1:10])
## POSIXct[1:10], format: "2014-07-28 12:00:00" "2014-07-28 13:00:00" ...
some variables we need to convert to factors to use in the plots:
data$CollectionFactors <- as.factor(data$COLLECTION)
data$SQLFactors <- as.factor(data$SQL)
data$PackageFactors <- as.factor(str_trim(data$PACKAGE, side = c("both")))
Subsetting the data for only the rows containing the package of interest We first find out which packages appear the most in the data
PackageFrequency <- table(data$PackageFactors)
barplot(PackageFrequency)
OrderedPackageFrequency <- PackageFrequency[order(PackageFrequency, decreasing = TRUE)]
#Packages of interest: selecting the top 5
packagesOfInterest <- names(OrderedPackageFrequency[1:5])
#packagesOfInterest names have been changed to accomodate the use of # and @ in the package name.
#All five top:
PTADRVACdata <- data[which(data$PackageFactors == packagesOfInterest[1]),]
LCA__HCAdata <- data[which(data$PackageFactors == packagesOfInterest[2]),]
SA_CDBdata <- data[which(data$PackageFactors == packagesOfInterest[3]),]
PRA__HCAdata <- data[which(data$PackageFactors == packagesOfInterest[4]),]
PTASSPIdata <- data[which(data$PackageFactors == packagesOfInterest[5]),]
#putting them together:
SubsetData <- rbind(PTADRVACdata, LCA__HCAdata, SA_CDBdata, PRA__HCAdata, PTASSPIdata)
Using the Lattice package: Plotting data in SQL column as a function of Date
#SQL data for the PTADRVAC package as a function of date:
plot(PTADRVACdata$DateHour, PTADRVACdata$SQL, main="Scatterplot Example for PTADRVAC data", xlab="Date", ylab="SQL", pch=19)
#continue for the other packages