Introduction to GGPlot

source(file = "data/load_interaction_data.R")

Summary

summary(interaction_data)

##       ci_name                 ci_type     
##  SAN000182: 10668   application   :99483  
##  DTA000616:  7911   subapplication:20745  
##  WBA000133:  6897   storage       :11829  
##  SUB000456:  6476   computer      : 8410  
##  DTA000057:  5708   software      : 1981  
##  SBA000659:  4008   displaydevice : 1662  
##  (Other)  :105336   (Other)       : 2894  
##                     ci_subtype    service_component   interaction_id  
##  Server Based Application:48469   WBS000073:33528   SD0000001:     1  
##  Web Based Application   :39225   WBS000128:14130   SD0000002:     1  
##  Desktop Application     :24239   WBS000092: 7219   SD0000003:     1  
##  SAN                     :11675   WBS000094: 6825   SD0000004:     1  
##  Citrix                  : 3836   WBS000091: 6411   SD0000005:     1  
##  Laptop                  : 3604   WBS000089: 6011   SD0000006:     1  
##  (Other)                 :15956   (Other)  :72880   (Other)  :146998  
##            status       impact    urgency   priority 
##  Closed       :146998   1:    2   1:   32   1:    2  
##  Open - Linked:     6   2:  904   2:  950   2:  947  
##                         3:16357   3:16074   3:16340  
##                         4:75552   4:76645   4:77534  
##                         5:54189   5:53303   5:52181  
##                                                      
##                                                      
##                     category          km_number     
##  complaint              :    63   KM0002125:  5727  
##  incident               :115704   KM0001935:  1768  
##  problem                :     5   KM0002126:  1698  
##  request for change     :     1   KM0001968:  1606  
##  request for information: 31183   KM0001625:  1415  
##  service request        :    48   KM0000075:  1389  
##                                   (Other)  :133401  
##             open_time                 close_time    
##  14-1-2014 9:07  :    12   25-11-2013 12:02:    55  
##  20-1-2014 11:16 :    12   10-10-2013 12:29:    49  
##  1-10-2013 11:35 :    10   28-1-2014 14:39 :    49  
##  14-1-2014 15:25 :    10   10-10-2013 12:30:    48  
##  18-12-2013 11:45:    10   10-10-2013 12:31:    47  
##  23-12-2013 10:25:    10   10-10-2013 12:32:    46  
##  (Other)         :146940   (Other)         :146710  
##                        closure_code   first_call_resolution
##  Other                       :54487   N:53008              
##  Software                    :45573   Y:93996              
##  Referred                    : 9793                        
##  User error                  : 9052                        
##  No error - works as designed: 5975                        
##  Hardware                    : 4224                        
##  (Other)                     :17900                        
##   handle_time         related_incident
##  Min.   :    0.0              :94250  
##  1st Qu.:  176.0   #MULTIVALUE:  873  
##  Median :  324.0   IM0000220  :  239  
##  Mean   :  444.7   IM0031184  :   85  
##  3rd Qu.:  562.0   IM0016639  :   69  
##  Max.   :22530.0   IM0014444  :   54  
##                    (Other)    :51434

ci_name, ci_type, ci_subtype en service_component zijn vier verschillende categorische variabelen met zeer veel mogelijke waarden. Toch lijken het zeer scheve verdelingen te zijn. Een beperkt aantal waarden komen zeer vaak voor en veel waarden komen ‘mogelijk’ zelden voor.
interaction_id is uniek zoals een id hoort te zijn
Bijne alle interacties in deze data set zijn afgesloten (op 6 na)
Het leeuwendeel van de interacties heeft een impact, urgency en prioriteit van 4 of 5
Bijna alle interacties zijn ofwel incidenten, ofwel request for information
closure_code lijkt zeer scheef verdeeld te zijn
handle_time lijkt wel over outliers (max) en onmogelijke waarden (min) te beschikken
related_incident bevat een opmerkelijke waarde (#MULTIVALUE). Mogelijk zijn dit interacties waarvoor meerdere incidenten zijn geopend.

Visual Analysis

GGplot is based on a grammar that decomposes a graph into separate independent elements. Three basic elements that you need to understand to get started are: * Data * Aesthetics (or aesthetic mappings) * Geometric objects

Data is the core of your graph, it contains the information that you want to visualize. GGPlot assumes that your data comes in the form of a data.frame. Geometric objects are how you want to visualize the data, i.e. as a line, as points, as boxplots, as bars, … . Every geometric object has several aesthetics which describe how the object should be visualized. For example, a point has an ‘x-coordinate’, an ‘y-coordinate’, but also a colour, a shape and a size.

Now when we want to visualize data with GGPlot, we will have to identify the data, mention the geometric object we want to use and map the data to the aesthetics of the geometric object, e.g. we have to connect a specific column of the data set to a specific aesthetic.

A scatterplot

A scatterplot can be a useful graph to visualize the covariance between two continuous variables. Since we don’t have two continuous variables, let’s assume that we want visualize the covariance between the handle time of an interaction and the urgency of the interaction. To do so, we use the following code:

library(ggplot2)
ggplot()+layer(data=interaction_data, 
               mapping = aes(x = urgency, y = handle_time), 
               geom="point")

This code starts by telling R that we are going to make a plot (by calling the ‘ggplot’ function). Next, we add a layer which defines the data and geometric object to use and which provides a mapping of aesthetics to the data. In this example, we simply connect the urgency variable to the x-coordinate and the handle_time variable to the y-coordinate.

A layer is another component of a graph in the GGPlot grammar that we haven’t discussed yet. GGPlot builds its graphs in layers. Each layer adds a specific geometry to the plot. We will deeper into layers further down the road. For now it suffices to understand that layers are what makes something appear in the plot. The result of the code above is shown below:

This plot reveals that as the urgency increases, the handle time also increases and the spread also appears to increase. On the other hand, there appears to be so much data that many points overlap to a straight line and it becomes hard to guess how much data there actually is within such a straight line.

A possible solution is to make small changes to the exact position of the data, such that they do not overlap anymore. GGPlot supports this by means of position adjustments. Position adjustments are yet another component of a GGPlot and consist of a function which make small alterations to the exact position of a data point on the plot. In this case, we use the jitter adjustment.

ggplot()+layer(data=interaction_data, 
               mapping = aes(x = urgency, y = handle_time), 
               geom="point",
               position = "jitter")

While there are many data points that still overlap, this graphs already gives a better impression of the amount of data for each urgency type. The plot might even seem to suggest that the average handle_time for an urgency 4 case is higher than for an urgency 5 case. However, this plot is not very clear to draw such conclusions.

Let’s step back for a moment and have a look at some univariate visualizations, i.e. the distributions of urgency and handle_time. Let’s start with urgency, which is an ordered categorical variable. To visualize the distribution of a categorical variable, the classical bar-plot is well suited. This plot will draw a bar (= the geometric object) for each category of the categorical variable (= the x-coordinate). The height of the bar (= the y-coordinate) is determined by the number of observations for each category.

Note that the latter is a problem, since our data set does not store the number of observations for each category. To find out the actual number, we need to apply a function which counts the number for each category. GGPlot provides such a function, i.e. the ‘bin’ function. The ‘bin’ function actually is yet another component of a plot, i.e. the statistical transformation. A statistical transformation will transform the data prior to plotting it, which can result in actually changing the values of an existing column or adding a new column. The ‘bin’ function actually adds a new column called ..count.. which contains the number of times a specific category has been observed in the data set. Now, we can create the bar-plot

ggplot() + layer(data = interaction_data, 
                 mapping = aes(x = urgency, y = ..count..),
                 geom = "bar",
                 stat = "bin")

Note that these differenc components are indeed independent from each other and that we can easily change the geometric object.

ggplot() + layer(data = interaction_data, 
                 mapping = aes(x = urgency, y = ..count..),
                 geom = "point",
                 stat = "bin")

To plot the density of the handle_time, we follow the same approach as for the urgency level. We will need to transform the data such that we know how many times a specific handle_time value occured in the data set. However, since handle_time is a continuous variable, it probably doesn’t make sense to count how many times each specific value occured as most of them only occur once. Therefore, we should ‘bin’ the values into intervals and count how many observations fall in each interval. Again, this requires a statistical transformation of the data and is also handled in GGPlot by the ‘bin’ function.

ggplot() + layer(data = interaction_data, 
                 mapping = aes(x = handle_time, y = ..count..),
                 geom = "bar",
                 stat = "bin")

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

GGPlot is smart enough to detect that handle_time is a continuous variable and automatically creates the required bins. By default, GGPlot assumes that we want 30 bins of equal width. This is not a particular smart setting, but is done on purpose to force you to experiment. To achieve proper binning, we must set the ‘binwidth’ parameter of the statistical tranformation function ‘bin’.

ggplot() + layer(data = interaction_data, 
                 mapping = aes(x = handle_time, y = ..count..),
                 geom = "bar",
                 stat = "bin",
                 stat_params = list(binwidth=60))

Not only statistical transformation functions do have parameters, also the geometric objects do have parameters. In fact, the parameters of the geometric objects are actually the aesthetics they support. The main difference is that aesthethics are mapped to the data and vary with the data, whereas parameters are set to a specific value and do not change in the plot. For example, we can change the color of the bars by setting the fill parameter.

ggplot() + layer(data = interaction_data, 
                 mapping = aes(x = handle_time, y = ..count..),
                 geom = "bar",
                 geom_params = list(fill = "skyblue1"),
                 stat = "bin",
                 stat_params = list(binwidth=60))

If you want to let the fill color of the histogram vary with the priority level, you must not set the fill parameter to a specific color, but map it to the priority variable.

ggplot() + layer(data = interaction_data, 
                 mapping = aes(x = handle_time, y = ..count.., fill = priority),
                 geom = "bar",
                 stat = "bin",
                 stat_params = list(binwidth=60))

Time for simplification

So far we have always used the ‘layer’ function to add visualization and defined all of the required parameters separately, i.e. data, mapping, geom, geom_params, stat, stat_params and position. This can turn into very verbose coding for some simple plots. Therefore, ggplot offers simplifications for adding layers by relying on defaults.

A first simplification is when you want to add a layer with a specific geometric object in mind. For example, to add a layer with ‘points’, we can use the ‘geom_point’ function. This allows us to rewrite the following code

ggplot()+layer(data=interaction_data, 
               mapping = aes(x = urgency, y = handle_time), 
               geom="point",
               position = "jitter")

ggplot()+geom_point(data=interaction_data, 
                    mapping = aes(x = urgency, y = handle_time), 
                    position = "jitter")

As you can see, geom_point automatically assumes that you want to use the ‘point’ geometry object and thus sets the ‘geom’ parameter to ‘point’ by default. However, this is not the only layer parameter which is set automatically. To see all the settings, we can save the plot and call the summary function.

p <- ggplot()+geom_point(data=interaction_data, 
                    mapping = aes(x = urgency, y = handle_time), 
                    position = "jitter")
summary(p)

## data: [x]
## faceting: facet_null() 
## -----------------------------------
## mapping: x = urgency, y = handle_time 
## geom_point: na.rm = FALSE 
## stat_identity:  
## position_jitter: (width = NULL, height = NULL)

As you might derive from the output, the geom_point also sets the statistical transformation to the identity function (i.e. f(x) = x), which results in no transformation. Actually, for all the code above where we didn’t specify the statistical transformation, the identity function was used.

Now, GGPlot not always assumes that you want the identity function by default. For example, when you use the ‘geom_bar’ function to create a layer with bars, GGPlot automatically assumes that you want to use the bin transformation function to calculate the actual heights and also sets the y-coordinate automatically to the ‘..count..’ variable which is created by the bin function.

p <- ggplot() + geom_bar(data = interaction_data, 
                 mapping = aes(x = urgency))
summary(p)

## data: [x]
## faceting: facet_null() 
## -----------------------------------
## mapping: x = urgency 
## geom_bar:  
## stat_bin:  
## position_stack: (width = NULL, height = NULL)

Note that the code above is already significantly shorter than the original code:

ggplot() + layer(data = interaction_data, 
                 mapping = aes(x = urgency, y = ..count..),
                 geom = "point",
                 stat = "bin")

The following example shows you how you can set statistical and geometric parameters in a simplified manner. Remember the following code that we have studied above:

ggplot() + layer(data = interaction_data, 
                 mapping = aes(x = handle_time, y = ..count..),
                 geom = "bar",
                 geom_params = list(fill = "skyblue1"),
                 stat = "bin",
                 stat_params = list(binwidth=60))

This can be rewritten as:

ggplot() + geom_bar(data = interaction_data, 
                 mapping = aes(x = handle_time),
                 fill = "skyblue1",
                 binwidth=60)

Typically however, we will define the data and aesthetic mappings not per layer, but once at the plot level. Also the ‘ggplot()’ function expects the first element to be the data and the second element to be the aesthetics mappings. So we can rewrite the previous code as follows

ggplot(interaction_data, aes(x = handle_time)) + 
  geom_bar(fill = "skyblue1", 
           binwidth=60)

If you want to change the aesthetic mappings (or data) for a specific layer, you can always overwrite or extend the defaults in the ‘ggplot()’ function. Note however that the layer functions (or the geom_x functions) expect the aesthetic mappings as the first element and the data as the second element!

ggplot(interaction_data, aes(x = handle_time)) + 
  geom_bar(aes(fill = priority), 
           binwidth=60)

library(ggplot2)
ggplot(interaction_data, aes(x = urgency,  y = handle_time)) + geom_boxplot() + 
  geom_point(aes(colour=impact), position="jitter", alpha='0.05')

Statistical Layers

The previous simplifications always created layers based on a specific geometric objects and relied on reasonable default values. One can also create a layer starting with a specific statistical transformation in mind. For example, if one wants to create the empirical cumulative distribution of a continuous variable, one will have to apply a statistical transformation that calculates for each observation of x the appropriate probability of observing a value of x or lower (P(X)<= x). GGPlot makes it relatively easy to visualize results of statistical transformation. Below shows the example for the handle_time variable.

ggplot(interaction_data, aes(x = handle_time)) + stat_ecdf()

The above plot shows that the x-axis allows for negative numbers, which do not make sense in the case of handle_time. To control the axis, we have to define yet another component of the plot, i.e. scales. We can add many different scales (i.e. axis and legends) to a plot, but in this case we require a continuous scale for the x-axis.

ggplot(interaction_data, aes(x = handle_time)) + stat_ecdf() + 
  scale_x_continuous(limits = c(0, 25000))

## Warning: Removed 2 rows containing missing values (geom_path).