Aim

Implement various EDA techniques using R

Algorithm

Step 1: Create R code chunks for coding

Step 2: Create boxplot using built-in dataset

Step 3: Create a histogram using R

Step 4: Create a scatter using R

Step 5: Create a Running time series using R

R Markdown

Boxplot:

The Youden-Beale Experiment. We have used this dataset in Chapter 2, Section 4, and in a few other places too. We need to compare here if the two virus extracts have a varying effect on the tobacco leaf or not. We have already read this dataset into R on more than one occasion. First, the boxplot is generated without the notches for yb data.frame using the boxplot function. The median for Preparation_1 certainly appears higher than for Preparation_2, see Part A of Figure 3.1. Thus, we are tempted to check whether the medians for the two preparations are significantly different with the notched boxplot. Now, the boxplot is generated to produce the notches with the option notch=TRUE. Appropriate headers for a figure are specified with the title function. Most importantly, we have used a powerful graphical technique of R through par, which is useful in setting graphical parameters

library(ACSWR)
data(yb)
par(mfrow=c(1,2))
boxplot(yb)
title("A: Boxplot for Youden-Beale Data")
boxplot(yb,notch=TRUE)
## Warning in (function (z, notch = FALSE, width = NULL, varwidth = FALSE, : some
## notches went outside hinges ('box'): maybe set notch=FALSE
title("B: Notched Boxplots Now")

summary(yb$Preparation_1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00    8.75   13.50   15.00   18.50   31.00
summary(yb$Preparation_2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00    6.75   10.50   11.00   14.75   18.00

Histogram:

The histogram was invented by the eminent statistician Karl Pearson and is one of the earliest types of graphical display. It goes without saying that its origin is earlier than EDA, at least the EDA envisioned by Tukey, and yet it is considered by many EDA experts to be a very useful graphical technique, and makes it to the list of one of the very useful practices of EDA. The basic idea is to plot a bar over an interval proportional to the frequency of the observations that lie in that interval. If the sample size is moderately good in some sense and the sample is a true representation of a population, the histogram reveals the shape of the true underlying uncertainty curve. Though histograms are plotted as two-dimensional, they are essentially one-dimensional plots in the sense that the shape of the uncertainty curve is revealed without even looking at the range of the x-axis. Furthermore, the Pareto chart, stem-and-leaf plot, and a few others may be shown as special cases of the histogram.We begin with a “cooked” dataset for understanding a range of uncertainty curves. Understanding Histogram of Various Uncertainty Curves. In the dataset sample, we have data from five different probability distributions. Towards understanding the plausible distribution of the samples, we plot the histogram and see how useful it is.

data(sample)
layout(matrix(c(1,1,2,2,3,3,0,4,4,5,5,0), nrow=2, ncol=6, byrow=TRUE),respect=FALSE)
#matrix(c(1,1,2,2,3,3,0,4,4,5,5,0), nrow=2, ncol=6, byrow=TRUE)
hist(sample[,1],main="Histogram of Sample 1",xlab="sample1", ylab="frequency")
hist(sample[,2],main="Histogram of Sample 2",xlab="sample2", ylab="frequency")
hist(sample[,3],main="Histogram of Sample 3",xlab="sample3", ylab="frequency")
hist(sample[,4],main="Histogram of Sample 4",xlab="sample4", ylab="frequency")
hist(sample[,5],main="Histogram of Sample 5",xlab="sample5", ylab="frequency")

### Histogram extensions

The histogram displays the frequencies over the intervals and for moderately large number of observations, it reflects the underlying probability distribution. The boxplot shows how evenly the data is distributed across the five important measures, although it cannot reveal the probability distribution in a better way than a boxplot. The boxplot helps in identifying the outliers in a more apparent way than the histogram. Hence, it would be very useful if we could bring together both these ideas in a closer way than look at them differently for outliers and probability distributions. An effective way of obtaining such a display is to place the boxplot along the x-axis of the histogram. This helps in clearly identifying outliers and also the appropriate probability distribution. The R package sfsmisc contains a function histBxp, which nicely places the boxplot along the x-axis of the histogram.

library(sfsmisc)
par(mfrow=c(1,3))
histBxp(sample$Sample_1,col="orange",boxcol="red",xlab="x")
histBxp(sample$Sample_2,col="grey",boxcol="grey",xlab="x")
histBxp(sample$Sample_3,col="brown",boxcol="brown",xlab="x")
title("Boxplot and Histogram Complementing",outer=TRUE,line=-1)

### Pareto Chart

The Pareto chart has been designed to address the implicit questions answered by the Pareto law. The common understanding of the Pareto law is that “majority resources” is consumed by a “minority user”. The most common of the percentages is the 80–20 rule, implying that 80% of the effects come from 20% of the causes. The Pareto law is also known as the law of vital few, or the 80–20 rule. The Pareto chart gives very smart answers by completely answering how much is owned by how many. Montgomery (2005), page 148, has listed the Pareto chart as one of the seven major tools of Statistical Process Control. A Pareto chart is a bar graph. The lengths of the bars represent frequency or cost (time or money), and are arranged with longest bars on the left and the shortest to the right. In this way the chart visually depicts which situations are more significant. This cause analysis tool is considered one of the seven basic quality tools.

library(qcc)
## Package 'qcc' version 2.7
## Type 'citation("qcc")' for citing this R package in publications.
freq <- c(14,2,1,2,3,8,1)
names(freq) <- c("Contamination","Corrosion","Doping", "Metallization", "Miscellaneous", "Oxide Effect","Silicon Effect")
pareto.chart(freq)

##                 
## Pareto chart analysis for freq
##                   Frequency  Cum.Freq. Percentage Cum.Percent.
##   Contamination   14.000000  14.000000  45.161290    45.161290
##   Oxide Effect     8.000000  22.000000  25.806452    70.967742
##   Miscellaneous    3.000000  25.000000   9.677419    80.645161
##   Corrosion        2.000000  27.000000   6.451613    87.096774
##   Metallization    2.000000  29.000000   6.451613    93.548387
##   Doping           1.000000  30.000000   3.225806    96.774194
##   Silicon Effect   1.000000  31.000000   3.225806   100.000000

Results & Discussions

Various visualization techniques for data analysis are implemented in R.