Implement various EDA techniques using R
Step 1: Create R code chunks for coding
Step 2: Create boxplot using built-in dataset
Step 3: Create a histogram using R
Step 4: Create a scatter using R
Step 5: Create a Running time series using R
The Youden-Beale Experiment. We have used this dataset in Chapter 2, Section 4, and in a few other places too. We need to compare here if the two virus extracts have a varying effect on the tobacco leaf or not. We have already read this dataset into R on more than one occasion. First, the boxplot is generated without the notches for yb data.frame using the boxplot function. The median for Preparation_1 certainly appears higher than for Preparation_2, see Part A of Figure 3.1. Thus, we are tempted to check whether the medians for the two preparations are significantly different with the notched boxplot. Now, the boxplot is generated to produce the notches with the option notch=TRUE. Appropriate headers for a figure are specified with the title function. Most importantly, we have used a powerful graphical technique of R through par, which is useful in setting graphical parameters
library(ACSWR)
data(yb)
par(mfrow=c(1,2))
boxplot(yb)
title("A: Boxplot for Youden-Beale Data")
boxplot(yb,notch=TRUE)
## Warning in (function (z, notch = FALSE, width = NULL, varwidth = FALSE, : some
## notches went outside hinges ('box'): maybe set notch=FALSE
title("B: Notched Boxplots Now")
summary(yb$Preparation_1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 8.75 13.50 15.00 18.50 31.00
summary(yb$Preparation_2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 6.75 10.50 11.00 14.75 18.00
The histogram was invented by the eminent statistician Karl Pearson and is one of the earliest types of graphical display. It goes without saying that its origin is earlier than EDA, at least the EDA envisioned by Tukey, and yet it is considered by many EDA experts to be a very useful graphical technique, and makes it to the list of one of the very useful practices of EDA. The basic idea is to plot a bar over an interval proportional to the frequency of the observations that lie in that interval. If the sample size is moderately good in some sense and the sample is a true representation of a population, the histogram reveals the shape of the true underlying uncertainty curve. Though histograms are plotted as two-dimensional, they are essentially one-dimensional plots in the sense that the shape of the uncertainty curve is revealed without even looking at the range of the x-axis. Furthermore, the Pareto chart, stem-and-leaf plot, and a few others may be shown as special cases of the histogram.We begin with a “cooked” dataset for understanding a range of uncertainty curves. Understanding Histogram of Various Uncertainty Curves. In the dataset sample, we have data from five different probability distributions. Towards understanding the plausible distribution of the samples, we plot the histogram and see how useful it is.
data(sample)
layout(matrix(c(1,1,2,2,3,3,0,4,4,5,5,0), nrow=2, ncol=6, byrow=TRUE),respect=FALSE)
#matrix(c(1,1,2,2,3,3,0,4,4,5,5,0), nrow=2, ncol=6, byrow=TRUE)
hist(sample[,1],main="Histogram of Sample 1",xlab="sample1", ylab="frequency")
hist(sample[,2],main="Histogram of Sample 2",xlab="sample2", ylab="frequency")
hist(sample[,3],main="Histogram of Sample 3",xlab="sample3", ylab="frequency")
hist(sample[,4],main="Histogram of Sample 4",xlab="sample4", ylab="frequency")
hist(sample[,5],main="Histogram of Sample 5",xlab="sample5", ylab="frequency")
### Histogram extensions
The histogram displays the frequencies over the intervals and for moderately large number of observations, it reflects the underlying probability distribution. The boxplot shows how evenly the data is distributed across the five important measures, although it cannot reveal the probability distribution in a better way than a boxplot. The boxplot helps in identifying the outliers in a more apparent way than the histogram. Hence, it would be very useful if we could bring together both these ideas in a closer way than look at them differently for outliers and probability distributions. An effective way of obtaining such a display is to place the boxplot along the x-axis of the histogram. This helps in clearly identifying outliers and also the appropriate probability distribution. The R package sfsmisc contains a function histBxp, which nicely places the boxplot along the x-axis of the histogram.
library(sfsmisc)
par(mfrow=c(1,3))
histBxp(sample$Sample_1,col="orange",boxcol="red",xlab="x")
histBxp(sample$Sample_2,col="grey",boxcol="grey",xlab="x")
histBxp(sample$Sample_3,col="brown",boxcol="brown",xlab="x")
title("Boxplot and Histogram Complementing",outer=TRUE,line=-1)
### Pareto Chart
The Pareto chart has been designed to address the implicit questions answered by the Pareto law. The common understanding of the Pareto law is that “majority resources” is consumed by a “minority user”. The most common of the percentages is the 80–20 rule, implying that 80% of the effects come from 20% of the causes. The Pareto law is also known as the law of vital few, or the 80–20 rule. The Pareto chart gives very smart answers by completely answering how much is owned by how many. Montgomery (2005), page 148, has listed the Pareto chart as one of the seven major tools of Statistical Process Control. A Pareto chart is a bar graph. The lengths of the bars represent frequency or cost (time or money), and are arranged with longest bars on the left and the shortest to the right. In this way the chart visually depicts which situations are more significant. This cause analysis tool is considered one of the seven basic quality tools.
library(qcc)
## Package 'qcc' version 2.7
## Type 'citation("qcc")' for citing this R package in publications.
freq <- c(14,2,1,2,3,8,1)
names(freq) <- c("Contamination","Corrosion","Doping", "Metallization", "Miscellaneous", "Oxide Effect","Silicon Effect")
pareto.chart(freq)
##
## Pareto chart analysis for freq
## Frequency Cum.Freq. Percentage Cum.Percent.
## Contamination 14.000000 14.000000 45.161290 45.161290
## Oxide Effect 8.000000 22.000000 25.806452 70.967742
## Miscellaneous 3.000000 25.000000 9.677419 80.645161
## Corrosion 2.000000 27.000000 6.451613 87.096774
## Metallization 2.000000 29.000000 6.451613 93.548387
## Doping 1.000000 30.000000 3.225806 96.774194
## Silicon Effect 1.000000 31.000000 3.225806 100.000000
Various visualization techniques for data analysis are implemented
in R.