Exploratory Data Analysis in R. Choose an interesting dataset and use R graphics to describe the data. You may use base R graphics, or a graphics package of your choice. You should include at least one example of each of the following: * histogram * boxplot * scatterplot Do the graphics provide insight into any relationships in the data?
library(ggplot2)
#Begin Copied Code
# http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_%28ggplot2%29/
# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols: Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
library(grid)
# Make a list from the ... arguments and plotlist
plots <- c(list(...), plotlist)
numPlots = length(plots)
# If layout is NULL, then use 'cols' to determine layout
if (is.null(layout)) {
# Make the panel
# ncol: Number of columns of plots
# nrow: Number of rows needed, calculated from # of cols
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots==1) {
print(plots[[1]])
} else {
# Set up the page
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
# Make each plot, in the correct location
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}
#End Copied Code
Orignial source of data set is: https://archive.ics.uci.edu/ml/datasets/Wine+Quality
wine <- read.csv(file= "winequality-red.csv" , header=TRUE, sep=";" )
wine$quality <-as.factor(wine$quality)
str(wine)
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
#View(wine)
summary(wine)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol quality
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40 3: 10
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 4: 53
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20 5:681
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42 6:638
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 7:199
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90 8: 18
box1 <- ggplot(wine, aes(quality, fixed.acidity)) + geom_boxplot()
box2 <-ggplot(wine, aes(quality, volatile.acidity)) + geom_boxplot()
box3 <-ggplot(wine, aes(quality, citric.acid)) + geom_boxplot()
box4 <-ggplot(wine, aes(quality, residual.sugar)) + geom_boxplot()
box5 <-ggplot(wine, aes(quality, chlorides)) + geom_boxplot()
box6 <-ggplot(wine, aes(quality, free.sulfur.dioxide)) + geom_boxplot()
box7 <-ggplot(wine, aes(quality, total.sulfur.dioxide)) + geom_boxplot()
box8 <-ggplot(wine, aes(quality, density)) + geom_boxplot()
box9 <-ggplot(wine, aes(quality, pH)) + geom_boxplot()
box10 <- ggplot(wine, aes(quality, sulphates)) + geom_boxplot()
box11 <- ggplot(wine, aes(quality, alcohol)) + geom_boxplot()
multiplot(box1, box2, box3, box4, box5, box6, box7, box8,
box9, box10, box11, cols=2)
hist1 <- ggplot ( wine, aes (fixed.acidity, fill = quality) ) + geom_histogram(bins=10)
hist2 <- ggplot ( wine, aes (volatile.acidity, fill = quality) ) + geom_histogram(bins=10)
hist3 <- ggplot ( wine, aes (citric.acid, fill = quality) ) + geom_histogram(bins=10)
hist4 <- ggplot ( wine, aes (residual.sugar, fill = quality) ) + geom_histogram(bins=10)
hist5 <- ggplot ( wine, aes (chlorides, fill = quality) ) + geom_histogram(bins=10)
hist6 <- ggplot ( wine, aes (free.sulfur.dioxide, fill = quality) ) + geom_histogram(bins=10)
hist7 <- ggplot ( wine, aes (total.sulfur.dioxide, fill = quality) ) + geom_histogram(bins=10)
hist8 <- ggplot ( wine, aes (density, fill = quality) ) + geom_histogram(bins=10)
hist9 <- ggplot ( wine, aes (pH, fill = quality) ) + geom_histogram(bins=10)
hist10 <- ggplot ( wine, aes (sulphates, fill = quality) ) + geom_histogram(bins=10)
hist11 <- ggplot ( wine, aes (alcohol, fill = quality) ) + geom_histogram(bins=10)
multiplot(hist1, hist2,hist3,hist4,hist5,hist6,hist7,hist8,hist9,hist10,
hist11, cols=2)
ggplot(wine, aes( x= pH, y= fixed.acidity)) +geom_point()
library(corrplot) #package corrplot
#Coppied Code
#https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html
M <- cor(wine[,1:11])
corrplot(M, method = "ellipse") #plot matrix
#end copied code
The histograms and the boxplot provide a lot of insight what are the chemical characteristics of good quality vs poor quality red wine. It appears from the graphs that high quality red wine is higher in alcohol, citric acid and sulphates than low quality red wine, and low quality red wine is higher in volatile acidity that high quality red wine.
Thinking about it this makes sense, since citric acid, sulphates and alcohol are the elements that give wine its flavor, while volatile acidity is literally the decaying bacteria that was used in thier fermenting process.
Scatter plots, are uses to find correlations between 2 variables. I found it very hard to read the scatterplots, so I looked for another method of visualizing the correlations. I stumbled upon the corrplot library. Using this library it very easy to see what variables are correlated to each other. Some of the correlations that this found is that pH is negatively correlated to fixed acidity, citric acid and density. This makes sense, because the more acid something is the lower the pH value. This visualization also shows that fixed acidity is correlated with citric acid and density. The final correlation that stands out is that alcohol is negatively correlated with density. This makes sense, since alcohol is lighter than water.