Assignment

Exploratory Data Analysis in R. Choose an interesting dataset and use R graphics to describe the data. You may use base R graphics, or a graphics package of your choice. You should include at least one example of each of the following: * histogram * boxplot * scatterplot Do the graphics provide insight into any relationships in the data?

library(ggplot2)

Function to create a matrix of ggplot2 graphs

#Begin Copied Code
# http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_%28ggplot2%29/
# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)
  
  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)
  
  numPlots = length(plots)
  
  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                     ncol = cols, nrow = ceiling(numPlots/cols))
  }
  
  if (numPlots==1) {
    print(plots[[1]])
    
  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
    
    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
      
      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}
#End Copied Code

Read in Wine Quality dataset

Orignial source of data set is: https://archive.ics.uci.edu/ml/datasets/Wine+Quality

wine <- read.csv(file= "winequality-red.csv" , header=TRUE, sep=";" )
wine$quality <-as.factor(wine$quality)
str(wine)
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
#View(wine)

Summary Statistics

summary(wine)
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol      quality
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40   3: 10  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   4: 53  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20   5:681  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42   6:638  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   7:199  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90   8: 18

Boxplots

box1 <- ggplot(wine, aes(quality, fixed.acidity)) + geom_boxplot()
box2 <-ggplot(wine, aes(quality, volatile.acidity)) + geom_boxplot()
box3 <-ggplot(wine, aes(quality, citric.acid)) + geom_boxplot()
box4 <-ggplot(wine, aes(quality, residual.sugar)) + geom_boxplot()
box5 <-ggplot(wine, aes(quality, chlorides)) + geom_boxplot()
box6 <-ggplot(wine, aes(quality, free.sulfur.dioxide)) + geom_boxplot()
box7 <-ggplot(wine, aes(quality, total.sulfur.dioxide)) + geom_boxplot()
box8 <-ggplot(wine, aes(quality, density)) + geom_boxplot()
box9 <-ggplot(wine, aes(quality, pH)) + geom_boxplot()
box10 <- ggplot(wine, aes(quality, sulphates)) + geom_boxplot()
box11 <- ggplot(wine, aes(quality, alcohol)) + geom_boxplot()

multiplot(box1, box2, box3, box4, box5, box6, box7, box8,
          box9, box10, box11, cols=2)

Histograms

hist1 <- ggplot ( wine, aes (fixed.acidity, fill = quality) ) + geom_histogram(bins=10)
hist2 <- ggplot ( wine, aes (volatile.acidity, fill = quality) ) + geom_histogram(bins=10)
hist3 <- ggplot ( wine, aes (citric.acid, fill = quality) ) + geom_histogram(bins=10)
hist4 <- ggplot ( wine, aes (residual.sugar, fill = quality) ) + geom_histogram(bins=10)
hist5 <- ggplot ( wine, aes (chlorides, fill = quality) ) + geom_histogram(bins=10)
hist6 <- ggplot ( wine, aes (free.sulfur.dioxide, fill = quality) ) + geom_histogram(bins=10)
hist7 <- ggplot ( wine, aes (total.sulfur.dioxide, fill = quality) ) + geom_histogram(bins=10)
hist8 <- ggplot ( wine, aes (density, fill = quality) ) + geom_histogram(bins=10)
hist9 <- ggplot ( wine, aes (pH, fill = quality) ) + geom_histogram(bins=10)
hist10 <- ggplot ( wine, aes (sulphates, fill = quality) ) + geom_histogram(bins=10)
hist11 <- ggplot ( wine, aes (alcohol, fill = quality) ) + geom_histogram(bins=10)
multiplot(hist1, hist2,hist3,hist4,hist5,hist6,hist7,hist8,hist9,hist10,
          hist11, cols=2)

Scatterplot

ggplot(wine, aes( x= pH, y= fixed.acidity)) +geom_point()

Correlation Plot

library(corrplot) #package corrplot
#Coppied Code
#https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html
M <- cor(wine[,1:11])
corrplot(M, method = "ellipse") #plot matrix

#end copied code

Conclusions

The histograms and the boxplot provide a lot of insight what are the chemical characteristics of good quality vs poor quality red wine. It appears from the graphs that high quality red wine is higher in alcohol, citric acid and sulphates than low quality red wine, and low quality red wine is higher in volatile acidity that high quality red wine.

Thinking about it this makes sense, since citric acid, sulphates and alcohol are the elements that give wine its flavor, while volatile acidity is literally the decaying bacteria that was used in thier fermenting process.

Scatter plots, are uses to find correlations between 2 variables. I found it very hard to read the scatterplots, so I looked for another method of visualizing the correlations. I stumbled upon the corrplot library. Using this library it very easy to see what variables are correlated to each other. Some of the correlations that this found is that pH is negatively correlated to fixed acidity, citric acid and density. This makes sense, because the more acid something is the lower the pH value. This visualization also shows that fixed acidity is correlated with citric acid and density. The final correlation that stands out is that alcohol is negatively correlated with density. This makes sense, since alcohol is lighter than water.