Instructions.

Create an R markdown document (.Rmd) and from that document, produce a knitted html document that includes the following content under each of the listed headers.

Introduction

Brief description of your chosen dataset and the question you are trying to answer.

Methods

Explain the steps you will take in your analysis.

  • How will you ‘get to know’ your dataset?
  • What graph will you use to answer your question?
  • Will you need to wrangle (i.e. rearrange) your data?
  • If so, how will you do it?

Results

Your informative (and beautiful) graph.

Conclusion

Answer your question and use your results as evidence.

In other words, what is the story in your graph?

Reflection

Are you happy with your work?

  • What was especially challenging?
  • What did you learn?
  • Anything else I should know?

Introduction.

This dataset consists of chemical measurements of wine samples from the same region of Italy but from three different cultivars (cultivated varieties), named simply as 1 (n = 59), 2 (n = 71), and 3 (n = 48). The quantities of 13 chemical constituents were calculated for each wine sample. Units were not provided for the dataset, as it is an incomplete version of the original with 30 chemical constituent.

Constituents:

  • Alcohol
  • Malic acid
  • Ash
  • Alkalinity of ash
  • Magnesium
  • Total phenols
  • Flavonoids
  • Nonflavanoid phenols
  • Proanthocyanins
  • Color intensity
  • Hue
  • OD280/OD315 of diluted wines
  • Proline

Using this dataset, I will address the question: How the quantities of chemical constituents differ between different wine types grown in the same region?

Methods.

Due to the large number of response variables in this dataset, I will focus on a subset with fewer but likely correlated variables. These will include: alcohol, malic acid, ash, and alkalinity of ash (each of which affect pH); total phenols, flavonoids, and nonflavonoid phenols (plant antioxidants); and hue and color intensity (relating to color). This will require subsetting the dataset to include only the variables of interest.

I will visualize these response variables three ways. First, I will use ggpairs() from the GGally package to create a scatter plot matrix of the variables. If the matrix contains too many variables to effectively visualize their relationships, I will create multiple matrices from the three groupings above (pH, antioxidants, and color). I will initially use symbols to differentiate between wine cultivars, but if the points are too dense to distinguish I will try the same with colors.

Second, I will create a three-dimensional plot for the pH and antioxidant subgroups, with symbols differentiating between wine cultivars and a color scale to represent hue, potentially with an opacity filter to represent color intensity. This will utilize the scatterplot3d() function from the scatterplot3d package.

Third, I will construct profiles of the variables by each wine cultivar. Side-by-side barcharts in the profile will likely be the best option to visualize differences. This may require rearranging the dataset or creating multiple profiles for each response variable subgroup (pH and antioxidants primarily, in this case).

Results.

# searching data() in R for a desired dataset.
# data()

# Data chosen from recommendation list on assignment page.

# Loading data
wine <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", sep = ",")

# Examining data
head(wine)
summary(wine)
dim(wine)

# Column names were not imported when dataset was loaded. 

# Changing column names using order provided by original website.
wine.names <- c("Cultivar", 
                "Alcohol", 
                "Malic_acid", 
                "Ash", 
                "Alkalinity_of_ash",
                "Magnesium",
                "Total_phenols",
                "Flavonoids",
                "Nonflavanoid_phenols",
                "Proanthocyanins",
                "Color_intensity",
                "Hue",
                "OD280_OD315_of_diluted_wines",
                "Proline")
colnames(wine) <- wine.names

#double-checking
head(wine)

#converting Cultivar from numeric variable to factor
wine$Cultivar <- as.factor(wine$Cultivar)
#subset with variables of interest
wine1 <- wine[,c("Cultivar", "Alcohol", "Malic_acid", "Ash", "Alkalinity_of_ash","Total_phenols", "Flavonoids",            "Nonflavanoid_phenols", "Color_intensity", "Hue")]
head(wine1)
dim(wine1)

#maybe that's still too many variables. narrowing it down to 4 continuous response variables
wine2 <- wine[,c("Cultivar", "Alcohol", "Malic_acid", "Ash", "Alkalinity_of_ash")]

After fully exploring the code, I realized my initial set of variables was far too numerous for any realistic visualization. I narrowed down the responses to include Cultivar (the main variable of interest) with alcohol, malic acid, ash, and alkalinity of ash. I plan to include more of my original variables as I work with plotting.

# n, mean, median, quartiles, min, and max by Cultivar 1
kable(summary(subset(wine2, Cultivar == "1")))
Cultivar Alcohol Malic_acid Ash Alkalinity_of_ash
1:59 Min. :12.85 Min. :1.350 Min. :2.040 Min. :11.20
2: 0 1st Qu.:13.40 1st Qu.:1.665 1st Qu.:2.295 1st Qu.:16.00
3: 0 Median :13.75 Median :1.770 Median :2.440 Median :16.80
NA Mean :13.74 Mean :2.011 Mean :2.456 Mean :17.04
NA 3rd Qu.:14.10 3rd Qu.:1.935 3rd Qu.:2.615 3rd Qu.:18.70
NA Max. :14.83 Max. :4.040 Max. :3.220 Max. :25.00
#Tables need polishing and better labels.

Table 1. Data summary of wine constituents for Cultivar 2 (n, mean, median, quartiles, minimum, and maximum).

# n, mean, median, quartiles, min, and max by Cultivar 2
kable(summary(subset(wine2, Cultivar == "2")))
Cultivar Alcohol Malic_acid Ash Alkalinity_of_ash
1: 0 Min. :11.03 Min. :0.740 Min. :1.360 Min. :10.60
2:71 1st Qu.:11.91 1st Qu.:1.270 1st Qu.:2.000 1st Qu.:18.00
3: 0 Median :12.29 Median :1.610 Median :2.240 Median :20.00
NA Mean :12.28 Mean :1.933 Mean :2.245 Mean :20.24
NA 3rd Qu.:12.52 3rd Qu.:2.145 3rd Qu.:2.420 3rd Qu.:22.00
NA Max. :13.86 Max. :5.800 Max. :3.230 Max. :30.00

Table 2. Data summary of wine constituents for Cultivar 2 (n, mean, median, quartiles, minimum, and maximum).

# n, mean, median, quartiles, min, and max by Cultivar 3
kable(summary(subset(wine2, Cultivar == "3")))
Cultivar Alcohol Malic_acid Ash Alkalinity_of_ash
1: 0 Min. :12.20 Min. :1.240 Min. :2.100 Min. :17.50
2: 0 1st Qu.:12.80 1st Qu.:2.587 1st Qu.:2.300 1st Qu.:20.00
3:48 Median :13.16 Median :3.265 Median :2.380 Median :21.00
NA Mean :13.15 Mean :3.334 Mean :2.437 Mean :21.42
NA 3rd Qu.:13.51 3rd Qu.:3.958 3rd Qu.:2.603 3rd Qu.:23.00
NA Max. :14.34 Max. :5.650 Max. :2.860 Max. :27.00

Table 3. Data summary of wine constituents for Cultivar 3 (n, mean, median, quartiles, minimum, and maximum).

#First attempt at scattermatrix - kept for reference as I figure out customizing colors

#names(pnw_palettes)

#pal1 <- pnw_palette("Starfish", n = 3)
#pal1


#ggpairs(data = wine2, mapping = aes(colour = Cultivar),
      #  upper = list(continuous = function(data, mapping, ...) {
     #    ggally_cor(data = data, mapping = mapping) + #scale_fill_manual(values = pal1)}),
  #  lower = list(continuous = function(data, mapping, ...) {
  #       ggally_smooth(data = data, mapping = mapping, alpha = #.2) + scale_colour_manual(values = pal1)}),
   # diag = list(continuous = function(data, mapping, ...) {
   #      ggally_barDiag(data = data, mapping = mapping, alpha = .5) + scale_fill_manual(values = pal1)}))


#?mapping
#?ggpairs

## Could not get all plots in the matrix to take on the customized color palette. Moving on for now, will circle back.
#scatterplot matrix using smaller subset
plot1 <- ggpairs(wine2, mapping = ggplot2::aes(pch = Cultivar, color = Cultivar), progress = FALSE)
plot1

# Pearson correlation coefficient tests to retrieve p-values associated with R values in plot1
cor.test(wine2$Malic_acid, wine2$Alcohol)
cor.test(wine2$Ash, wine2$Alkalinity_of_ash)

Figure 1. Scatter matrix plot for wine subset (cultivar, alcohol, malic acid, ash, alkalinity of ash). Diagonal plots show distributions of individual variables. Lower diagonal plots show histograms (with cultivar) and scatterplots for pairs of variables. Upper diagonal plots contain the Pearson correlation coefficients for each pair of continuous variables, and boxplots of each continuous variable by cultivar.

Most variables have some minor correlation, apart from alcohol and malic acid (R = 0.094, t176 = 1.258, p < 0.210), and the highest correlation is between ash and alkalinity of ash (R = 0.443, t176 = 6.562, p < 0.001).

Some sparse grouping by cultivar can be seen for most variables, particularly alcohol by itself, as well as alcohol and malic acid. Further analysis would be needed to determine how distinct these three groups are.

#Creating vector for pch to assign shapes by Cultivar

shapes = c(16, 17, 19) 
shapes <- shapes[as.numeric(wine$Cultivar)]

# 3D Scatterplot of Alcohol, Malic acid, and alkalinity, with cultivar by shape and color
scatterplot3d(x = wine$Alcohol,
              y = wine$Malic_acid, 
              z = wine$Alkalinity_of_ash, 
              pch = shapes,
              color = wine$Cultivar,
              xlab = "Alcohol",
              ylab = "Malic acid",
              zlab = "Alkalinity of ash",
              type = "h")
legend("topright", legend = levels(as.factor(wine$Cultivar)), col = seq_along(wine$Cultivar), pch = c(16, 17, 18) , title = "Cultivar")

#need to change Malic acid axis label

Figure 2. 3-Dimensional scatterplot of wine alcohol, malic acid, and alkalinity of ash. Different colors and shapes correspond to different cultivars, as shown in the legend. Lines beneath points drawn to better visualize spacing.

The 3D scatterplot further shows the clusters emerging in the alcohol-malic acid plot in Figure 1. Cultivar 1 appears to tend toward higher alcohol and lower malic acid and alkalinity of ash. Cultivars 2 and 3 have a similar spread of alkalinity and possibly alcohol, but cultivar 2 tends toward lower malic acid than cultivar 3.

#profile plots, attempt 1
wine.stats <- wine |>
  group_by(Cultivar) |>
  summarise(
    count = n(),
    meanAlc = mean(Alcohol,na.rm=TRUE),
    sdAlc = sd(Alcohol, na.rm=TRUE),
    seAlc = sdAlc/sqrt(count),
    ci95lower = meanAlc - seAlc*1.96,
    ci95upper = meanAlc + seAlc*1.96
  )

ggplot(data = wine.stats, aes (x = Cultivar, y = meanAlc)) +
  geom_point()
# I had some trouble getting the profile plots to work before the draft deadline. Will work on this for the final assignment. 
pal1 <- pnw_palette("Sunset", 178, type = "continuous")
# pal1
ggplot(data = wine, aes(x = Alcohol, y = Alkalinity_of_ash, color = Malic_acid, shape = Cultivar)) +
  geom_point() 

#?pnw_palette
# still can't get my custom palettes to work, despite having used PNW Colors in many previous classes.

Figure 3. Alcohol, alkalinity of ash, and malic acid contents of wine by cultivar. Symbols indicate wine cultivar, and colors show malic acid concentration.

Conclusion.

Wine constituents - in particular those that contribute significantly to pH - appear to differ slightly between the three cultivars, despite being grown in the same region of Italy. They differ more in their acids than in ash and its alkalinity: the 3D plot best visualizes the clusters of points by each cultivar. Cultivar 1 has higher alcohol and lower malic acid and alkalinity versus 2 or 3, while they differ primarily in their malic acid content.

This is a very large dataset, and including further variables will shed more light on how these three cultivars differ. Furthermore, these groupings are only based on visual appearance - statistical analysis should follow to determine if the differences are significant.

Reflection.

I struggled with including as many variables as I had originally intended. I wanted to do multiple plots because I have previously used scatterplot matrices and wanted to try new forms, but they’re very helpful visualization tools, so I did both.

I could not get the profile plots to work though. I think I was making some headway but ran out of time for the draft.

I learned a lot about 3D plotting, but there is still a lot to unpack there. I wasn’t able to get axis rotation to work, which would help with seeing groups/patterns.

I also could not get color gradients for additional variables to work, or even customize the colors of the points. While this is not a huge deal, nice colors in plots go a long way in my opinion.

New: I added the third figure in response to Jenna’s very helpful comments! I agree with her assessment - the 2D map with the cultivar as shape and malic acid as color is much easier to interpret than the static 3D plot.