Project_1

Author

Samantha Barbaro

Approach

I visited Napa Valley twice and enjoyed the wine. I’d like to learn more about residual sugar levels and alcohol percentage.

Dataset: https://raw.githubusercontent.com/samanthabarbaro/data607/refs/heads/main/red_wine_dataset.csv

(NB: I later realized these were Portuguese wines)

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
url <- "https://raw.githubusercontent.com/samanthabarbaro/data607/refs/heads/main/red_wine_dataset.csv"

red_wine <- read.csv(url)

glimpse(red_wine)
Rows: 1,599
Columns: 1
$ fixed.acidity..volatile.acidity...citric.acid...residual.sugar...chlorides...free.sulfur.dioxide...total.sulfur.dioxide...density...pH...sulphates...alcohol...quality. <chr> …

This data needs to be properly formatted in a table

It’s currently in just one column. First, let’s look at the headers to see what we’re dealing with.

head(red_wine,1)
  fixed.acidity..volatile.acidity...citric.acid...residual.sugar...chlorides...free.sulfur.dioxide...total.sulfur.dioxide...density...pH...sulphates...alcohol...quality.
1                                                                                                                        7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5

Make it a tibble

red_wine_tibble <- as_tibble(red_wine)

Delimit by semicolons

I tried a few tactics here. First, I tried the “separate” function, which would not allow me to use a semicolon as a delimiter, though it works with spaces. Then, I tried replacing the semicolon with a space (this gave me a warning because the argument was not an atomic vector, which is apparently coercive to the data, and you’re not allowed to coerce the data).

Eventually, I asked Gemini to help me with syntax for the separate_wider_delim function, which I was having trouble with. Apparently, this is a well-known data set, because Gemini even included the correct columns for me.

red_wine_split <- red_wine_tibble %>%
    separate_wider_delim(
        cols = everything(),            
        delim = ";",             
        names = c("fixed_acidity", "volatile_acidity", "citric_acid", "residual_sugar", 
                  "chlorides", "free_sulfur_dioxide", "total_sulfur_dioxide", 
                  "density", "ph", "sulphates", "alcohol", "quality")
    )

##We'll take a look at the new columns here.

glimpse(red_wine_split)
Rows: 1,599
Columns: 12
$ fixed_acidity        <chr> "7.4", "7.8", "7.8", "11.2", "7.4", "7.4", "7.9",…
$ volatile_acidity     <chr> "0.7", "0.88", "0.76", "0.28", "0.7", "0.66", "0.…
$ citric_acid          <chr> "0", "0", "0.04", "0.56", "0", "0", "0.06", "0", …
$ residual_sugar       <chr> "1.9", "2.6", "2.3", "1.9", "1.9", "1.8", "1.6", …
$ chlorides            <chr> "0.076", "0.098", "0.092", "0.075", "0.076", "0.0…
$ free_sulfur_dioxide  <chr> "11", "25", "15", "17", "11", "13", "15", "15", "…
$ total_sulfur_dioxide <chr> "34", "67", "54", "60", "34", "40", "59", "21", "…
$ density              <chr> "0.9978", "0.9968", "0.997", "0.998", "0.9978", "…
$ ph                   <chr> "3.51", "3.2", "3.26", "3.16", "3.51", "3.51", "3…
$ sulphates            <chr> "0.56", "0.68", "0.65", "0.58", "0.56", "0.56", "…
$ alcohol              <chr> "9.4", "9.8", "9.8", "9.8", "9.4", "9.4", "9.4", …
$ quality              <chr> "5", "5", "5", "6", "5", "5", "5", "7", "7", "5",…

Sugar vs. alcohol percentage

I wanted to understand if there is a correlation between residual sugar and alcohol percentage. I was also interested in the quality rating. Here’s a quick peek at those three columns.

print(n=15, select(red_wine_split, c(residual_sugar,alcohol,quality)))
# A tibble: 1,599 × 3
   residual_sugar alcohol quality
   <chr>          <chr>   <chr>  
 1 1.9            9.4     5      
 2 2.6            9.8     5      
 3 2.3            9.8     5      
 4 1.9            9.8     6      
 5 1.9            9.4     5      
 6 1.8            9.4     5      
 7 1.6            9.4     5      
 8 1.2            10      7      
 9 2              9.5     7      
10 6.1            10.5    5      
11 1.8            9.2     5      
12 6.1            10.5    5      
13 1.6            9.9     5      
14 1.6            9.1     5      
15 3.8            9.2     5      
# ℹ 1,584 more rows

Converting values to numeric

I created a dot plot, thought there was no correlation between residual sugar and alcohol, then realized the y axis started at 9, went to 13, then back to 10 (disaster!). I checked the type of data with the class function and confirmed it was stored as “character.”

So, here we turn the alcohol percentage and residual sugar columns into numeric data.

class(red_wine_split$alcohol)
[1] "character"
red_wine_split <- red_wine_split %>%
    mutate(alcohol = as.numeric(alcohol))

red_wine_split <- red_wine_split %>%
    mutate(residual_sugar = as.numeric(residual_sugar))

Dot plot

Then we make the dot plot. This dot plot compares residual sugar levels to alcohol percentage. I’ve also indexed the quality ratings as colors.

ggplot(data = red_wine_split, aes(x = residual_sugar, y = alcohol, color = quality)) + 
    geom_point()

I love the colors R chose for me.

It doesn’t seem like there’s much of a correlation between residual sugar and alcohol, given the cluster on the left. It is interesting that the wines with a 5 quality rating are clustered closer to the bottom, with a lower alcohol percentages, while the 6s tend to have a higher alcohol percentage.

I asked an expert, who explained (using entirely too many words) that an alcohol percentage < 15 meant none of these wines are fortified. So, no Port or Madeira in this data set!

Checking quality ratings

I was also interested in these quality ratings, so I used the count function to see what the distinct quality ratings were. They range from 3-8 in this data set.

red_wine_split |> count((quality))
# A tibble: 6 × 2
  `(quality)`     n
  <chr>       <int>
1 3              10
2 4              53
3 5             681
4 6             638
5 7             199
6 8              18

Then, I sorted from highest to lowest. The most typical quality rating was 5, followed by 6.

red_wine_split |> count(quality, sort = TRUE)
# A tibble: 6 × 2
  quality     n
  <chr>   <int>
1 5         681
2 6         638
3 7         199
4 4          53
5 8          18
6 3          10

Bar graph

I graphed this quality data. I originally used color = quality, and it just gave the gray bars a very thin outline, which wasn’t very visually interesting. Then I guessed (or it’s possible I’d seen this done before) that using “fill” would change the whole bar color, and I was right!

ggplot(data = red_wine_split, aes(x = quality, fill = quality)) + 
    geom_bar()

Citations

Google Gemini. (2026). Gemini 3 Flash [Large language model].
https://gemini.google.com. Accessed January 28, 2026.