JNorthamDDS8521-Assignment6

The Wine_Quality_Data from GitHub (Sdt 320, 2025) was used for wine quality analysis. The Wine_Quality database consists of 6497 observations and 13 variables. Twelve numeric variables, 11 of which are numeric variables including fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol content. The quality rating is an integer data type. Color was also included as a character data type. Inspection of the distribution of red and white wine types revealed 1599 observations of red wine and 4898 observations of white wine (Sdt 320, 2025). The purpose of this project is identify a relationship between the target variable (volatile acidity) and quality ratings using a bootstrap resampling.

#Install "tidymodels"
install.packages("tidymodels", repos = "https://cloud.r-project.org")

## Installing package into 'C:/Users/benke/AppData/Local/R/win-library/4.5'
## (as 'lib' is unspecified)

## package 'tidymodels' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\benke\AppData\Local\Temp\RtmpK8eYzy\downloaded_packages

#Load packages
library(tidyverse)

## Warning: package 'readr' was built under R version 4.5.1

## Warning: package 'purrr' was built under R version 4.5.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(readr)
library(boot)
library(ggplot2)
library(moments)
library(scales)

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

library(kableExtra)

## Warning: package 'kableExtra' was built under R version 4.5.1

## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(rsample)

## Warning: package 'rsample' was built under R version 4.5.2

library(tidymodels)

## Warning: package 'tidymodels' was built under R version 4.5.2

## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
## ✔ broom        1.0.11     ✔ tailor       0.1.0 
## ✔ dials        1.4.2      ✔ tune         2.0.1 
## ✔ infer        1.1.0      ✔ workflows    1.3.0 
## ✔ modeldata    1.5.1      ✔ workflowsets 1.1.1 
## ✔ parsnip      1.4.0      ✔ yardstick    1.3.2 
## ✔ recipes      1.3.1

## Warning: package 'broom' was built under R version 4.5.2

## Warning: package 'dials' was built under R version 4.5.2

## Warning: package 'modeldata' was built under R version 4.5.2

## Warning: package 'parsnip' was built under R version 4.5.2

## Warning: package 'tailor' was built under R version 4.5.2

## Warning: package 'tune' was built under R version 4.5.2

## Warning: package 'workflows' was built under R version 4.5.2

## Warning: package 'workflowsets' was built under R version 4.5.2

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard()        masks purrr::discard()
## ✖ dplyr::filter()          masks stats::filter()
## ✖ recipes::fixed()         masks stringr::fixed()
## ✖ kableExtra::group_rows() masks dplyr::group_rows()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ yardstick::spec()        masks readr::spec()
## ✖ recipes::step()          masks stats::step()

#Get working directory
getwd()

## [1] "C:/Users/benke/Downloads/New folder (3)"

#Set working directory
setwd("C:/Users/benke/Downloads/New folder (3)")

#Upload dataset
winedf <- read.csv("Wine_Quality_Data.csv")

#View dataset
head(winedf)

##   fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free_sulfur_dioxide total_sulfur_dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality color
## 1       5   red
## 2       5   red
## 3       5   red
## 4       6   red
## 5       5   red
## 6       5   red

#Examine data types
str(winedf)

## 'data.frame':    6497 obs. of  13 variables:
##  $ fixed_acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile_acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric_acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual_sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free_sulfur_dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total_sulfur_dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ color               : chr  "red" "red" "red" "red" ...

#Determine red and white wine counts
winedf %>% count(color)

##   color    n
## 1   red 1599
## 2 white 4898

#View data types
glimpse(winedf)

## Rows: 6,497
## Columns: 13
## $ fixed_acidity        <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.5…
## $ volatile_acidity     <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, …
## $ citric_acid          <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0…
## $ residual_sugar       <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1,…
## $ chlorides            <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, …
## $ free_sulfur_dioxide  <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16…
## $ total_sulfur_dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102,…
## $ density              <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, 0…
## $ pH                   <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3…
## $ sulphates            <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0…
## $ alcohol              <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.…
## $ quality              <int> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, 7…
## $ color                <chr> "red", "red", "red", "red", "red", "red", "red", …

#View summary statistics
summary(winedf)

##  fixed_acidity    volatile_acidity  citric_acid     residual_sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500   1st Qu.: 1.800  
##  Median : 7.000   Median :0.2900   Median :0.3100   Median : 3.000  
##  Mean   : 7.215   Mean   :0.3397   Mean   :0.3186   Mean   : 5.443  
##  3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900   3rd Qu.: 8.100  
##  Max.   :15.900   Max.   :1.5800   Max.   :1.6600   Max.   :65.800  
##    chlorides       free_sulfur_dioxide total_sulfur_dioxide    density      
##  Min.   :0.00900   Min.   :  1.00      Min.   :  6.0        Min.   :0.9871  
##  1st Qu.:0.03800   1st Qu.: 17.00      1st Qu.: 77.0        1st Qu.:0.9923  
##  Median :0.04700   Median : 29.00      Median :118.0        Median :0.9949  
##  Mean   :0.05603   Mean   : 30.53      Mean   :115.7        Mean   :0.9947  
##  3rd Qu.:0.06500   3rd Qu.: 41.00      3rd Qu.:156.0        3rd Qu.:0.9970  
##  Max.   :0.61100   Max.   :289.00      Max.   :440.0        Max.   :1.0390  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.720   Min.   :0.2200   Min.   : 8.00   Min.   :3.000  
##  1st Qu.:3.110   1st Qu.:0.4300   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.210   Median :0.5100   Median :10.30   Median :6.000  
##  Mean   :3.219   Mean   :0.5313   Mean   :10.49   Mean   :5.818  
##  3rd Qu.:3.320   3rd Qu.:0.6000   3rd Qu.:11.30   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :9.000  
##     color          
##  Length:6497       
##  Class :character  
##  Mode  :character  
##                    
##                    
##

#Check for missing values in rows
rows_NA <- sum(rowSums(is.na(winedf)) > 0)
rows_NA

## [1] 0

#Check for missing values in columns
col_NA <- sum(colSums(is.na(winedf)) > 0)
col_NA

## [1] 0

#Create function to view statistics
compute_stats <- function(column, name) {
  if (is.numeric(column) || is.integer(column)) {
    data.frame(
      Variable = name,
      Mean = round(mean(column, na.rm = TRUE), 2),
      Median = round(median(column, na.rm = TRUE), 2),
      St.Deviation = round(sd(column, na.rm = TRUE), 2),
      Variance = round(var(column, na.rm = TRUE), 2),
      Range = round(diff(range(column, na.rm = TRUE)), 2),
      IQR = round(IQR(column, na.rm = TRUE), 2),
      Skewness = round(skewness(column, na.rm = TRUE), 2),
      Kurtosis = round(kurtosis(column, na.rm = TRUE), 2),
      stringsAsFactors = FALSE
    )
  } else {
    NULL
  }
}

#Apply statistic function to dataset
descriptive_stats <- do.call(
  rbind,
  lapply(names(winedf), function(col) compute_stats(winedf[[col]], col))
)

#View statistics
descriptive_stats

##                Variable   Mean Median St.Deviation Variance  Range   IQR
## 1         fixed_acidity   7.22   7.00         1.30     1.68  12.10  1.30
## 2      volatile_acidity   0.34   0.29         0.16     0.03   1.50  0.17
## 3           citric_acid   0.32   0.31         0.15     0.02   1.66  0.14
## 4        residual_sugar   5.44   3.00         4.76    22.64  65.20  6.30
## 5             chlorides   0.06   0.05         0.04     0.00   0.60  0.03
## 6   free_sulfur_dioxide  30.53  29.00        17.75   315.04 288.00 24.00
## 7  total_sulfur_dioxide 115.74 118.00        56.52  3194.72 434.00 79.00
## 8               density   0.99   0.99         0.00     0.00   0.05  0.00
## 9                    pH   3.22   3.21         0.16     0.03   1.29  0.21
## 10            sulphates   0.53   0.51         0.15     0.02   1.78  0.17
## 11              alcohol  10.49  10.30         1.19     1.42   6.90  1.80
## 12              quality   5.82   6.00         0.87     0.76   6.00  1.00
##    Skewness Kurtosis
## 1      1.72     8.06
## 2      1.49     5.82
## 3      0.47     5.39
## 4      1.44     7.35
## 5      5.40    53.86
## 6      1.22    10.90
## 7      0.00     2.63
## 8      0.50     9.60
## 9      0.39     3.37
## 10     1.80    11.65
## 11     0.57     2.47
## 12     0.19     3.23

#View distribution of quality variable
ggplot(winedf, aes(x=quality, fill=quality)) +
  geom_bar(aes(fill=quality), fill="turquoise", bin=1) +
  labs(x="Group Comparison of Quality Rating", y="Count")

## Warning in geom_bar(aes(fill = quality), fill = "turquoise", bin = 1): Ignoring
## unknown parameters: `bin`

#Create subset of numerical features
num_winedf <- winedf[c(1:12)]
head(num_winedf)

##   fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free_sulfur_dioxide total_sulfur_dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

#Create subset of numerical features
num_winedf <- winedf[c(1:12)]
head(num_winedf)

##   fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free_sulfur_dioxide total_sulfur_dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

#View pairs analysis to identify features with relationships for further investigation
pairs(num_winedf)

corr_winedf <- cor(num_winedf)
corr_winedf

##                      fixed_acidity volatile_acidity citric_acid residual_sugar
## fixed_acidity           1.00000000       0.21900826  0.32443573    -0.11198128
## volatile_acidity        0.21900826       1.00000000 -0.37798132    -0.19601117
## citric_acid             0.32443573      -0.37798132  1.00000000     0.14245123
## residual_sugar         -0.11198128      -0.19601117  0.14245123     1.00000000
## chlorides               0.29819477       0.37712428  0.03899801    -0.12894050
## free_sulfur_dioxide    -0.28273543      -0.35255731  0.13312581     0.40287064
## total_sulfur_dioxide   -0.32905390      -0.41447619  0.19524198     0.49548159
## density                 0.45890998       0.27129565  0.09615393     0.55251695
## pH                     -0.25270047       0.26145440 -0.32980819    -0.26731984
## sulphates               0.29956774       0.22598368  0.05619730    -0.18592741
## alcohol                -0.09545152      -0.03764039 -0.01049349    -0.35941477
## quality                -0.07674321      -0.26569948  0.08553172    -0.03698048
##                        chlorides free_sulfur_dioxide total_sulfur_dioxide
## fixed_acidity         0.29819477         -0.28273543          -0.32905390
## volatile_acidity      0.37712428         -0.35255731          -0.41447619
## citric_acid           0.03899801          0.13312581           0.19524198
## residual_sugar       -0.12894050          0.40287064           0.49548159
## chlorides             1.00000000         -0.19504479          -0.27963045
## free_sulfur_dioxide  -0.19504479          1.00000000           0.72093408
## total_sulfur_dioxide -0.27963045          0.72093408           1.00000000
## density               0.36261466          0.02571684           0.03239451
## pH                    0.04470798         -0.14585390          -0.23841310
## sulphates             0.39559331         -0.18845725          -0.27572682
## alcohol              -0.25691558         -0.17983843          -0.26573964
## quality              -0.20066550          0.05546306          -0.04138545
##                          density          pH    sulphates      alcohol
## fixed_acidity         0.45890998 -0.25270047  0.299567744 -0.095451523
## volatile_acidity      0.27129565  0.26145440  0.225983680 -0.037640386
## citric_acid           0.09615393 -0.32980819  0.056197300 -0.010493492
## residual_sugar        0.55251695 -0.26731984 -0.185927405 -0.359414771
## chlorides             0.36261466  0.04470798  0.395593307 -0.256915580
## free_sulfur_dioxide   0.02571684 -0.14585390 -0.188457249 -0.179838435
## total_sulfur_dioxide  0.03239451 -0.23841310 -0.275726820 -0.265739639
## density               1.00000000  0.01168608  0.259478495 -0.686745422
## pH                    0.01168608  1.00000000  0.192123407  0.121248467
## sulphates             0.25947850  0.19212341  1.000000000 -0.003029195
## alcohol              -0.68674542  0.12124847 -0.003029195  1.000000000
## quality              -0.30585791  0.01950570  0.038485446  0.444318520
##                          quality
## fixed_acidity        -0.07674321
## volatile_acidity     -0.26569948
## citric_acid           0.08553172
## residual_sugar       -0.03698048
## chlorides            -0.20066550
## free_sulfur_dioxide   0.05546306
## total_sulfur_dioxide -0.04138545
## density              -0.30585791
## pH                    0.01950570
## sulphates             0.03848545
## alcohol               0.44431852
## quality               1.00000000

corr_winedf %>%
  kbl(caption = "Correlation of Numerical Wine Quality Variables") %>%
  kable_classic()

Correlation of Numerical Wine Quality Variables
	fixed_acidity	volatile_acidity	citric_acid	residual_sugar	chlorides	free_sulfur_dioxide	total_sulfur_dioxide	density	pH	sulphates	alcohol	quality
fixed_acidity	1.0000000	0.2190083	0.3244357	-0.1119813	0.2981948	-0.2827354	-0.3290539	0.4589100	-0.2527005	0.2995677	-0.0954515	-0.0767432
volatile_acidity	0.2190083	1.0000000	-0.3779813	-0.1960112	0.3771243	-0.3525573	-0.4144762	0.2712956	0.2614544	0.2259837	-0.0376404	-0.2656995
citric_acid	0.3244357	-0.3779813	1.0000000	0.1424512	0.0389980	0.1331258	0.1952420	0.0961539	-0.3298082	0.0561973	-0.0104935	0.0855317
residual_sugar	-0.1119813	-0.1960112	0.1424512	1.0000000	-0.1289405	0.4028706	0.4954816	0.5525170	-0.2673198	-0.1859274	-0.3594148	-0.0369805
chlorides	0.2981948	0.3771243	0.0389980	-0.1289405	1.0000000	-0.1950448	-0.2796304	0.3626147	0.0447080	0.3955933	-0.2569156	-0.2006655
free_sulfur_dioxide	-0.2827354	-0.3525573	0.1331258	0.4028706	-0.1950448	1.0000000	0.7209341	0.0257168	-0.1458539	-0.1884572	-0.1798384	0.0554631
total_sulfur_dioxide	-0.3290539	-0.4144762	0.1952420	0.4954816	-0.2796304	0.7209341	1.0000000	0.0323945	-0.2384131	-0.2757268	-0.2657396	-0.0413855
density	0.4589100	0.2712956	0.0961539	0.5525170	0.3626147	0.0257168	0.0323945	1.0000000	0.0116861	0.2594785	-0.6867454	-0.3058579
pH	-0.2527005	0.2614544	-0.3298082	-0.2673198	0.0447080	-0.1458539	-0.2384131	0.0116861	1.0000000	0.1921234	0.1212485	0.0195057
sulphates	0.2995677	0.2259837	0.0561973	-0.1859274	0.3955933	-0.1884572	-0.2757268	0.2594785	0.1921234	1.0000000	-0.0030292	0.0384854
alcohol	-0.0954515	-0.0376404	-0.0104935	-0.3594148	-0.2569156	-0.1798384	-0.2657396	-0.6867454	0.1212485	-0.0030292	1.0000000	0.4443185
quality	-0.0767432	-0.2656995	0.0855317	-0.0369805	-0.2006655	0.0554631	-0.0413855	-0.3058579	0.0195057	0.0384854	0.4443185	1.0000000

#View scatterplot of pH and chlorides
ggplot(winedf, aes(x = pH, y = chlorides)) +
  geom_point(color = "turquoise", size = 2) +
  labs(title = "Relationship between pH and Chlorides",
       x = "pH",
       y = "Chlorides") +
  theme_minimal()

#View scatterplot of quality and pH
ggplot(winedf, aes(x = quality, y = pH)) +
  geom_point(color = "turquoise", size = 2) +
  labs(title = "Relationship between Quality and pH",
       x = "Quality",
       y = "pH") +
  theme_minimal()

#View scatterplot of quality and pH
ggplot(winedf, aes(x = pH, y = quality)) +
  geom_point(color = "turquoise", size = 2) +
  labs(title = "Relationship between Quality and pH",
       x = "pH",
       y = "Quality") +
  theme_minimal()

#View scatterplot of pH and sulphates
ggplot(winedf, aes(x = pH, y = sulphates)) +
  geom_point(color = "turquoise", size = 2) +
  labs(title = "Relationship between pH and Sulfates",
       x = "pH",
       y = "Sulfates") +
  theme_minimal()

#View scatterplot of pH and fixed acidity
ggplot(winedf, aes(x = pH, y = fixed_acidity)) +
  geom_point(color = "turquoise", size = 2) +
  labs(title = "Relationship between pH and Fixed Acidity",
       x = "pH",
       y = "Fixed Acidity") +
  theme_minimal()

#View scatterplot of quality and volatile acidity
ggplot(winedf, aes(x = quality, y = volatile_acidity)) +
  geom_point(color = "turquoise", size = 2) +
  labs(title = "Relationship between Quality and Volatile Acidity",
       x = "Quality",
       y = "Volatile Acidity") +
  theme_minimal()

## Hypothesis: The average volatile acidity of high quality wine (rating >=7) is significantly lower than that of lower-quality wine (rating <7)

#Subset based on wine quality
subset1 <- subset(num_winedf, quality >= 7)
subset2 <- subset(num_winedf, quality < 7)
head(subset1)

##     fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## 8             7.3             0.65        0.00            1.2     0.065
## 9             7.8             0.58        0.02            2.0     0.073
## 17            8.5             0.28        0.56            1.8     0.092
## 38            8.1             0.38        0.28            2.1     0.066
## 63            7.5             0.52        0.16            1.9     0.085
## 129           8.0             0.59        0.16            1.8     0.065
##     free_sulfur_dioxide total_sulfur_dioxide density   pH sulphates alcohol
## 8                    15                   21  0.9946 3.39      0.47    10.0
## 9                     9                   18  0.9968 3.36      0.57     9.5
## 17                   35                  103  0.9969 3.30      0.75    10.5
## 38                   13                   30  0.9968 3.23      0.73     9.7
## 63                   12                   35  0.9968 3.38      0.62     9.5
## 129                   3                   16  0.9962 3.42      0.92    10.5
##     quality
## 8         7
## 9         7
## 17        7
## 38        7
## 63        7
## 129       7

head(subset2)

##   fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free_sulfur_dioxide total_sulfur_dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

#Calculate the difference in means for subsets for sample statistic
SampleStatistic <- mean(subset1$volatile_acidity) - mean(subset2$volatile_acidity)
SampleStatistic

## [1] -0.06284923

#Calculate t-test prior to bootstrapping for comparison
samplet <- t.test(subset1, subset2, var.equal = TRUE)
samplet

## 
##  Two Sample t-test
## 
## data:  subset1 and subset2
## t = -1.2263, df = 77962, p-value = 0.2201
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.0280658  0.2367345
## sample estimates:
## mean of x mean of y 
##  14.74022  15.13589

#Bootstrap using sample() with replacement, with a for loop to calculate the test statistic for 10000 randomizations. Results are stored in "results". In each iteration of the loop, the sample() function randomly samples, with replacement, 100 observations from each subset. The difference of the means is stored in "result[i]".

set.seed(1234)

results <- numeric(10000)
for (i in 1:10000) {
  data.subset1 <- sample(subset1$volatile_acidity, size=100, replace=TRUE)
  data.subset2 <- sample(subset2$volatile_acidity, size = 100, replace=TRUE)
  results[i] <- mean(data.subset1) - mean(data.subset2)
}

#Create a histogram of the statistic's distribution
par(cex.main = 1.5, cex.lab=1.5, cex.axis=1.25)
hist(results, breaks = 15, main = "Distribution of difference in means of volatile acidity",
     xlab="Difference in means", col="turquoise", cex=1.5)

#Calculate confidence interval
alpha = 0.05
n1=length(subset1)
n2=length(subset2)
se = sd(results)
ci.l <- mean(results)-qt(1-alpha/2, n1-1)*se
ci.u <- mean(results)+qt(1-alpha/2, n1-1)*se
ci.l

## [1] -0.1088172

ci.u

## [1] -0.01699331

#Calculate confidence interval from bootstrap distribution
alpha = 0.05

ci_lb = quantile(results, probs=(alpha/2))
ci_ub = quantile(results, probs=(1-(alpha/2)))
ci_lb

##    2.5% 
## -0.1044

ci_ub

##       97.5% 
## -0.02269625

ci_lb %>%
  kbl(caption = "Confidence Interval Lower Limit") %>%
  kable_classic()

Confidence Interval Lower Limit
	x
2.5%	-0.1044

ci_ub %>%
  kbl(caption = "Confidence Interval Upper Limit") %>%
  kable_classic()

Confidence Interval Upper Limit
	x
97.5%	-0.0226963

citation("tidyverse")

## To cite package 'tidyverse' in publications use:
## 
##   Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R,
##   Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller
##   E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V,
##   Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to
##   the tidyverse." _Journal of Open Source Software_, *4*(43), 1686.
##   doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     title = {Welcome to the {tidyverse}},
##     author = {Hadley Wickham and Mara Averick and Jennifer Bryan and Winston Chang and Lucy D'Agostino McGowan and Romain François and Garrett Grolemund and Alex Hayes and Lionel Henry and Jim Hester and Max Kuhn and Thomas Lin Pedersen and Evan Miller and Stephan Milton Bache and Kirill Müller and Jeroen Ooms and David Robinson and Dana Paige Seidel and Vitalie Spinu and Kohske Takahashi and Davis Vaughan and Claus Wilke and Kara Woo and Hiroaki Yutani},
##     year = {2019},
##     journal = {Journal of Open Source Software},
##     volume = {4},
##     number = {43},
##     pages = {1686},
##     doi = {10.21105/joss.01686},
##   }

citation("dplyr")

## To cite package 'dplyr' in publications use:
## 
##   Wickham H, François R, Henry L, Müller K, Vaughan D (2023). _dplyr: A
##   Grammar of Data Manipulation_. doi:10.32614/CRAN.package.dplyr
##   <https://doi.org/10.32614/CRAN.package.dplyr>, R package version
##   1.1.4, <https://CRAN.R-project.org/package=dplyr>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {dplyr: A Grammar of Data Manipulation},
##     author = {Hadley Wickham and Romain François and Lionel Henry and Kirill Müller and Davis Vaughan},
##     year = {2023},
##     note = {R package version 1.1.4},
##     url = {https://CRAN.R-project.org/package=dplyr},
##     doi = {10.32614/CRAN.package.dplyr},
##   }

citation("readr")

## To cite package 'readr' in publications use:
## 
##   Wickham H, Hester J, Bryan J (2024). _readr: Read Rectangular Text
##   Data_. doi:10.32614/CRAN.package.readr
##   <https://doi.org/10.32614/CRAN.package.readr>, R package version
##   2.1.5, <https://CRAN.R-project.org/package=readr>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {readr: Read Rectangular Text Data},
##     author = {Hadley Wickham and Jim Hester and Jennifer Bryan},
##     year = {2024},
##     note = {R package version 2.1.5},
##     url = {https://CRAN.R-project.org/package=readr},
##     doi = {10.32614/CRAN.package.readr},
##   }

citation("boot")

## To cite the 'boot' package in publications use:
## 
##   Angelo Canty and Brian Ripley (2024). boot: Bootstrap R (S-Plus)
##   Functions. R package version 1.3-31.
## 
##   Davison, A. C. & Hinkley, D. V. (1997) Bootstrap Methods and Their
##   Applications. Cambridge University Press, Cambridge. ISBN
##   0-521-57391-2
## 
## To see these entries in BibTeX format, use 'print(<citation>,
## bibtex=TRUE)', 'toBibtex(.)', or set
## 'options(citation.bibtex.max=999)'.

citation("ggplot2")

## To cite ggplot2 in publications, please use
## 
##   H. Wickham. ggplot2: Elegant Graphics for Data Analysis.
##   Springer-Verlag New York, 2016.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Book{,
##     author = {Hadley Wickham},
##     title = {ggplot2: Elegant Graphics for Data Analysis},
##     publisher = {Springer-Verlag New York},
##     year = {2016},
##     isbn = {978-3-319-24277-4},
##     url = {https://ggplot2.tidyverse.org},
##   }

citation("moments")

## To cite package 'moments' in publications use:
## 
##   Komsta L, Novomestky F (2022). _moments: Moments, Cumulants,
##   Skewness, Kurtosis and Related Tests_.
##   doi:10.32614/CRAN.package.moments
##   <https://doi.org/10.32614/CRAN.package.moments>, R package version
##   0.14.1, <https://CRAN.R-project.org/package=moments>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {moments: Moments, Cumulants, Skewness, Kurtosis and Related Tests},
##     author = {Lukasz Komsta and Frederick Novomestky},
##     year = {2022},
##     note = {R package version 0.14.1},
##     url = {https://CRAN.R-project.org/package=moments},
##     doi = {10.32614/CRAN.package.moments},
##   }
## 
## ATTENTION: This citation information has been auto-generated from the
## package DESCRIPTION file and may need manual editing, see
## 'help("citation")'.

citation("scales")

## To cite package 'scales' in publications use:
## 
##   Wickham H, Pedersen T, Seidel D (2025). _scales: Scale Functions for
##   Visualization_. doi:10.32614/CRAN.package.scales
##   <https://doi.org/10.32614/CRAN.package.scales>, R package version
##   1.4.0, <https://CRAN.R-project.org/package=scales>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {scales: Scale Functions for Visualization},
##     author = {Hadley Wickham and Thomas Lin Pedersen and Dana Seidel},
##     year = {2025},
##     note = {R package version 1.4.0},
##     url = {https://CRAN.R-project.org/package=scales},
##     doi = {10.32614/CRAN.package.scales},
##   }

citation("kableExtra")

## To cite package 'kableExtra' in publications use:
## 
##   Zhu H (2024). _kableExtra: Construct Complex Table with 'kable' and
##   Pipe Syntax_. doi:10.32614/CRAN.package.kableExtra
##   <https://doi.org/10.32614/CRAN.package.kableExtra>, R package version
##   1.4.0, <https://CRAN.R-project.org/package=kableExtra>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {kableExtra: Construct Complex Table with 'kable' and Pipe Syntax},
##     author = {Hao Zhu},
##     year = {2024},
##     note = {R package version 1.4.0},
##     url = {https://CRAN.R-project.org/package=kableExtra},
##     doi = {10.32614/CRAN.package.kableExtra},
##   }

citation("rsample")

## To cite package 'rsample' in publications use:
## 
##   Frick H, Chow F, Kuhn M, Mahoney M, Silge J, Wickham H (2025).
##   _rsample: General Resampling Infrastructure_.
##   doi:10.32614/CRAN.package.rsample
##   <https://doi.org/10.32614/CRAN.package.rsample>, R package version
##   1.3.1, <https://CRAN.R-project.org/package=rsample>.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {rsample: General Resampling Infrastructure},
##     author = {Hannah Frick and Fanny Chow and Max Kuhn and Michael Mahoney and Julia Silge and Hadley Wickham},
##     year = {2025},
##     note = {R package version 1.3.1},
##     url = {https://CRAN.R-project.org/package=rsample},
##     doi = {10.32614/CRAN.package.rsample},
##   }

citation("tidymodels")

## To cite package 'tidymodels' in publications use:
## 
##   Kuhn et al., (2020). Tidymodels: a collection of packages for
##   modeling and machine learning using tidyverse principles.
##   https://www.tidymodels.org
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles.},
##     author = {Max Kuhn and Hadley Wickham},
##     url = {https://www.tidymodels.org},
##     year = {2020},
##   }

Sdt320. (2025). Wine_quality_dataset/Wine_Quality_Data.csv at main · Sdt320/Wine_quality_dataset. GitHub. https://github.com/Sdt320/Wine_quality_dataset/blob/main/Wine_Quality_Data.csv

JNorthamDDS8521-Assignment6

Heather Northam

2025-12-22