The Wine_Quality_Data from GitHub (Sdt 320, 2025) was used for wine quality analysis. The Wine_Quality database consists of 6497 observations and 13 variables. Twelve numeric variables, 11 of which are numeric variables including fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol content. The quality rating is an integer data type. Color was also included as a character data type. Inspection of the distribution of red and white wine types revealed 1599 observations of red wine and 4898 observations of white wine (Sdt 320, 2025). The purpose of this project is identify a relationship between the target variable (volatile acidity) and quality ratings using a bootstrap resampling.
#Install "tidymodels"
install.packages("tidymodels", repos = "https://cloud.r-project.org")
## Installing package into 'C:/Users/benke/AppData/Local/R/win-library/4.5'
## (as 'lib' is unspecified)
## package 'tidymodels' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\benke\AppData\Local\Temp\RtmpK8eYzy\downloaded_packages
#Load packages
library(tidyverse)
## Warning: package 'readr' was built under R version 4.5.1
## Warning: package 'purrr' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(readr)
library(boot)
library(ggplot2)
library(moments)
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.5.1
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(rsample)
## Warning: package 'rsample' was built under R version 4.5.2
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.5.2
## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
## ✔ broom 1.0.11 ✔ tailor 0.1.0
## ✔ dials 1.4.2 ✔ tune 2.0.1
## ✔ infer 1.1.0 ✔ workflows 1.3.0
## ✔ modeldata 1.5.1 ✔ workflowsets 1.1.1
## ✔ parsnip 1.4.0 ✔ yardstick 1.3.2
## ✔ recipes 1.3.1
## Warning: package 'broom' was built under R version 4.5.2
## Warning: package 'dials' was built under R version 4.5.2
## Warning: package 'modeldata' was built under R version 4.5.2
## Warning: package 'parsnip' was built under R version 4.5.2
## Warning: package 'tailor' was built under R version 4.5.2
## Warning: package 'tune' was built under R version 4.5.2
## Warning: package 'workflows' was built under R version 4.5.2
## Warning: package 'workflowsets' was built under R version 4.5.2
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ kableExtra::group_rows() masks dplyr::group_rows()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
#Get working directory
getwd()
## [1] "C:/Users/benke/Downloads/New folder (3)"
#Set working directory
setwd("C:/Users/benke/Downloads/New folder (3)")
#Upload dataset
winedf <- read.csv("Wine_Quality_Data.csv")
#View dataset
head(winedf)
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free_sulfur_dioxide total_sulfur_dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality color
## 1 5 red
## 2 5 red
## 3 5 red
## 4 6 red
## 5 5 red
## 6 5 red
#Examine data types
str(winedf)
## 'data.frame': 6497 obs. of 13 variables:
## $ fixed_acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile_acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric_acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual_sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free_sulfur_dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total_sulfur_dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ color : chr "red" "red" "red" "red" ...
#Determine red and white wine counts
winedf %>% count(color)
## color n
## 1 red 1599
## 2 white 4898
#View data types
glimpse(winedf)
## Rows: 6,497
## Columns: 13
## $ fixed_acidity <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.5…
## $ volatile_acidity <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, …
## $ citric_acid <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0…
## $ residual_sugar <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1,…
## $ chlorides <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, …
## $ free_sulfur_dioxide <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16…
## $ total_sulfur_dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102,…
## $ density <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, 0…
## $ pH <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3…
## $ sulphates <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0…
## $ alcohol <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.…
## $ quality <int> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, 7…
## $ color <chr> "red", "red", "red", "red", "red", "red", "red", …
#View summary statistics
summary(winedf)
## fixed_acidity volatile_acidity citric_acid residual_sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500 1st Qu.: 1.800
## Median : 7.000 Median :0.2900 Median :0.3100 Median : 3.000
## Mean : 7.215 Mean :0.3397 Mean :0.3186 Mean : 5.443
## 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900 3rd Qu.: 8.100
## Max. :15.900 Max. :1.5800 Max. :1.6600 Max. :65.800
## chlorides free_sulfur_dioxide total_sulfur_dioxide density
## Min. :0.00900 Min. : 1.00 Min. : 6.0 Min. :0.9871
## 1st Qu.:0.03800 1st Qu.: 17.00 1st Qu.: 77.0 1st Qu.:0.9923
## Median :0.04700 Median : 29.00 Median :118.0 Median :0.9949
## Mean :0.05603 Mean : 30.53 Mean :115.7 Mean :0.9947
## 3rd Qu.:0.06500 3rd Qu.: 41.00 3rd Qu.:156.0 3rd Qu.:0.9970
## Max. :0.61100 Max. :289.00 Max. :440.0 Max. :1.0390
## pH sulphates alcohol quality
## Min. :2.720 Min. :0.2200 Min. : 8.00 Min. :3.000
## 1st Qu.:3.110 1st Qu.:0.4300 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.210 Median :0.5100 Median :10.30 Median :6.000
## Mean :3.219 Mean :0.5313 Mean :10.49 Mean :5.818
## 3rd Qu.:3.320 3rd Qu.:0.6000 3rd Qu.:11.30 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :9.000
## color
## Length:6497
## Class :character
## Mode :character
##
##
##
#Check for missing values in rows
rows_NA <- sum(rowSums(is.na(winedf)) > 0)
rows_NA
## [1] 0
#Check for missing values in columns
col_NA <- sum(colSums(is.na(winedf)) > 0)
col_NA
## [1] 0
#Create function to view statistics
compute_stats <- function(column, name) {
if (is.numeric(column) || is.integer(column)) {
data.frame(
Variable = name,
Mean = round(mean(column, na.rm = TRUE), 2),
Median = round(median(column, na.rm = TRUE), 2),
St.Deviation = round(sd(column, na.rm = TRUE), 2),
Variance = round(var(column, na.rm = TRUE), 2),
Range = round(diff(range(column, na.rm = TRUE)), 2),
IQR = round(IQR(column, na.rm = TRUE), 2),
Skewness = round(skewness(column, na.rm = TRUE), 2),
Kurtosis = round(kurtosis(column, na.rm = TRUE), 2),
stringsAsFactors = FALSE
)
} else {
NULL
}
}
#Apply statistic function to dataset
descriptive_stats <- do.call(
rbind,
lapply(names(winedf), function(col) compute_stats(winedf[[col]], col))
)
#View statistics
descriptive_stats
## Variable Mean Median St.Deviation Variance Range IQR
## 1 fixed_acidity 7.22 7.00 1.30 1.68 12.10 1.30
## 2 volatile_acidity 0.34 0.29 0.16 0.03 1.50 0.17
## 3 citric_acid 0.32 0.31 0.15 0.02 1.66 0.14
## 4 residual_sugar 5.44 3.00 4.76 22.64 65.20 6.30
## 5 chlorides 0.06 0.05 0.04 0.00 0.60 0.03
## 6 free_sulfur_dioxide 30.53 29.00 17.75 315.04 288.00 24.00
## 7 total_sulfur_dioxide 115.74 118.00 56.52 3194.72 434.00 79.00
## 8 density 0.99 0.99 0.00 0.00 0.05 0.00
## 9 pH 3.22 3.21 0.16 0.03 1.29 0.21
## 10 sulphates 0.53 0.51 0.15 0.02 1.78 0.17
## 11 alcohol 10.49 10.30 1.19 1.42 6.90 1.80
## 12 quality 5.82 6.00 0.87 0.76 6.00 1.00
## Skewness Kurtosis
## 1 1.72 8.06
## 2 1.49 5.82
## 3 0.47 5.39
## 4 1.44 7.35
## 5 5.40 53.86
## 6 1.22 10.90
## 7 0.00 2.63
## 8 0.50 9.60
## 9 0.39 3.37
## 10 1.80 11.65
## 11 0.57 2.47
## 12 0.19 3.23
#View distribution of quality variable
ggplot(winedf, aes(x=quality, fill=quality)) +
geom_bar(aes(fill=quality), fill="turquoise", bin=1) +
labs(x="Group Comparison of Quality Rating", y="Count")
## Warning in geom_bar(aes(fill = quality), fill = "turquoise", bin = 1): Ignoring
## unknown parameters: `bin`
#Create subset of numerical features
num_winedf <- winedf[c(1:12)]
head(num_winedf)
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free_sulfur_dioxide total_sulfur_dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
#Create subset of numerical features
num_winedf <- winedf[c(1:12)]
head(num_winedf)
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free_sulfur_dioxide total_sulfur_dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
#View pairs analysis to identify features with relationships for further investigation
pairs(num_winedf)
corr_winedf <- cor(num_winedf)
corr_winedf
## fixed_acidity volatile_acidity citric_acid residual_sugar
## fixed_acidity 1.00000000 0.21900826 0.32443573 -0.11198128
## volatile_acidity 0.21900826 1.00000000 -0.37798132 -0.19601117
## citric_acid 0.32443573 -0.37798132 1.00000000 0.14245123
## residual_sugar -0.11198128 -0.19601117 0.14245123 1.00000000
## chlorides 0.29819477 0.37712428 0.03899801 -0.12894050
## free_sulfur_dioxide -0.28273543 -0.35255731 0.13312581 0.40287064
## total_sulfur_dioxide -0.32905390 -0.41447619 0.19524198 0.49548159
## density 0.45890998 0.27129565 0.09615393 0.55251695
## pH -0.25270047 0.26145440 -0.32980819 -0.26731984
## sulphates 0.29956774 0.22598368 0.05619730 -0.18592741
## alcohol -0.09545152 -0.03764039 -0.01049349 -0.35941477
## quality -0.07674321 -0.26569948 0.08553172 -0.03698048
## chlorides free_sulfur_dioxide total_sulfur_dioxide
## fixed_acidity 0.29819477 -0.28273543 -0.32905390
## volatile_acidity 0.37712428 -0.35255731 -0.41447619
## citric_acid 0.03899801 0.13312581 0.19524198
## residual_sugar -0.12894050 0.40287064 0.49548159
## chlorides 1.00000000 -0.19504479 -0.27963045
## free_sulfur_dioxide -0.19504479 1.00000000 0.72093408
## total_sulfur_dioxide -0.27963045 0.72093408 1.00000000
## density 0.36261466 0.02571684 0.03239451
## pH 0.04470798 -0.14585390 -0.23841310
## sulphates 0.39559331 -0.18845725 -0.27572682
## alcohol -0.25691558 -0.17983843 -0.26573964
## quality -0.20066550 0.05546306 -0.04138545
## density pH sulphates alcohol
## fixed_acidity 0.45890998 -0.25270047 0.299567744 -0.095451523
## volatile_acidity 0.27129565 0.26145440 0.225983680 -0.037640386
## citric_acid 0.09615393 -0.32980819 0.056197300 -0.010493492
## residual_sugar 0.55251695 -0.26731984 -0.185927405 -0.359414771
## chlorides 0.36261466 0.04470798 0.395593307 -0.256915580
## free_sulfur_dioxide 0.02571684 -0.14585390 -0.188457249 -0.179838435
## total_sulfur_dioxide 0.03239451 -0.23841310 -0.275726820 -0.265739639
## density 1.00000000 0.01168608 0.259478495 -0.686745422
## pH 0.01168608 1.00000000 0.192123407 0.121248467
## sulphates 0.25947850 0.19212341 1.000000000 -0.003029195
## alcohol -0.68674542 0.12124847 -0.003029195 1.000000000
## quality -0.30585791 0.01950570 0.038485446 0.444318520
## quality
## fixed_acidity -0.07674321
## volatile_acidity -0.26569948
## citric_acid 0.08553172
## residual_sugar -0.03698048
## chlorides -0.20066550
## free_sulfur_dioxide 0.05546306
## total_sulfur_dioxide -0.04138545
## density -0.30585791
## pH 0.01950570
## sulphates 0.03848545
## alcohol 0.44431852
## quality 1.00000000
corr_winedf %>%
kbl(caption = "Correlation of Numerical Wine Quality Variables") %>%
kable_classic()
| fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| fixed_acidity | 1.0000000 | 0.2190083 | 0.3244357 | -0.1119813 | 0.2981948 | -0.2827354 | -0.3290539 | 0.4589100 | -0.2527005 | 0.2995677 | -0.0954515 | -0.0767432 |
| volatile_acidity | 0.2190083 | 1.0000000 | -0.3779813 | -0.1960112 | 0.3771243 | -0.3525573 | -0.4144762 | 0.2712956 | 0.2614544 | 0.2259837 | -0.0376404 | -0.2656995 |
| citric_acid | 0.3244357 | -0.3779813 | 1.0000000 | 0.1424512 | 0.0389980 | 0.1331258 | 0.1952420 | 0.0961539 | -0.3298082 | 0.0561973 | -0.0104935 | 0.0855317 |
| residual_sugar | -0.1119813 | -0.1960112 | 0.1424512 | 1.0000000 | -0.1289405 | 0.4028706 | 0.4954816 | 0.5525170 | -0.2673198 | -0.1859274 | -0.3594148 | -0.0369805 |
| chlorides | 0.2981948 | 0.3771243 | 0.0389980 | -0.1289405 | 1.0000000 | -0.1950448 | -0.2796304 | 0.3626147 | 0.0447080 | 0.3955933 | -0.2569156 | -0.2006655 |
| free_sulfur_dioxide | -0.2827354 | -0.3525573 | 0.1331258 | 0.4028706 | -0.1950448 | 1.0000000 | 0.7209341 | 0.0257168 | -0.1458539 | -0.1884572 | -0.1798384 | 0.0554631 |
| total_sulfur_dioxide | -0.3290539 | -0.4144762 | 0.1952420 | 0.4954816 | -0.2796304 | 0.7209341 | 1.0000000 | 0.0323945 | -0.2384131 | -0.2757268 | -0.2657396 | -0.0413855 |
| density | 0.4589100 | 0.2712956 | 0.0961539 | 0.5525170 | 0.3626147 | 0.0257168 | 0.0323945 | 1.0000000 | 0.0116861 | 0.2594785 | -0.6867454 | -0.3058579 |
| pH | -0.2527005 | 0.2614544 | -0.3298082 | -0.2673198 | 0.0447080 | -0.1458539 | -0.2384131 | 0.0116861 | 1.0000000 | 0.1921234 | 0.1212485 | 0.0195057 |
| sulphates | 0.2995677 | 0.2259837 | 0.0561973 | -0.1859274 | 0.3955933 | -0.1884572 | -0.2757268 | 0.2594785 | 0.1921234 | 1.0000000 | -0.0030292 | 0.0384854 |
| alcohol | -0.0954515 | -0.0376404 | -0.0104935 | -0.3594148 | -0.2569156 | -0.1798384 | -0.2657396 | -0.6867454 | 0.1212485 | -0.0030292 | 1.0000000 | 0.4443185 |
| quality | -0.0767432 | -0.2656995 | 0.0855317 | -0.0369805 | -0.2006655 | 0.0554631 | -0.0413855 | -0.3058579 | 0.0195057 | 0.0384854 | 0.4443185 | 1.0000000 |
#View scatterplot of pH and chlorides
ggplot(winedf, aes(x = pH, y = chlorides)) +
geom_point(color = "turquoise", size = 2) +
labs(title = "Relationship between pH and Chlorides",
x = "pH",
y = "Chlorides") +
theme_minimal()
#View scatterplot of quality and pH
ggplot(winedf, aes(x = quality, y = pH)) +
geom_point(color = "turquoise", size = 2) +
labs(title = "Relationship between Quality and pH",
x = "Quality",
y = "pH") +
theme_minimal()
#View scatterplot of quality and pH
ggplot(winedf, aes(x = pH, y = quality)) +
geom_point(color = "turquoise", size = 2) +
labs(title = "Relationship between Quality and pH",
x = "pH",
y = "Quality") +
theme_minimal()
#View scatterplot of pH and sulphates
ggplot(winedf, aes(x = pH, y = sulphates)) +
geom_point(color = "turquoise", size = 2) +
labs(title = "Relationship between pH and Sulfates",
x = "pH",
y = "Sulfates") +
theme_minimal()
#View scatterplot of pH and fixed acidity
ggplot(winedf, aes(x = pH, y = fixed_acidity)) +
geom_point(color = "turquoise", size = 2) +
labs(title = "Relationship between pH and Fixed Acidity",
x = "pH",
y = "Fixed Acidity") +
theme_minimal()
#View scatterplot of quality and volatile acidity
ggplot(winedf, aes(x = quality, y = volatile_acidity)) +
geom_point(color = "turquoise", size = 2) +
labs(title = "Relationship between Quality and Volatile Acidity",
x = "Quality",
y = "Volatile Acidity") +
theme_minimal()
## Hypothesis: The average volatile acidity of high quality wine (rating
>=7) is significantly lower than that of lower-quality wine (rating
<7)
#Subset based on wine quality
subset1 <- subset(num_winedf, quality >= 7)
subset2 <- subset(num_winedf, quality < 7)
head(subset1)
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## 8 7.3 0.65 0.00 1.2 0.065
## 9 7.8 0.58 0.02 2.0 0.073
## 17 8.5 0.28 0.56 1.8 0.092
## 38 8.1 0.38 0.28 2.1 0.066
## 63 7.5 0.52 0.16 1.9 0.085
## 129 8.0 0.59 0.16 1.8 0.065
## free_sulfur_dioxide total_sulfur_dioxide density pH sulphates alcohol
## 8 15 21 0.9946 3.39 0.47 10.0
## 9 9 18 0.9968 3.36 0.57 9.5
## 17 35 103 0.9969 3.30 0.75 10.5
## 38 13 30 0.9968 3.23 0.73 9.7
## 63 12 35 0.9968 3.38 0.62 9.5
## 129 3 16 0.9962 3.42 0.92 10.5
## quality
## 8 7
## 9 7
## 17 7
## 38 7
## 63 7
## 129 7
head(subset2)
## fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free_sulfur_dioxide total_sulfur_dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
#Calculate the difference in means for subsets for sample statistic
SampleStatistic <- mean(subset1$volatile_acidity) - mean(subset2$volatile_acidity)
SampleStatistic
## [1] -0.06284923
#Calculate t-test prior to bootstrapping for comparison
samplet <- t.test(subset1, subset2, var.equal = TRUE)
samplet
##
## Two Sample t-test
##
## data: subset1 and subset2
## t = -1.2263, df = 77962, p-value = 0.2201
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.0280658 0.2367345
## sample estimates:
## mean of x mean of y
## 14.74022 15.13589
#Bootstrap using sample() with replacement, with a for loop to calculate the test statistic for 10000 randomizations. Results are stored in "results". In each iteration of the loop, the sample() function randomly samples, with replacement, 100 observations from each subset. The difference of the means is stored in "result[i]".
set.seed(1234)
results <- numeric(10000)
for (i in 1:10000) {
data.subset1 <- sample(subset1$volatile_acidity, size=100, replace=TRUE)
data.subset2 <- sample(subset2$volatile_acidity, size = 100, replace=TRUE)
results[i] <- mean(data.subset1) - mean(data.subset2)
}
#Create a histogram of the statistic's distribution
par(cex.main = 1.5, cex.lab=1.5, cex.axis=1.25)
hist(results, breaks = 15, main = "Distribution of difference in means of volatile acidity",
xlab="Difference in means", col="turquoise", cex=1.5)
#Calculate confidence interval
alpha = 0.05
n1=length(subset1)
n2=length(subset2)
se = sd(results)
ci.l <- mean(results)-qt(1-alpha/2, n1-1)*se
ci.u <- mean(results)+qt(1-alpha/2, n1-1)*se
ci.l
## [1] -0.1088172
ci.u
## [1] -0.01699331
#Calculate confidence interval from bootstrap distribution
alpha = 0.05
ci_lb = quantile(results, probs=(alpha/2))
ci_ub = quantile(results, probs=(1-(alpha/2)))
ci_lb
## 2.5%
## -0.1044
ci_ub
## 97.5%
## -0.02269625
ci_lb %>%
kbl(caption = "Confidence Interval Lower Limit") %>%
kable_classic()
| x | |
|---|---|
| 2.5% | -0.1044 |
ci_ub %>%
kbl(caption = "Confidence Interval Upper Limit") %>%
kable_classic()
| x | |
|---|---|
| 97.5% | -0.0226963 |
citation("tidyverse")
## To cite package 'tidyverse' in publications use:
##
## Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R,
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller
## E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V,
## Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to
## the tidyverse." _Journal of Open Source Software_, *4*(43), 1686.
## doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
##
## A BibTeX entry for LaTeX users is
##
## @Article{,
## title = {Welcome to the {tidyverse}},
## author = {Hadley Wickham and Mara Averick and Jennifer Bryan and Winston Chang and Lucy D'Agostino McGowan and Romain François and Garrett Grolemund and Alex Hayes and Lionel Henry and Jim Hester and Max Kuhn and Thomas Lin Pedersen and Evan Miller and Stephan Milton Bache and Kirill Müller and Jeroen Ooms and David Robinson and Dana Paige Seidel and Vitalie Spinu and Kohske Takahashi and Davis Vaughan and Claus Wilke and Kara Woo and Hiroaki Yutani},
## year = {2019},
## journal = {Journal of Open Source Software},
## volume = {4},
## number = {43},
## pages = {1686},
## doi = {10.21105/joss.01686},
## }
citation("dplyr")
## To cite package 'dplyr' in publications use:
##
## Wickham H, François R, Henry L, Müller K, Vaughan D (2023). _dplyr: A
## Grammar of Data Manipulation_. doi:10.32614/CRAN.package.dplyr
## <https://doi.org/10.32614/CRAN.package.dplyr>, R package version
## 1.1.4, <https://CRAN.R-project.org/package=dplyr>.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {dplyr: A Grammar of Data Manipulation},
## author = {Hadley Wickham and Romain François and Lionel Henry and Kirill Müller and Davis Vaughan},
## year = {2023},
## note = {R package version 1.1.4},
## url = {https://CRAN.R-project.org/package=dplyr},
## doi = {10.32614/CRAN.package.dplyr},
## }
citation("readr")
## To cite package 'readr' in publications use:
##
## Wickham H, Hester J, Bryan J (2024). _readr: Read Rectangular Text
## Data_. doi:10.32614/CRAN.package.readr
## <https://doi.org/10.32614/CRAN.package.readr>, R package version
## 2.1.5, <https://CRAN.R-project.org/package=readr>.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {readr: Read Rectangular Text Data},
## author = {Hadley Wickham and Jim Hester and Jennifer Bryan},
## year = {2024},
## note = {R package version 2.1.5},
## url = {https://CRAN.R-project.org/package=readr},
## doi = {10.32614/CRAN.package.readr},
## }
citation("boot")
## To cite the 'boot' package in publications use:
##
## Angelo Canty and Brian Ripley (2024). boot: Bootstrap R (S-Plus)
## Functions. R package version 1.3-31.
##
## Davison, A. C. & Hinkley, D. V. (1997) Bootstrap Methods and Their
## Applications. Cambridge University Press, Cambridge. ISBN
## 0-521-57391-2
##
## To see these entries in BibTeX format, use 'print(<citation>,
## bibtex=TRUE)', 'toBibtex(.)', or set
## 'options(citation.bibtex.max=999)'.
citation("ggplot2")
## To cite ggplot2 in publications, please use
##
## H. Wickham. ggplot2: Elegant Graphics for Data Analysis.
## Springer-Verlag New York, 2016.
##
## A BibTeX entry for LaTeX users is
##
## @Book{,
## author = {Hadley Wickham},
## title = {ggplot2: Elegant Graphics for Data Analysis},
## publisher = {Springer-Verlag New York},
## year = {2016},
## isbn = {978-3-319-24277-4},
## url = {https://ggplot2.tidyverse.org},
## }
citation("moments")
## To cite package 'moments' in publications use:
##
## Komsta L, Novomestky F (2022). _moments: Moments, Cumulants,
## Skewness, Kurtosis and Related Tests_.
## doi:10.32614/CRAN.package.moments
## <https://doi.org/10.32614/CRAN.package.moments>, R package version
## 0.14.1, <https://CRAN.R-project.org/package=moments>.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {moments: Moments, Cumulants, Skewness, Kurtosis and Related Tests},
## author = {Lukasz Komsta and Frederick Novomestky},
## year = {2022},
## note = {R package version 0.14.1},
## url = {https://CRAN.R-project.org/package=moments},
## doi = {10.32614/CRAN.package.moments},
## }
##
## ATTENTION: This citation information has been auto-generated from the
## package DESCRIPTION file and may need manual editing, see
## 'help("citation")'.
citation("scales")
## To cite package 'scales' in publications use:
##
## Wickham H, Pedersen T, Seidel D (2025). _scales: Scale Functions for
## Visualization_. doi:10.32614/CRAN.package.scales
## <https://doi.org/10.32614/CRAN.package.scales>, R package version
## 1.4.0, <https://CRAN.R-project.org/package=scales>.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {scales: Scale Functions for Visualization},
## author = {Hadley Wickham and Thomas Lin Pedersen and Dana Seidel},
## year = {2025},
## note = {R package version 1.4.0},
## url = {https://CRAN.R-project.org/package=scales},
## doi = {10.32614/CRAN.package.scales},
## }
citation("kableExtra")
## To cite package 'kableExtra' in publications use:
##
## Zhu H (2024). _kableExtra: Construct Complex Table with 'kable' and
## Pipe Syntax_. doi:10.32614/CRAN.package.kableExtra
## <https://doi.org/10.32614/CRAN.package.kableExtra>, R package version
## 1.4.0, <https://CRAN.R-project.org/package=kableExtra>.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {kableExtra: Construct Complex Table with 'kable' and Pipe Syntax},
## author = {Hao Zhu},
## year = {2024},
## note = {R package version 1.4.0},
## url = {https://CRAN.R-project.org/package=kableExtra},
## doi = {10.32614/CRAN.package.kableExtra},
## }
citation("rsample")
## To cite package 'rsample' in publications use:
##
## Frick H, Chow F, Kuhn M, Mahoney M, Silge J, Wickham H (2025).
## _rsample: General Resampling Infrastructure_.
## doi:10.32614/CRAN.package.rsample
## <https://doi.org/10.32614/CRAN.package.rsample>, R package version
## 1.3.1, <https://CRAN.R-project.org/package=rsample>.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {rsample: General Resampling Infrastructure},
## author = {Hannah Frick and Fanny Chow and Max Kuhn and Michael Mahoney and Julia Silge and Hadley Wickham},
## year = {2025},
## note = {R package version 1.3.1},
## url = {https://CRAN.R-project.org/package=rsample},
## doi = {10.32614/CRAN.package.rsample},
## }
citation("tidymodels")
## To cite package 'tidymodels' in publications use:
##
## Kuhn et al., (2020). Tidymodels: a collection of packages for
## modeling and machine learning using tidyverse principles.
## https://www.tidymodels.org
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles.},
## author = {Max Kuhn and Hadley Wickham},
## url = {https://www.tidymodels.org},
## year = {2020},
## }
Sdt320. (2025). Wine_quality_dataset/Wine_Quality_Data.csv at main · Sdt320/Wine_quality_dataset. GitHub. https://github.com/Sdt320/Wine_quality_dataset/blob/main/Wine_Quality_Data.csv