Introduction

I plan to use the Wine dataset from the UCI Machine Learning Repository, which contains chemical measurements of wines derived from three different cultivars grown in the same region of Italy. I chose this dataset because it includes a clear target variable and multiple quantitative features that are well suited for practicing data transformations and cleaning tasks.

Planned Workflow

I will begin by selecting a tabular dataset that contains a clear outcome or target variable along with several additional features. The dataset will be accessed through a public URL to ensure the analysis is fully reproducible. After loading the data into R, I will inspect its structure and identify a subset of relevant variables to include in the final transformed data frame.

Anticipated Data Challenges

Possible challenges include missing values, inconsistent data types, and abbreviated or coded variable values that are not immediately interpretable. I anticipate needing to recode certain variables and rename columns to improve clarity and usability for future analysis.

Data Loading

wine_raw <- read.csv(
  "https://raw.githubusercontent.com/Kristoffgit/data-607-week-1/refs/heads/main/wine.csv",
  header = FALSE
)
colnames(wine_raw) <- c(
  "class",
  "alcohol",
  "malic_acid",
  "ash",
  "alcalinity_of_ash",
  "magnesium",
  "total_phenols",
  "flavanoids",
  "nonflavanoid_phenols",
  "proanthocyanins",
  "color_intensity",
  "hue",
  "od280_od315",
  "proline"
)
head(wine_raw)
##   class alcohol malic_acid  ash alcalinity_of_ash magnesium total_phenols
## 1     1   14.23       1.71 2.43              15.6       127          2.80
## 2     1   13.20       1.78 2.14              11.2       100          2.65
## 3     1   13.16       2.36 2.67              18.6       101          2.80
## 4     1   14.37       1.95 2.50              16.8       113          3.85
## 5     1   13.24       2.59 2.87              21.0       118          2.80
## 6     1   14.20       1.76 2.45              15.2       112          3.27
##   flavanoids nonflavanoid_phenols proanthocyanins color_intensity  hue
## 1       3.06                 0.28            2.29            5.64 1.04
## 2       2.76                 0.26            1.28            4.38 1.05
## 3       3.24                 0.30            2.81            5.68 1.03
## 4       3.49                 0.24            2.18            7.80 0.86
## 5       2.69                 0.39            1.82            4.32 1.04
## 6       3.39                 0.34            1.97            6.75 1.05
##   od280_od315 proline
## 1        3.92    1065
## 2        3.40    1050
## 3        3.17    1185
## 4        3.45    1480
## 5        2.93     735
## 6        2.85    1450
wine_raw$class <- factor(
  wine_raw$class,
  levels = c(1, 2, 3),
  labels = c("Cultivar_1", "Cultivar_2", "Cultivar_3")
)
table(wine_raw$class)
## 
## Cultivar_1 Cultivar_2 Cultivar_3 
##         59         71         48
wine_clean <- wine_raw[, c(
  "class",
  "alcohol",
  "malic_acid",
  "total_phenols",
  "flavanoids",
  "color_intensity",
  "proline"
)]

head(wine_clean)
##        class alcohol malic_acid total_phenols flavanoids color_intensity
## 1 Cultivar_1   14.23       1.71          2.80       3.06            5.64
## 2 Cultivar_1   13.20       1.78          2.65       2.76            4.38
## 3 Cultivar_1   13.16       2.36          2.80       3.24            5.68
## 4 Cultivar_1   14.37       1.95          3.85       3.49            7.80
## 5 Cultivar_1   13.24       2.59          2.80       2.69            4.32
## 6 Cultivar_1   14.20       1.76          3.27       3.39            6.75
##   proline
## 1    1065
## 2    1050
## 3    1185
## 4    1480
## 5     735
## 6    1450

Conclusions and Next Steps

In this assignment, I loaded the Wine dataset from a public GitHub URL and transformed it into a more interpretable format ready for analysis. I assigned column names based on the UCI documentation provided, recoded the target variable (class) into labeled cultivars. Then I created a smaller dataset (wine_clean) which contained the target plus a subset of relevant features. An idea to extend this work, I could explore how environmental factors such as where the grapes are grown may influence the chemical composition of wines. Since this dataset focuses on wines from a single region in Italy, comparing it to similar data from other regions could help determine whether the observed differences are specific to cultivars or influenced by conditions where it’s grown.