I plan to use the Wine dataset from the UCI Machine Learning Repository, which contains chemical measurements of wines derived from three different cultivars grown in the same region of Italy. I chose this dataset because it includes a clear target variable and multiple quantitative features that are well suited for practicing data transformations and cleaning tasks.
I will begin by selecting a tabular dataset that contains a clear outcome or target variable along with several additional features. The dataset will be accessed through a public URL to ensure the analysis is fully reproducible. After loading the data into R, I will inspect its structure and identify a subset of relevant variables to include in the final transformed data frame.
Possible challenges include missing values, inconsistent data types, and abbreviated or coded variable values that are not immediately interpretable. I anticipate needing to recode certain variables and rename columns to improve clarity and usability for future analysis.
wine_raw <- read.csv(
"https://raw.githubusercontent.com/Kristoffgit/data-607-week-1/refs/heads/main/wine.csv",
header = FALSE
)
colnames(wine_raw) <- c(
"class",
"alcohol",
"malic_acid",
"ash",
"alcalinity_of_ash",
"magnesium",
"total_phenols",
"flavanoids",
"nonflavanoid_phenols",
"proanthocyanins",
"color_intensity",
"hue",
"od280_od315",
"proline"
)
head(wine_raw)
## class alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols
## 1 1 14.23 1.71 2.43 15.6 127 2.80
## 2 1 13.20 1.78 2.14 11.2 100 2.65
## 3 1 13.16 2.36 2.67 18.6 101 2.80
## 4 1 14.37 1.95 2.50 16.8 113 3.85
## 5 1 13.24 2.59 2.87 21.0 118 2.80
## 6 1 14.20 1.76 2.45 15.2 112 3.27
## flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue
## 1 3.06 0.28 2.29 5.64 1.04
## 2 2.76 0.26 1.28 4.38 1.05
## 3 3.24 0.30 2.81 5.68 1.03
## 4 3.49 0.24 2.18 7.80 0.86
## 5 2.69 0.39 1.82 4.32 1.04
## 6 3.39 0.34 1.97 6.75 1.05
## od280_od315 proline
## 1 3.92 1065
## 2 3.40 1050
## 3 3.17 1185
## 4 3.45 1480
## 5 2.93 735
## 6 2.85 1450
wine_raw$class <- factor(
wine_raw$class,
levels = c(1, 2, 3),
labels = c("Cultivar_1", "Cultivar_2", "Cultivar_3")
)
table(wine_raw$class)
##
## Cultivar_1 Cultivar_2 Cultivar_3
## 59 71 48
wine_clean <- wine_raw[, c(
"class",
"alcohol",
"malic_acid",
"total_phenols",
"flavanoids",
"color_intensity",
"proline"
)]
head(wine_clean)
## class alcohol malic_acid total_phenols flavanoids color_intensity
## 1 Cultivar_1 14.23 1.71 2.80 3.06 5.64
## 2 Cultivar_1 13.20 1.78 2.65 2.76 4.38
## 3 Cultivar_1 13.16 2.36 2.80 3.24 5.68
## 4 Cultivar_1 14.37 1.95 3.85 3.49 7.80
## 5 Cultivar_1 13.24 2.59 2.80 2.69 4.32
## 6 Cultivar_1 14.20 1.76 3.27 3.39 6.75
## proline
## 1 1065
## 2 1050
## 3 1185
## 4 1480
## 5 735
## 6 1450
In this assignment, I loaded the Wine dataset from a public GitHub
URL and transformed it into a more interpretable format ready for
analysis. I assigned column names based on the UCI documentation
provided, recoded the target variable (class) into labeled
cultivars. Then I created a smaller dataset (wine_clean)
which contained the target plus a subset of relevant features. An idea
to extend this work, I could explore how environmental factors such as
where the grapes are grown may influence the chemical composition of
wines. Since this dataset focuses on wines from a single region in
Italy, comparing it to similar data from other regions could help
determine whether the observed differences are specific to cultivars or
influenced by conditions where it’s grown.