Question 1
This data was collected as part of a psychometric analysis investigating whether our customers will or will not buy our produce and why. The outcome is Behavioural Intention (B_INT).
Determine how many dimensions are in the data.
Dimensions:
Dimensions are the number of variables or features in a dataset. In this case, each variable represents a different aspect of the psychometric analysis.
The number of dimensions is the number of columns in the dataset.
Here, I have used three different functions to determine the number of dimensions in the dataset.
ncol() - This function returns the number of columns in a data frame.
nrow() - This function returns the number of rows in a data frame.
dim() - This function returns the dimensions of the data frame as a vector, where the first element is the number of rows and the second element is the number of columns.
And at the end, I have used the skim() function from the skimr package to get a summary of the dataset, which includes the number of rows, columns, missing values, and basic statistics for each column.
# Load data
library(readr)
utaut_data <- read.csv("utaut_data.csv")
print(paste("Columns: ", ncol(utaut_data))) # number of columns
## [1] "Columns: 66"
print(paste("Rows: ",nrow(utaut_data))) # number of rows
## [1] "Rows: 216"
#dimensions of the dataset
dim(utaut_data) # rows and columns
## [1] 216 66
# skim the dataset
library(skimr)
#skim is used to get a summary of the data,
#it contains the number of rows, columns, missing values,
#and basic statistics for each column.
skim(utaut_data)
| Name | utaut_data |
| Number of rows | 216 |
| Number of columns | 66 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 65 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| qualitative | 0 | 1 | 0 | 654 | 4 | 212 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ANX1 | 0 | 1 | 2.40 | 1.37 | 1 | 1.00 | 2 | 3.00 | 7 | ▇▂▁▁▁ |
| ANX2 | 0 | 1 | 2.19 | 1.34 | 1 | 1.00 | 2 | 3.00 | 7 | ▇▂▁▁▁ |
| ANX3 | 0 | 1 | 2.04 | 1.23 | 1 | 1.00 | 2 | 2.00 | 7 | ▇▁▁▁▁ |
| ANX4 | 0 | 1 | 1.98 | 1.13 | 1 | 1.00 | 2 | 2.00 | 6 | ▇▁▁▁▁ |
| ATTU1 | 0 | 1 | 5.86 | 0.97 | 2 | 5.00 | 6 | 6.00 | 7 | ▁▁▃▇▃ |
| ATTU2 | 0 | 1 | 5.87 | 1.08 | 2 | 5.00 | 6 | 7.00 | 7 | ▁▁▃▇▆ |
| ATTU3 | 0 | 1 | 5.46 | 1.22 | 2 | 5.00 | 6 | 6.00 | 7 | ▂▃▇▇▆ |
| ATTU4 | 0 | 1 | 5.55 | 1.20 | 1 | 5.00 | 6 | 6.00 | 7 | ▁▁▂▃▇ |
| DEMO1 | 0 | 1 | 5.69 | 1.05 | 2 | 5.00 | 6 | 6.00 | 7 | ▁▂▅▇▅ |
| DEMO2 | 0 | 1 | 5.64 | 1.23 | 1 | 5.00 | 6 | 6.00 | 7 | ▁▁▁▃▇ |
| DEMO3 | 0 | 1 | 5.55 | 1.27 | 1 | 5.00 | 6 | 6.00 | 7 | ▁▁▂▂▇ |
| DEMO4 | 0 | 1 | 5.63 | 1.15 | 2 | 5.00 | 6 | 6.00 | 7 | ▁▂▅▇▅ |
| FACL1 | 0 | 1 | 5.91 | 0.95 | 2 | 6.00 | 6 | 7.00 | 7 | ▁▁▂▇▅ |
| FACL2 | 0 | 1 | 5.96 | 0.89 | 3 | 6.00 | 6 | 7.00 | 7 | ▁▁▃▇▅ |
| FACL3 | 0 | 1 | 5.02 | 1.25 | 1 | 4.00 | 5 | 6.00 | 7 | ▁▁▇▅▇ |
| FACL4 | 0 | 1 | 5.31 | 1.26 | 1 | 5.00 | 5 | 6.00 | 7 | ▁▁▃▅▇ |
| HAB1 | 0 | 1 | 4.69 | 1.42 | 1 | 4.00 | 5 | 6.00 | 7 | ▂▂▆▆▇ |
| HAB2 | 0 | 1 | 3.42 | 1.74 | 1 | 2.00 | 3 | 5.00 | 7 | ▇▅▃▂▃ |
| HAB3 | 0 | 1 | 3.47 | 1.75 | 1 | 2.00 | 3 | 5.00 | 7 | ▇▃▃▃▃ |
| HAB4 | 0 | 1 | 5.01 | 1.36 | 1 | 4.00 | 5 | 6.00 | 7 | ▁▂▃▇▇ |
| IMAG1 | 0 | 1 | 3.17 | 1.41 | 1 | 2.00 | 3 | 4.00 | 7 | ▇▂▆▂▂ |
| IMAG2 | 0 | 1 | 3.37 | 1.43 | 1 | 2.00 | 4 | 4.00 | 7 | ▇▃▇▂▂ |
| IMAG3 | 0 | 1 | 2.97 | 1.47 | 1 | 2.00 | 3 | 4.00 | 7 | ▇▂▅▁▁ |
| INFL3 | 0 | 1 | 3.69 | 1.44 | 1 | 2.00 | 4 | 4.00 | 7 | ▅▂▇▃▂ |
| INNOV1 | 0 | 1 | 5.60 | 1.10 | 1 | 5.00 | 6 | 6.00 | 7 | ▁▁▁▃▇ |
| INNOV2 | 0 | 1 | 5.64 | 1.17 | 1 | 5.00 | 6 | 6.00 | 7 | ▁▁▁▃▇ |
| INNOV3 | 0 | 1 | 4.55 | 1.57 | 1 | 3.00 | 5 | 6.00 | 7 | ▃▅▅▆▇ |
| INT1 | 0 | 1 | 3.77 | 1.65 | 1 | 2.00 | 4 | 5.00 | 7 | ▇▅▇▅▅ |
| INT2 | 0 | 1 | 3.72 | 1.73 | 1 | 2.00 | 4 | 5.00 | 7 | ▇▃▅▅▅ |
| INT3 | 0 | 1 | 3.71 | 1.72 | 1 | 2.00 | 4 | 5.00 | 7 | ▇▃▆▅▅ |
| MOTIV1 | 0 | 1 | 5.34 | 1.22 | 2 | 5.00 | 5 | 6.00 | 7 | ▂▃▇▇▅ |
| MOTIV2 | 0 | 1 | 5.53 | 1.13 | 2 | 5.00 | 6 | 6.00 | 7 | ▁▂▅▇▃ |
| MOTIV3 | 0 | 1 | 5.08 | 1.26 | 2 | 4.00 | 5 | 6.00 | 7 | ▃▃▇▆▃ |
| NORM1 | 0 | 1 | 3.75 | 1.35 | 1 | 3.00 | 4 | 4.00 | 7 | ▃▂▇▂▂ |
| NORM2 | 0 | 1 | 3.72 | 1.43 | 1 | 3.00 | 4 | 4.00 | 7 | ▅▂▇▂▂ |
| PEOU1 | 0 | 1 | 5.90 | 0.99 | 3 | 5.00 | 6 | 7.00 | 7 | ▁▁▃▇▅ |
| PEOU2 | 0 | 1 | 5.80 | 1.03 | 2 | 5.00 | 6 | 6.25 | 7 | ▁▁▅▇▅ |
| PEOU3 | 0 | 1 | 5.88 | 1.06 | 1 | 5.00 | 6 | 7.00 | 7 | ▁▁▁▂▇ |
| PEOU4 | 0 | 1 | 5.76 | 1.10 | 1 | 5.00 | 6 | 7.00 | 7 | ▁▁▁▃▇ |
| PEOU5 | 0 | 1 | 5.82 | 1.09 | 2 | 5.00 | 6 | 7.00 | 7 | ▁▂▃▇▆ |
| PEOU6 | 0 | 1 | 5.92 | 0.96 | 2 | 5.00 | 6 | 7.00 | 7 | ▁▁▃▇▅ |
| PERF1 | 0 | 1 | 5.26 | 1.45 | 1 | 5.00 | 6 | 6.00 | 7 | ▁▁▂▃▇ |
| PERF2 | 0 | 1 | 5.05 | 1.34 | 1 | 4.00 | 5 | 6.00 | 7 | ▁▂▃▆▇ |
| PERF3 | 0 | 1 | 5.16 | 1.40 | 1 | 4.00 | 5 | 6.00 | 7 | ▁▁▂▅▇ |
| PERF4 | 0 | 1 | 4.95 | 1.40 | 1 | 4.00 | 5 | 6.00 | 7 | ▂▁▅▆▇ |
| QUAL1 | 0 | 1 | 5.49 | 1.04 | 1 | 5.00 | 6 | 6.00 | 7 | ▁▁▂▃▇ |
| QUAL2 | 0 | 1 | 5.70 | 1.04 | 1 | 5.00 | 6 | 6.00 | 7 | ▁▁▁▂▇ |
| RELEV1 | 0 | 1 | 4.01 | 1.60 | 1 | 3.00 | 4 | 5.00 | 7 | ▆▅▇▇▅ |
| RELEV2 | 0 | 1 | 4.74 | 1.61 | 1 | 4.00 | 5 | 6.00 | 7 | ▃▁▃▆▇ |
| SEFF1 | 0 | 1 | 5.89 | 0.97 | 3 | 5.00 | 6 | 7.00 | 7 | ▁▁▃▇▅ |
| SEFF2 | 0 | 1 | 5.77 | 1.23 | 2 | 5.00 | 6 | 7.00 | 7 | ▁▃▃▇▇ |
| SEFF3 | 0 | 1 | 5.90 | 1.08 | 2 | 5.00 | 6 | 7.00 | 7 | ▁▂▃▇▇ |
| SEFF4 | 0 | 1 | 5.88 | 1.04 | 3 | 5.00 | 6 | 7.00 | 7 | ▁▂▃▇▆ |
| USEFUL2 | 0 | 1 | 5.00 | 1.45 | 1 | 4.00 | 5 | 6.00 | 7 | ▂▂▂▇▇ |
| USEFUL4 | 0 | 1 | 5.07 | 1.39 | 1 | 4.00 | 5 | 6.00 | 7 | ▂▁▃▇▇ |
| USEFUL5 | 0 | 1 | 5.16 | 1.37 | 1 | 5.00 | 5 | 6.00 | 7 | ▁▁▂▆▇ |
| VALUE1 | 0 | 1 | 4.58 | 1.50 | 1 | 3.75 | 5 | 6.00 | 7 | ▃▃▃▇▇ |
| VALUE2 | 0 | 1 | 4.48 | 1.46 | 1 | 4.00 | 5 | 6.00 | 7 | ▃▃▆▇▇ |
| VALUE3 | 0 | 1 | 4.59 | 1.41 | 1 | 4.00 | 5 | 6.00 | 7 | ▂▂▅▇▇ |
| VOL1 | 0 | 1 | 6.17 | 1.03 | 1 | 6.00 | 6 | 7.00 | 7 | ▁▁▁▁▇ |
| VOL2 | 0 | 1 | 6.20 | 1.23 | 1 | 6.00 | 7 | 7.00 | 7 | ▁▁▁▁▇ |
| VOL3 | 0 | 1 | 6.19 | 1.14 | 2 | 6.00 | 7 | 7.00 | 7 | ▁▁▁▅▇ |
| age | 0 | 1 | 2.25 | 0.95 | 1 | 2.00 | 2 | 3.00 | 5 | ▃▇▃▂▁ |
| Experience | 0 | 1 | 4.51 | 1.14 | 1 | 4.00 | 5 | 5.00 | 6 | ▁▁▂▇▁ |
| gender | 0 | 1 | 1.44 | 0.50 | 1 | 1.00 | 1 | 2.00 | 2 | ▇▁▁▁▆ |
Summary of Dataset:
-The dataset has 216 rows, which are observations, and 66 columns, which are dimensions.
-Of these 66 columns, 65 are numeric and 1 is categorical (character type).
-In total, the dataset has 66 dimensions, with most of them being numeric variables that show different aspects of how customers think and act.
In conclusion:
Each column in the dataset is a dimension because it shows a different part of the survey. For example, a dimension could be effort expectations, social influence, or facilitating conditions, etc. The survey looked at a lot of different psychological and behavioral factors that could help explain whether customers plan to buy the product. This is shown by the large number of numeric variables.
Question 2
Normalize the data in Table 2.17, showing calculations.
Table 2.17
| Age | Income ($) |
|---|---|
| 25 | 49,000 |
| 56 | 156,000 |
| 65 | 99,000 |
| 32 | 192,000 |
| 41 | 39,000 |
| 49 | 57,000 |
Normalization: Normalization is the process of rescaling data so that all variables are on a comparable scale, which prevents variables with large ranges or units from dominating the analysis.
Types of Normalization
The formula for min-max normalization is:
\[X_{norm} = \frac{X - X_{min}}{X_{max} -
X_{min}}\]
Where: - \(X\) is the original value -
\(X_{min}\) is the minimum value in the
dataset - \(X_{max}\) is the maximum
value in the dataset - \(X_{norm}\) is
the normalized value
# Function for min-max normalisation
min_max_normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
# apply the function to the columns
IncomeAge <- data.frame(
Age = c(25, 56, 65, 32, 41, 49),
Income = c(49000, 156000, 99000, 192000, 39000, 57000))
rownames(IncomeAge) <- paste0("R", 1:6)
# Normalize the Age and Income columns
IncomeAge$Normalized_Age <- min_max_normalize(IncomeAge$Age)
IncomeAge$Normalized_Income <- min_max_normalize(IncomeAge$Income)
IncomeAge
## Age Income Normalized_Age Normalized_Income
## R1 25 49000 0.000 0.06535948
## R2 56 156000 0.775 0.76470588
## R3 65 99000 1.000 0.39215686
## R4 32 192000 0.175 1.00000000
## R5 41 39000 0.400 0.00000000
## R6 49 57000 0.600 0.11764706
# Z-score normalization function
z_score_normalize <- function(x) {
return ((x - mean(x)) / sd(x))
}
# Apply z-score normalization to the columns
IncomeAge$Z_Score_Age <- z_score_normalize(IncomeAge$Age)
IncomeAge$Z_Score_Income <- z_score_normalize(IncomeAge$Income)
IncomeAge
## Age Income Normalized_Age Normalized_Income Z_Score_Age Z_Score_Income
## R1 25 49000 0.000 0.06535948 -1.3132530 -0.790026927
## R2 56 156000 0.775 0.76470588 0.7567898 0.911977392
## R3 65 99000 1.000 0.39215686 1.3577700 0.005302194
## R4 32 192000 0.175 1.00000000 -0.8458239 1.484614360
## R5 41 39000 0.400 0.00000000 -0.2448438 -0.949092751
## R6 49 57000 0.600 0.11764706 0.2893608 -0.662774268
Summary:
Now, all values for Age and Income are on the same scale (0 to 1).
-For Age, the youngest person (25) becomes 0, and the oldest (65) becomes 1.
-For Income, the lowest ($39,000) becomes 0, and the highest ($192,000) becomes 1.
-Everyone else is mapped between 0 and 1.
This way, neither Age nor Income dominates because of big units.
Question 3
Statistical distance between records can be measured in several
ways.
Consider Euclidean distance, measured as the square
root of the sum of the squared differences.
For the first two records in Table 2.17, it is:
\[ d = \sqrt{(25 - 56)^2 + (49{,}000 - 156{,}000)^2} \]
Can normalizing the data change which two records are farthest from each other in terms of Euclidean distance?
# Function to calculate Euclidean distance
euclidean_distance <- function(x1, x2) {
sqrt(sum((x1 - x2)^2))
}
# Calculate Euclidean distance between all the records in original data
record1 <- c(IncomeAge$Age[1], IncomeAge$Income[1])
record2 <- c(IncomeAge$Age[2], IncomeAge$Income[2])
distance_original <- euclidean_distance(record1, record2)
distance_original
## [1] 107000
# Find the farthest records
find_farthest_records <- function(data) {
max_distance <- 0
record_pair <- c(NA, NA)
for (i in 1:(nrow(data)-1)) {
for (j in (i+1):nrow(data)) {
dist <- euclidean_distance(c(data$Age[i], data$Income[i]), c(data
$Age[j], data$Income[j]))
if (dist > max_distance) {
max_distance <- dist
record_pair <- c(i, j)
}
}
}
list(pair = record_pair, distance = max_distance)
}
# Find farthest records in original data
farthest_original <- find_farthest_records(IncomeAge)
print(paste("Farthest records in original data are R", farthest_original$pair[1],
"and R", farthest_original$pair[2], "with a distance of",
farthest_original$distance))
## [1] "Farthest records in original data are R 4 and R 5 with a distance of 153000.000264706"
# Find farthest records in normalized data
farthest_normalized <- find_farthest_records(data.frame(Age = IncomeAge$
Normalized_Age, Income = IncomeAge$Normalized_Income))
print(paste("Farthest records in normalized data are R", farthest_normalized$pair[1],
"and R", farthest_normalized$pair[2], "with a distance of",
farthest_normalized$distance))
## [1] "Farthest records in normalized data are R 1 and R 3 with a distance of 1.05204397781113"
Answer:
Yes, normalizing the data can change which two records are farthest from each other in terms of Euclidean distance. In the original data, the Euclidean distance is heavily influenced by the Income variable due to its larger scale compared to Age. After normalization, both variables are on the same scale (0 to 1), which allows Age to contribute equally to the distance calculation. This can lead to different pairs of records being identified as the farthest apart when using normalized data.
Question 4
The data are in the file Wine.csv. Perform PCA on the data, showing the first four principal components.
# Load the mlba package
if (!require(mlba)) {
library(devtools)
install_github("gedeck/mlba/mlba", force=TRUE)
}
## Loading required package: mlba
## Loading required package: caret
## Loading required package: ggplot2
## Loading required package: lattice
## Loading required package: forecast
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
# Load Wine data from mlba package
data(Wine, package = "mlba")
wine.df <- Wine
# skim the data
skimr::skim(wine.df)
| Name | wine.df |
| Number of rows | 178 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 13 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Type | 0 | 1 | 1 | 1 | 0 | 3 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Alcohol | 0 | 1 | 13.00 | 0.81 | 11.03 | 12.36 | 13.05 | 13.68 | 14.83 | ▂▇▇▇▃ |
| Malic_Acid | 0 | 1 | 2.34 | 1.12 | 0.74 | 1.60 | 1.87 | 3.08 | 5.80 | ▇▅▂▂▁ |
| Ash | 0 | 1 | 2.37 | 0.27 | 1.36 | 2.21 | 2.36 | 2.56 | 3.23 | ▁▂▇▅▁ |
| Ash_Alcalinity | 0 | 1 | 19.49 | 3.34 | 10.60 | 17.20 | 19.50 | 21.50 | 30.00 | ▁▆▇▃▁ |
| Magnesium | 0 | 1 | 99.74 | 14.28 | 70.00 | 88.00 | 98.00 | 107.00 | 162.00 | ▅▇▃▁▁ |
| Total_Phenols | 0 | 1 | 2.30 | 0.63 | 0.98 | 1.74 | 2.36 | 2.80 | 3.88 | ▅▇▇▇▁ |
| Flavanoids | 0 | 1 | 2.03 | 1.00 | 0.34 | 1.20 | 2.13 | 2.88 | 5.08 | ▆▆▇▂▁ |
| Nonflavanoid_Phenols | 0 | 1 | 0.36 | 0.12 | 0.13 | 0.27 | 0.34 | 0.44 | 0.66 | ▃▇▅▃▂ |
| Proanthocyanins | 0 | 1 | 1.59 | 0.57 | 0.41 | 1.25 | 1.56 | 1.95 | 3.58 | ▃▇▆▂▁ |
| Color_Intensity | 0 | 1 | 5.06 | 2.32 | 1.28 | 3.22 | 4.69 | 6.20 | 13.00 | ▆▇▃▂▁ |
| Hue | 0 | 1 | 0.96 | 0.23 | 0.48 | 0.78 | 0.96 | 1.12 | 1.71 | ▅▇▇▃▁ |
| OD280_OD315 | 0 | 1 | 2.61 | 0.71 | 1.27 | 1.94 | 2.78 | 3.17 | 4.00 | ▆▃▆▇▃ |
| Proline | 0 | 1 | 746.89 | 314.91 | 278.00 | 500.50 | 673.50 | 985.00 | 1680.00 | ▇▇▅▃▁ |
# Perform PCA on the wine data
wine.pca <- prcomp(wine.df[, -1])
summary(wine.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 314.9632 13.13527 3.07215 2.23409 1.10853 0.91710 0.5282
## Proportion of Variance 0.9981 0.00174 0.00009 0.00005 0.00001 0.00001 0.0000
## Cumulative Proportion 0.9981 0.99983 0.99992 0.99997 0.99998 0.99999 1.0000
## PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 0.3891 0.3348 0.2678 0.1938 0.1452 0.09057
## Proportion of Variance 0.0000 0.0000 0.0000 0.0000 0.0000 0.00000
## Cumulative Proportion 1.0000 1.0000 1.0000 1.0000 1.0000 1.00000
# Show the first 4 principal components
wine.pca$x[, 1:4]
## PC1 PC2 PC3 PC4
## [1,] -318.562979 -21.49213073 3.13073470 -0.25011376
## [2,] -303.097420 5.36471768 6.82283550 -0.86403475
## [3,] -438.061133 6.53730945 -1.11322298 0.91241068
## [4,] -733.240139 -0.19272903 -0.91725702 -0.54125065
## [5,] 11.571428 -18.48999461 -0.55442208 1.36089609
## [6,] -703.231192 0.33215867 0.94937533 -0.35999383
## [7,] -542.971581 13.51896664 2.12694283 0.05556614
## [8,] -548.401860 -11.44943238 0.04049242 1.34945380
## [9,] -298.036863 8.18015784 3.88097517 -0.91064304
## [10,] -298.049553 7.10154294 1.55845533 -1.67704202
## [11,] -763.079712 8.33431723 -1.88629037 1.59831835
## [12,] -532.943228 14.28763377 0.13033524 0.88222479
## [13,] -572.834410 21.00501433 0.37261486 0.30578060
## [14,] -402.925358 16.10263522 5.67513986 -1.48207065
## [15,] -800.053394 12.11844719 3.04652991 -1.74362051
## [16,] -563.245578 -2.21482866 -0.52551098 -0.66067725
## [17,] -533.379651 -10.80480223 -2.47652734 1.07715922
## [18,] -383.317591 -8.47741982 -1.98974501 0.38454435
## [19,] -933.118387 8.35447844 -1.93291276 -0.82650477
## [20,] -98.403177 -14.39265941 4.10374616 -0.93780103
## [21,] -33.593594 -25.56795650 4.03358615 -0.99367472
## [22,] -23.146438 -1.81747309 0.88716084 0.31629532
## [23,] -288.093030 3.96304175 2.05371381 1.45708782
## [24,] -267.981513 9.57070401 0.75841183 1.63579347
## [25,] -98.019886 5.49584606 -0.38729766 2.16549611
## [26,] -83.498744 -22.89162155 -4.27883653 3.82342832
## [27,] -447.925596 14.79733125 1.21983445 0.79472419
## [28,] -537.919165 15.38834607 0.16469233 1.90244700
## [29,] -168.210468 -4.25531096 -0.10338441 1.34711983
## [30,] -288.008247 8.95973155 2.16149821 0.27145553
## [31,] -538.026452 8.21273882 -5.16839628 2.47728974
## [32,] -768.092939 7.37989737 -3.22996241 1.01541462
## [33,] -243.150751 0.14391493 1.68080273 1.27761976
## [34,] -488.601280 -23.56532504 -1.17162642 1.66199059
## [35,] -348.231007 -4.03808015 -0.44767500 1.88763930
## [36,] -173.079957 2.79292165 -1.50979198 1.21655368
## [37,] -133.286424 -7.77272958 3.71926608 -0.21371279
## [38,] -358.018559 8.15798412 0.07320061 1.48986642
## [39,] -273.044432 6.72507431 3.03479237 0.96213541
## [40,] -13.643002 -27.84683215 6.64055073 -1.95399429
## [41,] -48.429642 -16.32194976 3.36121013 -1.31779970
## [42,] -287.884092 14.88511854 -0.80693653 1.18004484
## [43,] -348.095348 5.04342609 2.83896837 -0.27280950
## [44,] 66.814655 -4.38482992 2.36654298 -0.34343985
## [45,] -138.227010 -4.73068836 2.19316584 0.10699322
## [46,] -333.254806 -5.31933116 -0.78375813 0.54995943
## [47,] -318.111848 3.48694494 1.99385738 0.03531697
## [48,] -238.111485 3.05152100 2.22732714 -0.90436054
## [49,] -313.119915 2.30536316 -0.85590353 -0.08986194
## [50,] -513.187748 0.86375924 -0.96389277 -2.04393504
## [51,] -402.939017 15.05330285 4.25268054 -2.55794679
## [52,] -517.928963 14.99755489 -0.29543348 0.70374354
## [53,] -443.266411 -3.26050204 3.22533487 -1.60996988
## [54,] -628.286071 -4.04114538 -0.35793153 0.37006381
## [55,] -313.397613 -12.60891351 2.00469936 -0.37035604
## [56,] -373.333291 -9.66419863 -2.29204575 0.85049738
## [57,] -223.417106 -14.21689241 2.38786504 -1.11917463
## [58,] -523.072851 7.09662526 0.12928837 0.24875462
## [59,] -538.182097 1.35678013 0.23097851 -0.26525709
## [60,] 227.032690 7.99496797 9.94713134 -1.14211555
## [61,] 66.850122 -2.31986654 4.12689933 0.11626887
## [62,] 296.824437 -5.46847570 3.66175522 -2.57505965
## [63,] 116.970473 3.72638342 2.21999884 0.18292420
## [64,] 327.059126 6.94813081 2.08570803 -0.08646051
## [65,] 391.752486 -11.17167826 3.02474464 0.60831546
## [66,] 68.905760 0.55862054 1.89506922 0.08594298
## [67,] 245.212524 17.49363930 4.99323593 -1.99681057
## [68,] 237.241410 17.53102555 0.65601423 -0.09660360
## [69,] -3.296156 -10.09581160 3.21779225 0.79035284
## [70,] 27.966121 -51.63012519 4.99946128 1.51405246
## [71,] -123.130138 -1.05983451 -0.81102288 2.27011220
## [72,] 337.104122 7.62385512 -3.33557626 2.51476928
## [73,] 275.096026 7.75315213 -2.93117724 1.70431803
## [74,] -238.716757 -35.25182737 -9.00507574 6.46174472
## [75,] -139.094057 1.20747620 -1.19825468 3.10807878
## [76,] 318.876329 -2.81077113 5.00732155 -1.01830637
## [77,] 355.060253 7.53070183 4.80195138 -1.75352681
## [78,] 246.633597 -16.55843665 3.44015940 0.78140649
## [79,] -3.771690 -36.03109236 6.13187473 0.36672729
## [80,] 283.842248 -6.36186080 -1.30886277 2.51826726
## [81,] 469.059350 5.45892398 3.20131486 1.07273741
## [82,] 33.131339 13.18710496 0.89421990 0.92776508
## [83,] 117.290392 19.57861247 -3.39719663 3.77944343
## [84,] 232.132988 10.53010326 -2.43010911 -0.06967566
## [85,] 226.953299 1.77605879 3.09059588 1.06753180
## [86,] 296.852152 -4.45280309 3.64545501 1.06890802
## [87,] 252.046392 5.21261759 -1.45231083 2.64590054
## [88,] 185.108857 8.31858202 -4.81593129 3.99378098
## [89,] 67.180752 14.52469022 -1.47148703 2.55661956
## [90,] 122.431326 27.50403870 -3.08038596 4.04680260
## [91,] 267.185333 14.05389009 2.34355042 1.09184357
## [92,] 297.104415 8.41503161 -0.97271545 1.20491910
## [93,] 252.213885 15.25157025 -0.01055997 1.18178509
## [94,] 457.022215 3.69733793 4.01928491 0.79973656
## [95,] 401.851274 -5.33140022 3.84371981 0.24118777
## [96,] -191.183397 -58.79376238 2.36881154 3.28506528
## [97,] 121.279178 -36.41751489 0.32317103 2.63414973
## [98,] 319.088273 9.19307820 5.13148563 0.06072652
## [99,] 87.080293 10.22657269 1.53970742 0.55188385
## [100,] 341.040900 5.75056559 3.48018898 1.10072198
## [101,] 36.931615 2.17041593 2.59030958 1.01140983
## [102,] 185.073713 8.52051902 2.25097671 1.38163733
## [103,] 308.882387 -3.75165655 0.64573575 1.73494290
## [104,] 332.089296 7.88372875 2.04129847 1.69966262
## [105,] 75.153138 13.41382794 -0.26944061 2.18948833
## [106,] 432.009742 2.03294673 -0.01297855 1.57866101
## [107,] 237.206697 15.56196793 1.54724661 0.70650691
## [108,] 259.148368 11.07881635 -1.59997285 1.62990718
## [109,] 434.957780 0.05952603 3.09505248 0.88500532
## [110,] 66.990696 4.57145764 0.64900869 2.74788244
## [111,] 184.736439 -10.51646316 1.70289086 1.63245952
## [112,] 422.042874 4.24492194 1.07980318 1.92901901
## [113,] 139.818663 -5.73555207 0.53402919 0.90699791
## [114,] 313.060948 6.16709401 0.42958374 1.63877803
## [115,] 362.129633 9.25929540 -0.83904435 2.10006124
## [116,] 340.115006 8.68973443 0.34593057 2.85790723
## [117,] 252.109472 9.27587015 0.55770574 2.63672220
## [118,] 401.695973 -15.42589027 0.24785619 2.67763346
## [119,] 375.171092 13.18505025 4.60945469 -1.44279916
## [120,] 183.097929 9.55354432 1.90848316 2.48374537
## [121,] 121.945241 1.58889428 0.65563543 1.88171817
## [122,] 281.534569 -24.50826119 -6.48504587 2.42218920
## [123,] 381.826778 -9.18910976 -4.13334205 3.61261836
## [124,] 367.089380 7.18821589 -0.21865503 1.26717316
## [125,] 367.157175 11.19632022 0.36550887 1.53842878
## [126,] 369.103601 8.17580491 0.63984872 1.67682732
## [127,] 395.078836 6.67983187 0.06189456 0.80817650
## [128,] 281.030369 2.53586007 -6.78592530 4.08101907
## [129,] 405.061288 4.45107367 -2.20721536 3.22344065
## [130,] 167.237642 16.73118573 -1.73687914 2.09258255
## [131,] 116.473000 -24.26602006 2.78924950 -0.18758517
## [132,] 216.784540 -8.13044956 0.23992681 -1.34628911
## [133,] 186.915112 -1.71005736 -3.85614652 -0.15509302
## [134,] 146.769530 -8.92033411 -1.36448130 -0.32739567
## [135,] 97.132974 13.05828383 1.67585772 -1.63723441
## [136,] 51.982047 4.81229310 0.16073012 -2.79611640
## [137,] 27.097931 10.22999622 -1.83765584 0.51889957
## [138,] 231.951534 -0.53490741 -4.83149311 0.21186915
## [139,] 167.076246 8.75672311 -0.17737763 -1.91760509
## [140,] 156.868045 -4.16316249 -3.70307659 0.68875440
## [141,] 146.946490 1.09672896 -1.00960254 -0.06331136
## [142,] -32.904293 11.30536564 -1.34137698 -0.76466366
## [143,] 226.926210 -1.38827837 -2.96295367 0.55626903
## [144,] 197.004347 4.22842870 -0.12287404 -0.77868229
## [145,] -108.312677 -10.35133044 -0.48108543 -3.26865157
## [146,] -83.118263 -0.80686970 -1.79684297 0.96420008
## [147,] 332.195840 13.81329889 -0.17093584 -2.07373848
## [148,] 122.126092 11.46992586 -3.00064371 -2.88747351
## [149,] 97.019389 5.90644258 -2.98282629 -3.28558130
## [150,] 196.624807 -16.87311392 -2.25389963 -4.40408659
## [151,] 246.449199 -27.81009538 -3.75074929 -2.87467515
## [152,] 266.628788 -17.15023200 -2.68311926 -5.42609326
## [153,] 321.570741 -22.16434010 -4.42434279 -1.04671719
## [154,] 71.897659 0.39965640 -0.70775316 -5.99935436
## [155,] 106.816618 -5.20612307 -0.73329402 -2.68645834
## [156,] 22.020257 6.22943925 -3.86483282 -2.76113172
## [157,] 267.031732 5.92308914 -0.70328446 -5.20401306
## [158,] -132.999841 4.86770544 -8.75379841 0.23220542
## [159,] 86.913999 -0.07963578 -7.17131368 -5.68849896
## [160,] 127.061290 8.29141684 -4.57728089 -5.57063640
## [161,] 227.068836 7.61863030 -1.85508543 -3.12235391
## [162,] 66.755431 -8.47028990 -0.54913654 -1.43344098
## [163,] 176.765098 -9.47319000 -1.84287292 -0.69150257
## [164,] 71.768398 -7.51063784 1.01222585 -1.38291711
## [165,] 132.048248 7.25503031 -3.66332810 -4.29589844
## [166,] 227.077845 7.59337930 -3.10400630 -2.10397325
## [167,] 51.689506 -12.35808497 -4.74361265 -4.59492845
## [168,] 62.085011 10.55401530 -1.94213604 -5.45880165
## [169,] -3.182765 -5.39136150 -5.85723969 -2.00624823
## [170,] 116.674818 -14.53337035 -5.75495650 -2.28351323
## [171,] 236.921208 -0.46303700 0.97019374 -1.84720991
## [172,] 277.083578 8.74033191 -0.83969362 -5.73827405
## [173,] 87.027403 7.10459575 -1.96051582 -4.89097541
## [174,] 6.980211 4.54113657 -2.47470686 -3.15591973
## [175,] -3.131605 -2.33519051 -4.30993061 -1.56218098
## [176,] -88.458074 -18.77628456 -2.23757651 -4.82070817
## [177,] -93.456242 -18.67081912 -1.78839152 -3.70935193
## [178,] 186.943190 0.21333080 -5.63050984 -3.47298694
Summary of the dataset
The summary shows that the wine data contains no missing values and is complete. However, the variables are measured on very different scales. For example, Proline has values in the hundreds (mean ≈ 747, max = 1680), while others such as Nonflavanoid_Phenols are less than 1.
a. The data are in the file Wine.csv. Consider the rows labeled “Proportion of Variance.” Explain why the value for PC1 is so much greater than that of any other column.
Answer:
In this dataset, the first column (PC1) has a value that is much higher than the others. This is because the features of the wine are recorded on different sizes. For instance, Proline’s values are in the hundreds, while Ash’s or Hue’s values are very low. The first measure captures almost all of the variation, making the others seem much less important. This is because bigger differences mean bigger numbers. This doesn’t really mean that Proline explains everything by itself; it just means that it is the most important because of its size.
After Normalization
# standardize the data
data.std <- scale(wine.df[,-1])
# skim the standardized data
skimr::skim(as.data.frame(data.std))
| Name | as.data.frame(data.std) |
| Number of rows | 178 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| numeric | 13 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Alcohol | 0 | 1 | 0 | 1 | -2.43 | -0.79 | 0.06 | 0.83 | 2.25 | ▂▇▇▇▃ |
| Malic_Acid | 0 | 1 | 0 | 1 | -1.43 | -0.66 | -0.42 | 0.67 | 3.10 | ▇▅▂▂▁ |
| Ash | 0 | 1 | 0 | 1 | -3.67 | -0.57 | -0.02 | 0.70 | 3.15 | ▁▂▇▅▁ |
| Ash_Alcalinity | 0 | 1 | 0 | 1 | -2.66 | -0.69 | 0.00 | 0.60 | 3.15 | ▁▆▇▃▁ |
| Magnesium | 0 | 1 | 0 | 1 | -2.08 | -0.82 | -0.12 | 0.51 | 4.36 | ▅▇▃▁▁ |
| Total_Phenols | 0 | 1 | 0 | 1 | -2.10 | -0.88 | 0.10 | 0.81 | 2.53 | ▅▇▇▇▁ |
| Flavanoids | 0 | 1 | 0 | 1 | -1.69 | -0.83 | 0.11 | 0.85 | 3.05 | ▆▆▇▂▁ |
| Nonflavanoid_Phenols | 0 | 1 | 0 | 1 | -1.86 | -0.74 | -0.18 | 0.61 | 2.40 | ▃▇▅▃▂ |
| Proanthocyanins | 0 | 1 | 0 | 1 | -2.06 | -0.60 | -0.06 | 0.63 | 3.48 | ▃▇▆▂▁ |
| Color_Intensity | 0 | 1 | 0 | 1 | -1.63 | -0.79 | -0.16 | 0.49 | 3.43 | ▆▇▃▂▁ |
| Hue | 0 | 1 | 0 | 1 | -2.09 | -0.77 | 0.03 | 0.71 | 3.29 | ▅▇▇▃▁ |
| OD280_OD315 | 0 | 1 | 0 | 1 | -1.89 | -0.95 | 0.24 | 0.79 | 1.96 | ▆▃▆▇▃ |
| Proline | 0 | 1 | 0 | 1 | -1.49 | -0.78 | -0.23 | 0.76 | 2.96 | ▇▇▅▃▁ |
# Perform PCA on the wine data after standardizing the data and show the summary
pcs.std <- prcomp(data.std)
summary(pcs.std)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.169 1.5802 1.2025 0.95863 0.92370 0.80103 0.74231
## Proportion of Variance 0.362 0.1921 0.1112 0.07069 0.06563 0.04936 0.04239
## Cumulative Proportion 0.362 0.5541 0.6653 0.73599 0.80162 0.85098 0.89337
## PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 0.59034 0.53748 0.5009 0.47517 0.41082 0.32152
## Proportion of Variance 0.02681 0.02222 0.0193 0.01737 0.01298 0.00795
## Cumulative Proportion 0.92018 0.94240 0.9617 0.97907 0.99205 1.00000
# Show the first 4 principal components
pcs.std$rot[,1:4]
## PC1 PC2 PC3 PC4
## Alcohol -0.144329395 -0.483651548 -0.20738262 -0.01785630
## Malic_Acid 0.245187580 -0.224930935 0.08901289 0.53689028
## Ash 0.002051061 -0.316068814 0.62622390 -0.21417556
## Ash_Alcalinity 0.239320405 0.010590502 0.61208035 0.06085941
## Magnesium -0.141992042 -0.299634003 0.13075693 -0.35179658
## Total_Phenols -0.394660845 -0.065039512 0.14617896 0.19806835
## Flavanoids -0.422934297 0.003359812 0.15068190 0.15229479
## Nonflavanoid_Phenols 0.298533103 -0.028779488 0.17036816 -0.20330102
## Proanthocyanins -0.313429488 -0.039301722 0.14945431 0.39905653
## Color_Intensity 0.088616705 -0.529995672 -0.13730621 0.06592568
## Hue -0.296714564 0.279235148 0.08522192 -0.42777141
## OD280_OD315 -0.376167411 0.164496193 0.16600459 0.18412074
## Proline -0.286752227 -0.364902832 -0.12674592 -0.23207086
# Compare variance explained
non_norm_var <- summary(wine.pca)$importance[2,]
norm_var <- summary(pcs.std)$importance[2,]
# Put into a table
variance_table <- data.frame(
Component = paste0("PC", 1:length(non_norm_var)),
Non_Normalized = round(non_norm_var, 4),
Normalized = round(norm_var, 4)
) # Creates a data frame to compare variance explained
knitr::kable(variance_table, caption = "Proportion of Variance Explained:
Non-normalized vs Normalized PCA")
| Component | Non_Normalized | Normalized | |
|---|---|---|---|
| PC1 | PC1 | 0.9981 | 0.3620 |
| PC2 | PC2 | 0.0017 | 0.1921 |
| PC3 | PC3 | 0.0001 | 0.1112 |
| PC4 | PC4 | 0.0000 | 0.0707 |
| PC5 | PC5 | 0.0000 | 0.0656 |
| PC6 | PC6 | 0.0000 | 0.0494 |
| PC7 | PC7 | 0.0000 | 0.0424 |
| PC8 | PC8 | 0.0000 | 0.0268 |
| PC9 | PC9 | 0.0000 | 0.0222 |
| PC10 | PC10 | 0.0000 | 0.0193 |
| PC11 | PC11 | 0.0000 | 0.0174 |
| PC12 | PC12 | 0.0000 | 0.0130 |
| PC13 | PC13 | 0.0000 | 0.0080 |
b. Comment on the use of normalization (standardization) in part (a).
Answer:
Normalization means adjusting all variables so they are on the same scale, usually with mean 0 and standard deviation 1.It changes everything so that everything is measured on the same scale. Once the numbers have been normalized, one feature can’t take over the analysis just because it has more numbers. In this way, the results are more balanced, and each wine quality can contribute fairly. To put it simply, normalization makes things fair so that each variable has an equal chance to show how important it is.