Business Data Mining

Question 1

This data was collected as part of a psychometric analysis investigating whether our customers will or will not buy our produce and why. The outcome is Behavioural Intention (B_INT).

Determine how many dimensions are in the data.

Dimensions:

Dimensions are the number of variables or features in a dataset. In this case, each variable represents a different aspect of the psychometric analysis.

The number of dimensions is the number of columns in the dataset.

Here, I have used three different functions to determine the number of dimensions in the dataset.

ncol() - This function returns the number of columns in a data frame.
nrow() - This function returns the number of rows in a data frame.
dim() - This function returns the dimensions of the data frame as a vector, where the first element is the number of rows and the second element is the number of columns.

And at the end, I have used the skim() function from the skimr package to get a summary of the dataset, which includes the number of rows, columns, missing values, and basic statistics for each column.

# Load data
library(readr)
utaut_data <- read.csv("utaut_data.csv")


print(paste("Columns: ", ncol(utaut_data)))   # number of columns

## [1] "Columns:  66"

print(paste("Rows: ",nrow(utaut_data)))   # number of rows

## [1] "Rows:  216"

#dimensions of the dataset
dim(utaut_data)   # rows and columns

## [1] 216  66

# skim the dataset
library(skimr)
#skim is used to get a summary of the data, 
#it contains the number of rows, columns, missing values,
#and basic statistics for each column.
skim(utaut_data)

Data summary
Name	utaut_data
Number of rows	216
Number of columns	66
_______________________
Column type frequency:
character	1
numeric	65
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
qualitative	0	1	0	654	4	212	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ANX1	1	2.40	1.37	1	1.00	2	3.00	7	▇▂▁▁▁
ANX2	1	2.19	1.34	1	1.00	2	3.00	7	▇▂▁▁▁
ANX3	1	2.04	1.23	1	1.00	2	2.00	7	▇▁▁▁▁
ANX4	1	1.98	1.13	1	1.00	2	2.00	6	▇▁▁▁▁
ATTU1	1	5.86	0.97	2	5.00	6	6.00	7	▁▁▃▇▃
ATTU2	1	5.87	1.08	2	5.00	6	7.00	7	▁▁▃▇▆
ATTU3	1	5.46	1.22	2	5.00	6	6.00	7	▂▃▇▇▆
ATTU4	1	5.55	1.20	1	5.00	6	6.00	7	▁▁▂▃▇
DEMO1	1	5.69	1.05	2	5.00	6	6.00	7	▁▂▅▇▅
DEMO2	1	5.64	1.23	1	5.00	6	6.00	7	▁▁▁▃▇
DEMO3	1	5.55	1.27	1	5.00	6	6.00	7	▁▁▂▂▇
DEMO4	1	5.63	1.15	2	5.00	6	6.00	7	▁▂▅▇▅
FACL1	1	5.91	0.95	2	6.00	6	7.00	7	▁▁▂▇▅
FACL2	1	5.96	0.89	3	6.00	6	7.00	7	▁▁▃▇▅
FACL3	1	5.02	1.25	1	4.00	5	6.00	7	▁▁▇▅▇
FACL4	1	5.31	1.26	1	5.00	5	6.00	7	▁▁▃▅▇
HAB1	1	4.69	1.42	1	4.00	5	6.00	7	▂▂▆▆▇
HAB2	1	3.42	1.74	1	2.00	3	5.00	7	▇▅▃▂▃
HAB3	1	3.47	1.75	1	2.00	3	5.00	7	▇▃▃▃▃
HAB4	1	5.01	1.36	1	4.00	5	6.00	7	▁▂▃▇▇
IMAG1	1	3.17	1.41	1	2.00	3	4.00	7	▇▂▆▂▂
IMAG2	1	3.37	1.43	1	2.00	4	4.00	7	▇▃▇▂▂
IMAG3	1	2.97	1.47	1	2.00	3	4.00	7	▇▂▅▁▁
INFL3	1	3.69	1.44	1	2.00	4	4.00	7	▅▂▇▃▂
INNOV1	1	5.60	1.10	1	5.00	6	6.00	7	▁▁▁▃▇
INNOV2	1	5.64	1.17	1	5.00	6	6.00	7	▁▁▁▃▇
INNOV3	1	4.55	1.57	1	3.00	5	6.00	7	▃▅▅▆▇
INT1	1	3.77	1.65	1	2.00	4	5.00	7	▇▅▇▅▅
INT2	1	3.72	1.73	1	2.00	4	5.00	7	▇▃▅▅▅
INT3	1	3.71	1.72	1	2.00	4	5.00	7	▇▃▆▅▅
MOTIV1	1	5.34	1.22	2	5.00	5	6.00	7	▂▃▇▇▅
MOTIV2	1	5.53	1.13	2	5.00	6	6.00	7	▁▂▅▇▃
MOTIV3	1	5.08	1.26	2	4.00	5	6.00	7	▃▃▇▆▃
NORM1	1	3.75	1.35	1	3.00	4	4.00	7	▃▂▇▂▂
NORM2	1	3.72	1.43	1	3.00	4	4.00	7	▅▂▇▂▂
PEOU1	1	5.90	0.99	3	5.00	6	7.00	7	▁▁▃▇▅
PEOU2	1	5.80	1.03	2	5.00	6	6.25	7	▁▁▅▇▅
PEOU3	1	5.88	1.06	1	5.00	6	7.00	7	▁▁▁▂▇
PEOU4	1	5.76	1.10	1	5.00	6	7.00	7	▁▁▁▃▇
PEOU5	1	5.82	1.09	2	5.00	6	7.00	7	▁▂▃▇▆
PEOU6	1	5.92	0.96	2	5.00	6	7.00	7	▁▁▃▇▅
PERF1	1	5.26	1.45	1	5.00	6	6.00	7	▁▁▂▃▇
PERF2	1	5.05	1.34	1	4.00	5	6.00	7	▁▂▃▆▇
PERF3	1	5.16	1.40	1	4.00	5	6.00	7	▁▁▂▅▇
PERF4	1	4.95	1.40	1	4.00	5	6.00	7	▂▁▅▆▇
QUAL1	1	5.49	1.04	1	5.00	6	6.00	7	▁▁▂▃▇
QUAL2	1	5.70	1.04	1	5.00	6	6.00	7	▁▁▁▂▇
RELEV1	1	4.01	1.60	1	3.00	4	5.00	7	▆▅▇▇▅
RELEV2	1	4.74	1.61	1	4.00	5	6.00	7	▃▁▃▆▇
SEFF1	1	5.89	0.97	3	5.00	6	7.00	7	▁▁▃▇▅
SEFF2	1	5.77	1.23	2	5.00	6	7.00	7	▁▃▃▇▇
SEFF3	1	5.90	1.08	2	5.00	6	7.00	7	▁▂▃▇▇
SEFF4	1	5.88	1.04	3	5.00	6	7.00	7	▁▂▃▇▆
USEFUL2	1	5.00	1.45	1	4.00	5	6.00	7	▂▂▂▇▇
USEFUL4	1	5.07	1.39	1	4.00	5	6.00	7	▂▁▃▇▇
USEFUL5	1	5.16	1.37	1	5.00	5	6.00	7	▁▁▂▆▇
VALUE1	1	4.58	1.50	1	3.75	5	6.00	7	▃▃▃▇▇
VALUE2	1	4.48	1.46	1	4.00	5	6.00	7	▃▃▆▇▇
VALUE3	1	4.59	1.41	1	4.00	5	6.00	7	▂▂▅▇▇
VOL1	1	6.17	1.03	1	6.00	6	7.00	7	▁▁▁▁▇
VOL2	1	6.20	1.23	1	6.00	7	7.00	7	▁▁▁▁▇
VOL3	1	6.19	1.14	2	6.00	7	7.00	7	▁▁▁▅▇
age	1	2.25	0.95	1	2.00	2	3.00	5	▃▇▃▂▁
Experience	1	4.51	1.14	1	4.00	5	5.00	6	▁▁▂▇▁
gender	1	1.44	0.50	1	1.00	1	2.00	2	▇▁▁▁▆

Summary of Dataset:

-The dataset has 216 rows, which are observations, and 66 columns, which are dimensions.

-Of these 66 columns, 65 are numeric and 1 is categorical (character type).

-In total, the dataset has 66 dimensions, with most of them being numeric variables that show different aspects of how customers think and act.

In conclusion:

Each column in the dataset is a dimension because it shows a different part of the survey. For example, a dimension could be effort expectations, social influence, or facilitating conditions, etc. The survey looked at a lot of different psychological and behavioral factors that could help explain whether customers plan to buy the product. This is shown by the large number of numeric variables.

Question 2

Normalize the data in Table 2.17, showing calculations.

Table 2.17

Age	Income ($)
25	49,000
56	156,000
65	99,000
32	192,000
41	39,000
49	57,000

Normalization: Normalization is the process of rescaling data so that all variables are on a comparable scale, which prevents variables with large ranges or units from dominating the analysis.

Types of Normalization

Min-Max Normalization: Min–max puts values into a fixed 0–1 scale, useful when we want proportional comparisons.

The formula for min-max normalization is:

\[X_{norm} = \frac{X - X_{min}}{X_{max} - X_{min}}\]
Where: - $X$ is the original value - $X_{min}$ is the minimum value in the dataset - $X_{max}$ is the maximum value in the dataset - $X_{norm}$ is the normalized value

Z-Score Normalization: Mean value become 0 and standard deviation become 1. The formula for z-score normalization is: \[Z = \frac{X - \mu}{\sigma}\] Where:

$X$ is the original value
$\mu$ is the mean of the dataset
$\sigma$ is the standard deviation of the dataset

# Function for min-max normalisation
min_max_normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}

# apply the function to the columns
IncomeAge <- data.frame(
 Age = c(25, 56, 65, 32, 41, 49),
 Income = c(49000, 156000, 99000, 192000, 39000, 57000))
rownames(IncomeAge) <- paste0("R", 1:6)

# Normalize the Age and Income columns
IncomeAge$Normalized_Age <- min_max_normalize(IncomeAge$Age)
IncomeAge$Normalized_Income <- min_max_normalize(IncomeAge$Income)
IncomeAge

##    Age Income Normalized_Age Normalized_Income
## R1  25  49000          0.000        0.06535948
## R2  56 156000          0.775        0.76470588
## R3  65  99000          1.000        0.39215686
## R4  32 192000          0.175        1.00000000
## R5  41  39000          0.400        0.00000000
## R6  49  57000          0.600        0.11764706

# Z-score normalization function
z_score_normalize <- function(x) {
  return ((x - mean(x)) / sd(x))
}
# Apply z-score normalization to the columns
IncomeAge$Z_Score_Age <- z_score_normalize(IncomeAge$Age)
IncomeAge$Z_Score_Income <- z_score_normalize(IncomeAge$Income)
IncomeAge

##    Age Income Normalized_Age Normalized_Income Z_Score_Age Z_Score_Income
## R1  25  49000          0.000        0.06535948  -1.3132530   -0.790026927
## R2  56 156000          0.775        0.76470588   0.7567898    0.911977392
## R3  65  99000          1.000        0.39215686   1.3577700    0.005302194
## R4  32 192000          0.175        1.00000000  -0.8458239    1.484614360
## R5  41  39000          0.400        0.00000000  -0.2448438   -0.949092751
## R6  49  57000          0.600        0.11764706   0.2893608   -0.662774268

Summary:

Now, all values for Age and Income are on the same scale (0 to 1).

-For Age, the youngest person (25) becomes 0, and the oldest (65) becomes 1.

-For Income, the lowest ($39,000) becomes 0, and the highest ($192,000) becomes 1.

-Everyone else is mapped between 0 and 1.

This way, neither Age nor Income dominates because of big units.

Question 3

Statistical distance between records can be measured in several ways.
Consider Euclidean distance, measured as the square root of the sum of the squared differences.

For the first two records in Table 2.17, it is:

\[ d = \sqrt{(25 - 56)^2 + (49{,}000 - 156{,}000)^2} \]

Can normalizing the data change which two records are farthest from each other in terms of Euclidean distance?

# Function to calculate Euclidean distance
euclidean_distance <- function(x1, x2) {
  sqrt(sum((x1 - x2)^2))
}
# Calculate Euclidean distance between all the records in original data
record1 <- c(IncomeAge$Age[1], IncomeAge$Income[1])
record2 <- c(IncomeAge$Age[2], IncomeAge$Income[2])
distance_original <- euclidean_distance(record1, record2)
distance_original

## [1] 107000

# Find the farthest records
find_farthest_records <- function(data) {
  max_distance <- 0
  record_pair <- c(NA, NA)
  for (i in 1:(nrow(data)-1)) {
    for (j in (i+1):nrow(data)) {
      dist <- euclidean_distance(c(data$Age[i], data$Income[i]), c(data
$Age[j], data$Income[j]))
      if (dist > max_distance) {
        max_distance <- dist
        record_pair <- c(i, j)
      }
    }
  }
  list(pair = record_pair, distance = max_distance)
}
# Find farthest records in original data
farthest_original <- find_farthest_records(IncomeAge)
print(paste("Farthest records in original data are R", farthest_original$pair[1], 
      "and R", farthest_original$pair[2], "with a distance of", 
      farthest_original$distance))

## [1] "Farthest records in original data are R 4 and R 5 with a distance of 153000.000264706"

# Find farthest records in normalized data
farthest_normalized <- find_farthest_records(data.frame(Age = IncomeAge$
Normalized_Age, Income = IncomeAge$Normalized_Income))
print(paste("Farthest records in normalized data are R", farthest_normalized$pair[1], 
      "and R", farthest_normalized$pair[2], "with a distance of", 
      farthest_normalized$distance))

## [1] "Farthest records in normalized data are R 1 and R 3 with a distance of 1.05204397781113"

Answer:

Yes, normalizing the data can change which two records are farthest from each other in terms of Euclidean distance. In the original data, the Euclidean distance is heavily influenced by the Income variable due to its larger scale compared to Age. After normalization, both variables are on the same scale (0 to 1), which allows Age to contribute equally to the distance calculation. This can lead to different pairs of records being identified as the farthest apart when using normalized data.

Question 4

The data are in the file Wine.csv. Perform PCA on the data, showing the first four principal components.

# Load the mlba package
if (!require(mlba)) {
  library(devtools)
  install_github("gedeck/mlba/mlba", force=TRUE)
}

## Loading required package: mlba

## Loading required package: caret

## Loading required package: ggplot2

## Loading required package: lattice

## Loading required package: forecast

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

# Load Wine data from mlba package
data(Wine, package = "mlba")
wine.df <- Wine

# skim the data
skimr::skim(wine.df)

Data summary
Name	wine.df
Number of rows	178
Number of columns	14
_______________________
Column type frequency:
character	1
numeric	13
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Type	0	1	1	1	0	3	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Alcohol	1	13.00	0.81	11.03	12.36	13.05	13.68	14.83	▂▇▇▇▃
Malic_Acid	1	2.34	1.12	0.74	1.60	1.87	3.08	5.80	▇▅▂▂▁
Ash	1	2.37	0.27	1.36	2.21	2.36	2.56	3.23	▁▂▇▅▁
Ash_Alcalinity	1	19.49	3.34	10.60	17.20	19.50	21.50	30.00	▁▆▇▃▁
Magnesium	1	99.74	14.28	70.00	88.00	98.00	107.00	162.00	▅▇▃▁▁
Total_Phenols	1	2.30	0.63	0.98	1.74	2.36	2.80	3.88	▅▇▇▇▁
Flavanoids	1	2.03	1.00	0.34	1.20	2.13	2.88	5.08	▆▆▇▂▁
Nonflavanoid_Phenols	1	0.36	0.12	0.13	0.27	0.34	0.44	0.66	▃▇▅▃▂
Proanthocyanins	1	1.59	0.57	0.41	1.25	1.56	1.95	3.58	▃▇▆▂▁
Color_Intensity	1	5.06	2.32	1.28	3.22	4.69	6.20	13.00	▆▇▃▂▁
Hue	1	0.96	0.23	0.48	0.78	0.96	1.12	1.71	▅▇▇▃▁
OD280_OD315	1	2.61	0.71	1.27	1.94	2.78	3.17	4.00	▆▃▆▇▃
Proline	1	746.89	314.91	278.00	500.50	673.50	985.00	1680.00	▇▇▅▃▁

# Perform PCA on the wine data
wine.pca <- prcomp(wine.df[, -1])
summary(wine.pca)

## Importance of components:
##                             PC1      PC2     PC3     PC4     PC5     PC6    PC7
## Standard deviation     314.9632 13.13527 3.07215 2.23409 1.10853 0.91710 0.5282
## Proportion of Variance   0.9981  0.00174 0.00009 0.00005 0.00001 0.00001 0.0000
## Cumulative Proportion    0.9981  0.99983 0.99992 0.99997 0.99998 0.99999 1.0000
##                           PC8    PC9   PC10   PC11   PC12    PC13
## Standard deviation     0.3891 0.3348 0.2678 0.1938 0.1452 0.09057
## Proportion of Variance 0.0000 0.0000 0.0000 0.0000 0.0000 0.00000
## Cumulative Proportion  1.0000 1.0000 1.0000 1.0000 1.0000 1.00000

# Show the first 4 principal components
wine.pca$x[, 1:4]

##                PC1          PC2         PC3         PC4
##   [1,] -318.562979 -21.49213073  3.13073470 -0.25011376
##   [2,] -303.097420   5.36471768  6.82283550 -0.86403475
##   [3,] -438.061133   6.53730945 -1.11322298  0.91241068
##   [4,] -733.240139  -0.19272903 -0.91725702 -0.54125065
##   [5,]   11.571428 -18.48999461 -0.55442208  1.36089609
##   [6,] -703.231192   0.33215867  0.94937533 -0.35999383
##   [7,] -542.971581  13.51896664  2.12694283  0.05556614
##   [8,] -548.401860 -11.44943238  0.04049242  1.34945380
##   [9,] -298.036863   8.18015784  3.88097517 -0.91064304
##  [10,] -298.049553   7.10154294  1.55845533 -1.67704202
##  [11,] -763.079712   8.33431723 -1.88629037  1.59831835
##  [12,] -532.943228  14.28763377  0.13033524  0.88222479
##  [13,] -572.834410  21.00501433  0.37261486  0.30578060
##  [14,] -402.925358  16.10263522  5.67513986 -1.48207065
##  [15,] -800.053394  12.11844719  3.04652991 -1.74362051
##  [16,] -563.245578  -2.21482866 -0.52551098 -0.66067725
##  [17,] -533.379651 -10.80480223 -2.47652734  1.07715922
##  [18,] -383.317591  -8.47741982 -1.98974501  0.38454435
##  [19,] -933.118387   8.35447844 -1.93291276 -0.82650477
##  [20,]  -98.403177 -14.39265941  4.10374616 -0.93780103
##  [21,]  -33.593594 -25.56795650  4.03358615 -0.99367472
##  [22,]  -23.146438  -1.81747309  0.88716084  0.31629532
##  [23,] -288.093030   3.96304175  2.05371381  1.45708782
##  [24,] -267.981513   9.57070401  0.75841183  1.63579347
##  [25,]  -98.019886   5.49584606 -0.38729766  2.16549611
##  [26,]  -83.498744 -22.89162155 -4.27883653  3.82342832
##  [27,] -447.925596  14.79733125  1.21983445  0.79472419
##  [28,] -537.919165  15.38834607  0.16469233  1.90244700
##  [29,] -168.210468  -4.25531096 -0.10338441  1.34711983
##  [30,] -288.008247   8.95973155  2.16149821  0.27145553
##  [31,] -538.026452   8.21273882 -5.16839628  2.47728974
##  [32,] -768.092939   7.37989737 -3.22996241  1.01541462
##  [33,] -243.150751   0.14391493  1.68080273  1.27761976
##  [34,] -488.601280 -23.56532504 -1.17162642  1.66199059
##  [35,] -348.231007  -4.03808015 -0.44767500  1.88763930
##  [36,] -173.079957   2.79292165 -1.50979198  1.21655368
##  [37,] -133.286424  -7.77272958  3.71926608 -0.21371279
##  [38,] -358.018559   8.15798412  0.07320061  1.48986642
##  [39,] -273.044432   6.72507431  3.03479237  0.96213541
##  [40,]  -13.643002 -27.84683215  6.64055073 -1.95399429
##  [41,]  -48.429642 -16.32194976  3.36121013 -1.31779970
##  [42,] -287.884092  14.88511854 -0.80693653  1.18004484
##  [43,] -348.095348   5.04342609  2.83896837 -0.27280950
##  [44,]   66.814655  -4.38482992  2.36654298 -0.34343985
##  [45,] -138.227010  -4.73068836  2.19316584  0.10699322
##  [46,] -333.254806  -5.31933116 -0.78375813  0.54995943
##  [47,] -318.111848   3.48694494  1.99385738  0.03531697
##  [48,] -238.111485   3.05152100  2.22732714 -0.90436054
##  [49,] -313.119915   2.30536316 -0.85590353 -0.08986194
##  [50,] -513.187748   0.86375924 -0.96389277 -2.04393504
##  [51,] -402.939017  15.05330285  4.25268054 -2.55794679
##  [52,] -517.928963  14.99755489 -0.29543348  0.70374354
##  [53,] -443.266411  -3.26050204  3.22533487 -1.60996988
##  [54,] -628.286071  -4.04114538 -0.35793153  0.37006381
##  [55,] -313.397613 -12.60891351  2.00469936 -0.37035604
##  [56,] -373.333291  -9.66419863 -2.29204575  0.85049738
##  [57,] -223.417106 -14.21689241  2.38786504 -1.11917463
##  [58,] -523.072851   7.09662526  0.12928837  0.24875462
##  [59,] -538.182097   1.35678013  0.23097851 -0.26525709
##  [60,]  227.032690   7.99496797  9.94713134 -1.14211555
##  [61,]   66.850122  -2.31986654  4.12689933  0.11626887
##  [62,]  296.824437  -5.46847570  3.66175522 -2.57505965
##  [63,]  116.970473   3.72638342  2.21999884  0.18292420
##  [64,]  327.059126   6.94813081  2.08570803 -0.08646051
##  [65,]  391.752486 -11.17167826  3.02474464  0.60831546
##  [66,]   68.905760   0.55862054  1.89506922  0.08594298
##  [67,]  245.212524  17.49363930  4.99323593 -1.99681057
##  [68,]  237.241410  17.53102555  0.65601423 -0.09660360
##  [69,]   -3.296156 -10.09581160  3.21779225  0.79035284
##  [70,]   27.966121 -51.63012519  4.99946128  1.51405246
##  [71,] -123.130138  -1.05983451 -0.81102288  2.27011220
##  [72,]  337.104122   7.62385512 -3.33557626  2.51476928
##  [73,]  275.096026   7.75315213 -2.93117724  1.70431803
##  [74,] -238.716757 -35.25182737 -9.00507574  6.46174472
##  [75,] -139.094057   1.20747620 -1.19825468  3.10807878
##  [76,]  318.876329  -2.81077113  5.00732155 -1.01830637
##  [77,]  355.060253   7.53070183  4.80195138 -1.75352681
##  [78,]  246.633597 -16.55843665  3.44015940  0.78140649
##  [79,]   -3.771690 -36.03109236  6.13187473  0.36672729
##  [80,]  283.842248  -6.36186080 -1.30886277  2.51826726
##  [81,]  469.059350   5.45892398  3.20131486  1.07273741
##  [82,]   33.131339  13.18710496  0.89421990  0.92776508
##  [83,]  117.290392  19.57861247 -3.39719663  3.77944343
##  [84,]  232.132988  10.53010326 -2.43010911 -0.06967566
##  [85,]  226.953299   1.77605879  3.09059588  1.06753180
##  [86,]  296.852152  -4.45280309  3.64545501  1.06890802
##  [87,]  252.046392   5.21261759 -1.45231083  2.64590054
##  [88,]  185.108857   8.31858202 -4.81593129  3.99378098
##  [89,]   67.180752  14.52469022 -1.47148703  2.55661956
##  [90,]  122.431326  27.50403870 -3.08038596  4.04680260
##  [91,]  267.185333  14.05389009  2.34355042  1.09184357
##  [92,]  297.104415   8.41503161 -0.97271545  1.20491910
##  [93,]  252.213885  15.25157025 -0.01055997  1.18178509
##  [94,]  457.022215   3.69733793  4.01928491  0.79973656
##  [95,]  401.851274  -5.33140022  3.84371981  0.24118777
##  [96,] -191.183397 -58.79376238  2.36881154  3.28506528
##  [97,]  121.279178 -36.41751489  0.32317103  2.63414973
##  [98,]  319.088273   9.19307820  5.13148563  0.06072652
##  [99,]   87.080293  10.22657269  1.53970742  0.55188385
## [100,]  341.040900   5.75056559  3.48018898  1.10072198
## [101,]   36.931615   2.17041593  2.59030958  1.01140983
## [102,]  185.073713   8.52051902  2.25097671  1.38163733
## [103,]  308.882387  -3.75165655  0.64573575  1.73494290
## [104,]  332.089296   7.88372875  2.04129847  1.69966262
## [105,]   75.153138  13.41382794 -0.26944061  2.18948833
## [106,]  432.009742   2.03294673 -0.01297855  1.57866101
## [107,]  237.206697  15.56196793  1.54724661  0.70650691
## [108,]  259.148368  11.07881635 -1.59997285  1.62990718
## [109,]  434.957780   0.05952603  3.09505248  0.88500532
## [110,]   66.990696   4.57145764  0.64900869  2.74788244
## [111,]  184.736439 -10.51646316  1.70289086  1.63245952
## [112,]  422.042874   4.24492194  1.07980318  1.92901901
## [113,]  139.818663  -5.73555207  0.53402919  0.90699791
## [114,]  313.060948   6.16709401  0.42958374  1.63877803
## [115,]  362.129633   9.25929540 -0.83904435  2.10006124
## [116,]  340.115006   8.68973443  0.34593057  2.85790723
## [117,]  252.109472   9.27587015  0.55770574  2.63672220
## [118,]  401.695973 -15.42589027  0.24785619  2.67763346
## [119,]  375.171092  13.18505025  4.60945469 -1.44279916
## [120,]  183.097929   9.55354432  1.90848316  2.48374537
## [121,]  121.945241   1.58889428  0.65563543  1.88171817
## [122,]  281.534569 -24.50826119 -6.48504587  2.42218920
## [123,]  381.826778  -9.18910976 -4.13334205  3.61261836
## [124,]  367.089380   7.18821589 -0.21865503  1.26717316
## [125,]  367.157175  11.19632022  0.36550887  1.53842878
## [126,]  369.103601   8.17580491  0.63984872  1.67682732
## [127,]  395.078836   6.67983187  0.06189456  0.80817650
## [128,]  281.030369   2.53586007 -6.78592530  4.08101907
## [129,]  405.061288   4.45107367 -2.20721536  3.22344065
## [130,]  167.237642  16.73118573 -1.73687914  2.09258255
## [131,]  116.473000 -24.26602006  2.78924950 -0.18758517
## [132,]  216.784540  -8.13044956  0.23992681 -1.34628911
## [133,]  186.915112  -1.71005736 -3.85614652 -0.15509302
## [134,]  146.769530  -8.92033411 -1.36448130 -0.32739567
## [135,]   97.132974  13.05828383  1.67585772 -1.63723441
## [136,]   51.982047   4.81229310  0.16073012 -2.79611640
## [137,]   27.097931  10.22999622 -1.83765584  0.51889957
## [138,]  231.951534  -0.53490741 -4.83149311  0.21186915
## [139,]  167.076246   8.75672311 -0.17737763 -1.91760509
## [140,]  156.868045  -4.16316249 -3.70307659  0.68875440
## [141,]  146.946490   1.09672896 -1.00960254 -0.06331136
## [142,]  -32.904293  11.30536564 -1.34137698 -0.76466366
## [143,]  226.926210  -1.38827837 -2.96295367  0.55626903
## [144,]  197.004347   4.22842870 -0.12287404 -0.77868229
## [145,] -108.312677 -10.35133044 -0.48108543 -3.26865157
## [146,]  -83.118263  -0.80686970 -1.79684297  0.96420008
## [147,]  332.195840  13.81329889 -0.17093584 -2.07373848
## [148,]  122.126092  11.46992586 -3.00064371 -2.88747351
## [149,]   97.019389   5.90644258 -2.98282629 -3.28558130
## [150,]  196.624807 -16.87311392 -2.25389963 -4.40408659
## [151,]  246.449199 -27.81009538 -3.75074929 -2.87467515
## [152,]  266.628788 -17.15023200 -2.68311926 -5.42609326
## [153,]  321.570741 -22.16434010 -4.42434279 -1.04671719
## [154,]   71.897659   0.39965640 -0.70775316 -5.99935436
## [155,]  106.816618  -5.20612307 -0.73329402 -2.68645834
## [156,]   22.020257   6.22943925 -3.86483282 -2.76113172
## [157,]  267.031732   5.92308914 -0.70328446 -5.20401306
## [158,] -132.999841   4.86770544 -8.75379841  0.23220542
## [159,]   86.913999  -0.07963578 -7.17131368 -5.68849896
## [160,]  127.061290   8.29141684 -4.57728089 -5.57063640
## [161,]  227.068836   7.61863030 -1.85508543 -3.12235391
## [162,]   66.755431  -8.47028990 -0.54913654 -1.43344098
## [163,]  176.765098  -9.47319000 -1.84287292 -0.69150257
## [164,]   71.768398  -7.51063784  1.01222585 -1.38291711
## [165,]  132.048248   7.25503031 -3.66332810 -4.29589844
## [166,]  227.077845   7.59337930 -3.10400630 -2.10397325
## [167,]   51.689506 -12.35808497 -4.74361265 -4.59492845
## [168,]   62.085011  10.55401530 -1.94213604 -5.45880165
## [169,]   -3.182765  -5.39136150 -5.85723969 -2.00624823
## [170,]  116.674818 -14.53337035 -5.75495650 -2.28351323
## [171,]  236.921208  -0.46303700  0.97019374 -1.84720991
## [172,]  277.083578   8.74033191 -0.83969362 -5.73827405
## [173,]   87.027403   7.10459575 -1.96051582 -4.89097541
## [174,]    6.980211   4.54113657 -2.47470686 -3.15591973
## [175,]   -3.131605  -2.33519051 -4.30993061 -1.56218098
## [176,]  -88.458074 -18.77628456 -2.23757651 -4.82070817
## [177,]  -93.456242 -18.67081912 -1.78839152 -3.70935193
## [178,]  186.943190   0.21333080 -5.63050984 -3.47298694

Summary of the dataset

The summary shows that the wine data contains no missing values and is complete. However, the variables are measured on very different scales. For example, Proline has values in the hundreds (mean ≈ 747, max = 1680), while others such as Nonflavanoid_Phenols are less than 1.

a. The data are in the file Wine.csv. Consider the rows labeled “Proportion of Variance.” Explain why the value for PC1 is so much greater than that of any other column.

Answer:

In this dataset, the first column (PC1) has a value that is much higher than the others. This is because the features of the wine are recorded on different sizes. For instance, Proline’s values are in the hundreds, while Ash’s or Hue’s values are very low. The first measure captures almost all of the variation, making the others seem much less important. This is because bigger differences mean bigger numbers. This doesn’t really mean that Proline explains everything by itself; it just means that it is the most important because of its size.

After Normalization

# standardize the data
data.std <- scale(wine.df[,-1])
# skim the standardized data
skimr::skim(as.data.frame(data.std))

Data summary
Name	as.data.frame(data.std)
Number of rows	178
Number of columns	13
_______________________
Column type frequency:
numeric	13
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	sd	p0	p25	p50	p75	p100	hist
Alcohol	1	1	-2.43	-0.79	0.06	0.83	2.25	▂▇▇▇▃
Malic_Acid	1	1	-1.43	-0.66	-0.42	0.67	3.10	▇▅▂▂▁
Ash	1	1	-3.67	-0.57	-0.02	0.70	3.15	▁▂▇▅▁
Ash_Alcalinity	1	1	-2.66	-0.69	0.00	0.60	3.15	▁▆▇▃▁
Magnesium	1	1	-2.08	-0.82	-0.12	0.51	4.36	▅▇▃▁▁
Total_Phenols	1	1	-2.10	-0.88	0.10	0.81	2.53	▅▇▇▇▁
Flavanoids	1	1	-1.69	-0.83	0.11	0.85	3.05	▆▆▇▂▁
Nonflavanoid_Phenols	1	1	-1.86	-0.74	-0.18	0.61	2.40	▃▇▅▃▂
Proanthocyanins	1	1	-2.06	-0.60	-0.06	0.63	3.48	▃▇▆▂▁
Color_Intensity	1	1	-1.63	-0.79	-0.16	0.49	3.43	▆▇▃▂▁
Hue	1	1	-2.09	-0.77	0.03	0.71	3.29	▅▇▇▃▁
OD280_OD315	1	1	-1.89	-0.95	0.24	0.79	1.96	▆▃▆▇▃
Proline	1	1	-1.49	-0.78	-0.23	0.76	2.96	▇▇▅▃▁

# Perform PCA on the wine data after standardizing the data and show the summary
pcs.std <- prcomp(data.std)
summary(pcs.std)

## Importance of components:
##                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.169 1.5802 1.2025 0.95863 0.92370 0.80103 0.74231
## Proportion of Variance 0.362 0.1921 0.1112 0.07069 0.06563 0.04936 0.04239
## Cumulative Proportion  0.362 0.5541 0.6653 0.73599 0.80162 0.85098 0.89337
##                            PC8     PC9   PC10    PC11    PC12    PC13
## Standard deviation     0.59034 0.53748 0.5009 0.47517 0.41082 0.32152
## Proportion of Variance 0.02681 0.02222 0.0193 0.01737 0.01298 0.00795
## Cumulative Proportion  0.92018 0.94240 0.9617 0.97907 0.99205 1.00000

# Show the first 4 principal components
pcs.std$rot[,1:4]

##                               PC1          PC2         PC3         PC4
## Alcohol              -0.144329395 -0.483651548 -0.20738262 -0.01785630
## Malic_Acid            0.245187580 -0.224930935  0.08901289  0.53689028
## Ash                   0.002051061 -0.316068814  0.62622390 -0.21417556
## Ash_Alcalinity        0.239320405  0.010590502  0.61208035  0.06085941
## Magnesium            -0.141992042 -0.299634003  0.13075693 -0.35179658
## Total_Phenols        -0.394660845 -0.065039512  0.14617896  0.19806835
## Flavanoids           -0.422934297  0.003359812  0.15068190  0.15229479
## Nonflavanoid_Phenols  0.298533103 -0.028779488  0.17036816 -0.20330102
## Proanthocyanins      -0.313429488 -0.039301722  0.14945431  0.39905653
## Color_Intensity       0.088616705 -0.529995672 -0.13730621  0.06592568
## Hue                  -0.296714564  0.279235148  0.08522192 -0.42777141
## OD280_OD315          -0.376167411  0.164496193  0.16600459  0.18412074
## Proline              -0.286752227 -0.364902832 -0.12674592 -0.23207086

# Compare variance explained
non_norm_var <- summary(wine.pca)$importance[2,]
norm_var     <- summary(pcs.std)$importance[2,]
# Put into a table
variance_table <- data.frame(
  Component = paste0("PC", 1:length(non_norm_var)),
  Non_Normalized = round(non_norm_var, 4),
  Normalized = round(norm_var, 4)
) # Creates a data frame to compare variance explained

knitr::kable(variance_table, caption = "Proportion of Variance Explained:
Non-normalized vs Normalized PCA")

Proportion of Variance Explained: Non-normalized vs Normalized PCA
	Component	Non_Normalized	Normalized
PC1	PC1	0.9981	0.3620
PC2	PC2	0.0017	0.1921
PC3	PC3	0.0001	0.1112
PC4	PC4	0.0000	0.0707
PC5	PC5	0.0000	0.0656
PC6	PC6	0.0000	0.0494
PC7	PC7	0.0000	0.0424
PC8	PC8	0.0000	0.0268
PC9	PC9	0.0000	0.0222
PC10	PC10	0.0000	0.0193
PC11	PC11	0.0000	0.0174
PC12	PC12	0.0000	0.0130
PC13	PC13	0.0000	0.0080

b. Comment on the use of normalization (standardization) in part (a).

Answer:

Normalization means adjusting all variables so they are on the same scale, usually with mean 0 and standard deviation 1.It changes everything so that everything is measured on the same scale. Once the numbers have been normalized, one feature can’t take over the analysis just because it has more numbers. In this way, the results are more balanced, and each wine quality can contribute fairly. To put it simply, normalization makes things fair so that each variable has an equal chance to show how important it is.

Business Data Mining – Homework 1