Homework assignment No.7

  1. Find principal components of your data (use: Matlab: pca, R: princomp, or other).

  2. Make visualizations in two projection.

  3. Make the attribute axis representation.

  4. Select the most informative/informative attributes possible according to the shortest / longest axes

  5. Create different subsets of attributes (informative only, non-informative only, no non-informative attributies, etc.) and visualize the data using the nonlinear projection method - MDS

  6. Present different visualizations, comment on the result.

Visualizations

pc <- prcomp(data7, center = TRUE, scale = TRUE)
paaisk <- round((pc$sdev^2 / sum(pc$sdev^2)*100),2)
ggplot(as.data.frame(pc$x), 
       aes(x=PC1, y=PC2)) + 
  labs(title = "Scatterplot according to the first two Pricipal components",
       x = paste0("PC1 (",paaisk[1],"%)"), y = paste0("PC2 (",paaisk[2],"%)")) + 
  geom_point() + theme_classic()

Here we can see, that the variance is quite well depicted by the 1st principal component with \(58.4\%\) of the variance explained. The 2nd principal component explains the variance half as good as the 1st principal component (\(24.8\%\)).

plot(1:length(paaisk), paaisk, type = "b", 
     xlim = c(1,6.4), ylim = c(0,62),
     xlab = "Principal component", ylab = "Variable explenation, %",
     main = "Total variance explained by each principal component")
text(1+0.3:length(paaisk), paaisk+1.7, labels = paste0(paaisk,"%"))

The length of attribute vectors is more or less the same for all attributes, hence they carry the same amount of information. Two groups of attributes can be destinguished:

  1. FirstDimension and SecondDimension;

  2. Value, LDM, Volume, Weight.

ggbiplot(pc,
         obs.scale = 1,
         var.scale = 1,
         ellipse = TRUE,
         circle = TRUE,
         ellipse.prob = 0.68) +
  labs(title = "Bi-plot") + 
  theme_classic()

The classical (metric) multidimensional scaling was used to visualize the two attribute subsets. Also, I chose to cluster the objects using k-means clustering into 8 groups, as that is the amount of unique Unit Types.

mds <- pirms %>%
  dist() %>%          
  cmdscale() %>%
  as_tibble()
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
colnames(mds) <- c("Dim.1", "Dim.2")

#
clust <- kmeans(mds, 8)$cluster %>% as.factor()
mds <- mds %>%
  mutate(groups = clust)
# Plot and color by groups
ggscatter(mds, x = "Dim.1", y = "Dim.2", 
          label = rownames(pirms),
          color = "groups",
          palette = "Set1",
          size = 1, 
          ellipse = TRUE,
          ellipse.type = "convex",
          repel = TRUE)
## Warning: ggrepel: 34 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

mds <- antrs %>%
  dist() %>%          
  cmdscale() %>%
  as_tibble()
colnames(mds) <- c("Dim.1", "Dim.2")

#
clust <- kmeans(mds, 8)$cluster %>% as.factor()
mds <- mds %>%
  mutate(groups = clust)
# Plot and color by groups
ggscatter(mds, x = "Dim.1", y = "Dim.2", 
          label = rownames(antrs),
          color = "groups",
          palette = "Set1",
          size = 1, 
          ellipse = TRUE,
          ellipse.type = "convex",
          repel = TRUE)
## Warning: ggrepel: 27 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps