Find principal components of your data (use: Matlab: pca, R: princomp, or other).
Make visualizations in two projection.
Make the attribute axis representation.
Select the most informative/informative attributes possible according to the shortest / longest axes
Create different subsets of attributes (informative only, non-informative only, no non-informative attributies, etc.) and visualize the data using the nonlinear projection method - MDS
Present different visualizations, comment on the result.
pc <- prcomp(data7, center = TRUE, scale = TRUE)
paaisk <- round((pc$sdev^2 / sum(pc$sdev^2)*100),2)
ggplot(as.data.frame(pc$x),
aes(x=PC1, y=PC2)) +
labs(title = "Scatterplot according to the first two Pricipal components",
x = paste0("PC1 (",paaisk[1],"%)"), y = paste0("PC2 (",paaisk[2],"%)")) +
geom_point() + theme_classic()
Here we
can see, that the variance is quite well depicted by the 1st principal
component with \(58.4\%\) of the
variance explained. The 2nd principal component explains the variance
half as good as the 1st principal component (\(24.8\%\)).
plot(1:length(paaisk), paaisk, type = "b",
xlim = c(1,6.4), ylim = c(0,62),
xlab = "Principal component", ylab = "Variable explenation, %",
main = "Total variance explained by each principal component")
text(1+0.3:length(paaisk), paaisk+1.7, labels = paste0(paaisk,"%"))
The length of attribute vectors is more or less the same for all attributes, hence they carry the same amount of information. Two groups of attributes can be destinguished:
FirstDimension and SecondDimension;
Value, LDM, Volume, Weight.
ggbiplot(pc,
obs.scale = 1,
var.scale = 1,
ellipse = TRUE,
circle = TRUE,
ellipse.prob = 0.68) +
labs(title = "Bi-plot") +
theme_classic()
The classical (metric) multidimensional scaling was used to visualize the two attribute subsets. Also, I chose to cluster the objects using k-means clustering into 8 groups, as that is the amount of unique Unit Types.
mds <- pirms %>%
dist() %>%
cmdscale() %>%
as_tibble()
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
colnames(mds) <- c("Dim.1", "Dim.2")
#
clust <- kmeans(mds, 8)$cluster %>% as.factor()
mds <- mds %>%
mutate(groups = clust)
# Plot and color by groups
ggscatter(mds, x = "Dim.1", y = "Dim.2",
label = rownames(pirms),
color = "groups",
palette = "Set1",
size = 1,
ellipse = TRUE,
ellipse.type = "convex",
repel = TRUE)
## Warning: ggrepel: 34 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
mds <- antrs %>%
dist() %>%
cmdscale() %>%
as_tibble()
colnames(mds) <- c("Dim.1", "Dim.2")
#
clust <- kmeans(mds, 8)$cluster %>% as.factor()
mds <- mds %>%
mutate(groups = clust)
# Plot and color by groups
ggscatter(mds, x = "Dim.1", y = "Dim.2",
label = rownames(antrs),
color = "groups",
palette = "Set1",
size = 1,
ellipse = TRUE,
ellipse.type = "convex",
repel = TRUE)
## Warning: ggrepel: 27 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps