DAT 301 - hw3

2024-09-22

Information about ggplot plots 1-2

Used the mpg dataset in ggplot2
ggplot plot 1-2: Made 2 scatter plots one for cars released in 1999 and one for 2008, showing city vs higway MPG. K-Means clustering was used to cluster data points into 4 groups.
The analysis could be used by car manufacturers classify new vechiles into categories such as compact, midsize, SUV, pickup, etc. Additionally, it can identify competing models within and outside the manufacturer’s lineup.

KMeans Overview Part 1

K-means clustering is an unsupervised machine learning algorithm that divides data into k clusters.

Steps:

1 - Randomly create k centroids.

2 - Calculate the distance between each data point and assign it to the nearest centroid. The distance between each centroid C and a data point x can be calculated using the equation below. \[ d(x_i, C_k) = \sqrt{(x_i - C_k)^2} \]

KMeans Overview Part 2

3 - Recalculate the coordinates of the k centroids as the means of all points assigned to it using the equation below \[ C_k = \frac{1}{|C_k|} \sum_{x_i \in C_k} x_i \]

4 - Repeat steps 2-3 while centroid location and data point cluster assignments change each iteration.

Code for KMeans

set.seed(123)  # seed to reproduce same results

kmeans_result <- kmeans(mpg2008, centers = 4) # 4 for 4 clusters

# Add cluster assignments for cars made in 2008
mpg2008Clusters <- mpg %>%
  filter(year == 2008) %>%
  mutate(cluster = as.factor(kmeans_result$cluster))

ggplot plot 1

ggplot plot 2

Information about plotly plot

used the mpg dataset in ggplot2
Created a plot to visualize car class distribution by car manufacturers (2008)
Consumers can use it to identify which manufacturers dominate in specific class segments, potentially offering more options and availability in those segments
Manufacturers can use it to identify their main competitors in each class segment