Nika_Boh_MVA

Source: Data retrieved on Jan, 29th, 2024, from: https://www.kaggle.com/datasets/mysarahmadbhat/airline-passenger-satisfaction/data

library(readr)
AirplaneSatisfaction <- read_csv("~/Bootcamp_working/AirplaneSatisfaction.csv")

## Rows: 129880 Columns: 24
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): Gender, Customer Type, Type of Travel, Class, Satisfaction
## dbl (19): ID, Age, Flight Distance, Departure Delay, Arrival Delay, Departur...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

mydata <- data.frame(AirplaneSatisfaction)
nrow(mydata)

## [1] 129880

head(mydata)

##   ID Gender Age Customer.Type Type.of.Travel    Class Flight.Distance
## 1  1   Male  48    First-time       Business Business             821
## 2  2 Female  35     Returning       Business Business             821
## 3  3   Male  41     Returning       Business Business             853
## 4  4   Male  50     Returning       Business Business            1905
## 5  5 Female  49     Returning       Business Business            3470
## 6  6   Male  43     Returning       Business Business            3788
##   Departure.Delay Arrival.Delay Departure.and.Arrival.Time.Convenience
## 1               2             5                                      3
## 2              26            39                                      2
## 3               0             0                                      4
## 4               0             0                                      2
## 5               0             1                                      3
## 6               0             0                                      4
##   Ease.of.Online.Booking Check.in.Service Online.Boarding Gate.Location
## 1                      3                4               3             3
## 2                      2                3               5             2
## 3                      4                4               5             4
## 4                      2                3               4             2
## 5                      3                3               5             3
## 6                      4                3               5             4
##   On.board.Service Seat.Comfort Leg.Room.Service Cleanliness Food.and.Drink
## 1                3            5                2           5              5
## 2                5            4                5           5              3
## 3                3            5                3           5              5
## 4                5            5                5           4              4
## 5                3            4                4           5              4
## 6                4            4                4           3              3
##   In.flight.Service In.flight.Wifi.Service In.flight.Entertainment
## 1                 5                      3                       5
## 2                 5                      2                       5
## 3                 3                      4                       3
## 4                 5                      2                       5
## 5                 3                      3                       3
## 6                 4                      4                       4
##   Baggage.Handling            Satisfaction
## 1                5 Neutral or Dissatisfied
## 2                5               Satisfied
## 3                3               Satisfied
## 4                5               Satisfied
## 5                3               Satisfied
## 6                4               Satisfied

Description of the data set:

Our data set includes 129880 units (rows) and 24 variables (columns).

ID: Unique passenger identifier, numeric type of a variable
Gender: Gender of the passenger (Female/Male)
Age: Age of the passenger, numeric type of a variable
Customer Type: Type of airline customer (First-time/Returning)
Type of Travel: Purpose of the flight (Business/Personal)
Class: Travel class in the airplane for the passenger seat
Flight Distance:Flight distance in miles
Departure Delay: Flight departure delay in minutes
Arrival Delay:Flight arrival delay in minutes
Departure and Arrival Time Convenience: “Satisfaction level with the convenience of the flight departure and arrival times from 1 (lowest) to 5 (highest) - 0 means”“not applicable”“”
Ease of Online Booking: “Satisfaction level with the online booking experience from 1 (lowest) to 5 (highest) - 0 means”“not applicable”“”
Check-in Service: “Satisfaction level with the check-in service from 1 (lowest) to 5 (highest) - 0 means”“not applicable”“”
Online Boarding: “Satisfaction level with the online boarding experience from 1 (lowest) to 5 (highest) - 0 means”“not applicable”“”
Gate Location: “Satisfaction level with the gate location in the airport from 1 (lowest) to 5 (highest) - 0 means”“not applicable”“”
On-board Service: “Satisfaction level with the on-boarding service in the airport from 1 (lowest) to 5 (highest) - 0 means”“not applicable”“”
Seat Comfort: “Satisfaction level with the comfort of the airplane seat from 1 (lowest) to 5 (highest) - 0 means”“not applicable”“”
Leg Room Service: “Satisfaction level with the leg room of the airplane seat from 1 (lowest) to 5 (highest) - 0 means”“not applicable”“”
Cleanliness. “Satisfaction level with the cleanliness of the airplane from 1 (lowest) to 5 (highest) - 0 means”“not applicable”“”
Food and Drink: “Satisfaction level with the food and drinks on the airplane from 1 (lowest) to 5 (highest) - 0 means”“not applicable”“”
In-flight Service: “Satisfaction level with the in-flight service from 1 (lowest) to 5 (highest) - 0 means”“not applicable”“”
In-flight Wifi Service: “Satisfaction level with the in-flight Wifi service from 1 (lowest) to 5 (highest) - 0 means”“not applicable”“”
In-flight Entertainment: “Satisfaction level with the in-flight entertainment from 1 (lowest) to 5 (highest) - 0 means”“not applicable”“”
Baggage Handling: “Satisfaction level with the baggage handling from the airline from 1 (lowest) to 5 (highest) - 0 means”“not applicable”“”
Satisfaction: Overall satisfaction level with the airline (Satisfied/Neutral or unsatisfied)

# Convert columns to numeric

mydata[c(1,3, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)] <- lapply(mydata[c(1,3, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)], as.numeric)

set.seed(1) #Setting initial point of sampling
mydata <- mydata[sample(nrow(mydata), 300), ]    #Random sample of 300 units

mydata$ID <- seq(1, nrow(mydata))
rownames(mydata) <- mydata$ID

any(is.na(mydata))

## [1] TRUE

library(tidyr)
mydata <- drop_na(mydata)

any(is.na(mydata))

## [1] FALSE

table(mydata$Gender)

## 
## Female   Male 
##    147    151

table(mydata$Customer.Type)

## 
## First-time  Returning 
##         44        254

table(mydata$Type.of.Travel)

## 
## Business Personal 
##      199       99

table(mydata$Class)

## 
##     Business      Economy Economy Plus 
##          146          127           25

table(mydata$Satisfaction)

## 
## Neutral or Dissatisfied               Satisfied 
##                     167                     131

This was done to see, which levels of the variables should be set a a reference group, when factoring.

mydata$GenderF <- factor(mydata$Gender,            #Factoring the categorical variables
                             levels = c("Male", "Female"), 
                             labels = c("Male", "Female"))

mydata$Customer.TypeF <- factor(mydata$Customer.Type, 
                             levels = c( "Returning", "First-time"), 
                             labels = c("Returning", "First-time"))

mydata$Type.of.TravelF <- factor(mydata$Type.of.Travel, 
                             levels = c("Business", "Personal"), 
                             labels = c("Business", "Personal"))

mydata$ClassF <- factor(mydata$Class, 
                             levels = c("Business","Economy","Economy Plus"), 
                             labels = c("Business","Economy","Economy Plus"))
                             
mydata$SatisfactionF <- factor(mydata$Satisfaction, 
                             levels = c("Neutral or Dissatisfied", "Satisfied"), 
                             labels = c("Neutral or Dissatisfied", "Satisfied"))

summary(mydata[ , c(-2, -4, -9, -10, -11, -12, -13, -14, -15, -16, -17, -18, -19, -20, -21, -22, -23, -24)])

##        ID              Age        Type.of.Travel        Class          
##  Min.   :  1.00   Min.   : 7.00   Length:298         Length:298        
##  1st Qu.: 75.25   1st Qu.:28.25   Class :character   Class :character  
##  Median :149.50   Median :40.50   Mode  :character   Mode  :character  
##  Mean   :149.98   Mean   :40.31                                        
##  3rd Qu.:223.75   3rd Qu.:51.00                                        
##  Max.   :300.00   Max.   :79.00                                        
##  Flight.Distance  Departure.Delay    GenderF       Customer.TypeF
##  Min.   :  67.0   Min.   :  0.00   Male  :151   Returning :254   
##  1st Qu.: 375.5   1st Qu.:  0.00   Female:147   First-time: 44   
##  Median : 765.0   Median :  0.00                                 
##  Mean   :1142.2   Mean   : 11.03                                 
##  3rd Qu.:1649.2   3rd Qu.:  6.00                                 
##  Max.   :3989.0   Max.   :211.00                                 
##  Type.of.TravelF          ClassF                    SatisfactionF
##  Business:199    Business    :146   Neutral or Dissatisfied:167  
##  Personal: 99    Economy     :127   Satisfied              :131  
##                  Economy Plus: 25                                
##                                                                  
##                                                                  
##

Minimum age of the passenger in the sample is 7 years old and the oldest passenger is 79 years old. We have 151 males and 147 females in the sample. Out of 298 people, 146 of them are travelling in a Business class, 127 of them are travelling in Economy class, and 25 of them are in Economy Plus.
Out of 298 passengers, we have 199 passengers, who are travelling for a Business purpose, and 99 of them are travelling due to Personal reasons.
The average value of a delay of departure is 11.03 minutes.
75% of passengers are travelling up to 1649,2 miles, and the remaining 25% of the passengers are travelling longer routes.

mydata$Flight.Distance_z <- scale(mydata$Flight.Distance)  #Standardizing the variables used
mydata$Departure.Delay_z <- scale(mydata$Departure.Delay)
mydata$Seat.Comfort_z <- scale(mydata$Seat.Comfort)

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydata[ , c("Flight.Distance_z", "Departure.Delay_z", "Seat.Comfort_z")]),
      type = "pearson")

##                   Flight.Distance_z Departure.Delay_z Seat.Comfort_z
## Flight.Distance_z              1.00              0.02           0.08
## Departure.Delay_z              0.02              1.00          -0.07
## Seat.Comfort_z                 0.08             -0.07           1.00
## 
## n= 298 
## 
## 
## P
##                   Flight.Distance_z Departure.Delay_z Seat.Comfort_z
## Flight.Distance_z                   0.7301            0.1919        
## Departure.Delay_z 0.7301                              0.2437        
## Seat.Comfort_z    0.1919            0.2437

We see a weak correlation between the variables, which is what we aim to see here.

mydata$Dissimilarity <- sqrt(mydata$Flight.Distance_z^2 + mydata$Departure.Delay_z^2 + mydata$Seat.Comfort_z^2)

head(mydata[order(-mydata$Dissimilarity), c(1, 30, 31, 32, 33)], 10)   #Check for the outliers based on the Dissimilarity index

##      ID Flight.Distance_z Departure.Delay_z Seat.Comfort_z Dissimilarity
## 145 145       -0.23415541         7.0150083     -0.2987280      7.025269
## 203 203        0.05848708         6.1379851      0.4431132      6.154237
## 219 219       -0.62335001         5.9625805     -1.0405691      6.084712
## 43   43       -0.54571017         4.5593434     -1.7824103      4.925688
## 196 196       -0.83934804         4.4190197      0.4431132      4.519800
## 40   40        0.36207878         3.9980486      1.1849543      4.185643
## 254 256       -0.13660791         3.6472393     -0.2987280      3.662001
## 79   79       -0.27397071         3.3315109     -1.0405691      3.500973
## 146 146        2.73407554        -0.3870674     -1.7824103      3.286636
## 234 236        0.21177600         2.6649733     -1.7824103      3.213085

mydata1 <- mydata[c(-145, -203, -219, -43), ]   #Deleting the units

mydata1$Flight.Distance_z <- scale(mydata1$Flight.Distance_z)  #Standardizing the variables AGAIN
mydata1$Departure.Delay_z <- scale(mydata1$Departure.Delay_z)
mydata1$Seat.Comfort_z <- scale(mydata1$Seat.Comfort_z)

library(ggplot2)
#install.packages("factoextra")
library(factoextra)

## Warning: package 'factoextra' was built under R version 4.3.2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

distance <- get_dist(mydata1[c("Flight.Distance_z", "Departure.Delay_z", "Seat.Comfort_z")],
                     method = "euclidian")       #Calculating Euclidean distances and adding them to the dataset

distance2 <- distance^2                          #Creating squared Euclidean distances

fviz_dist(distance2)

From the graph we can see the formation of squares but to determine if the data really is clusterable, we need to check it with Hopkins statistic.

get_clust_tendency(mydata1[c("Flight.Distance_z", "Departure.Delay_z", "Seat.Comfort_z")],
                   n = nrow(mydata1) - 1,
                   graph = FALSE)

## $hopkins_stat
## [1] 0.8703446
## 
## $plot
## NULL

We wish to have a Hopkins statistic above 0.5, which is achieved here. The higher the Hopkins, the more the data is clusterable. With the value of 0.87, our data is clusterable and we can continue with analysis on how many clusters we should have.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

WARD <- mydata1[c("Flight.Distance_z", "Departure.Delay_z", "Seat.Comfort_z")] %>%
           get_dist(method = "euclidian") %>%
           hclust(method = "ward.D2")

WARD

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 294

fviz_dend(WARD)  #Draw a dendrogram

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

From the dendrogram, we can see that we can do the cut where we get 4 clusters. If we were wondering how many steps we would need to merge the units into 1 cluster, we would need it to be (n - 1), which happens after 294 steps (originally our sample had 300 units and then we deleted 2 as they had missing values, and then additional 3 because they were outliers based on the dissimilarity index - so 295- 1 = 294).
We could not cut it into more groups are the horizontal lines would be too close together.

fviz_dend(WARD,
          k = 4,
          cex = 0.5,
          palette = "jama",
          color_labels_by_k = TRUE,
          rect = TRUE)

mydata1$ClusterWard <- cutree(WARD,
                             k = 4)
head(mydata1[c(1, 7, 8, 16, 34)])

##   ID Flight.Distance Departure.Delay Seat.Comfort ClusterWard
## 1  1             669               0            3           1
## 2  2             854              15            5           1
## 3  3            1917               0            5           2
## 4  4             387               8            4           1
## 5  5            2475              15            4           2
## 6  6             468              92            3           3

We got a new variable called ClusterWard, which tells us which of the cluster groups the unit was assigned to.

Initial_leaders <- aggregate(mydata1[ , c("Flight.Distance_z", "Departure.Delay_z", "Seat.Comfort_z")],
                             by = list(mydata1$ClusterWard),
                             FUN = mean)
Initial_leaders

##   Group.1 Flight.Distance_z Departure.Delay_z Seat.Comfort_z
## 1       1       -0.58260538        -0.3399479      0.5715136
## 2       2        1.46980516        -0.3313738      0.6355359
## 3       3        0.26337499         2.4171960      0.2807911
## 4       4       -0.02815888        -0.1149369     -1.2982233

Here, we have our initial leaders, which are used for K-Means clustering technique later on.

K_MEANS <- hkmeans(mydata1[c("Flight.Distance_z", "Departure.Delay_z", "Seat.Comfort_z")], 
                   k = 4,
                   hc.metric = "euclidean",
                   hc.method = "ward.D2")

K_MEANS

## Hierarchical K-means clustering with 4 clusters of sizes 149, 54, 19, 72
## 
## Cluster means:
##   Flight.Distance_z Departure.Delay_z Seat.Comfort_z
## 1        -0.4972643        -0.2694633      0.5191469
## 2         1.6547944        -0.2010963      0.4069226
## 3         0.2463884         3.1825929      0.1608489
## 4        -0.2770541        -0.1313894     -1.4219838
## 
## Clustering vector:
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   1   1   2   1   2   3   1   1   1   2   4   2   1   2   1   1   2   1   2   1 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
##   4   4   1   4   4   1   3   1   2   1   1   2   4   1   1   2   2   4   1   3 
##  41  42  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61 
##   2   1   4   1   1   1   4   4   1   1   1   4   1   2   1   4   1   1   1   4 
##  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81 
##   1   1   1   2   1   1   4   2   1   1   2   2   1   1   2   2   1   3   4   4 
##  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 
##   1   4   4   1   1   1   1   2   1   1   4   2   4   2   2   1   4   1   2   1 
## 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 
##   1   1   2   2   1   2   1   1   3   1   1   1   1   3   1   1   4   4   1   1 
## 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 
##   1   4   1   1   1   1   1   2   1   2   1   1   1   1   4   4   1   2   2   4 
## 142 143 144 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 
##   4   4   4   2   1   4   1   1   1   1   1   1   2   1   4   4   1   2   4   2 
## 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 
##   4   1   1   1   2   1   1   1   1   4   1   2   1   4   4   1   4   2   4   1 
## 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 
##   1   1   1   4   1   1   2   4   1   1   1   4   1   3   4   1   4   1   1   2 
## 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 220 221 222 223 224 
##   4   4   4   1   4   3   1   1   1   1   4   4   4   4   1   3   2   1   1   1 
## 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 
##   3   4   2   4   4   1   4   1   1   3   4   2   4   1   1   3   2   3   3   1 
## 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 
##   2   1   4   4   1   1   4   1   1   3   3   1   1   4   4   1   3   1   2   4 
## 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 
##   2   2   2   1   1   1   1   3   1   2   1   1   1   2   4   1   2   1   1   4 
## 285 286 287 288 289 290 291 292 293 294 295 296 297 298 
##   4   1   1   2   4   3   4   1   1   1   4   4   1   2 
## 
## Within cluster sum of squares by cluster:
## [1] 90.30902 69.43067 62.22226 63.77169
##  (between_SS / total_SS =  67.5 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "data"        
## [11] "hclust"

Our cluster one has 150 units (150/295 = 51%), cluster two has 61 (21%), cluster three contains 12 units (4,1%), and the fourth one includes 72 units (24%).
The ratio between SS of between the groups with the total SS is 67.7%, which is good. We want to maximize it (being as close as 100%!), because we want the clusters to be homogeneous (homogeneity within the cluster) and in between the clusters we want to have a heterogeneity (heterogeneity between the cluster groups).

fviz_cluster(K_MEANS, 
             palette = "jama", 
             repel = FALSE,
             ggtheme = theme_classic())

Now, let’s check the reclassifications with the following tables.

mydata1$ClusterK_Means <- K_MEANS$cluster
head(mydata1[c("Flight.Distance_z", "Departure.Delay_z", "Seat.Comfort_z")])

##   Flight.Distance_z Departure.Delay_z Seat.Comfort_z
## 1        -0.4730148        -0.4257827     -0.3081386
## 2        -0.2898817         0.3051249      1.1769884
## 3         0.7623915        -0.4257827      1.1769884
## 4        -0.7521692        -0.0359653      0.4344249
## 5         1.3147607         0.3051249      0.4344249
## 6        -0.6719865         4.0571175     -0.3081386

table(mydata1$ClusterWard)

## 
##   1   2   3   4 
## 130  48  29  87

Those are our initial numbers of units in each of the four clusters, done with hierarchical clustering.

table(mydata1$ClusterK_Means)

## 
##   1   2   3   4 
## 149  54  19  72

Those are our final numbers on units in each of the four clusters, which was done by K-Means. We see some of the reclassifications.

table(mydata1$ClusterWard, mydata1$ClusterK_Means)

##    
##       1   2   3   4
##   1 130   0   0   0
##   2   5  43   0   0
##   3   7   3  19   0
##   4   7   8   0  72

The second row: The second group initially had 47 units (by hierarchical clustering), however, after the K-Means, 5 units were reclassified into the 1st group, and the remaining 43 units stayed in the 2nd group.
The first column: In the first group we have 149 units (130 + 5 + 7 + 7), after the K-Means clustering. Out of the 149 units, 130 were in the first group initially, 5 of the units were reclassified from the group 2, 7 of the units were reclassified from the group 3, and 7 cam from the fourth group.

The explanation looks the same for all other rows and columns, except that when looking at the third and fourth column, there was no reclassification happening.

Centroids <- K_MEANS$centers
Centroids

##   Flight.Distance_z Departure.Delay_z Seat.Comfort_z
## 1        -0.4972643        -0.2694633      0.5191469
## 2         1.6547944        -0.2010963      0.4069226
## 3         0.2463884         3.1825929      0.1608489
## 4        -0.2770541        -0.1313894     -1.4219838

Those values will be used for a better graphical visualization of the differences between the groups

library(ggplot2)
library(tidyr)

Figure <- as.data.frame(Centroids)
Figure$Id <- 1:nrow(Figure)
Figure <- pivot_longer(Figure, cols = c("Flight.Distance_z", "Departure.Delay_z", "Seat.Comfort_z"))

Figure$Groups <- factor(Figure$Id, 
                        levels = c(1, 2, 3, 4), 
                        labels = c("1", "2", "3", "4"))

Figure$nameFactor <- factor(Figure$name, 
                            levels = c("Flight.Distance_z", "Departure.Delay_z", "Seat.Comfort_z"), 
                            labels = c("Flight.Distance_z", "Departure.Delay_z", "Seat.Comfort_z"))

ggplot(Figure, aes(x = nameFactor, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Groups, col = Groups), size = 4) +
  geom_line(aes(group = Id), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables") +
  ylim(-1.5, 3.5)

Group 1: This groups consists of passengers that fly the shortest distances. The value of the departure delays is the lowest, which means they had the least minutes of departure delays, in comparison with other groups. However, they valued their seat comfort the highest among all the groups.
Group 2: This group of passengers take the longest distances when flying. Their experience with departure delay is kind of average, and they are more satisfied with their seat comfortableness than group 4 and 3.
Group 3: The departure delays were far above average for this group (it in an unpleasant thing). This group tends to fly shorter distances (slightly below average) and they find the comfortableness of their seat slightly worse than the average.
Group 4: The group’s values are all below the average - however, the value of departure delay being below the average is a good thing, because it means that this group had a better experience than group 2 and 3 because there were less delays. This group was the most dissatisfied with the comfortableness of their seats. In this group, the passengers are flying shorter distances of flights (below average).
Groups 1, 3 and 4 fly shorter distances (below average) than group 2. Group 2 flies the longest distance of flights, in comparison with other groups here.

fit <- aov(cbind(Flight.Distance_z, Departure.Delay_z, Seat.Comfort_z) ~ as.factor(ClusterK_Means), 
                 data = mydata1)

summary(fit)

##  Response 1 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   3 191.39  63.798  182.09 < 2.2e-16 ***
## Residuals                 290 101.61   0.350                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 2 :
##                            Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   3 206.695  68.898  231.51 < 2.2e-16 ***
## Residuals                 290  86.305   0.298                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 3 :
##                            Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   3 195.177  65.059  192.87 < 2.2e-16 ***
## Residuals                 290  97.823   0.337                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA:

H0: All means are equal.

H1: At least one of the means differs.

As we can see from the p-value, in all three tests, we can reject the null hypothesis. It means that the means differ among our cluster groups (p < 0.001).

Validation

aggregate(mydata1$Age,                     #Validation with age
          by = list(mydata1$ClusterK_Means), 
          FUN = "mean")

##   Group.1        x
## 1       1 41.58389
## 2       2 41.50000
## 3       3 46.36842
## 4       4 35.19444

Here we see the mean of each group. With the following ANOVa test, we can see if those values have a statistical effect on the groups, which were classified with the help of K-Means.

fit <- aov(Age ~ as.factor(ClusterK_Means), 
           data = mydata1)
summary(fit)

##                            Df Sum Sq Mean Sq F value  Pr(>F)   
## as.factor(ClusterK_Means)   3   2900   966.6   4.452 0.00447 **
## Residuals                 290  62963   217.1                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The hypotheses:

H0: The means of age are are the same across all clusters.

H1: At least one mean differs.

We reject the null hypothesis (p = 0.004) and conclude that at least one mean differs from others. We found a statistical effect of age on the clusters.

The third group has on average the oldest passengers and the fourth group has the youngest, on average.

chisq_results <- chisq.test(mydata1$GenderF, as.factor(mydata1$GenderF))  #Validation with the Gender

chisq_results

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydata1$GenderF and as.factor(mydata1$GenderF)
## X-squared = 290.01, df = 1, p-value < 2.2e-16

The hypotheses:

H0: There is no association between the gender of the passenger is flying in and the clusters made with K-Means.

H1: There is an association between the gender of the passenger is flying in and the clusters made with K-Means.

Based on the data set, we reject the null hypothesis at p < 0.001, and conclude that there is an association between the gender the passenger is flying in and the clusters made with K-Means.

round(chisq_results$expected, 2)

##                as.factor(mydata1$GenderF)
## mydata1$GenderF  Male Female
##          Male   75.51  73.49
##          Female 73.49  71.51

All expected frequencies are above 5, which is great.

chisq_results1 <- chisq.test(mydata1$Type.of.TravelF, as.factor(mydata1$Type.of.TravelF))   #Validation with a Type of Travel

chisq_results1

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydata1$Type.of.TravelF and as.factor(mydata1$Type.of.TravelF)
## X-squared = 289.52, df = 1, p-value < 2.2e-16

The hypotheses:

H0: There is no association between the passenger’s type of travel and the clusters made with K-Means.

H1: There is an association between the passenger’s type of travel and the clusters made with K-Means.

Based on the data set, we reject the null hypothesis at p < 0.001, and conclude that there is an association between the passenger’s type of travel (purpose) and the clusters made by K-Means. There is a statistically significant difference in types between the groups.

addmargins(chisq_results1$observed)     #Observed frequencies

##                        
## mydata1$Type.of.TravelF Business Personal Sum
##                Business      196        0 196
##                Personal        0       98  98
##                Sum           196       98 294

round(chisq_results1$expected, 2)     #Expected frequencies

##                        
## mydata1$Type.of.TravelF Business Personal
##                Business   130.67    65.33
##                Personal    65.33    32.67

Conclusion for GROUP 1:

The segment size is the biggest - 149 units, which is 51% (149/294). Passengers in the first group are on average 42 years old.

Nika_Boh_MVA_HW4

Nika Boh

2024-01-29

Validation

Conclusion for GROUP 1: