Student: Katja Volk Štefić
mydata <- read.table("./Housing.csv", header=TRUE, sep = ",", dec = ".")
head(mydata)
## price area bedrooms bathrooms stories mainroad guestroom
## 1 13300000 7420 4 2 3 yes no
## 2 12250000 8960 4 4 4 yes no
## 3 12250000 9960 3 2 2 yes no
## 4 12215000 7500 4 2 2 yes no
## 5 11410000 7420 4 1 2 yes yes
## 6 10850000 7500 3 3 1 yes no
## basement hotwaterheating airconditioning parking prefarea
## 1 no no yes 2 yes
## 2 no no yes 3 no
## 3 yes no no 2 yes
## 4 yes no yes 3 yes
## 5 yes no yes 2 no
## 6 yes no yes 2 yes
## furnishingstatus
## 1 furnished
## 2 furnished
## 3 semi-furnished
## 4 furnished
## 5 furnished
## 6 semi-furnished
Unit of observation is one house.
Sample size is 545.
Explanation of data:
Price: The price of the house in euros.
Area: The area or size of the house in square feet.
Bedrooms: The number of bedrooms in the house.
Bathrooms: The number of bathrooms in the house.
Stories: The number of stories or floors in the house.
Mainroad: Categorical variable indicating whether the house is located near the main road or not.
Guestroom: Categorical variable indicating whether the house has a guest room or not.
Basement: Categorical variable indicating whether the house has a basement or not.
Hotwaterheating: Categorical variable indicating whether the house has hot water heating or not.
Airconditioning: Categorical variable indicating whether the house has air conditioning or not.
Parking: The number of parking spaces available with the house.
Prefarea: Categorical variable indicating whether the house is in a preferred area or not.
Furnishingstatus: The furnishing status of the house (e.g., unfurnished, semi-furnished, fully furnished).
The source of the data is Kaggle.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- mydata %>% mutate(ID = row_number()) #Creating ID
mydata$furnishingstatusF <- factor(mydata$furnishingstatus,
levels = c("semi-furnished", "unfurnished", "furnished"),
labels = c("semi-furnished", "unfurnished", "furnished"))
mydata$mainroadF <- factor(mydata$mainroad,
levels = c("yes", "no"),
labels = c("yes", "no"))
mydata$guestroomF <- factor(mydata$guestroom,
levels = c("yes", "no"),
labels = c("yes", "no"))
mydata$basementF <- factor(mydata$basement,
levels = c("yes", "no"),
labels = c("yes", "no"))
mydata$mainroadF <- factor(mydata$mainroad,
levels = c("yes", "no"),
labels = c("yes", "no"))
mydata$hotwaterheatingF <- factor(mydata$hotwaterheating,
levels = c("yes", "no"),
labels = c("yes", "no"))
mydata$airconditioningF <- factor(mydata$airconditioning,
levels = c("yes", "no"),
labels = c("yes", "no"))
mydata$prefareaF <- factor(mydata$prefarea,
levels = c("yes", "no"),
labels = c("yes", "no"))
summary(mydata[ , c(-6, -7, -8, -9, -10, -12, -13, -14)])
## price area bedrooms bathrooms
## Min. : 1750000 Min. : 1650 Min. :1.000 Min. :1.000
## 1st Qu.: 3430000 1st Qu.: 3600 1st Qu.:2.000 1st Qu.:1.000
## Median : 4340000 Median : 4600 Median :3.000 Median :1.000
## Mean : 4766729 Mean : 5151 Mean :2.965 Mean :1.286
## 3rd Qu.: 5740000 3rd Qu.: 6360 3rd Qu.:3.000 3rd Qu.:2.000
## Max. :13300000 Max. :16200 Max. :6.000 Max. :4.000
## stories parking furnishingstatusF mainroadF
## Min. :1.000 Min. :0.0000 semi-furnished:227 yes:468
## 1st Qu.:1.000 1st Qu.:0.0000 unfurnished :178 no : 77
## Median :2.000 Median :0.0000 furnished :140
## Mean :1.806 Mean :0.6936
## 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :4.000 Max. :3.0000
## guestroomF basementF hotwaterheatingF airconditioningF prefareaF
## yes: 97 yes:191 yes: 25 yes:172 yes:128
## no :448 no :354 no :520 no :373 no :417
##
##
##
##
#Descriptive statistics
Descriptive statistics:
Bedrooms: 1st Qu.:2.000 25% of houses have 2 bedrooms or less and 75% of houses have more than 2 bedrooms.
Parking: Max.:3.0000 The maximum number of parking spaces is 3.
Area: Mean: 5151 The average size of the house is 5151 square meters.
mydata$price_z <- scale(mydata$price)
mydata$area_z <- scale(mydata$area)
mydata$bedrooms_z <- scale(mydata$bedrooms)
mydata$stories_z <- scale(mydata$stories) #Standardization of variables
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata[, c("price_z", "area_z", "bedrooms_z", "stories_z")]),
type = "pearson") #Correlation matrix
## price_z area_z bedrooms_z stories_z
## price_z 1.00 0.54 0.37 0.42
## area_z 0.54 1.00 0.15 0.08
## bedrooms_z 0.37 0.15 1.00 0.41
## stories_z 0.42 0.08 0.41 1.00
##
## n= 545
##
##
## P
## price_z area_z bedrooms_z stories_z
## price_z 0e+00 0e+00 0e+00
## area_z 0e+00 4e-04 5e-02
## bedrooms_z 0e+00 4e-04 0e+00
## stories_z 0e+00 5e-02 0e+00
mydata$Dissimilarity <- sqrt(mydata$price_z^2 + mydata$area_z^2 + mydata$bedrooms_z^2 + mydata$stories_z^2) #Creating new variable to find outliers
head(mydata[order(-mydata$Dissimilarity), ], 10) #10 units with the highest value of dissimilarity
## price area bedrooms bathrooms stories mainroad guestroom
## 8 10150000 16200 5 3 2 yes no
## 2 12250000 8960 4 4 4 yes no
## 1 13300000 7420 4 2 3 yes no
## 126 5943000 15600 3 1 1 yes no
## 11 9800000 13200 3 1 2 yes no
## 3 12250000 9960 3 2 2 yes no
## 7 10150000 8580 4 3 4 yes no
## 4 12215000 7500 4 2 2 yes no
## 396 3500000 3600 6 1 2 yes no
## 67 6930000 13200 2 1 1 yes no
## basement hotwaterheating airconditioning parking prefarea
## 8 no no no 0 no
## 2 no no yes 3 no
## 1 no no yes 2 yes
## 126 no no yes 2 no
## 11 yes no yes 2 yes
## 3 yes no no 2 yes
## 7 no no yes 2 yes
## 4 yes no yes 3 yes
## 396 no no no 1 no
## 67 yes yes no 1 no
## furnishingstatus ID furnishingstatusF mainroadF guestroomF
## 8 unfurnished 8 unfurnished yes no
## 2 furnished 2 furnished yes no
## 1 furnished 1 furnished yes no
## 126 semi-furnished 126 semi-furnished yes no
## 11 furnished 11 furnished yes no
## 3 semi-furnished 3 semi-furnished yes no
## 7 semi-furnished 7 semi-furnished yes no
## 4 furnished 4 furnished yes no
## 396 unfurnished 396 unfurnished yes no
## 67 furnished 67 furnished yes no
## basementF hotwaterheatingF airconditioningF prefareaF price_z
## 8 no no no no 2.8780778
## 2 no no yes no 4.0008085
## 1 no no yes yes 4.5621739
## 126 no no yes no 0.6288740
## 11 yes no yes yes 2.6909560
## 3 yes no no yes 4.0008085
## 7 no no yes yes 2.8780778
## 4 yes no yes yes 3.9820963
## 396 no no no no -0.6772361
## 67 yes yes no no 1.1565574
## area_z bedrooms_z stories_z Dissimilarity
## 8 5.0915856 2.75702753 0.2242042 6.469857
## 2 1.7553969 1.40213123 2.5296997 5.239584
## 1 1.0457655 1.40213123 1.3769519 5.076320
## 126 4.8151058 0.04723492 -0.9285436 4.944204
## 11 3.7091869 0.04723492 0.2242042 4.588225
## 3 2.2161964 0.04723492 0.2242042 4.579355
## 7 1.5802930 1.40213123 2.5296997 4.375615
## 4 1.0826295 1.40213123 0.2242042 4.364106
## 396 -0.7144887 4.11192384 0.2242042 4.234068
## 67 3.7091869 -1.30766139 -0.9285436 4.203316
library(dplyr)
mydata <- mydata %>%
filter(!ID %in% c(8)) #Removing ID8
mydata$price_z <- scale(mydata$price)
mydata$area_z <- scale(mydata$area)
mydata$bedrooms_z <- scale(mydata$bedrooms)
mydata$stories_z <- scale(mydata$stories) #Standardizing again because we removed ID8
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
#Calculating Euclidean distances
distance <- get_dist(mydata[c("price_z", "area_z", "bedrooms_z", "stories_z")],
method = "euclidian")
distance2 <- distance^2
fviz_dist(distance2) #Showing dissimilarity matrix
get_clust_tendency(mydata[c("price_z", "area_z", "bedrooms_z", "stories_z")], #Hopkins statistics
n = nrow(mydata) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.8720085
##
## $plot
## NULL
Hopkins statistic is 0.87 ( it is above 0.5), meaning that data is clusterable.
library(dplyr)
WARD <- mydata[c("price_z", "area_z", "bedrooms_z", "stories_z")] %>%
get_dist(method = "euclidean") %>% #Selecting distance
hclust(method = "ward.D2") #Selecting algorithm
WARD
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 544
library(factoextra)
fviz_dend(WARD)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none"
## instead as of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at
## <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning
## was generated.
fviz_dend(WARD,
k = 6,
cex = 0.5,
palette = "jama",
color_labels_by_k = TRUE,
rect = TRUE)
Using dendrogram, the best number of clusters is 6.
mydata$ClusterWard <- cutree(WARD,
k = 6) #Cutting the dendrogram
head(mydata)
## price area bedrooms bathrooms stories mainroad guestroom
## 1 13300000 7420 4 2 3 yes no
## 2 12250000 8960 4 4 4 yes no
## 3 12250000 9960 3 2 2 yes no
## 4 12215000 7500 4 2 2 yes no
## 5 11410000 7420 4 1 2 yes yes
## 6 10850000 7500 3 3 1 yes no
## basement hotwaterheating airconditioning parking prefarea
## 1 no no yes 2 yes
## 2 no no yes 3 no
## 3 yes no no 2 yes
## 4 yes no yes 3 yes
## 5 yes no yes 2 no
## 6 yes no yes 2 yes
## furnishingstatus ID furnishingstatusF mainroadF guestroomF
## 1 furnished 1 furnished yes no
## 2 furnished 2 furnished yes no
## 3 semi-furnished 3 semi-furnished yes no
## 4 furnished 4 furnished yes no
## 5 furnished 5 furnished yes yes
## 6 semi-furnished 6 semi-furnished yes no
## basementF hotwaterheatingF airconditioningF prefareaF price_z
## 1 no no yes yes 4.598473
## 2 no no yes no 4.033297
## 3 yes no no yes 4.033297
## 4 yes no yes yes 4.014458
## 5 yes no yes no 3.581156
## 6 yes no yes yes 3.279728
## area_z bedrooms_z stories_z Dissimilarity ClusterWard
## 1 1.080257 1.41585012 1.3761612 5.076320 1
## 2 1.806791 1.41585012 2.5279023 5.239584 1
## 3 2.278567 0.05262452 0.2244201 4.579355 1
## 4 1.117999 1.41585012 0.2244201 4.364106 1
## 5 1.080257 1.41585012 0.2244201 3.965420 1
## 6 1.117999 0.05262452 -0.9273209 3.551634 1
Initial_leaders <- aggregate(mydata[, c("price_z", "area_z", "bedrooms_z", "stories_z")],
by = list(mydata$ClusterWard),
FUN = mean)
#Calculating positions of initial leaders
Initial_leaders
## Group.1 price_z area_z bedrooms_z stories_z
## 1 1 2.1159784 0.7430451 0.703254916 0.19824420
## 2 2 1.0788085 0.5626733 0.365495966 2.11252027
## 3 3 0.2733168 1.2371021 -0.006646161 -0.52671535
## 4 4 -0.1741098 -0.2995748 1.616324468 0.15667066
## 5 5 -0.5446579 -0.7107584 0.052624518 -0.01965744
## 6 6 -0.7029118 -0.4151160 -1.331901480 -0.75635938
library(factoextra)
K_MEANS <- hkmeans(mydata[c("price_z", "area_z", "bedrooms_z", "stories_z")],
k = 6,
hc.metric = "euclidean",
hc.method = "ward.D2")
#Performing K-Means clustering - initial leaders are chosen as centroids of groups, found with hierarhical clustering
K_MEANS
## Hierarchical K-means clustering with 6 clusters of sizes 35, 63, 87, 69, 166, 124
##
## Cluster means:
## price_z area_z bedrooms_z stories_z
## 1 2.3390592 0.8617579 0.79266127 0.191513249
## 2 1.1001755 0.6180447 0.37720204 2.107425396
## 3 0.2649337 1.3553620 -0.08839882 -0.675791282
## 4 -0.1584563 -0.2716459 1.63317594 0.157652538
## 5 -0.4575413 -0.6600597 0.05262452 0.009335959
## 6 -0.7043712 -0.4733981 -1.33258859 -0.750844488
##
## Clustering vector:
## [1] 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 2 2 1
## [33] 1 1 2 2 2 2 2 1 2 2 2 2 2 2 2 1 3 2 2 2 2 1 1 1 2 2 2 1 3 1 2 3
## [65] 3 3 3 3 3 5 2 2 2 5 4 2 2 3 2 3 5 3 2 5 2 2 5 3 4 5 3 2 3 2 2 3
## [97] 3 2 2 3 2 2 2 3 2 4 2 4 4 3 2 4 3 3 3 3 4 3 3 3 3 4 4 2 3 3 2 2
## [129] 2 5 2 2 2 3 2 4 4 2 3 2 4 3 4 4 2 3 5 2 4 5 4 4 5 5 3 3 4 3 5 2
## [161] 2 4 3 3 3 3 6 4 4 3 3 3 4 5 3 3 3 3 5 4 3 5 3 5 5 3 3 6 6 4 3 3
## [193] 6 3 4 3 5 5 5 5 5 6 4 5 3 6 5 5 3 5 3 4 4 6 5 3 3 6 3 2 4 3 3 3
## [225] 6 2 6 5 3 6 5 5 5 5 5 6 5 4 5 5 5 5 5 5 5 5 2 6 4 5 5 3 6 4 6 5
## [257] 3 5 6 5 5 6 5 6 5 5 5 4 5 5 4 5 4 4 6 6 3 5 6 6 6 5 4 3 3 5 5 5
## [289] 6 4 5 4 6 4 5 5 5 3 3 5 5 5 5 3 5 5 5 5 4 3 6 5 5 6 6 4 5 5 4 5
## [321] 5 5 5 5 4 4 5 5 5 6 3 4 5 6 6 3 4 6 4 4 6 3 6 6 5 6 5 6 5 6 6 6
## [353] 5 3 3 4 4 6 5 6 3 6 6 5 6 6 6 6 6 6 5 5 6 6 5 5 5 5 5 6 6 5 4 6
## [385] 6 5 5 5 4 5 5 5 3 5 4 6 6 5 6 6 3 6 3 5 5 6 5 6 6 5 5 5 5 6 5 5
## [417] 6 4 4 6 6 6 5 5 6 5 5 6 4 6 4 3 4 4 6 5 5 6 6 4 5 6 5 5 6 6 6 6
## [449] 5 5 6 3 6 5 5 5 5 5 6 3 6 5 6 6 6 5 5 6 6 5 5 4 3 4 6 5 6 5 4 5
## [481] 6 5 5 6 6 6 4 4 5 5 6 5 5 6 5 6 6 6 5 5 5 6 5 6 5 6 6 6 6 5 5 6
## [513] 5 5 5 6 6 6 6 6 6 5 4 6 6 6 6 6 5 5 5 6 5 4 6 4 5 6 6 6 5 6 5 5
##
## Within cluster sum of squares by cluster:
## [1] 89.83247 89.56444 152.40088 96.71577 144.40695 83.35783
## (between_SS / total_SS = 69.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault" "data" "hclust"
fviz_cluster(K_MEANS,
palette = "jama",
repel = FALSE,
ggtheme = theme_classic())
mydata <- mydata %>%
filter(!ID %in% c( 1, 2, 3, 11, 67, 70, 126))
mydata_clu_std <- as.data.frame(scale(mydata[c("price_z", "area_z", "bedrooms_z", "stories_z")]))
library(factoextra)
#Performing K-Means clustering - initial leaders are chosen as centroids of groups, found with hierarhical clustering
K_MEANS <- hkmeans(mydata[c("price_z", "area_z", "bedrooms_z", "stories_z")],
k = 6,
hc.metric = "euclidean",
hc.method = "ward.D2")
K_MEANS
## Hierarchical K-means clustering with 6 clusters of sizes 34, 63, 87, 66, 163, 124
##
## Cluster means:
## price_z area_z bedrooms_z stories_z
## 1 2.0306881 0.7135213 0.85452193 0.08892119
## 2 1.1001755 0.6180447 0.37720204 2.10742540
## 3 0.2182471 1.2299348 -0.07272956 -0.68902969
## 4 -0.1959932 -0.3348116 1.62239945 0.17206827
## 5 -0.4622489 -0.6773290 0.05262452 0.01950915
## 6 -0.7043712 -0.4733981 -1.33258859 -0.75084449
##
## Clustering vector:
## [1] 1 1 1 2 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 2 2 1 1 1 2 2
## [33] 2 2 2 1 2 2 2 2 2 2 2 1 3 2 2 2 2 1 1 1 2 2 2 1 3 1 2 3 3 3 3 5
## [65] 2 2 2 5 4 2 2 3 2 1 5 3 2 5 2 2 5 3 1 5 3 2 3 2 2 3 3 2 2 3 2 2
## [97] 2 3 2 4 2 4 4 3 2 4 3 3 3 3 4 3 3 3 3 4 1 2 3 2 2 2 3 2 2 2 3 2
## [129] 4 4 2 3 2 4 3 4 4 2 3 5 2 3 5 4 4 5 5 3 3 4 3 5 2 2 4 3 3 3 3 6
## [161] 4 4 3 3 3 4 5 3 3 3 3 5 4 3 5 3 5 5 3 3 6 6 4 3 3 6 3 4 3 5 5 5
## [193] 5 5 6 4 5 3 6 5 5 3 5 3 4 4 6 5 3 3 6 3 2 4 3 3 3 6 2 6 5 3 6 5
## [225] 5 5 5 5 6 5 4 5 5 5 5 5 5 5 5 2 6 4 5 5 3 6 4 6 5 3 5 6 5 5 6 5
## [257] 6 5 5 5 4 5 5 4 5 4 4 6 6 3 5 6 6 6 5 4 3 3 5 5 5 6 4 5 4 6 4 5
## [289] 5 5 3 3 5 5 5 5 3 5 5 5 5 4 3 6 5 5 6 6 4 5 5 4 5 5 5 5 5 4 4 3
## [321] 5 5 6 3 4 5 6 6 3 4 6 4 4 6 3 6 6 5 6 5 6 5 6 6 6 5 3 3 4 4 6 5
## [353] 6 3 6 6 5 6 6 6 6 6 6 5 5 6 6 5 5 5 5 5 6 6 5 4 6 6 5 5 5 4 5 5
## [385] 5 3 5 4 6 6 5 6 6 3 6 3 5 5 6 5 6 6 5 5 5 5 6 5 5 6 4 4 6 6 6 5
## [417] 5 6 5 5 6 4 6 4 3 4 4 6 5 3 6 6 4 5 6 5 5 6 6 6 6 5 5 6 3 6 5 5
## [449] 5 5 5 6 3 6 5 6 6 6 5 5 6 6 5 5 4 3 4 6 5 6 5 4 5 6 5 5 6 6 6 4
## [481] 4 5 5 6 5 5 6 5 6 6 6 5 5 5 6 5 6 5 6 6 6 6 5 5 6 5 5 5 6 6 6 6
## [513] 6 6 5 4 6 6 6 6 6 5 5 5 6 5 4 6 4 5 6 6 6 5 6 5 5
##
## Within cluster sum of squares by cluster:
## [1] 66.42777 89.56444 128.87874 85.35492 138.34875 83.35783
## (between_SS / total_SS = 70.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault" "data" "hclust"
fviz_cluster(K_MEANS,
palette = "jama",
repel = FALSE,
ggtheme = theme_classic())
mydata$ClusterK_Means <- K_MEANS$cluster
head(mydata[c("ID", "ClusterWard", "ClusterK_Means")])
## ID ClusterWard ClusterK_Means
## 1 4 1 1
## 2 5 1 1
## 3 6 1 1
## 4 7 1 2
## 5 9 1 1
## 6 10 2 2
table(mydata$ClusterWard) #Checking for reclassifications
##
## 1 2 3 4 5 6
## 40 61 89 68 151 128
table(mydata$ClusterK_Means)
##
## 1 2 3 4 5 6
## 34 63 87 66 163 124
table(mydata$ClusterWard, mydata$ClusterK_Means)
##
## 1 2 3 4 5 6
## 1 32 1 3 0 4 0
## 2 0 61 0 0 0 0
## 3 1 1 74 0 13 0
## 4 1 0 1 66 0 0
## 5 0 0 5 0 146 0
## 6 0 0 4 0 0 124
Centroids <- K_MEANS$centers
Centroids
## price_z area_z bedrooms_z stories_z
## 1 2.0306881 0.7135213 0.85452193 0.08892119
## 2 1.1001755 0.6180447 0.37720204 2.10742540
## 3 0.2182471 1.2299348 -0.07272956 -0.68902969
## 4 -0.1959932 -0.3348116 1.62239945 0.17206827
## 5 -0.4622489 -0.6773290 0.05262452 0.01950915
## 6 -0.7043712 -0.4733981 -1.33258859 -0.75084449
library(ggplot2)
library(tidyr)
Figure <- as.data.frame(Centroids)
Figure$ID <- 1:nrow(Figure)
Figure <- pivot_longer(Figure, cols = c(price_z, area_z, bedrooms_z, stories_z))
Figure$Groups <- factor(Figure$ID,
levels = c(1, 2, 3, 4, 5, 6),
labels = c("1", "2", "3", "4", "5", "6"))
Figure$nameFactor <- factor(Figure$name,
levels = c("price_z", "area_z", "bedrooms_z", "stories_z"),
labels = c("price_z", "area_z", "bedrooms_z", "stories_z"))
ggplot(Figure, aes(x = nameFactor, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Groups, col = Groups), size = 3) +
geom_line(aes(group = ID), linewidth = 1) +
ylab("Averages") +
xlab("Cluster variables") +
ylim(-2.5, 2.5)
fit <- aov(cbind(price_z, area_z, bedrooms_z, stories_z) ~ as.factor(ClusterK_Means),
data = mydata)
#Performing ANOVAs.
summary(fit)
## Response 1 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 5 318.87 63.773 212.33 < 2.2e-16 ***
## Residuals 531 159.49 0.300
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 2 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 5 282.13 56.426 160.86 < 2.2e-16 ***
## Residuals 531 186.26 0.351
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 3 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 5 428.61 85.721 426.86 < 2.2e-16 ***
## Residuals 531 106.64 0.201
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response 4 :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 5 393.28 78.656 299.29 < 2.2e-16 ***
## Residuals 531 139.55 0.263
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We perform Anova-s, asssuming that normality and homoskedasticity is present.
H0: μG1 = μG2 = μG3 = μG4 H1: at least one μ is different.
We can reject H0 (p<0.001) for all cluster variables, meaning that we chose correct cluster variables.
chisq_results <- chisq.test(mydata$airconditioningF, as.factor(mydata$ClusterK_Means))
#Validation
#Pearson Chi2 test
chisq_results
##
## Pearson's Chi-squared test
##
## data: mydata$airconditioningF and as.factor(mydata$ClusterK_Means)
## X-squared = 96.282, df = 5, p-value < 2.2e-16
addmargins(chisq_results$observed)
##
## mydata$airconditioningF 1 2 3 4 5 6 Sum
## yes 20 48 29 20 32 19 168
## no 14 15 58 46 131 105 369
## Sum 34 63 87 66 163 124 537
round(chisq_results$expected, 2)
##
## mydata$airconditioningF 1 2 3 4 5 6
## yes 10.64 19.71 27.22 20.65 50.99 38.79
## no 23.36 43.29 59.78 45.35 112.01 85.21
round(chisq_results$res, 2)
##
## mydata$airconditioningF 1 2 3 4 5 6
## yes 2.87 6.37 0.34 -0.14 -2.66 -3.18
## no -1.94 -4.30 -0.23 0.10 1.79 2.14
HO: There is no association between variables. H1: There is association between variables.
We can reject H0 (p<0.001), concluding that there is association between classification of groups and air conditioning.
chisq_results <- chisq.test(mydata$guestroomF, as.factor(mydata$ClusterK_Means))
chisq_results
##
## Pearson's Chi-squared test
##
## data: mydata$guestroomF and as.factor(mydata$ClusterK_Means)
## X-squared = 40.849, df = 5, p-value = 1.007e-07
addmargins(chisq_results$observed)
##
## mydata$guestroomF 1 2 3 4 5 6 Sum
## yes 13 16 30 9 15 14 97
## no 21 47 57 57 148 110 440
## Sum 34 63 87 66 163 124 537
round(chisq_results$expected, 2)
##
## mydata$guestroomF 1 2 3 4 5 6
## yes 6.14 11.38 15.72 11.92 29.44 22.4
## no 27.86 51.62 71.28 54.08 133.56 101.6
round(chisq_results$res, 2)
##
## mydata$guestroomF 1 2 3 4 5 6
## yes 2.77 1.37 3.60 -0.85 -2.66 -1.77
## no -1.30 -0.64 -1.69 0.40 1.25 0.83
HO: There is no association between variables. H1: There is association between variables.
We can reject H0 (p<0.001), concluding that there is association between classification of groups and having a guest room.
For hierarchical clustering, Ward’s algorithm was used, and based on the analysis of the dendrogram it was decided to classify houses into six groups. The classification was further optimized using the K-Means cluster.
Group 1 (6.33%) contains houses that are above average in price of a house, area (size of a house) and number of bedrooms. In addition, group 1 is a little above average in number of stories. In group 1 we have more than expected number of houses that have air conditioning (α=0.01) and more than expected number of houses with a guest room (α=0.01).
Group 2 (11.73%) contains houses that are above average in all categories (price of a house, area (size of a house), number of bedrooms and number of stories). Group 2 have more than expected number of houses that have air conditioning (α=0.001) and less than expected number of houses without air conditioning (α=0.001). There is also more than expected numbers of houses with guest rooms.
Group 3 (16.20%) contains houses that are above average in price of a house and area (size of a house) and below average in number of bedrooms and stories. There is more than expected number of houses with a guest room (α=0.001) and more than expected number of houses that have air conditioning.
Group 4 (12.29%) contains houses that are below average in price of a house and area (size of a house) and above average in number of bedrooms and stories. In group 4 we have less than expected number of houses with air conditioning and less than expected number of houses with a guest room.
Group 5 (30.35%) contains houses that are below average in price of a house and area (size of a house) and is a little above average in number of bedrooms and stories. Group 5 have less than expected number of houses that have air conditioning (α=0.01) and less than expected number of houses with a guest room (α=0.01).
Group 6 (23.09%) contains houses that are below average in all categories (price of a house, area (size of a house), number of bedrooms and number of stories). Group 6 have less than expected number of houses that have air conditioning (α=0.01) and more than expected number of houses without air conditioning (α=0.05). In addition, there is less than expected number of houses with a guest room.