Homework_4_MA

Student: Katja Volk Štefić

mydata <- read.table("./Housing.csv", header=TRUE, sep = ",", dec = ".")

head(mydata)

##      price area bedrooms bathrooms stories mainroad guestroom
## 1 13300000 7420        4         2       3      yes        no
## 2 12250000 8960        4         4       4      yes        no
## 3 12250000 9960        3         2       2      yes        no
## 4 12215000 7500        4         2       2      yes        no
## 5 11410000 7420        4         1       2      yes       yes
## 6 10850000 7500        3         3       1      yes        no
##   basement hotwaterheating airconditioning parking prefarea
## 1       no              no             yes       2      yes
## 2       no              no             yes       3       no
## 3      yes              no              no       2      yes
## 4      yes              no             yes       3      yes
## 5      yes              no             yes       2       no
## 6      yes              no             yes       2      yes
##   furnishingstatus
## 1        furnished
## 2        furnished
## 3   semi-furnished
## 4        furnished
## 5        furnished
## 6   semi-furnished

Unit of observation is one house.

Sample size is 545.

Explanation of data:

Price: The price of the house in euros.
Area: The area or size of the house in square feet.
Bedrooms: The number of bedrooms in the house.
Bathrooms: The number of bathrooms in the house.
Stories: The number of stories or floors in the house.
Mainroad: Categorical variable indicating whether the house is located near the main road or not.
Guestroom: Categorical variable indicating whether the house has a guest room or not.
Basement: Categorical variable indicating whether the house has a basement or not.
Hotwaterheating: Categorical variable indicating whether the house has hot water heating or not.
Airconditioning: Categorical variable indicating whether the house has air conditioning or not.
Parking: The number of parking spaces available with the house.
Prefarea: Categorical variable indicating whether the house is in a preferred area or not.
Furnishingstatus: The furnishing status of the house (e.g., unfurnished, semi-furnished, fully furnished).

The source of the data is Kaggle.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata <- mydata %>% mutate(ID = row_number()) #Creating ID

mydata$furnishingstatusF <- factor(mydata$furnishingstatus, 
                         levels = c("semi-furnished", "unfurnished", "furnished"), 
                         labels = c("semi-furnished", "unfurnished", "furnished"))
mydata$mainroadF <- factor(mydata$mainroad,
                           levels = c("yes", "no"),
                           labels = c("yes", "no"))
mydata$guestroomF <- factor(mydata$guestroom,
                           levels = c("yes", "no"),
                           labels = c("yes", "no"))
mydata$basementF <- factor(mydata$basement,
                           levels = c("yes", "no"),
                           labels = c("yes", "no"))
mydata$mainroadF <- factor(mydata$mainroad,
                           levels = c("yes", "no"),
                           labels = c("yes", "no"))
mydata$hotwaterheatingF <- factor(mydata$hotwaterheating,
                           levels = c("yes", "no"),
                           labels = c("yes", "no"))
mydata$airconditioningF <- factor(mydata$airconditioning,
                           levels = c("yes", "no"),
                           labels = c("yes", "no"))
mydata$prefareaF <- factor(mydata$prefarea,
                           levels = c("yes", "no"),
                           labels = c("yes", "no"))

summary(mydata[ , c(-6, -7, -8, -9, -10, -12, -13, -14)])

##      price               area          bedrooms       bathrooms    
##  Min.   : 1750000   Min.   : 1650   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 3430000   1st Qu.: 3600   1st Qu.:2.000   1st Qu.:1.000  
##  Median : 4340000   Median : 4600   Median :3.000   Median :1.000  
##  Mean   : 4766729   Mean   : 5151   Mean   :2.965   Mean   :1.286  
##  3rd Qu.: 5740000   3rd Qu.: 6360   3rd Qu.:3.000   3rd Qu.:2.000  
##  Max.   :13300000   Max.   :16200   Max.   :6.000   Max.   :4.000  
##     stories         parking            furnishingstatusF mainroadF
##  Min.   :1.000   Min.   :0.0000   semi-furnished:227     yes:468  
##  1st Qu.:1.000   1st Qu.:0.0000   unfurnished   :178     no : 77  
##  Median :2.000   Median :0.0000   furnished     :140              
##  Mean   :1.806   Mean   :0.6936                                   
##  3rd Qu.:2.000   3rd Qu.:1.0000                                   
##  Max.   :4.000   Max.   :3.0000                                   
##  guestroomF basementF hotwaterheatingF airconditioningF prefareaF
##  yes: 97    yes:191   yes: 25          yes:172          yes:128  
##  no :448    no :354   no :520          no :373          no :417  
##                                                                  
##                                                                  
##                                                                  
##

#Descriptive statistics

Descriptive statistics:

Bedrooms: 1st Qu.:2.000 25% of houses have 2 bedrooms or less and 75% of houses have more than 2 bedrooms.

Parking: Max.:3.0000 The maximum number of parking spaces is 3.

Area: Mean: 5151 The average size of the house is 5151 square meters.

Research question: Can we make homogeneous groups of houses based on price of the house, size of the house, number of bedrooms and number of stories?

mydata$price_z <- scale(mydata$price)
mydata$area_z   <- scale(mydata$area)
mydata$bedrooms_z <- scale(mydata$bedrooms)
mydata$stories_z <- scale(mydata$stories) #Standardization of variables

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydata[, c("price_z", "area_z", "bedrooms_z", "stories_z")]), 
      type = "pearson") #Correlation matrix

##            price_z area_z bedrooms_z stories_z
## price_z       1.00   0.54       0.37      0.42
## area_z        0.54   1.00       0.15      0.08
## bedrooms_z    0.37   0.15       1.00      0.41
## stories_z     0.42   0.08       0.41      1.00
## 
## n= 545 
## 
## 
## P
##            price_z area_z bedrooms_z stories_z
## price_z            0e+00  0e+00      0e+00    
## area_z     0e+00          4e-04      5e-02    
## bedrooms_z 0e+00   4e-04             0e+00    
## stories_z  0e+00   5e-02  0e+00

mydata$Dissimilarity <- sqrt(mydata$price_z^2 + mydata$area_z^2 + mydata$bedrooms_z^2 + mydata$stories_z^2) #Creating new variable to find outliers

head(mydata[order(-mydata$Dissimilarity), ], 10) #10 units with the highest value of dissimilarity

##        price  area bedrooms bathrooms stories mainroad guestroom
## 8   10150000 16200        5         3       2      yes        no
## 2   12250000  8960        4         4       4      yes        no
## 1   13300000  7420        4         2       3      yes        no
## 126  5943000 15600        3         1       1      yes        no
## 11   9800000 13200        3         1       2      yes        no
## 3   12250000  9960        3         2       2      yes        no
## 7   10150000  8580        4         3       4      yes        no
## 4   12215000  7500        4         2       2      yes        no
## 396  3500000  3600        6         1       2      yes        no
## 67   6930000 13200        2         1       1      yes        no
##     basement hotwaterheating airconditioning parking prefarea
## 8         no              no              no       0       no
## 2         no              no             yes       3       no
## 1         no              no             yes       2      yes
## 126       no              no             yes       2       no
## 11       yes              no             yes       2      yes
## 3        yes              no              no       2      yes
## 7         no              no             yes       2      yes
## 4        yes              no             yes       3      yes
## 396       no              no              no       1       no
## 67       yes             yes              no       1       no
##     furnishingstatus  ID furnishingstatusF mainroadF guestroomF
## 8        unfurnished   8       unfurnished       yes         no
## 2          furnished   2         furnished       yes         no
## 1          furnished   1         furnished       yes         no
## 126   semi-furnished 126    semi-furnished       yes         no
## 11         furnished  11         furnished       yes         no
## 3     semi-furnished   3    semi-furnished       yes         no
## 7     semi-furnished   7    semi-furnished       yes         no
## 4          furnished   4         furnished       yes         no
## 396      unfurnished 396       unfurnished       yes         no
## 67         furnished  67         furnished       yes         no
##     basementF hotwaterheatingF airconditioningF prefareaF    price_z
## 8          no               no               no        no  2.8780778
## 2          no               no              yes        no  4.0008085
## 1          no               no              yes       yes  4.5621739
## 126        no               no              yes        no  0.6288740
## 11        yes               no              yes       yes  2.6909560
## 3         yes               no               no       yes  4.0008085
## 7          no               no              yes       yes  2.8780778
## 4         yes               no              yes       yes  3.9820963
## 396        no               no               no        no -0.6772361
## 67        yes              yes               no        no  1.1565574
##         area_z  bedrooms_z  stories_z Dissimilarity
## 8    5.0915856  2.75702753  0.2242042      6.469857
## 2    1.7553969  1.40213123  2.5296997      5.239584
## 1    1.0457655  1.40213123  1.3769519      5.076320
## 126  4.8151058  0.04723492 -0.9285436      4.944204
## 11   3.7091869  0.04723492  0.2242042      4.588225
## 3    2.2161964  0.04723492  0.2242042      4.579355
## 7    1.5802930  1.40213123  2.5296997      4.375615
## 4    1.0826295  1.40213123  0.2242042      4.364106
## 396 -0.7144887  4.11192384  0.2242042      4.234068
## 67   3.7091869 -1.30766139 -0.9285436      4.203316

library(dplyr)
mydata <- mydata %>%
  filter(!ID %in% c(8)) #Removing ID8

mydata$price_z <- scale(mydata$price)
mydata$area_z   <- scale(mydata$area)
mydata$bedrooms_z <- scale(mydata$bedrooms)
mydata$stories_z <- scale(mydata$stories) #Standardizing again because we removed ID8

library(factoextra)

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

#Calculating Euclidean distances
distance <- get_dist(mydata[c("price_z", "area_z", "bedrooms_z", "stories_z")], 
                     method = "euclidian")

distance2 <- distance^2

fviz_dist(distance2) #Showing dissimilarity matrix

get_clust_tendency(mydata[c("price_z", "area_z", "bedrooms_z", "stories_z")], #Hopkins statistics
                   n = nrow(mydata) - 1, 
                   graph = FALSE)

## $hopkins_stat
## [1] 0.8720085
## 
## $plot
## NULL

Hopkins statistic is 0.87 ( it is above 0.5), meaning that data is clusterable.

library(dplyr) 

WARD <- mydata[c("price_z", "area_z", "bedrooms_z", "stories_z")] %>%
  get_dist(method = "euclidean") %>%  #Selecting distance
  hclust(method = "ward.D2") #Selecting algorithm

WARD

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 544

library(factoextra)
fviz_dend(WARD)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none"
## instead as of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at
##   <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning
## was generated.

fviz_dend(WARD, 
          k = 6,
          cex = 0.5, 
          palette = "jama",
          color_labels_by_k = TRUE, 
          rect = TRUE)

Using dendrogram, the best number of clusters is 6.

mydata$ClusterWard <- cutree(WARD, 
                             k = 6) #Cutting the dendrogram
head(mydata)

##      price area bedrooms bathrooms stories mainroad guestroom
## 1 13300000 7420        4         2       3      yes        no
## 2 12250000 8960        4         4       4      yes        no
## 3 12250000 9960        3         2       2      yes        no
## 4 12215000 7500        4         2       2      yes        no
## 5 11410000 7420        4         1       2      yes       yes
## 6 10850000 7500        3         3       1      yes        no
##   basement hotwaterheating airconditioning parking prefarea
## 1       no              no             yes       2      yes
## 2       no              no             yes       3       no
## 3      yes              no              no       2      yes
## 4      yes              no             yes       3      yes
## 5      yes              no             yes       2       no
## 6      yes              no             yes       2      yes
##   furnishingstatus ID furnishingstatusF mainroadF guestroomF
## 1        furnished  1         furnished       yes         no
## 2        furnished  2         furnished       yes         no
## 3   semi-furnished  3    semi-furnished       yes         no
## 4        furnished  4         furnished       yes         no
## 5        furnished  5         furnished       yes        yes
## 6   semi-furnished  6    semi-furnished       yes         no
##   basementF hotwaterheatingF airconditioningF prefareaF  price_z
## 1        no               no              yes       yes 4.598473
## 2        no               no              yes        no 4.033297
## 3       yes               no               no       yes 4.033297
## 4       yes               no              yes       yes 4.014458
## 5       yes               no              yes        no 3.581156
## 6       yes               no              yes       yes 3.279728
##     area_z bedrooms_z  stories_z Dissimilarity ClusterWard
## 1 1.080257 1.41585012  1.3761612      5.076320           1
## 2 1.806791 1.41585012  2.5279023      5.239584           1
## 3 2.278567 0.05262452  0.2244201      4.579355           1
## 4 1.117999 1.41585012  0.2244201      4.364106           1
## 5 1.080257 1.41585012  0.2244201      3.965420           1
## 6 1.117999 0.05262452 -0.9273209      3.551634           1

Initial_leaders <- aggregate(mydata[, c("price_z", "area_z", "bedrooms_z", "stories_z")], 
                             by = list(mydata$ClusterWard), 
                             FUN = mean) 
#Calculating positions of initial leaders

Initial_leaders

##   Group.1    price_z     area_z   bedrooms_z   stories_z
## 1       1  2.1159784  0.7430451  0.703254916  0.19824420
## 2       2  1.0788085  0.5626733  0.365495966  2.11252027
## 3       3  0.2733168  1.2371021 -0.006646161 -0.52671535
## 4       4 -0.1741098 -0.2995748  1.616324468  0.15667066
## 5       5 -0.5446579 -0.7107584  0.052624518 -0.01965744
## 6       6 -0.7029118 -0.4151160 -1.331901480 -0.75635938

library(factoextra)
K_MEANS <- hkmeans(mydata[c("price_z", "area_z", "bedrooms_z", "stories_z")], 
                   k = 6,
                   hc.metric = "euclidean",
                   hc.method = "ward.D2")
#Performing K-Means clustering - initial leaders are chosen as centroids of groups, found with hierarhical clustering

K_MEANS

## Hierarchical K-means clustering with 6 clusters of sizes 35, 63, 87, 69, 166, 124
## 
## Cluster means:
##      price_z     area_z  bedrooms_z    stories_z
## 1  2.3390592  0.8617579  0.79266127  0.191513249
## 2  1.1001755  0.6180447  0.37720204  2.107425396
## 3  0.2649337  1.3553620 -0.08839882 -0.675791282
## 4 -0.1584563 -0.2716459  1.63317594  0.157652538
## 5 -0.4575413 -0.6600597  0.05262452  0.009335959
## 6 -0.7043712 -0.4733981 -1.33258859 -0.750844488
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 2 2 1
##  [33] 1 1 2 2 2 2 2 1 2 2 2 2 2 2 2 1 3 2 2 2 2 1 1 1 2 2 2 1 3 1 2 3
##  [65] 3 3 3 3 3 5 2 2 2 5 4 2 2 3 2 3 5 3 2 5 2 2 5 3 4 5 3 2 3 2 2 3
##  [97] 3 2 2 3 2 2 2 3 2 4 2 4 4 3 2 4 3 3 3 3 4 3 3 3 3 4 4 2 3 3 2 2
## [129] 2 5 2 2 2 3 2 4 4 2 3 2 4 3 4 4 2 3 5 2 4 5 4 4 5 5 3 3 4 3 5 2
## [161] 2 4 3 3 3 3 6 4 4 3 3 3 4 5 3 3 3 3 5 4 3 5 3 5 5 3 3 6 6 4 3 3
## [193] 6 3 4 3 5 5 5 5 5 6 4 5 3 6 5 5 3 5 3 4 4 6 5 3 3 6 3 2 4 3 3 3
## [225] 6 2 6 5 3 6 5 5 5 5 5 6 5 4 5 5 5 5 5 5 5 5 2 6 4 5 5 3 6 4 6 5
## [257] 3 5 6 5 5 6 5 6 5 5 5 4 5 5 4 5 4 4 6 6 3 5 6 6 6 5 4 3 3 5 5 5
## [289] 6 4 5 4 6 4 5 5 5 3 3 5 5 5 5 3 5 5 5 5 4 3 6 5 5 6 6 4 5 5 4 5
## [321] 5 5 5 5 4 4 5 5 5 6 3 4 5 6 6 3 4 6 4 4 6 3 6 6 5 6 5 6 5 6 6 6
## [353] 5 3 3 4 4 6 5 6 3 6 6 5 6 6 6 6 6 6 5 5 6 6 5 5 5 5 5 6 6 5 4 6
## [385] 6 5 5 5 4 5 5 5 3 5 4 6 6 5 6 6 3 6 3 5 5 6 5 6 6 5 5 5 5 6 5 5
## [417] 6 4 4 6 6 6 5 5 6 5 5 6 4 6 4 3 4 4 6 5 5 6 6 4 5 6 5 5 6 6 6 6
## [449] 5 5 6 3 6 5 5 5 5 5 6 3 6 5 6 6 6 5 5 6 6 5 5 4 3 4 6 5 6 5 4 5
## [481] 6 5 5 6 6 6 4 4 5 5 6 5 5 6 5 6 6 6 5 5 5 6 5 6 5 6 6 6 6 5 5 6
## [513] 5 5 5 6 6 6 6 6 6 5 4 6 6 6 6 6 5 5 5 6 5 4 6 4 5 6 6 6 5 6 5 5
## 
## Within cluster sum of squares by cluster:
## [1]  89.83247  89.56444 152.40088  96.71577 144.40695  83.35783
##  (between_SS / total_SS =  69.8 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"    
##  [5] "tot.withinss" "betweenss"    "size"         "iter"        
##  [9] "ifault"       "data"         "hclust"

fviz_cluster(K_MEANS, 
             palette = "jama", 
             repel = FALSE,
             ggtheme = theme_classic())

mydata <- mydata %>%
  filter(!ID %in% c( 1, 2, 3, 11, 67, 70, 126))

mydata_clu_std <- as.data.frame(scale(mydata[c("price_z", "area_z", "bedrooms_z", "stories_z")]))

library(factoextra)

#Performing K-Means clustering - initial leaders are chosen as centroids of groups, found with hierarhical clustering
K_MEANS <- hkmeans(mydata[c("price_z", "area_z", "bedrooms_z", "stories_z")], 
                   k = 6,
                   hc.metric = "euclidean",
                   hc.method = "ward.D2")

K_MEANS

## Hierarchical K-means clustering with 6 clusters of sizes 34, 63, 87, 66, 163, 124
## 
## Cluster means:
##      price_z     area_z  bedrooms_z   stories_z
## 1  2.0306881  0.7135213  0.85452193  0.08892119
## 2  1.1001755  0.6180447  0.37720204  2.10742540
## 3  0.2182471  1.2299348 -0.07272956 -0.68902969
## 4 -0.1959932 -0.3348116  1.62239945  0.17206827
## 5 -0.4622489 -0.6773290  0.05262452  0.01950915
## 6 -0.7043712 -0.4733981 -1.33258859 -0.75084449
## 
## Clustering vector:
##   [1] 1 1 1 2 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 2 2 1 1 1 2 2
##  [33] 2 2 2 1 2 2 2 2 2 2 2 1 3 2 2 2 2 1 1 1 2 2 2 1 3 1 2 3 3 3 3 5
##  [65] 2 2 2 5 4 2 2 3 2 1 5 3 2 5 2 2 5 3 1 5 3 2 3 2 2 3 3 2 2 3 2 2
##  [97] 2 3 2 4 2 4 4 3 2 4 3 3 3 3 4 3 3 3 3 4 1 2 3 2 2 2 3 2 2 2 3 2
## [129] 4 4 2 3 2 4 3 4 4 2 3 5 2 3 5 4 4 5 5 3 3 4 3 5 2 2 4 3 3 3 3 6
## [161] 4 4 3 3 3 4 5 3 3 3 3 5 4 3 5 3 5 5 3 3 6 6 4 3 3 6 3 4 3 5 5 5
## [193] 5 5 6 4 5 3 6 5 5 3 5 3 4 4 6 5 3 3 6 3 2 4 3 3 3 6 2 6 5 3 6 5
## [225] 5 5 5 5 6 5 4 5 5 5 5 5 5 5 5 2 6 4 5 5 3 6 4 6 5 3 5 6 5 5 6 5
## [257] 6 5 5 5 4 5 5 4 5 4 4 6 6 3 5 6 6 6 5 4 3 3 5 5 5 6 4 5 4 6 4 5
## [289] 5 5 3 3 5 5 5 5 3 5 5 5 5 4 3 6 5 5 6 6 4 5 5 4 5 5 5 5 5 4 4 3
## [321] 5 5 6 3 4 5 6 6 3 4 6 4 4 6 3 6 6 5 6 5 6 5 6 6 6 5 3 3 4 4 6 5
## [353] 6 3 6 6 5 6 6 6 6 6 6 5 5 6 6 5 5 5 5 5 6 6 5 4 6 6 5 5 5 4 5 5
## [385] 5 3 5 4 6 6 5 6 6 3 6 3 5 5 6 5 6 6 5 5 5 5 6 5 5 6 4 4 6 6 6 5
## [417] 5 6 5 5 6 4 6 4 3 4 4 6 5 3 6 6 4 5 6 5 5 6 6 6 6 5 5 6 3 6 5 5
## [449] 5 5 5 6 3 6 5 6 6 6 5 5 6 6 5 5 4 3 4 6 5 6 5 4 5 6 5 5 6 6 6 4
## [481] 4 5 5 6 5 5 6 5 6 6 6 5 5 5 6 5 6 5 6 6 6 6 5 5 6 5 5 5 6 6 6 6
## [513] 6 6 5 4 6 6 6 6 6 5 5 5 6 5 4 6 4 5 6 6 6 5 6 5 5
## 
## Within cluster sum of squares by cluster:
## [1]  66.42777  89.56444 128.87874  85.35492 138.34875  83.35783
##  (between_SS / total_SS =  70.6 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"    
##  [5] "tot.withinss" "betweenss"    "size"         "iter"        
##  [9] "ifault"       "data"         "hclust"

fviz_cluster(K_MEANS, 
             palette = "jama", 
             repel = FALSE,
             ggtheme = theme_classic())

mydata$ClusterK_Means <- K_MEANS$cluster
head(mydata[c("ID", "ClusterWard", "ClusterK_Means")])

##   ID ClusterWard ClusterK_Means
## 1  4           1              1
## 2  5           1              1
## 3  6           1              1
## 4  7           1              2
## 5  9           1              1
## 6 10           2              2

table(mydata$ClusterWard) #Checking for reclassifications

## 
##   1   2   3   4   5   6 
##  40  61  89  68 151 128

table(mydata$ClusterK_Means)

## 
##   1   2   3   4   5   6 
##  34  63  87  66 163 124

table(mydata$ClusterWard, mydata$ClusterK_Means)

##    
##       1   2   3   4   5   6
##   1  32   1   3   0   4   0
##   2   0  61   0   0   0   0
##   3   1   1  74   0  13   0
##   4   1   0   1  66   0   0
##   5   0   0   5   0 146   0
##   6   0   0   4   0   0 124

Centroids <- K_MEANS$centers
Centroids

##      price_z     area_z  bedrooms_z   stories_z
## 1  2.0306881  0.7135213  0.85452193  0.08892119
## 2  1.1001755  0.6180447  0.37720204  2.10742540
## 3  0.2182471  1.2299348 -0.07272956 -0.68902969
## 4 -0.1959932 -0.3348116  1.62239945  0.17206827
## 5 -0.4622489 -0.6773290  0.05262452  0.01950915
## 6 -0.7043712 -0.4733981 -1.33258859 -0.75084449

library(ggplot2)
library(tidyr)

Figure <- as.data.frame(Centroids)
Figure$ID <- 1:nrow(Figure)
Figure <- pivot_longer(Figure, cols = c(price_z, area_z, bedrooms_z, stories_z))

Figure$Groups <- factor(Figure$ID, 
                        levels = c(1, 2, 3, 4, 5, 6), 
                        labels = c("1", "2", "3", "4", "5", "6"))

Figure$nameFactor <- factor(Figure$name, 
                            levels = c("price_z", "area_z", "bedrooms_z", "stories_z"), 
                            labels = c("price_z", "area_z", "bedrooms_z", "stories_z"))

ggplot(Figure, aes(x = nameFactor, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Groups, col = Groups), size = 3) +
  geom_line(aes(group = ID), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables") +
  ylim(-2.5, 2.5)

fit <- aov(cbind(price_z, area_z, bedrooms_z, stories_z) ~ as.factor(ClusterK_Means), 
                 data = mydata)
#Performing ANOVAs.
summary(fit)

##  Response 1 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   5 318.87  63.773  212.33 < 2.2e-16 ***
## Residuals                 531 159.49   0.300                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 2 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   5 282.13  56.426  160.86 < 2.2e-16 ***
## Residuals                 531 186.26   0.351                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 3 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   5 428.61  85.721  426.86 < 2.2e-16 ***
## Residuals                 531 106.64   0.201                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 4 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   5 393.28  78.656  299.29 < 2.2e-16 ***
## Residuals                 531 139.55   0.263                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We perform Anova-s, asssuming that normality and homoskedasticity is present.

H0: μG1 = μG2 = μG3 = μG4 H1: at least one μ is different.

We can reject H0 (p<0.001) for all cluster variables, meaning that we chose correct cluster variables.

chisq_results <- chisq.test(mydata$airconditioningF, as.factor(mydata$ClusterK_Means))
#Validation
#Pearson Chi2 test

chisq_results

## 
##  Pearson's Chi-squared test
## 
## data:  mydata$airconditioningF and as.factor(mydata$ClusterK_Means)
## X-squared = 96.282, df = 5, p-value < 2.2e-16

addmargins(chisq_results$observed)

##                        
## mydata$airconditioningF   1   2   3   4   5   6 Sum
##                     yes  20  48  29  20  32  19 168
##                     no   14  15  58  46 131 105 369
##                     Sum  34  63  87  66 163 124 537

round(chisq_results$expected, 2)

##                        
## mydata$airconditioningF     1     2     3     4      5     6
##                     yes 10.64 19.71 27.22 20.65  50.99 38.79
##                     no  23.36 43.29 59.78 45.35 112.01 85.21

round(chisq_results$res, 2)

##                        
## mydata$airconditioningF     1     2     3     4     5     6
##                     yes  2.87  6.37  0.34 -0.14 -2.66 -3.18
##                     no  -1.94 -4.30 -0.23  0.10  1.79  2.14

HO: There is no association between variables. H1: There is association between variables.

We can reject H0 (p<0.001), concluding that there is association between classification of groups and air conditioning.

chisq_results <- chisq.test(mydata$guestroomF, as.factor(mydata$ClusterK_Means))

chisq_results

## 
##  Pearson's Chi-squared test
## 
## data:  mydata$guestroomF and as.factor(mydata$ClusterK_Means)
## X-squared = 40.849, df = 5, p-value = 1.007e-07

addmargins(chisq_results$observed)

##                  
## mydata$guestroomF   1   2   3   4   5   6 Sum
##               yes  13  16  30   9  15  14  97
##               no   21  47  57  57 148 110 440
##               Sum  34  63  87  66 163 124 537

round(chisq_results$expected, 2)

##                  
## mydata$guestroomF     1     2     3     4      5     6
##               yes  6.14 11.38 15.72 11.92  29.44  22.4
##               no  27.86 51.62 71.28 54.08 133.56 101.6

round(chisq_results$res, 2)

##                  
## mydata$guestroomF     1     2     3     4     5     6
##               yes  2.77  1.37  3.60 -0.85 -2.66 -1.77
##               no  -1.30 -0.64 -1.69  0.40  1.25  0.83

HO: There is no association between variables. H1: There is association between variables.

We can reject H0 (p<0.001), concluding that there is association between classification of groups and having a guest room.

For hierarchical clustering, Ward’s algorithm was used, and based on the analysis of the dendrogram it was decided to classify houses into six groups. The classification was further optimized using the K-Means cluster.

Group 1 (6.33%) contains houses that are above average in price of a house, area (size of a house) and number of bedrooms. In addition, group 1 is a little above average in number of stories. In group 1 we have more than expected number of houses that have air conditioning (α=0.01) and more than expected number of houses with a guest room (α=0.01).

Group 2 (11.73%) contains houses that are above average in all categories (price of a house, area (size of a house), number of bedrooms and number of stories). Group 2 have more than expected number of houses that have air conditioning (α=0.001) and less than expected number of houses without air conditioning (α=0.001). There is also more than expected numbers of houses with guest rooms.

Group 3 (16.20%) contains houses that are above average in price of a house and area (size of a house) and below average in number of bedrooms and stories. There is more than expected number of houses with a guest room (α=0.001) and more than expected number of houses that have air conditioning.

Group 4 (12.29%) contains houses that are below average in price of a house and area (size of a house) and above average in number of bedrooms and stories. In group 4 we have less than expected number of houses with air conditioning and less than expected number of houses with a guest room.

Group 5 (30.35%) contains houses that are below average in price of a house and area (size of a house) and is a little above average in number of bedrooms and stories. Group 5 have less than expected number of houses that have air conditioning (α=0.01) and less than expected number of houses with a guest room (α=0.01).

Group 6 (23.09%) contains houses that are below average in all categories (price of a house, area (size of a house), number of bedrooms and number of stories). Group 6 have less than expected number of houses that have air conditioning (α=0.01) and more than expected number of houses without air conditioning (α=0.05). In addition, there is less than expected number of houses with a guest room.

Homework_4_MA_KVŠ

2024-01-29

Research question: Can we make homogeneous groups of houses based on price of the house, size of the house, number of bedrooms and number of stories?