This document looks at clustering for house types.In particular, it looks at data from the Mount_Pleasant_Real_Estate.csv and the HosuePrices.csv. The goal is to create create clusters for houses and see if there are any similarities between the two csvs.
house[1:3,]
## # A tibble: 3 x 24
## ID `List Price` `Duplex?` Bedrooms `Baths - Total` `Baths - Full`
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 115 369900 Yes 3 2.5 2
## 2 117 375000 Yes 3 2.5 2
## 3 5 769900 No 4 3.5 3
## # ... with 18 more variables: `Baths - Half` <dbl>, Stories <dbl>,
## # Subdivision <chr>, `Square Footage` <dbl>, `Year Built` <dbl>,
## # Acreage <dbl>, `New Owned?` <chr>, `House Style` <chr>, `Covered Parking
## # Spots` <dbl>, `Misc Exterior` <chr>, `Has Pool?` <chr>, `Has Dock?` <chr>,
## # `Fenced Yard` <chr>, `Screened Porch?` <chr>, Amenities <chr>, `Golf
## # Course?` <chr>, `Fireplace?` <chr>, `Number of Fireplaces` <dbl>
set.seed(1)
grpHouse <- kmeans(house[,c("List Price","Square Footage")], centers=3, nstart=10)
grpHouse
## K-means clustering with 3 clusters of sizes 16, 148, 81
##
## Cluster means:
## List Price Square Footage
## 1 1534312.5 4942.812
## 2 451724.2 2438.318
## 3 797579.1 3782.148
##
## Clustering vector:
## [1] 2 2 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [38] 2 2 2 2 2 2 2 3 3 3 3 3 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3
## [75] 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1
## [112] 1 1 1 1 2 2 2 2 2 1 2 2 2 2 3 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [186] 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 3
## [223] 2 2 3 2 2 3 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 3
##
## Within cluster sum of squares by cluster:
## [1] 8.951311e+11 1.775633e+12 1.062754e+12
## (between_SS / total_SS = 84.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
o=order(grpHouse$cluster)
data.frame(house$ID[o],grpHouse$cluster[o])
## house.ID.o. grpHouse.cluster.o.
## 1 135 1
## 2 136 1
## 3 137 1
## 4 138 1
## 5 139 1
## 6 140 1
## 7 142 1
## 8 143 1
## 9 144 1
## 10 150 1
## 11 141 1
## 12 241 1
## 13 240 1
## 14 242 1
## 15 243 1
## 16 244 1
## 17 115 2
## 18 117 2
## 19 62 2
## 20 61 2
## 21 25 2
## 22 27 2
## 23 60 2
## 24 59 2
## 25 32 2
## 26 40 2
## 27 57 2
## 28 58 2
## 29 56 2
## 30 54 2
## 31 52 2
## 32 55 2
## 33 50 2
## 34 49 2
## 35 47 2
## 36 46 2
## 37 45 2
## 38 44 2
## 39 43 2
## 40 42 2
## 41 41 2
## 42 38 2
## 43 37 2
## 44 36 2
## 45 35 2
## 46 33 2
## 47 34 2
## 48 31 2
## 49 30 2
## 50 28 2
## 51 29 2
## 52 24 2
## 53 22 2
## 54 23 2
## 55 19 2
## 56 18 2
## 57 15 2
## 58 13 2
## 59 158 2
## 60 164 2
## 61 163 2
## 62 172 2
## 63 167 2
## 64 176 2
## 65 116 2
## 66 109 2
## 67 114 2
## 68 111 2
## 69 112 2
## 70 118 2
## 71 119 2
## 72 120 2
## 73 121 2
## 74 122 2
## 75 123 2
## 76 124 2
## 77 125 2
## 78 126 2
## 79 127 2
## 80 128 2
## 81 129 2
## 82 130 2
## 83 131 2
## 84 132 2
## 85 133 2
## 86 134 2
## 87 145 2
## 88 146 2
## 89 147 2
## 90 148 2
## 91 149 2
## 92 113 2
## 93 153 2
## 94 154 2
## 95 155 2
## 96 162 2
## 97 165 2
## 98 110 2
## 99 1 2
## 100 3 2
## 101 231 2
## 102 229 2
## 103 8 2
## 104 10 2
## 105 11 2
## 106 228 2
## 107 225 2
## 108 14 2
## 109 226 2
## 110 16 2
## 111 17 2
## 112 223 2
## 113 20 2
## 114 21 2
## 115 222 2
## 116 221 2
## 117 26 2
## 118 218 2
## 119 215 2
## 120 214 2
## 121 213 2
## 122 39 2
## 123 212 2
## 124 211 2
## 125 210 2
## 126 48 2
## 127 205 2
## 128 51 2
## 129 53 2
## 130 203 2
## 131 202 2
## 132 201 2
## 133 200 2
## 134 198 2
## 135 197 2
## 136 63 2
## 137 196 2
## 138 195 2
## 139 192 2
## 140 190 2
## 141 191 2
## 142 193 2
## 143 194 2
## 144 199 2
## 145 204 2
## 146 206 2
## 147 207 2
## 148 208 2
## 149 209 2
## 150 216 2
## 151 217 2
## 152 219 2
## 153 220 2
## 154 224 2
## 155 227 2
## 156 230 2
## 157 232 2
## 158 233 2
## 159 234 2
## 160 235 2
## 161 236 2
## 162 237 2
## 163 238 2
## 164 239 2
## 165 5 3
## 166 9 3
## 167 12 3
## 168 7 3
## 169 6 3
## 170 4 3
## 171 2 3
## 172 64 3
## 173 66 3
## 174 67 3
## 175 69 3
## 176 70 3
## 177 72 3
## 178 73 3
## 179 74 3
## 180 77 3
## 181 80 3
## 182 81 3
## 183 85 3
## 184 87 3
## 185 91 3
## 186 92 3
## 187 93 3
## 188 95 3
## 189 97 3
## 190 98 3
## 191 99 3
## 192 100 3
## 193 101 3
## 194 102 3
## 195 103 3
## 196 104 3
## 197 105 3
## 198 106 3
## 199 107 3
## 200 108 3
## 201 156 3
## 202 96 3
## 203 94 3
## 204 90 3
## 205 88 3
## 206 89 3
## 207 86 3
## 208 84 3
## 209 83 3
## 210 82 3
## 211 79 3
## 212 78 3
## 213 76 3
## 214 75 3
## 215 71 3
## 216 68 3
## 217 65 3
## 218 157 3
## 219 152 3
## 220 151 3
## 221 159 3
## 222 161 3
## 223 166 3
## 224 168 3
## 225 169 3
## 226 171 3
## 227 189 3
## 228 174 3
## 229 175 3
## 230 177 3
## 231 178 3
## 232 179 3
## 233 180 3
## 234 181 3
## 235 182 3
## 236 183 3
## 237 184 3
## 238 185 3
## 239 186 3
## 240 187 3
## 241 188 3
## 242 173 3
## 243 170 3
## 244 160 3
## 245 245 3
plot(house$`List Price`, house$`Square Footage`, type="n", xlim=c(119900,2000000), xlab="List Price", ylab="Square Footage")
text(x=house$`List Price`, y=house$`Square Footage`, labels=house$ID, col=grpHouse$cluster+1)
set.seed(1)
grpHouse <- kmeans(house[,c("List Price","Square Footage")], centers=7, nstart=10)
o=order(grpHouse$cluster)
data.frame(house$ID[o],grpHouse$cluster[o])
## house.ID.o. grpHouse.cluster.o.
## 1 149 1
## 2 153 1
## 3 154 1
## 4 155 1
## 5 165 1
## 6 3 1
## 7 8 1
## 8 10 1
## 9 20 1
## 10 39 1
## 11 51 1
## 12 53 1
## 13 63 1
## 14 2 2
## 15 64 2
## 16 66 2
## 17 67 2
## 18 69 2
## 19 70 2
## 20 72 2
## 21 73 2
## 22 74 2
## 23 77 2
## 24 156 2
## 25 79 2
## 26 78 2
## 27 76 2
## 28 75 2
## 29 71 2
## 30 68 2
## 31 65 2
## 32 157 2
## 33 159 2
## 34 161 2
## 35 166 2
## 36 168 2
## 37 169 2
## 38 171 2
## 39 174 2
## 40 175 2
## 41 173 2
## 42 170 2
## 43 160 2
## 44 5 3
## 45 9 3
## 46 12 3
## 47 7 3
## 48 6 3
## 49 4 3
## 50 80 3
## 51 81 3
## 52 85 3
## 53 87 3
## 54 91 3
## 55 92 3
## 56 93 3
## 57 95 3
## 58 97 3
## 59 98 3
## 60 99 3
## 61 100 3
## 62 101 3
## 63 102 3
## 64 103 3
## 65 104 3
## 66 105 3
## 67 106 3
## 68 107 3
## 69 108 3
## 70 96 3
## 71 94 3
## 72 90 3
## 73 88 3
## 74 89 3
## 75 86 3
## 76 84 3
## 77 83 3
## 78 82 3
## 79 189 3
## 80 177 3
## 81 178 3
## 82 179 3
## 83 180 3
## 84 181 3
## 85 182 3
## 86 183 3
## 87 184 3
## 88 185 3
## 89 186 3
## 90 187 3
## 91 188 3
## 92 25 4
## 93 27 4
## 94 60 4
## 95 59 4
## 96 32 4
## 97 40 4
## 98 57 4
## 99 58 4
## 100 56 4
## 101 54 4
## 102 52 4
## 103 55 4
## 104 50 4
## 105 49 4
## 106 47 4
## 107 46 4
## 108 45 4
## 109 44 4
## 110 43 4
## 111 42 4
## 112 41 4
## 113 38 4
## 114 37 4
## 115 36 4
## 116 35 4
## 117 33 4
## 118 34 4
## 119 31 4
## 120 30 4
## 121 28 4
## 122 29 4
## 123 24 4
## 124 22 4
## 125 23 4
## 126 19 4
## 127 18 4
## 128 15 4
## 129 13 4
## 130 164 4
## 131 163 4
## 132 172 4
## 133 167 4
## 134 176 4
## 135 116 4
## 136 109 4
## 137 114 4
## 138 111 4
## 139 112 4
## 140 118 4
## 141 119 4
## 142 120 4
## 143 121 4
## 144 122 4
## 145 113 4
## 146 110 4
## 147 215 4
## 148 214 4
## 149 213 4
## 150 212 4
## 151 211 4
## 152 210 4
## 153 205 4
## 154 203 4
## 155 202 4
## 156 201 4
## 157 200 4
## 158 198 4
## 159 197 4
## 160 196 4
## 161 195 4
## 162 192 4
## 163 190 4
## 164 191 4
## 165 193 4
## 166 194 4
## 167 199 4
## 168 204 4
## 169 206 4
## 170 207 4
## 171 208 4
## 172 209 4
## 173 216 4
## 174 217 4
## 175 115 5
## 176 117 5
## 177 62 5
## 178 61 5
## 179 158 5
## 180 123 5
## 181 124 5
## 182 125 5
## 183 126 5
## 184 127 5
## 185 128 5
## 186 129 5
## 187 130 5
## 188 131 5
## 189 132 5
## 190 133 5
## 191 134 5
## 192 145 5
## 193 146 5
## 194 147 5
## 195 148 5
## 196 162 5
## 197 1 5
## 198 231 5
## 199 229 5
## 200 11 5
## 201 228 5
## 202 225 5
## 203 14 5
## 204 226 5
## 205 16 5
## 206 17 5
## 207 223 5
## 208 21 5
## 209 222 5
## 210 221 5
## 211 26 5
## 212 218 5
## 213 48 5
## 214 219 5
## 215 220 5
## 216 224 5
## 217 227 5
## 218 230 5
## 219 232 5
## 220 233 5
## 221 234 5
## 222 235 5
## 223 236 5
## 224 237 5
## 225 238 5
## 226 239 5
## 227 135 6
## 228 136 6
## 229 137 6
## 230 138 6
## 231 139 6
## 232 140 6
## 233 241 6
## 234 240 6
## 235 242 6
## 236 142 7
## 237 143 7
## 238 144 7
## 239 150 7
## 240 152 7
## 241 151 7
## 242 141 7
## 243 243 7
## 244 244 7
## 245 245 7
plot(house$`List Price`, house$`Square Footage`, type="n", xlim=c(119900,2000000), xlab="List Price", ylab="Square Footage")
text(x=house$`List Price`, y=house$`Square Footage`, labels=house$ID, col=rainbow(7)[grpHouse$cluster])
## HousePrices.csv
price$Brick<-ifelse(price$Brick=="Yes",1,0)
price$Neighborhood <- ifelse(price$Neighborhood == "East",0,
ifelse(price$Neighborhood == "North",1,2))
price[1:3,]
## # A tibble: 3 x 8
## HomeID Price SqFt Bedrooms Bathrooms Offers Brick Neighborhood
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 114300 1790 2 2 2 0 0
## 2 2 114200 2030 4 2 3 0 0
## 3 3 114800 1740 3 2 1 0 0
set.seed(1)
grpPrice <- kmeans(price[,c("Price","SqFt")], centers=3, nstart=10)
grpHouse
## K-means clustering with 7 clusters of sizes 13, 30, 48, 83, 52, 9, 10
##
## Cluster means:
## List Price Square Footage
## 1 202676.9 1226.692
## 2 889835.9 4024.800
## 3 719980.1 3621.208
## 4 529085.0 2780.012
## 5 390506.4 2195.827
## 6 1703555.6 5337.444
## 7 1256678.5 4284.000
##
## Clustering vector:
## [1] 5 5 3 3 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [38] 4 4 4 4 4 4 4 3 3 3 3 2 5 4 4 4 4 4 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 3
## [75] 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6
## [112] 6 7 7 7 5 5 5 5 1 7 4 1 1 1 2 5 1 4 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 7 7
## [149] 7 5 1 5 5 1 1 5 5 5 5 5 5 5 5 1 5 5 5 5 5 4 4 4 1 4 4 4 5 4 1 1 4 4 4 4 4
## [186] 4 1 4 4 4 2 2 2 2 2 2 3 2 2 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 2
## [223] 4 4 2 5 5 2 5 5 5 5 5 5 5 5 5 5 5 6 6 6 7 7 7
##
## Within cluster sum of squares by cluster:
## [1] 19574867214 94603319520 117755598618 159035680150 99066014324
## [6] 244996767968 145702188467
## (between_SS / total_SS = 96.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
o=order(grpPrice$cluster)
data.frame(price$HomeID[o],grpPrice$cluster[o])
## price.HomeID.o. grpPrice.cluster.o.
## 1 15 1
## 2 20 1
## 3 30 1
## 4 31 1
## 5 45 1
## 6 61 1
## 7 63 1
## 8 70 1
## 9 71 1
## 10 78 1
## 11 82 1
## 12 83 1
## 13 86 1
## 14 88 1
## 15 95 1
## 16 100 1
## 17 104 1
## 18 117 1
## 19 1 2
## 20 2 2
## 21 3 2
## 22 4 2
## 23 5 2
## 24 6 2
## 25 9 2
## 26 10 2
## 27 12 2
## 28 13 2
## 29 18 2
## 30 19 2
## 31 21 2
## 32 22 2
## 33 23 2
## 34 24 2
## 35 28 2
## 36 29 2
## 37 32 2
## 38 35 2
## 39 36 2
## 40 37 2
## 41 40 2
## 42 41 2
## 43 43 2
## 44 46 2
## 45 48 2
## 46 49 2
## 47 50 2
## 48 52 2
## 49 53 2
## 50 55 2
## 51 62 2
## 52 64 2
## 53 66 2
## 54 69 2
## 55 73 2
## 56 76 2
## 57 79 2
## 58 85 2
## 59 87 2
## 60 90 2
## 61 92 2
## 62 101 2
## 63 102 2
## 64 105 2
## 65 107 2
## 66 109 2
## 67 110 2
## 68 111 2
## 69 112 2
## 70 113 2
## 71 114 2
## 72 115 2
## 73 116 2
## 74 118 2
## 75 120 2
## 76 121 2
## 77 122 2
## 78 124 2
## 79 126 2
## 80 128 2
## 81 7 3
## 82 8 3
## 83 11 3
## 84 14 3
## 85 16 3
## 86 17 3
## 87 25 3
## 88 26 3
## 89 27 3
## 90 33 3
## 91 34 3
## 92 38 3
## 93 39 3
## 94 42 3
## 95 44 3
## 96 47 3
## 97 51 3
## 98 54 3
## 99 56 3
## 100 57 3
## 101 58 3
## 102 59 3
## 103 60 3
## 104 65 3
## 105 67 3
## 106 68 3
## 107 72 3
## 108 74 3
## 109 75 3
## 110 77 3
## 111 80 3
## 112 81 3
## 113 84 3
## 114 89 3
## 115 91 3
## 116 93 3
## 117 94 3
## 118 96 3
## 119 97 3
## 120 98 3
## 121 99 3
## 122 103 3
## 123 106 3
## 124 108 3
## 125 119 3
## 126 123 3
## 127 125 3
## 128 127 3
plot(price$`Price`, price$`SqFt`, type="n", xlim=c(69100,211200), xlab="Price", ylab="SqFt")
text(x=price$`Price`, y=price$`SqFt`, labels=price$HomeID, col=grpPrice$cluster+1)
set.seed(1)
grpPrice <- kmeans(price[,c("Price","SqFt")], centers=7, nstart=10)
o=order(grpPrice$cluster)
data.frame(price$HomeID[o],grpPrice$cluster[o])
## price.HomeID.o. grpPrice.cluster.o.
## 1 30 1
## 2 31 1
## 3 61 1
## 4 82 1
## 5 86 1
## 6 104 1
## 7 117 1
## 8 1 2
## 9 2 2
## 10 3 2
## 11 5 2
## 12 6 2
## 13 9 2
## 14 12 2
## 15 19 2
## 16 21 2
## 17 22 2
## 18 32 2
## 19 35 2
## 20 36 2
## 21 37 2
## 22 49 2
## 23 53 2
## 24 64 2
## 25 66 2
## 26 79 2
## 27 92 2
## 28 102 2
## 29 109 2
## 30 111 2
## 31 112 2
## 32 113 2
## 33 114 2
## 34 118 2
## 35 124 2
## 36 126 2
## 37 4 3
## 38 18 3
## 39 23 3
## 40 29 3
## 41 48 3
## 42 52 3
## 43 55 3
## 44 69 3
## 45 85 3
## 46 105 3
## 47 7 4
## 48 8 4
## 49 16 4
## 50 17 4
## 51 25 4
## 52 26 4
## 53 38 4
## 54 44 4
## 55 51 4
## 56 57 4
## 57 58 4
## 58 60 4
## 59 68 4
## 60 72 4
## 61 75 4
## 62 80 4
## 63 81 4
## 64 84 4
## 65 91 4
## 66 93 4
## 67 94 4
## 68 96 4
## 69 99 4
## 70 106 4
## 71 119 4
## 72 123 4
## 73 125 4
## 74 127 4
## 75 15 5
## 76 20 5
## 77 45 5
## 78 63 5
## 79 70 5
## 80 71 5
## 81 78 5
## 82 83 5
## 83 88 5
## 84 95 5
## 85 100 5
## 86 11 6
## 87 14 6
## 88 27 6
## 89 33 6
## 90 34 6
## 91 39 6
## 92 42 6
## 93 47 6
## 94 54 6
## 95 56 6
## 96 59 6
## 97 65 6
## 98 67 6
## 99 74 6
## 100 77 6
## 101 89 6
## 102 97 6
## 103 98 6
## 104 103 6
## 105 108 6
## 106 115 6
## 107 128 6
## 108 10 7
## 109 13 7
## 110 24 7
## 111 28 7
## 112 40 7
## 113 41 7
## 114 43 7
## 115 46 7
## 116 50 7
## 117 62 7
## 118 73 7
## 119 76 7
## 120 87 7
## 121 90 7
## 122 101 7
## 123 107 7
## 124 110 7
## 125 116 7
## 126 120 7
## 127 121 7
## 128 122 7
plot(price$`Price`, price$`SqFt`, type="n", xlim=c(69100,211200), xlab="Price", ylab="SqFt")
text(x=price$`Price`, y=price$`SqFt`, labels=price$HomeID, col=rainbow(7)[grpPrice$cluster])
The plots for Mount_Pleasant_Real_Estate.csv and the HousePrices.csv showed that the relationship between Square Footage vs. Price looked somewhat linear as price increased with larger square footage. In regards to clustering, it was noticeable that there were more clusters with the Mount_Pleasant_Real_Estate.csv. Some of the clusters so dense that it made it difficult to identify what the ID was for each house. In those cases, they clusters looked a bright blob of color. Since the clustering was closer for houses in the Mount_Pleasant_Real_Estate.csv, that seemed to indicate that square footage and pricing was similar among them. This was a good indication that the housing for the Mount_Pleasant_Real_Estate.csv is more defined and easy to differentiate in comparison to the HousePrices.csv