By pawaneet Kaur (Pauline)
In this document, we will be using the demographic census data of New Zealand (2013) to conduct a multivariate data exploration. I will include:
The process of displaying and cleaning up data
Issues of Multivariate data
Dimensional reduction
Classification and clustering
library(sf) # spatial data
library(tmap) # simple mapping
library(MASS) # multivariate methods
library(scatterplot3d) # 3D plots
library(tidyr) # Tidying up data
library(ggplot2) # ploting
library(dplyr)
My Area of interest in Auckland, New Zealand. I will be unzipping, clipping, and opening the nz-2193-cau-2013.zip file on ArcGIS pro.
sfa <- st_read('Auckland.shp')
sfa <- drop_na(sfa)
sfa.d <- st_drop_geometry(sfa)
Most of the variables display the number of people participating. Plotting and mapping is a way of viewing the data.
tm_shape(sfa) +
tm_polygons(col='ExSmoker') +
tm_legend(legend.outside=T)
I will be cleaning up our dataset.
we have 64 list of columns in our dataset from the 'names()'
names(sfa)
## [1] "OBJECTID" "AU2013" "AUName" "PopUR13" "PopUR06"
## [6] "PopUR01" "Male_UR13" "Female_UR1" "Median_Age" "Age0_4"
## [11] "Age5_9" "Age10_14" "Age15_19" "Age20_24" "Age25_29"
## [16] "Age30_34" "Age35_39" "Age40_44" "Age45_49" "Age50_54"
## [21] "Age55_59" "Age60_64" "Age65plus" "YrUR0" "YrUR1_4"
## [26] "YrUR5_9" "YrUR10_14" "YrUR15_29" "YrUR30plus" "URSame5"
## [31] "UR5inNZ" "UR5notborn" "UR5oversea" "UR5nfa" "NZborn"
## [36] "NotNZborn" "Arr0_9" "Arr10_19" "Arr20_29" "Arr30_39"
## [41] "Arr40_49" "Arr50plus" "European" "Maori" "Pacific"
## [46] "Asian" "MELAA" "Other" "LangEng" "LangMaori"
## [51] "LangSamoan" "LangNZSL" "LangOther" "MaoriDes" "NoMaoriDes"
## [56] "Smoker" "ExSmoker" "NevSmoker" "Married" "ExMarried"
## [61] "NevMarried" "PShip" "Single" "geometry"
I will be focusing on:
smoking status based on race, gender and age
Married status based on age and race.
Arranging the gender, age, race of smokers.
smoker_gender= sfa %>%
select(AUName, PopUR13, Male_UR13,Female_UR1,Smoker) %>%
mutate(male_numbers = Male_UR13/PopUR13*Smoker,female_numbers = Female_UR1/PopUR13*Smoker,male_percentage = Male_UR13/PopUR13,female_percentage = Female_UR1/PopUR13) %>%
select(AUName,male_numbers,female_numbers,male_percentage,female_percentage,Smoker)
smoker_race= sfa %>%
select(AUName, European, Maori,Pacific,Asian, Smoker) %>%
mutate(European = European/(European+Maori+Pacific+Asian)*Smoker,
Maori = Maori/(European+Maori+Pacific+Asian)*Smoker,
Pacific = Pacific/(European+Maori+Pacific+Asian) *Smoker,
Asian = Asian/(European+Maori+Pacific+Asian) *Smoker) %>%
select(AUName,European,Maori,Pacific,Asian,Smoker)
smoker_YrUR=sfa %>%
select(AUName, YrUR0, YrUR1_4,YrUR5_9,YrUR10_14,YrUR15_29,YrUR30plus,Smoker) %>%
mutate(YrUR0 = YrUR0/(YrUR0 +YrUR1_4 + YrUR5_9 + YrUR10_14 +YrUR15_29 +YrUR30plus)*Smoker,
YrUR1_4_9 = (YrUR1_4 +YrUR5_9)/(YrUR0 +YrUR1_4 + YrUR5_9 + YrUR10_14 +YrUR15_29 +YrUR30plus)*Smoker,
YrUR10_14_29 = (YrUR10_14+YrUR15_29)/(YrUR0 +YrUR1_4 + YrUR5_9 + YrUR10_14 +YrUR15_29 +YrUR30plus) *Smoker,
YrUR30plus = YrUR30plus/(YrUR0 +YrUR1_4 + YrUR5_9 + YrUR10_14 +YrUR15_29 +YrUR30plus) *Smoker) %>%
select(AUName,YrUR0,YrUR1_4_9,YrUR10_14_29,YrUR30plus,Smoker)
Selecting marriage status based on age, race, and gender.
marrried_age= sfa %>%
select(AUName, Age0_4, Age5_9, Age10_14, Age15_19 ,Age20_24,Age25_29,Age30_34,Age35_39,Age40_44,Age45_49,Age50_54, Age55_59,Age60_64,Age65plus,Married) %>%
mutate( Age0_14 = (Age0_4 + Age5_9 + Age10_14) /(a= Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus)*Married,
Age15_19= Age15_19/(Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus) *Married,
Age20_39 = (Age20_24 + Age25_29 + Age30_34 + Age35_39)/(Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus) *Married,
Age40_59 = (Age40_44 + Age45_49+ Age50_54 + Age55_59)/(Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus) *Married,
Age60plus=(Age60_64+ Age65plus)/(Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus) *Married) %>%
select(AUName,Married,Age0_14,Age15_19,Age20_39,Age40_59,Age60plus)
married_race= sfa %>%
select(AUName, Married, European, Maori,Pacific,Asian) %>%
mutate(European = European/(European+Maori+Pacific+Asian)*Married,
Maori = Maori/(European+Maori+Pacific+Asian)*Married,
Pacific = Pacific/(European+Maori+Pacific+Asian) *Married,
Asian = Asian/(European+Maori+Pacific+Asian) *Married) %>%
select(AUName, Married, European, Maori,Pacific,Asian)
single_gender= sfa %>%
select(AUName, Single, Male_UR13, Female_UR1) %>%
mutate(male = Male_UR13/(Male_UR13+ Female_UR1)*Single,
female = Female_UR1/(Male_UR13+ Female_UR1)*Single) %>%
select(AUName, Single, male, female)
Displaying smokers and marriage status in assending order
cau.sorted <- smoker_gender %>%
arrange(desc(Smoker))
head(st_drop_geometry(cau.sorted), n=10)
## AUName male_numbers female_numbers male_percentage
## 1 Auckland Central West 698.7954 600.2046 0.5379487
## 2 Papakura East 536.1179 582.8821 0.4791045
## 3 Pukekohe North 506.3188 540.3305 0.4835901
## 4 Auckland Central East 500.5849 489.4151 0.5056413
## 5 Waiheke Island 481.3001 508.6999 0.4861617
## 6 Homai East 464.3596 501.1685 0.4807035
## 7 Leabank 406.1761 439.3432 0.4801136
## 8 Wattle Farm 408.9343 437.0657 0.4833739
## 9 Otahuhu West 411.7742 383.2258 0.5179550
## 10 Glen Eden East 382.1772 406.8228 0.4843817
## female_percentage Smoker
## 1 0.4620513 1299
## 2 0.5208955 1119
## 3 0.5160750 1047
## 4 0.4943587 990
## 5 0.5138383 990
## 6 0.5188080 966
## 7 0.5193182 846
## 8 0.5166261 846
## 9 0.4820450 795
## 10 0.5156183 789
# the desc(Smoker) sorts information in descending order
# you can sort multiple variables by listing them arrange()
smoker_race.sorted <- smoker_race %>%
arrange(desc(Smoker))
head(st_drop_geometry(smoker_race.sorted), n=10)
## AUName European Maori Pacific Asian Smoker
## 1 Auckland Central West 413.4209 85.56326 59.03723 1192.4175 1299
## 2 Papakura East 397.4114 593.08499 542.41133 247.3515 1119
## 3 Pukekohe North 629.1512 519.96675 373.07943 290.6681 1047
## 4 Auckland Central East 378.2736 58.73895 35.15415 903.6467 990
## 5 Waiheke Island 828.6719 398.59003 132.21170 154.8148 990
## 6 Homai East 244.6733 326.73670 571.83020 441.6676 966
## 7 Leabank 239.9361 327.17760 476.68353 360.2184 846
## 8 Wattle Farm 461.2875 335.45651 413.07679 304.6963 846
## 9 Otahuhu West 143.2134 147.10481 485.48707 470.5243 795
## 10 Glen Eden East 421.7357 163.58221 282.24599 488.8515 789
smoker_YrUR.sorted <- smoker_YrUR %>%
arrange(desc(Smoker))
head(st_drop_geometry(smoker_YrUR.sorted), n=10)
## AUName YrUR0 YrUR1_4_9 YrUR10_14_29 YrUR30plus Smoker
## 1 Auckland Central West 714.8784 1048.5244 72.63440 6.472373 1299
## 2 Papakura East 310.8688 759.8223 220.59356 53.922870 1119
## 3 Pukekohe North 246.9926 803.4557 174.96177 28.345762 1047
## 4 Auckland Central East 586.2883 812.6697 39.72742 1.393945 990
## 5 Waiheke Island 238.4841 625.6522 293.44315 31.895994 990
## 6 Homai East 218.6484 649.8004 218.51694 51.115073 966
## 7 Leabank 207.5194 524.6880 223.44128 51.870298 846
## 8 Wattle Farm 160.7979 578.7553 219.58787 24.350338 846
## 9 Otahuhu West 201.9311 542.1833 184.91449 20.933716 795
## 10 Glen Eden East 163.2414 519.4147 200.20068 44.284021 789
marriage age
marrried_age.sorted <- marrried_age %>%
arrange(desc(Married))
head(st_drop_geometry(marrried_age.sorted), n=10)
## AUName Married Age0_14 Age15_19 Age20_39 Age40_59 Age60plus
## 1 Orewa 3822 470.8553 185.6515 507.3175 897.1364 1848.1840
## 2 Greenhithe 3585 875.8226 284.7062 837.4224 1252.1204 450.8168
## 3 Sturges North 2967 663.5871 213.1135 813.2823 867.8561 499.1503
## 4 Baverstock Oaks 2958 675.5521 229.5565 953.3003 880.6026 312.7374
## 5 Hillsborough West 2928 485.8063 244.0997 956.1108 883.0483 474.9062
## 6 Beachlands-Maraetai 2904 716.6717 191.7129 601.6840 973.3910 498.7292
## 7 Wattle Farm 2757 666.3309 178.8808 733.0628 728.4010 531.4414
## 8 Manly 2718 521.4478 190.4093 515.1881 814.9576 761.7517
## 9 Kingseat 2676 555.3475 189.8514 510.0902 971.7285 530.2784
## 10 Pukekohe North 2634 736.1783 194.1957 726.3042 682.5398 388.2934
Marriage race
married_race.sorted <- married_race %>%
arrange(desc(Married))
head(st_drop_geometry(married_race.sorted), n=10)
## AUName Married European Maori Pacific Asian
## 1 Orewa 3822 3324.1149 391.9899 191.0082 392.3725
## 2 Greenhithe 3585 2918.5329 267.0984 95.2498 900.7741
## 3 Sturges North 2967 1447.7222 312.3954 333.8489 1614.3233
## 4 Baverstock Oaks 2958 849.8339 144.1186 173.2153 2316.5331
## 5 Hillsborough West 2928 907.0963 147.5652 295.1622 2204.1151
## 6 Beachlands-Maraetai 2904 2582.4455 454.6369 124.7194 93.3390
## 7 Wattle Farm 2757 1503.2737 877.1104 869.3885 477.6458
## 8 Manly 2718 2391.5973 415.5265 140.8955 136.5997
## 9 Kingseat 2676 2227.0491 514.5452 182.1329 184.0975
## 10 Pukekohe North 2634 1582.7930 1071.5251 602.1549 401.1196
Single gender
single_gender.sorted <- single_gender %>%
arrange(desc(Single))
head(st_drop_geometry(single_gender.sorted), n=10)
## AUName Single male female
## 1 Auckland Central West 5880 3163.1385 2716.862
## 2 Auckland Central East 5382 2721.3616 2660.638
## 3 Orewa 2709 1203.3643 1505.636
## 4 Waiheke Island 2463 1197.4162 1265.584
## 5 Hillsborough West 2409 1195.6434 1213.357
## 6 Pukekohe North 2364 1143.5899 1220.410
## 7 Glen Eden East 2232 1081.1399 1150.860
## 8 Akarana 2058 1012.3675 1045.633
## 9 Ellerslie North 2055 994.0202 1060.980
## 10 Mangere South 2025 981.3527 1043.647
From our analysis, we can see that some datasets have a relationship with one another. Now, we will have to combined the variables of interest into one dataframe in order for us to run multiple or add Weights to the analysis.
married_total=
sfa %>%
select(AUName, PopUR13, Male_UR13,Female_UR1,Smoker,European, Maori,Pacific,Asian,YrUR0, YrUR1_4,YrUR5_9,YrUR10_14,YrUR15_29,YrUR30plus,Married,Age0_4, Age5_9, Age10_14, Age15_19 ,Age20_24,Age25_29,Age30_34,Age35_39,Age40_44,Age45_49,Age50_54,Age55_59,Age60_64,Age65plus,European, Maori,Pacific,Asian,Single) %>%
mutate(Age0_14_married = ((Age0_4 + Age5_9 + Age10_14) /(a= Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus)*Married)/Married,
Age15_19_married= (Age15_19/(Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus) *Married)/Married,
Age20_39_married = ((Age20_24 + Age25_29 + Age30_34 + Age35_39)/(Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus) *Married)/Married,
Age40_59_married = ((Age40_44 + Age45_49+ Age50_54 + Age55_59)/(Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus) *Married)/Married,
Age60plus_married=((Age60_64+ Age65plus)/(Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus) *Married)/Married,
European_married = (European/(European+Maori+Pacific+Asian)*Married)/Married,
Maori_married = (Maori/(European+Maori+Pacific+Asian)*Married)/Married,
Pacific_married = (Pacific/(European+Maori+Pacific+Asian) *Married)/Married,
Asian_married = (Asian/(European+Maori+Pacific+Asian) *Married)/Married,
#single_male = (Male_UR13/(Male_UR13+ Female_UR1)*Single)/Single,
#single_female = (Female_UR1/(Male_UR13+ Female_UR1)*Single)/Single
) %>%
select(Age0_14_married,Age15_19_married,Age20_39_married,Age40_59_married,Age60plus_married,European_married, Maori_married,Pacific_married,Asian_married)
head(st_drop_geometry(married_total), n=10)
## Age0_14_married Age15_19_married Age20_39_married Age40_59_married
## 1 0.1658986 0.05990783 0.1716590 0.2361751
## 2 0.2256604 0.07169811 0.1811321 0.3049057
## 3 0.2273263 0.07538280 0.2473498 0.2932862
## 4 0.1934605 0.07084469 0.2179837 0.3433243
## 5 0.1864407 0.03389831 0.1355932 0.2881356
## 6 0.1803279 0.06557377 0.1536885 0.3565574
## 7 0.1926007 0.05440696 0.3220892 0.2763874
## 8 0.2301136 0.05113636 0.2301136 0.2528409
## 9 0.2282609 0.06763285 0.1811594 0.3623188
## 10 0.2105263 0.07218045 0.1879699 0.3609023
## Age60plus_married European_married Maori_married Pacific_married
## 1 0.3663594 0.8834499 0.05244755 0.01864802
## 2 0.2166038 0.8714499 0.06427504 0.02167414
## 3 0.1566549 0.8547297 0.09346847 0.01914414
## 4 0.1743869 0.8963585 0.04201681 0.01120448
## 5 0.3559322 0.9210526 0.03508772 0.03508772
## 6 0.2438525 0.8092243 0.14884696 0.02515723
## 7 0.1545158 0.3563596 0.09100877 0.13815789
## 8 0.2357955 0.6849315 0.22739726 0.05205479
## 9 0.1606280 0.9026217 0.06991261 0.01498127
## 10 0.1684211 0.8437979 0.11026034 0.02297090
## Asian_married
## 1 0.04545455
## 2 0.04260090
## 3 0.03265766
## 4 0.05042017
## 5 0.00877193
## 6 0.01677149
## 7 0.41447368
## 8 0.03561644
## 9 0.01248439
## 10 0.02297090
smoker_total=
sfa %>%
select(AUName, PopUR13, Male_UR13,Female_UR1,Smoker,European, Maori,Pacific,Asian,YrUR0, YrUR1_4,YrUR5_9,YrUR10_14,YrUR15_29,YrUR30plus,Married,Age0_4, Age5_9, Age10_14, Age15_19 ,Age20_24,Age25_29,Age30_34,Age35_39,Age40_44,Age45_49,Age50_54, Age55_59,Age60_64,Age65plus,European, Maori,Pacific,Asian,Single) %>%
mutate(YrUR0_smoker = (YrUR0/(YrUR0 +YrUR1_4 + YrUR5_9 + YrUR10_14 +YrUR15_29 +YrUR30plus)*Smoker)/Smoker,
YrUR1_4_9_smoker = ((YrUR1_4 +YrUR5_9)/(YrUR0 +YrUR1_4 + YrUR5_9 + YrUR10_14 +YrUR15_29 +YrUR30plus)*Smoker)/Smoker,
YrUR10_14_29_smoker = ((YrUR10_14+YrUR15_29)/(YrUR0 +YrUR1_4 + YrUR5_9 + YrUR10_14 +YrUR15_29 +YrUR30plus)*Smoker)/Smoker,
YrUR30plus_smoker = (YrUR30plus/(YrUR0 +YrUR1_4 + YrUR5_9 + YrUR10_14 +YrUR15_29 +YrUR30plus)*Smoker)/Smoker,
male_smoker = (Male_UR13/PopUR13*Smoker)/Smoker,
female_smoker = (Female_UR1/PopUR13*Smoker)/Smoker,
European_smoker = (European/(European+Maori+Pacific+Asian)*Smoker)/Smoker,
Maori_smoker = (Maori/(European+Maori+Pacific+Asian)*Smoker)/Smoker,
Pacific_smoker = (Pacific/(European+Maori+Pacific+Asian)*Smoker)/Smoker,
Asian_smoker = (Asian/(European+Maori+Pacific+Asian)*Smoker)/Smoker,
single_female = (Female_UR1/(Male_UR13+ Female_UR1)*Single)/Single
) %>%
select(male_smoker,female_smoker,European_smoker, Maori_smoker,Pacific_smoker,Asian_smoker,YrUR0_smoker,YrUR1_4_9_smoker,YrUR10_14_29_smoker,YrUR30plus_smoker,Male_UR13,Female_UR1,Smoker,European, Maori,Pacific,Asian,YrUR0, YrUR1_4,YrUR5_9,YrUR10_14,YrUR15_29,YrUR30plus,Married,Age0_4, Age5_9, Age10_14, Age15_19 ,Age20_24,Age25_29,Age30_34,Age35_39,Age40_44,Age45_49,Age50_54,Age55_59,Age60_64,Age65plus,Single)
head(st_drop_geometry(smoker_total), n=10)
## male_smoker female_smoker European_smoker Maori_smoker Pacific_smoker
## 1 0.4643678 0.5356322 0.8834499 0.05244755 0.01864802
## 2 0.4799698 0.5200302 0.8714499 0.06427504 0.02167414
## 3 0.4882353 0.5117647 0.8547297 0.09346847 0.01914414
## 4 0.5054348 0.4918478 0.8963585 0.04201681 0.01120448
## 5 0.5000000 0.5000000 0.9210526 0.03508772 0.03508772
## 6 0.5071575 0.4928425 0.8092243 0.14884696 0.02515723
## 7 0.4689204 0.5310796 0.3563596 0.09100877 0.13815789
## 8 0.4899713 0.5100287 0.6849315 0.22739726 0.05205479
## 9 0.5012107 0.5000000 0.9026217 0.06991261 0.01498127
## 10 0.5022556 0.4992481 0.8437979 0.11026034 0.02297090
## Asian_smoker YrUR0_smoker YrUR1_4_9_smoker YrUR10_14_29_smoker
## 1 0.04545455 0.2261614 0.5317848 0.2151589
## 2 0.04260090 0.2259348 0.5799523 0.1837709
## 3 0.03265766 0.2383292 0.5270270 0.2272727
## 4 0.05042017 0.1735294 0.5323529 0.2735294
## 5 0.00877193 0.2293578 0.4495413 0.2660550
## 6 0.01677149 0.2121212 0.5058275 0.2331002
## 7 0.41447368 0.2342995 0.5277778 0.2089372
## 8 0.03561644 0.2364217 0.6102236 0.1405751
## 9 0.01248439 0.1534392 0.5767196 0.2473545
## 10 0.02297090 0.1281198 0.4891847 0.3044925
## YrUR30plus_smoker Male_UR13 Female_UR1 Smoker European Maori Pacific Asian
## 1 0.026894866 1212 1398 183 2274 135 48 117
## 2 0.010342084 1905 2064 315 3498 258 87 171
## 3 0.007371007 1245 1305 261 2277 249 51 87
## 4 0.020588235 558 543 66 960 45 12 54
## 5 0.055045872 177 177 21 315 12 12 3
## 6 0.048951049 744 723 153 1158 213 36 24
## 7 0.028985507 1290 1461 276 975 249 378 1134
## 8 0.012779553 513 534 189 750 249 57 39
## 9 0.022486772 1242 1239 183 2169 168 36 30
## 10 0.078202995 1002 996 153 1653 216 45 45
## YrUR0 YrUR1_4 YrUR5_9 YrUR10_14 YrUR15_29 YrUR30plus Married Age0_4 Age5_9
## 1 555 813 492 276 252 66 1059 147 141
## 2 852 1236 951 411 282 39 1776 243 321
## 3 582 747 540 330 225 18 933 195 195
## 4 177 318 225 153 126 21 489 51 90
## 5 75 81 66 33 54 18 186 21 21
## 6 273 369 282 165 135 63 555 69 99
## 7 582 732 579 276 243 72 897 225 171
## 8 222 333 240 54 78 12 261 90 96
## 9 348 660 648 282 279 51 990 159 198
## 10 231 459 423 255 294 141 747 108 138
## Age10_14 Age15_19 Age20_24 Age25_29 Age30_34 Age35_39 Age40_44 Age45_49
## 1 144 156 105 93 111 138 171 171
## 2 333 285 147 132 192 249 345 378
## 3 189 192 156 129 174 171 219 192
## 4 72 78 78 63 45 54 93 102
## 5 24 12 9 12 12 15 21 24
## 6 96 96 63 33 48 81 108 147
## 7 135 150 216 234 243 195 216 180
## 8 57 54 51 63 69 60 78 66
## 9 210 168 111 84 93 162 258 261
## 10 174 144 108 75 78 114 186 222
## Age50_54 Age55_59 Age60_64 Age65plus Single
## 1 156 117 132 822 759
## 2 306 183 204 657 810
## 3 192 144 126 273 696
## 4 105 78 75 117 252
## 5 24 33 36 90 60
## 6 153 114 120 237 333
## 7 189 177 144 282 879
## 8 69 54 48 201 321
## 9 216 165 138 261 528
## 10 177 135 96 240 465
married_smoker_total_1=
smoker_total %>%
select(male_smoker,female_smoker,European_smoker, Maori_smoker,Pacific_smoker,Asian_smoker,YrUR0_smoker,YrUR1_4_9_smoker,YrUR10_14_29_smoker,YrUR30plus_smoker,Male_UR13,Female_UR1,Smoker,European, Maori,Pacific,Asian,YrUR0, YrUR1_4,YrUR5_9,YrUR10_14,YrUR15_29,YrUR30plus,Married,Age0_4, Age5_9, Age10_14, Age15_19 ,Age20_24,Age25_29,Age30_34,Age35_39,Age40_44,Age45_49,Age50_54,Age55_59,Age60_64,Age65plus,Single) %>%
mutate(Age0_14_married = ((Age0_4 + Age5_9 + Age10_14) /(a= Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus)*Married)/Married,
Age15_19_married= (Age15_19/(Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus) *Married)/Married,
Age20_39_married = ((Age20_24 + Age25_29 + Age30_34 + Age35_39)/(Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus) *Married)/Married,
Age40_59_married = ((Age40_44 + Age45_49+ Age50_54 + Age55_59)/(Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus) *Married)/Married,
Age60plus_married=((Age60_64+ Age65plus)/(Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus) *Married)/Married,
European_married = (European/(European+Maori+Pacific+Asian)*Married)/Married,
Maori_married = (Maori/(European+Maori+Pacific+Asian)*Married)/Married,
Pacific_married = (Pacific/(European+Maori+Pacific+Asian) *Married)/Married,
Asian_married = (Asian/(European+Maori+Pacific+Asian) *Married)/Married,
single_male = (Male_UR13/(Male_UR13+ Female_UR1)*Single)/Single,
single_female = (Female_UR1/(Male_UR13+ Female_UR1)*Single)/Single) %>%
select(Age0_14_married,Age15_19_married,Age20_39_married,Age40_59_married,Age60plus_married,European_married, Maori_married,Pacific_married,Asian_married,single_male,single_female,male_smoker,female_smoker,European_smoker, Maori_smoker,Pacific_smoker,Asian_smoker,YrUR0_smoker,YrUR1_4_9_smoker,YrUR10_14_29_smoker,YrUR30plus_smoker)
head(st_drop_geometry(married_smoker_total_1), n=10)
## Age0_14_married Age15_19_married Age20_39_married Age40_59_married
## 1 0.1658986 0.05990783 0.1716590 0.2361751
## 2 0.2256604 0.07169811 0.1811321 0.3049057
## 3 0.2273263 0.07538280 0.2473498 0.2932862
## 4 0.1934605 0.07084469 0.2179837 0.3433243
## 5 0.1864407 0.03389831 0.1355932 0.2881356
## 6 0.1803279 0.06557377 0.1536885 0.3565574
## 7 0.1926007 0.05440696 0.3220892 0.2763874
## 8 0.2301136 0.05113636 0.2301136 0.2528409
## 9 0.2282609 0.06763285 0.1811594 0.3623188
## 10 0.2105263 0.07218045 0.1879699 0.3609023
## Age60plus_married European_married Maori_married Pacific_married
## 1 0.3663594 0.8834499 0.05244755 0.01864802
## 2 0.2166038 0.8714499 0.06427504 0.02167414
## 3 0.1566549 0.8547297 0.09346847 0.01914414
## 4 0.1743869 0.8963585 0.04201681 0.01120448
## 5 0.3559322 0.9210526 0.03508772 0.03508772
## 6 0.2438525 0.8092243 0.14884696 0.02515723
## 7 0.1545158 0.3563596 0.09100877 0.13815789
## 8 0.2357955 0.6849315 0.22739726 0.05205479
## 9 0.1606280 0.9026217 0.06991261 0.01498127
## 10 0.1684211 0.8437979 0.11026034 0.02297090
## Asian_married single_male single_female male_smoker female_smoker
## 1 0.04545455 0.4643678 0.5356322 0.4643678 0.5356322
## 2 0.04260090 0.4799698 0.5200302 0.4799698 0.5200302
## 3 0.03265766 0.4882353 0.5117647 0.4882353 0.5117647
## 4 0.05042017 0.5068120 0.4931880 0.5054348 0.4918478
## 5 0.00877193 0.5000000 0.5000000 0.5000000 0.5000000
## 6 0.01677149 0.5071575 0.4928425 0.5071575 0.4928425
## 7 0.41447368 0.4689204 0.5310796 0.4689204 0.5310796
## 8 0.03561644 0.4899713 0.5100287 0.4899713 0.5100287
## 9 0.01248439 0.5006046 0.4993954 0.5012107 0.5000000
## 10 0.02297090 0.5015015 0.4984985 0.5022556 0.4992481
## European_smoker Maori_smoker Pacific_smoker Asian_smoker YrUR0_smoker
## 1 0.8834499 0.05244755 0.01864802 0.04545455 0.2261614
## 2 0.8714499 0.06427504 0.02167414 0.04260090 0.2259348
## 3 0.8547297 0.09346847 0.01914414 0.03265766 0.2383292
## 4 0.8963585 0.04201681 0.01120448 0.05042017 0.1735294
## 5 0.9210526 0.03508772 0.03508772 0.00877193 0.2293578
## 6 0.8092243 0.14884696 0.02515723 0.01677149 0.2121212
## 7 0.3563596 0.09100877 0.13815789 0.41447368 0.2342995
## 8 0.6849315 0.22739726 0.05205479 0.03561644 0.2364217
## 9 0.9026217 0.06991261 0.01498127 0.01248439 0.1534392
## 10 0.8437979 0.11026034 0.02297090 0.02297090 0.1281198
## YrUR1_4_9_smoker YrUR10_14_29_smoker YrUR30plus_smoker
## 1 0.5317848 0.2151589 0.026894866
## 2 0.5799523 0.1837709 0.010342084
## 3 0.5270270 0.2272727 0.007371007
## 4 0.5323529 0.2735294 0.020588235
## 5 0.4495413 0.2660550 0.055045872
## 6 0.5058275 0.2331002 0.048951049
## 7 0.5277778 0.2089372 0.028985507
## 8 0.6102236 0.1405751 0.012779553
## 9 0.5767196 0.2473545 0.022486772
## 10 0.4891847 0.3044925 0.078202995
Note: The marrage as this seems unrealistic -- It could prevalent in some cultures or could be as a reslut of peer pressure/negative unfluences.
I will be using the Dimension Reduction Method (DRM) and the Principal Component Analysis (PCA), to linearly map of the information to a reduced dimensional space. It will allow the maximization of the variance of the data in reduced dimensions.
I will be running the DRM and PCA as the variables have similar distributions with each other. This means the variables are non-independent and are likely to be weighted together.
I will run the PCA though princomp
Note: na values and geometry columns needs to be removed as the PCA will not work.
library(sf)
library(tmap)
library(tidyr)
smoker_married_total <- drop_na(married_smoker_total_1) # removing the na values
smoker_married_total.d <- st_drop_geometry(smoker_married_total) # removing the geometry columns
smoker_married_total.pca <- princomp(smoker_married_total.d, cor=TRUE) # the cor=TRUE indcate that there is a correlation.
The summary displays 21 components along with the Standard deviation, Proportion of Variance and Cumulative Proportion. These components are ordered from the highest Proportion of Variance to the lowest. For example, the Proportion of Variance for component 1 is 0.30 (30%), followed by 2.0 for component 2 (etc..).
From this dataset, about 80% is accounted by the first 5 principal components; is not all of the variance but it is a big portion of it.
Therefore, it is effective to select 5 attributes as it is easier to analyse than 21 attributes as we add more of these components there will be extra bits of additional information adding on. It better to add less.
summary(smoker_married_total.pca) # Summarises the PCA of individuals who are smokers and married.
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 2.5174842 2.0538368 1.9836965 1.33486235 1.24244596
## Proportion of Variance 0.3017965 0.2008688 0.1873834 0.08485036 0.07350819
## Cumulative Proportion 0.3017965 0.5026654 0.6900488 0.77489913 0.84840732
## Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## Standard deviation 0.95209596 0.83052486 0.7961795 0.64467722 0.56647303
## Proportion of Variance 0.04316603 0.03284626 0.0301858 0.01979089 0.01528056
## Cumulative Proportion 0.89157335 0.92441961 0.9546054 0.97439631 0.98967687
## Comp.11 Comp.12 Comp.13 Comp.14
## Standard deviation 0.453030116 0.1072603132 6.690120e-03 3.100448e-08
## Proportion of Variance 0.009773156 0.0005478464 2.131319e-06 4.577513e-17
## Cumulative Proportion 0.999450022 0.9999978687 1.000000e+00 1.000000e+00
## Comp.15 Comp.16 Comp.17 Comp.18 Comp.19 Comp.20 Comp.21
## Standard deviation 0 0 0 0 0 0 0
## Proportion of Variance 0 0 0 0 0 0 0
## Cumulative Proportion 1 1 1 1 1 1 1
We can find the principal components of each weighted amount of the original variables by usingloadings.
smoker_married_total.pca$loadings
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## Age0_14_married 0.233 0.231 0.193 0.262 0.208 0.195
## Age15_19_married 0.171 0.274 -0.186 0.727 -0.347
## Age20_39_married 0.174 -0.327 -0.168 -0.277 -0.172 0.338 -0.137
## Age40_59_married -0.302 0.252 0.271 0.199 -0.169
## Age60plus_married -0.236 0.169 -0.116 -0.396 -0.618 0.290
## European_married -0.369 -0.197 0.125
## Maori_married 0.257 0.154 0.185 -0.327 -0.302 -0.358
## Pacific_married 0.330 0.181 0.102 -0.110 0.119 0.365
## Asian_married 0.111 -0.309 -0.241 0.337 0.140 -0.121 -0.239
## single_male -0.126 -0.256 0.392
## single_female 0.126 0.256 -0.392
## male_smoker -0.120 -0.258 0.392
## female_smoker 0.130 0.252 -0.389
## European_smoker -0.369 -0.197 0.125
## Maori_smoker 0.257 0.154 0.185 -0.327 -0.302 -0.358
## Pacific_smoker 0.330 0.181 0.102 -0.110 0.119 0.365
## Asian_smoker 0.111 -0.309 -0.241 0.337 0.140 -0.121 -0.239
## YrUR0_smoker -0.349 -0.328 -0.287 0.140 0.151
## YrUR1_4_9_smoker -0.142 -0.146 0.698
## YrUR10_14_29_smoker -0.133 0.298 0.147 0.384 -0.115
## YrUR30plus_smoker 0.240 0.170 0.257 -0.296 -0.271 0.164 -0.449
## Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
## Age0_14_married 0.734 0.361
## Age15_19_married 0.196 -0.291 -0.235 0.156
## Age20_39_married -0.191 -0.318 -0.194 0.587
## Age40_59_married 0.635 -0.370 0.346
## Age60plus_married 0.147 0.458
## European_married -0.167 -0.253
## Maori_married -0.141 -0.363
## Pacific_married 0.136 -0.181
## Asian_married 0.119 0.120 -0.278
## single_male 0.497 0.416
## single_female -0.497 0.416
## male_smoker -0.746 -0.443
## female_smoker -0.661 0.555
## European_smoker -0.167 0.313
## Maori_smoker -0.141 0.379
## Pacific_smoker 0.136 -0.181
## Asian_smoker 0.119 0.316
## YrUR0_smoker 0.182 0.375 0.183 -0.235 -0.118
## YrUR1_4_9_smoker 0.247 -0.353 -0.275 -0.161
## YrUR10_14_29_smoker -0.572 -0.144 -0.218 -0.110
## YrUR30plus_smoker 0.649
## Comp.16 Comp.17 Comp.18 Comp.19 Comp.20 Comp.21
## Age0_14_married 0.164
## Age15_19_married
## Age20_39_married 0.267
## Age40_59_married 0.157
## Age60plus_married 0.208
## European_married 0.541 -0.201 -0.404 -0.157 -0.406
## Maori_married 0.168 0.116 0.241 0.437 -0.303
## Pacific_married 0.522 0.389 -0.399 -0.184
## Asian_married 0.320 -0.354 -0.376 -0.147 -0.265 -0.218
## single_male 0.329 0.190 -0.302 0.257 -0.132
## single_female 0.329 0.190 -0.302 0.257 -0.132
## male_smoker
## female_smoker
## European_smoker -0.567 0.270 -0.183 -0.467
## Maori_smoker -0.163 -0.123 -0.223 -0.525
## Pacific_smoker -0.509 -0.341 0.164 -0.178 -0.427
## Asian_smoker -0.309 0.338 0.419 -0.329
## YrUR0_smoker -0.134 0.518 -0.216
## YrUR1_4_9_smoker 0.354 -0.147
## YrUR10_14_29_smoker -0.124 0.481 -0.200
## YrUR30plus_smoker 0.144
##
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048
## Cumulative Var 0.048 0.095 0.143 0.190 0.238 0.286 0.333 0.381 0.429
## Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15 Comp.16 Comp.17
## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048
## Cumulative Var 0.476 0.524 0.571 0.619 0.667 0.714 0.762 0.810
## Comp.18 Comp.19 Comp.20 Comp.21
## SS loadings 1.000 1.000 1.000 1.000
## Proportion Var 0.048 0.048 0.048 0.048
## Cumulative Var 0.857 0.905 0.952 1.000
The table of loadings helps us find the interpretation of individual component from the variable weights.
The components work by multiplying and adding each of attributes within that component. For example, in component 1, Age0_14_married x 0.233 + Age15_19_married x 0.171 + Age20_39_married etc...
For component 1:
The highest weight was both pacific married (0.330) and pacific smokers (0.330).
The lowest weights were European smoker and European married (-0.369).
This means that most smokers are pacific and the lowest is European smokers within that neighborhood.
For component 2:
Single female smokers (0.252- 0.256) had the highest weight.
The low weights were for zero-year smoking residence in New Zealand who are married Asian smokers.
The neighborhoods could possibly have more females.
Here, I will create a biplot
biplot(smoker_married_total.pca, pc.biplot=TRUE, cex=0.8)
A biplot enables us to view which observations (e.g., census territories) score have high or low variable and which variable weight are similar to the components (that point in similar directions) in space by examining which ones are related.
Each of these points relates to the observation of these data that shows which of these observations have high weights on original values.
For example, principal component 1 weight depends highly on Maori smokers who are females while component 2 relies highly on smokers. Component 2 dimension focuses on smoker who has residency while the component emphasis on race, marriage, and smoking status.
Note: It is hard to read but it gives us a rough idea.
Both loadings and biplot allows us to figure out the analysis of principal components Finally, we can take the components from the PCA analysis scores result which will allow me to create a map (shown below)
#install.packages("webshot")
#webshot::install_phantomjs()
smoker_married_total$PC1 <- smoker_married_total.pca$scores[,1] # the [,1] means taking the first principle component of the first column
tmap_mode('view')
## tmap mode set to interactive viewing
tm_shape(smoker_married_total) +
tm_polygons(col='PC1') +
tm_legend(legend.outside=T)
## Variable(s) "PC1" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
Results:
Component 1 seems to relate to Maori and pacific smokers who are married (component values over 0.2).
The values ascend from dark green (5-10) to red (-10 to -5). The red indicates a significant amount of Pasifika while the opposite for dark green.
There are less Pacific smokers in the urban areas of Auckland (e.g., Auckland city Central district).
Kawau Island had the highest number of pacific smokers.
Areas towards the beach (e.g., Muriwai, Piha) tend to have more Mauri and Pacifica's.
PCA is popular when first looking at data dimensional reduction techniques.
Pacifica's and Maoris tend to be married towards rural areas. Couples tend to be benefited from factors such as low property prices, more space and privacy.
The dimensional reduction methods focus on the variables in a dataset.
The clustering methods focuses on the observations of the variables by breaking the dataset into groups of observations. These groups allow us to find the differences and similarities between the variables in the dataset.
To understand differences and similarities between the variables in the dataset, it is important to comprehend the distance in Euclidean (two dimensional) space. It is the square root of the number of the squared difference of the individual coordinate to upper dimensions.
Note: all attributes can be rescaled. The changes in a single large, valued attribute do not mask out the changes in other variables. A the strongly correlated variables maybe exclude in the analysis. Therefore, clustering is typically used on principal component scores.
Here, the K-means clustering method is used by 1) finding the number of k clusters and the centers of the cluster, 2) Allocate each outcome to the closest cluster center and compute the average center of individual cluster.
km <- kmeans(smoker_married_total.d, 5) # we use the kmans function to our smoker_married_total.d dataset
# the number of clusters is set to 5
clusters <- fitted(km, method='classes') # setting the cluster method to "classes"
smoker_married_total$c5 <- clusters # we can add the new variables to our dataset
tmap_mode('view') # maping it
## tmap mode set to interactive viewing
tm_shape(smoker_married_total) +
tm_polygons(col='c5', palette='cat') + # setting the colour palette to cat
tm_legend(legend.outside=T)
Here, we are using 5 clusters, so we are clustering alike variables to one another - a paartitioning of the areas into clusters.
Before we run the kmeans function, I will need to choose the number of clusters, I picked 5 because it seemed reasonable for the variable in our dataset. We got the cluster assignments viafitted function, changed palette to cat as we are dealing with categorical variable and then we map it.
Thsi method is not determanistic which means I will have diiferent outcomes each time we run it, but it can be quite similar sometimes.
We can improve the quality of this method from the kmeansfunction; to measure the variance within and between cluster.
we can see the the rural areas tend to be grouped together which is shown as yellow with at value of 1. As the clusters approtch towards the center/high/ main part of the urban areas; it changes colours. For example, from the yellow urban cluster, the cluster 3 orange and red cluster 2 aproches to cluster 4 and 5 shown as a light and dark purple colour which somewhat represents high urban areas or even high populated areas. I know this, as the dark and light purple custered areas are at west and city center of auckland district.
K-means does not tell us about the structure of the clusters it creates. The Hierarchical methods tells us how the outcomes have been gathered into the clusters. The algorithm works by 'agglomerative' approach as we work with each observation.
Hierarchical clustering works by:
computing the (multivariate) distances from every set of observations
locating the closest pair/set of observations and combined them into a cluster
Finding the recently generated cluster which will tell us the distances from all the remaining observations.
This method is popular with network data when cluster detection (which is known as community detection).
To do this, we will need to use the hclust function
hc <- hclust(dist(smoker_married_total.d))
We will use the tmap to display the outcome. It is better than using the cluster dendrogram (from 'plot(hc)'). The height of the joined pair clusters and merging branches may make the dendrogram look messy.
plot(hc)
smoker_married_total$hc5 <- cutree(hc, k=5) # can use the cutree function to focus on Hierarchical clustering of 5 clusters.
tm_shape(smoker_married_total) +
tm_polygons(col='hc5', palette='cat') +
tm_legend(legend.outside=T)
There is a clear similarity between the Hierarchical clustering and k-means one. output
Once clusters have been assigned, we can do further analysis by comparing the characteristics of different clusters. For example:
boxplot(smoker_married_total$Pacific_smoker ~ smoker_married_total$hc5, xlab='Cluster', ylab='Pacific_smoker') # here we are ploting the orginal vaiables to the Hierarchical clusters of 5.
Boxplot can interpret the Hierarchical cluster map. For example, both Hierarchical clustering map and boxplot show 5 cluster with high amounts of pacific smokers.
A special thanks to David O'Sullivan for teaching the process and method of running a Multivariate Analysis.