Multivariate Analysis

By pawaneet Kaur (Pauline)

In this document, we will be using the demographic census data of New Zealand (2013) to conduct a multivariate data exploration. I will include:

The process of displaying and cleaning up data
Issues of Multivariate data
Dimensional reduction
Classification and clustering

Libraries

library(sf) # spatial data
library(tmap) # simple mapping
library(MASS) # multivariate methods
library(scatterplot3d) # 3D plots
library(tidyr) # Tidying up data 
library(ggplot2) # ploting
library(dplyr)

Data

My Area of interest in Auckland, New Zealand. I will be unzipping, clipping, and opening the nz-2193-cau-2013.zip file on ArcGIS pro.

sfa <- st_read('Auckland.shp')

sfa <- drop_na(sfa)
sfa.d <- st_drop_geometry(sfa)

Most of the variables display the number of people participating. Plotting and mapping is a way of viewing the data.

tm_shape(sfa) +
  tm_polygons(col='ExSmoker') +
  tm_legend(legend.outside=T)

Cleaning the Dataset

I will be cleaning up our dataset.

we have 64 list of columns in our dataset from the 'names()'

names(sfa)

##  [1] "OBJECTID"   "AU2013"     "AUName"     "PopUR13"    "PopUR06"   
##  [6] "PopUR01"    "Male_UR13"  "Female_UR1" "Median_Age" "Age0_4"    
## [11] "Age5_9"     "Age10_14"   "Age15_19"   "Age20_24"   "Age25_29"  
## [16] "Age30_34"   "Age35_39"   "Age40_44"   "Age45_49"   "Age50_54"  
## [21] "Age55_59"   "Age60_64"   "Age65plus"  "YrUR0"      "YrUR1_4"   
## [26] "YrUR5_9"    "YrUR10_14"  "YrUR15_29"  "YrUR30plus" "URSame5"   
## [31] "UR5inNZ"    "UR5notborn" "UR5oversea" "UR5nfa"     "NZborn"    
## [36] "NotNZborn"  "Arr0_9"     "Arr10_19"   "Arr20_29"   "Arr30_39"  
## [41] "Arr40_49"   "Arr50plus"  "European"   "Maori"      "Pacific"   
## [46] "Asian"      "MELAA"      "Other"      "LangEng"    "LangMaori" 
## [51] "LangSamoan" "LangNZSL"   "LangOther"  "MaoriDes"   "NoMaoriDes"
## [56] "Smoker"     "ExSmoker"   "NevSmoker"  "Married"    "ExMarried" 
## [61] "NevMarried" "PShip"      "Single"     "geometry"

I will be focusing on:

smoking status based on race, gender and age
Married status based on age and race.

Arranging the gender, age, race of smokers.

smoker_gender= sfa %>% 
  select(AUName, PopUR13, Male_UR13,Female_UR1,Smoker) %>% 
  mutate(male_numbers = Male_UR13/PopUR13*Smoker,female_numbers = Female_UR1/PopUR13*Smoker,male_percentage = Male_UR13/PopUR13,female_percentage = Female_UR1/PopUR13) %>% 
 select(AUName,male_numbers,female_numbers,male_percentage,female_percentage,Smoker)


smoker_race= sfa %>%
  select(AUName, European, Maori,Pacific,Asian, Smoker) %>% 
  mutate(European = European/(European+Maori+Pacific+Asian)*Smoker,
         Maori = Maori/(European+Maori+Pacific+Asian)*Smoker,
         Pacific =  Pacific/(European+Maori+Pacific+Asian) *Smoker,
         Asian = Asian/(European+Maori+Pacific+Asian) *Smoker) %>% 
 select(AUName,European,Maori,Pacific,Asian,Smoker)

  
  
smoker_YrUR=sfa %>%
  select(AUName, YrUR0, YrUR1_4,YrUR5_9,YrUR10_14,YrUR15_29,YrUR30plus,Smoker) %>% 
  mutate(YrUR0 = YrUR0/(YrUR0 +YrUR1_4 + YrUR5_9 + YrUR10_14 +YrUR15_29 +YrUR30plus)*Smoker,
         YrUR1_4_9 = (YrUR1_4 +YrUR5_9)/(YrUR0 +YrUR1_4 + YrUR5_9 + YrUR10_14 +YrUR15_29 +YrUR30plus)*Smoker,
         YrUR10_14_29 =  (YrUR10_14+YrUR15_29)/(YrUR0 +YrUR1_4 + YrUR5_9 + YrUR10_14 +YrUR15_29 +YrUR30plus) *Smoker,
         YrUR30plus = YrUR30plus/(YrUR0 +YrUR1_4 + YrUR5_9 + YrUR10_14 +YrUR15_29 +YrUR30plus) *Smoker) %>% 
         select(AUName,YrUR0,YrUR1_4_9,YrUR10_14_29,YrUR30plus,Smoker)

Selecting marriage status based on age, race, and gender.

marrried_age= sfa %>% 
  select(AUName, Age0_4, Age5_9, Age10_14, Age15_19 ,Age20_24,Age25_29,Age30_34,Age35_39,Age40_44,Age45_49,Age50_54,   Age55_59,Age60_64,Age65plus,Married) %>%   

mutate( Age0_14 = (Age0_4 + Age5_9 + Age10_14) /(a= Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus)*Married,
        
        Age15_19=  Age15_19/(Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus) *Married,
        Age20_39 = (Age20_24 + Age25_29 + Age30_34 + Age35_39)/(Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus) *Married,
        Age40_59 = (Age40_44 + Age45_49+ Age50_54 + Age55_59)/(Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus) *Married,
        Age60plus=(Age60_64+ Age65plus)/(Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus) *Married) %>% 
  
 select(AUName,Married,Age0_14,Age15_19,Age20_39,Age40_59,Age60plus)


  
married_race= sfa %>% 
  select(AUName, Married, European, Maori,Pacific,Asian) %>% 
  mutate(European = European/(European+Maori+Pacific+Asian)*Married,
         Maori = Maori/(European+Maori+Pacific+Asian)*Married,
         Pacific =  Pacific/(European+Maori+Pacific+Asian) *Married,
         Asian = Asian/(European+Maori+Pacific+Asian) *Married) %>% 
 select(AUName, Married, European, Maori,Pacific,Asian)



single_gender=  sfa %>%
  select(AUName, Single, Male_UR13, Female_UR1) %>% 
  mutate(male = Male_UR13/(Male_UR13+ Female_UR1)*Single,
         female = Female_UR1/(Male_UR13+ Female_UR1)*Single) %>% 
 select(AUName, Single, male, female)

Displaying smokers and marriage status in assending order

smokers

cau.sorted <- smoker_gender %>%
  arrange(desc(Smoker))


head(st_drop_geometry(cau.sorted), n=10)

##                   AUName male_numbers female_numbers male_percentage
## 1  Auckland Central West     698.7954       600.2046       0.5379487
## 2          Papakura East     536.1179       582.8821       0.4791045
## 3         Pukekohe North     506.3188       540.3305       0.4835901
## 4  Auckland Central East     500.5849       489.4151       0.5056413
## 5         Waiheke Island     481.3001       508.6999       0.4861617
## 6             Homai East     464.3596       501.1685       0.4807035
## 7                Leabank     406.1761       439.3432       0.4801136
## 8            Wattle Farm     408.9343       437.0657       0.4833739
## 9           Otahuhu West     411.7742       383.2258       0.5179550
## 10        Glen Eden East     382.1772       406.8228       0.4843817
##    female_percentage Smoker
## 1          0.4620513   1299
## 2          0.5208955   1119
## 3          0.5160750   1047
## 4          0.4943587    990
## 5          0.5138383    990
## 6          0.5188080    966
## 7          0.5193182    846
## 8          0.5166261    846
## 9          0.4820450    795
## 10         0.5156183    789

# the desc(Smoker) sorts information in descending order
# you can sort multiple variables by listing them arrange()

smoker_race.sorted <- smoker_race %>%
  arrange(desc(Smoker))

head(st_drop_geometry(smoker_race.sorted), n=10)

##                   AUName European     Maori   Pacific     Asian Smoker
## 1  Auckland Central West 413.4209  85.56326  59.03723 1192.4175   1299
## 2          Papakura East 397.4114 593.08499 542.41133  247.3515   1119
## 3         Pukekohe North 629.1512 519.96675 373.07943  290.6681   1047
## 4  Auckland Central East 378.2736  58.73895  35.15415  903.6467    990
## 5         Waiheke Island 828.6719 398.59003 132.21170  154.8148    990
## 6             Homai East 244.6733 326.73670 571.83020  441.6676    966
## 7                Leabank 239.9361 327.17760 476.68353  360.2184    846
## 8            Wattle Farm 461.2875 335.45651 413.07679  304.6963    846
## 9           Otahuhu West 143.2134 147.10481 485.48707  470.5243    795
## 10        Glen Eden East 421.7357 163.58221 282.24599  488.8515    789

smoker_YrUR.sorted <- smoker_YrUR %>%
  arrange(desc(Smoker))

head(st_drop_geometry(smoker_YrUR.sorted), n=10)

##                   AUName    YrUR0 YrUR1_4_9 YrUR10_14_29 YrUR30plus Smoker
## 1  Auckland Central West 714.8784 1048.5244     72.63440   6.472373   1299
## 2          Papakura East 310.8688  759.8223    220.59356  53.922870   1119
## 3         Pukekohe North 246.9926  803.4557    174.96177  28.345762   1047
## 4  Auckland Central East 586.2883  812.6697     39.72742   1.393945    990
## 5         Waiheke Island 238.4841  625.6522    293.44315  31.895994    990
## 6             Homai East 218.6484  649.8004    218.51694  51.115073    966
## 7                Leabank 207.5194  524.6880    223.44128  51.870298    846
## 8            Wattle Farm 160.7979  578.7553    219.58787  24.350338    846
## 9           Otahuhu West 201.9311  542.1833    184.91449  20.933716    795
## 10        Glen Eden East 163.2414  519.4147    200.20068  44.284021    789

marriage status

marriage age

marrried_age.sorted <- marrried_age %>%
  arrange(desc(Married))

head(st_drop_geometry(marrried_age.sorted), n=10)

##                 AUName Married  Age0_14 Age15_19 Age20_39  Age40_59 Age60plus
## 1                Orewa    3822 470.8553 185.6515 507.3175  897.1364 1848.1840
## 2           Greenhithe    3585 875.8226 284.7062 837.4224 1252.1204  450.8168
## 3        Sturges North    2967 663.5871 213.1135 813.2823  867.8561  499.1503
## 4      Baverstock Oaks    2958 675.5521 229.5565 953.3003  880.6026  312.7374
## 5    Hillsborough West    2928 485.8063 244.0997 956.1108  883.0483  474.9062
## 6  Beachlands-Maraetai    2904 716.6717 191.7129 601.6840  973.3910  498.7292
## 7          Wattle Farm    2757 666.3309 178.8808 733.0628  728.4010  531.4414
## 8                Manly    2718 521.4478 190.4093 515.1881  814.9576  761.7517
## 9             Kingseat    2676 555.3475 189.8514 510.0902  971.7285  530.2784
## 10      Pukekohe North    2634 736.1783 194.1957 726.3042  682.5398  388.2934

Marriage race

married_race.sorted <- married_race %>%
  arrange(desc(Married))

head(st_drop_geometry(married_race.sorted), n=10)

##                 AUName Married  European     Maori  Pacific     Asian
## 1                Orewa    3822 3324.1149  391.9899 191.0082  392.3725
## 2           Greenhithe    3585 2918.5329  267.0984  95.2498  900.7741
## 3        Sturges North    2967 1447.7222  312.3954 333.8489 1614.3233
## 4      Baverstock Oaks    2958  849.8339  144.1186 173.2153 2316.5331
## 5    Hillsborough West    2928  907.0963  147.5652 295.1622 2204.1151
## 6  Beachlands-Maraetai    2904 2582.4455  454.6369 124.7194   93.3390
## 7          Wattle Farm    2757 1503.2737  877.1104 869.3885  477.6458
## 8                Manly    2718 2391.5973  415.5265 140.8955  136.5997
## 9             Kingseat    2676 2227.0491  514.5452 182.1329  184.0975
## 10      Pukekohe North    2634 1582.7930 1071.5251 602.1549  401.1196

Single gender

single_gender.sorted <- single_gender %>%
  arrange(desc(Single))

head(st_drop_geometry(single_gender.sorted), n=10)

##                   AUName Single      male   female
## 1  Auckland Central West   5880 3163.1385 2716.862
## 2  Auckland Central East   5382 2721.3616 2660.638
## 3                  Orewa   2709 1203.3643 1505.636
## 4         Waiheke Island   2463 1197.4162 1265.584
## 5      Hillsborough West   2409 1195.6434 1213.357
## 6         Pukekohe North   2364 1143.5899 1220.410
## 7         Glen Eden East   2232 1081.1399 1150.860
## 8                Akarana   2058 1012.3675 1045.633
## 9        Ellerslie North   2055  994.0202 1060.980
## 10         Mangere South   2025  981.3527 1043.647

From our analysis, we can see that some datasets have a relationship with one another. Now, we will have to combined the variables of interest into one dataframe in order for us to run multiple or add Weights to the analysis.

married_total= 
sfa %>% 
  select(AUName, PopUR13, Male_UR13,Female_UR1,Smoker,European, Maori,Pacific,Asian,YrUR0, YrUR1_4,YrUR5_9,YrUR10_14,YrUR15_29,YrUR30plus,Married,Age0_4, Age5_9, Age10_14, Age15_19 ,Age20_24,Age25_29,Age30_34,Age35_39,Age40_44,Age45_49,Age50_54,Age55_59,Age60_64,Age65plus,European, Maori,Pacific,Asian,Single) %>% 
   mutate(Age0_14_married = ((Age0_4 + Age5_9 + Age10_14) /(a= Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus)*Married)/Married,
        
        Age15_19_married=  (Age15_19/(Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus) *Married)/Married,
        Age20_39_married = ((Age20_24 + Age25_29 + Age30_34 + Age35_39)/(Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus) *Married)/Married,
        Age40_59_married = ((Age40_44 + Age45_49+ Age50_54 + Age55_59)/(Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus) *Married)/Married,
        Age60plus_married=((Age60_64+ Age65plus)/(Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus) *Married)/Married,
        European_married = (European/(European+Maori+Pacific+Asian)*Married)/Married,
         Maori_married = (Maori/(European+Maori+Pacific+Asian)*Married)/Married,
         Pacific_married =  (Pacific/(European+Maori+Pacific+Asian) *Married)/Married,
         Asian_married = (Asian/(European+Maori+Pacific+Asian) *Married)/Married,
         #single_male = (Male_UR13/(Male_UR13+ Female_UR1)*Single)/Single,
         #single_female = (Female_UR1/(Male_UR13+ Female_UR1)*Single)/Single
        ) %>%
  
    select(Age0_14_married,Age15_19_married,Age20_39_married,Age40_59_married,Age60plus_married,European_married, Maori_married,Pacific_married,Asian_married)

head(st_drop_geometry(married_total), n=10)

##    Age0_14_married Age15_19_married Age20_39_married Age40_59_married
## 1        0.1658986       0.05990783        0.1716590        0.2361751
## 2        0.2256604       0.07169811        0.1811321        0.3049057
## 3        0.2273263       0.07538280        0.2473498        0.2932862
## 4        0.1934605       0.07084469        0.2179837        0.3433243
## 5        0.1864407       0.03389831        0.1355932        0.2881356
## 6        0.1803279       0.06557377        0.1536885        0.3565574
## 7        0.1926007       0.05440696        0.3220892        0.2763874
## 8        0.2301136       0.05113636        0.2301136        0.2528409
## 9        0.2282609       0.06763285        0.1811594        0.3623188
## 10       0.2105263       0.07218045        0.1879699        0.3609023
##    Age60plus_married European_married Maori_married Pacific_married
## 1          0.3663594        0.8834499    0.05244755      0.01864802
## 2          0.2166038        0.8714499    0.06427504      0.02167414
## 3          0.1566549        0.8547297    0.09346847      0.01914414
## 4          0.1743869        0.8963585    0.04201681      0.01120448
## 5          0.3559322        0.9210526    0.03508772      0.03508772
## 6          0.2438525        0.8092243    0.14884696      0.02515723
## 7          0.1545158        0.3563596    0.09100877      0.13815789
## 8          0.2357955        0.6849315    0.22739726      0.05205479
## 9          0.1606280        0.9026217    0.06991261      0.01498127
## 10         0.1684211        0.8437979    0.11026034      0.02297090
##    Asian_married
## 1     0.04545455
## 2     0.04260090
## 3     0.03265766
## 4     0.05042017
## 5     0.00877193
## 6     0.01677149
## 7     0.41447368
## 8     0.03561644
## 9     0.01248439
## 10    0.02297090

smoker_total= 
sfa %>% 
  select(AUName, PopUR13, Male_UR13,Female_UR1,Smoker,European, Maori,Pacific,Asian,YrUR0, YrUR1_4,YrUR5_9,YrUR10_14,YrUR15_29,YrUR30plus,Married,Age0_4, Age5_9, Age10_14, Age15_19 ,Age20_24,Age25_29,Age30_34,Age35_39,Age40_44,Age45_49,Age50_54, Age55_59,Age60_64,Age65plus,European, Maori,Pacific,Asian,Single) %>% 
   mutate(YrUR0_smoker = (YrUR0/(YrUR0 +YrUR1_4 + YrUR5_9 + YrUR10_14 +YrUR15_29 +YrUR30plus)*Smoker)/Smoker,
         YrUR1_4_9_smoker = ((YrUR1_4 +YrUR5_9)/(YrUR0 +YrUR1_4 + YrUR5_9 + YrUR10_14 +YrUR15_29 +YrUR30plus)*Smoker)/Smoker,
         YrUR10_14_29_smoker =  ((YrUR10_14+YrUR15_29)/(YrUR0 +YrUR1_4 + YrUR5_9 + YrUR10_14 +YrUR15_29 +YrUR30plus)*Smoker)/Smoker,
         YrUR30plus_smoker = (YrUR30plus/(YrUR0 +YrUR1_4 + YrUR5_9 + YrUR10_14 +YrUR15_29 +YrUR30plus)*Smoker)/Smoker,
         male_smoker = (Male_UR13/PopUR13*Smoker)/Smoker,
         female_smoker = (Female_UR1/PopUR13*Smoker)/Smoker,
         European_smoker = (European/(European+Maori+Pacific+Asian)*Smoker)/Smoker,
         Maori_smoker = (Maori/(European+Maori+Pacific+Asian)*Smoker)/Smoker,
         Pacific_smoker =  (Pacific/(European+Maori+Pacific+Asian)*Smoker)/Smoker,
         Asian_smoker = (Asian/(European+Maori+Pacific+Asian)*Smoker)/Smoker,
         single_female = (Female_UR1/(Male_UR13+ Female_UR1)*Single)/Single
         ) %>%
  
 
    select(male_smoker,female_smoker,European_smoker, Maori_smoker,Pacific_smoker,Asian_smoker,YrUR0_smoker,YrUR1_4_9_smoker,YrUR10_14_29_smoker,YrUR30plus_smoker,Male_UR13,Female_UR1,Smoker,European, Maori,Pacific,Asian,YrUR0, YrUR1_4,YrUR5_9,YrUR10_14,YrUR15_29,YrUR30plus,Married,Age0_4, Age5_9, Age10_14, Age15_19 ,Age20_24,Age25_29,Age30_34,Age35_39,Age40_44,Age45_49,Age50_54,Age55_59,Age60_64,Age65plus,Single)

head(st_drop_geometry(smoker_total), n=10)

##    male_smoker female_smoker European_smoker Maori_smoker Pacific_smoker
## 1    0.4643678     0.5356322       0.8834499   0.05244755     0.01864802
## 2    0.4799698     0.5200302       0.8714499   0.06427504     0.02167414
## 3    0.4882353     0.5117647       0.8547297   0.09346847     0.01914414
## 4    0.5054348     0.4918478       0.8963585   0.04201681     0.01120448
## 5    0.5000000     0.5000000       0.9210526   0.03508772     0.03508772
## 6    0.5071575     0.4928425       0.8092243   0.14884696     0.02515723
## 7    0.4689204     0.5310796       0.3563596   0.09100877     0.13815789
## 8    0.4899713     0.5100287       0.6849315   0.22739726     0.05205479
## 9    0.5012107     0.5000000       0.9026217   0.06991261     0.01498127
## 10   0.5022556     0.4992481       0.8437979   0.11026034     0.02297090
##    Asian_smoker YrUR0_smoker YrUR1_4_9_smoker YrUR10_14_29_smoker
## 1    0.04545455    0.2261614        0.5317848           0.2151589
## 2    0.04260090    0.2259348        0.5799523           0.1837709
## 3    0.03265766    0.2383292        0.5270270           0.2272727
## 4    0.05042017    0.1735294        0.5323529           0.2735294
## 5    0.00877193    0.2293578        0.4495413           0.2660550
## 6    0.01677149    0.2121212        0.5058275           0.2331002
## 7    0.41447368    0.2342995        0.5277778           0.2089372
## 8    0.03561644    0.2364217        0.6102236           0.1405751
## 9    0.01248439    0.1534392        0.5767196           0.2473545
## 10   0.02297090    0.1281198        0.4891847           0.3044925
##    YrUR30plus_smoker Male_UR13 Female_UR1 Smoker European Maori Pacific Asian
## 1        0.026894866      1212       1398    183     2274   135      48   117
## 2        0.010342084      1905       2064    315     3498   258      87   171
## 3        0.007371007      1245       1305    261     2277   249      51    87
## 4        0.020588235       558        543     66      960    45      12    54
## 5        0.055045872       177        177     21      315    12      12     3
## 6        0.048951049       744        723    153     1158   213      36    24
## 7        0.028985507      1290       1461    276      975   249     378  1134
## 8        0.012779553       513        534    189      750   249      57    39
## 9        0.022486772      1242       1239    183     2169   168      36    30
## 10       0.078202995      1002        996    153     1653   216      45    45
##    YrUR0 YrUR1_4 YrUR5_9 YrUR10_14 YrUR15_29 YrUR30plus Married Age0_4 Age5_9
## 1    555     813     492       276       252         66    1059    147    141
## 2    852    1236     951       411       282         39    1776    243    321
## 3    582     747     540       330       225         18     933    195    195
## 4    177     318     225       153       126         21     489     51     90
## 5     75      81      66        33        54         18     186     21     21
## 6    273     369     282       165       135         63     555     69     99
## 7    582     732     579       276       243         72     897    225    171
## 8    222     333     240        54        78         12     261     90     96
## 9    348     660     648       282       279         51     990    159    198
## 10   231     459     423       255       294        141     747    108    138
##    Age10_14 Age15_19 Age20_24 Age25_29 Age30_34 Age35_39 Age40_44 Age45_49
## 1       144      156      105       93      111      138      171      171
## 2       333      285      147      132      192      249      345      378
## 3       189      192      156      129      174      171      219      192
## 4        72       78       78       63       45       54       93      102
## 5        24       12        9       12       12       15       21       24
## 6        96       96       63       33       48       81      108      147
## 7       135      150      216      234      243      195      216      180
## 8        57       54       51       63       69       60       78       66
## 9       210      168      111       84       93      162      258      261
## 10      174      144      108       75       78      114      186      222
##    Age50_54 Age55_59 Age60_64 Age65plus Single
## 1       156      117      132       822    759
## 2       306      183      204       657    810
## 3       192      144      126       273    696
## 4       105       78       75       117    252
## 5        24       33       36        90     60
## 6       153      114      120       237    333
## 7       189      177      144       282    879
## 8        69       54       48       201    321
## 9       216      165      138       261    528
## 10      177      135       96       240    465

married_smoker_total_1= 
smoker_total %>% 
  select(male_smoker,female_smoker,European_smoker, Maori_smoker,Pacific_smoker,Asian_smoker,YrUR0_smoker,YrUR1_4_9_smoker,YrUR10_14_29_smoker,YrUR30plus_smoker,Male_UR13,Female_UR1,Smoker,European, Maori,Pacific,Asian,YrUR0, YrUR1_4,YrUR5_9,YrUR10_14,YrUR15_29,YrUR30plus,Married,Age0_4, Age5_9, Age10_14, Age15_19 ,Age20_24,Age25_29,Age30_34,Age35_39,Age40_44,Age45_49,Age50_54,Age55_59,Age60_64,Age65plus,Single) %>% 
   mutate(Age0_14_married = ((Age0_4 + Age5_9 + Age10_14) /(a= Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus)*Married)/Married,
        
        Age15_19_married=  (Age15_19/(Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus) *Married)/Married,
        Age20_39_married = ((Age20_24 + Age25_29 + Age30_34 + Age35_39)/(Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus) *Married)/Married,
        Age40_59_married = ((Age40_44 + Age45_49+ Age50_54 + Age55_59)/(Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus) *Married)/Married,
        Age60plus_married=((Age60_64+ Age65plus)/(Age0_4 + Age5_9 +Age10_14 + Age15_19 + Age20_24 + Age25_29 + Age30_34 + Age35_39 + Age40_44 + Age45_49+ Age50_54 + Age55_59 +Age60_64+ Age65plus) *Married)/Married,
        European_married = (European/(European+Maori+Pacific+Asian)*Married)/Married,
         Maori_married = (Maori/(European+Maori+Pacific+Asian)*Married)/Married,
         Pacific_married =  (Pacific/(European+Maori+Pacific+Asian) *Married)/Married,
         Asian_married = (Asian/(European+Maori+Pacific+Asian) *Married)/Married,
         single_male = (Male_UR13/(Male_UR13+ Female_UR1)*Single)/Single,
         single_female = (Female_UR1/(Male_UR13+ Female_UR1)*Single)/Single) %>%
  
    select(Age0_14_married,Age15_19_married,Age20_39_married,Age40_59_married,Age60plus_married,European_married, Maori_married,Pacific_married,Asian_married,single_male,single_female,male_smoker,female_smoker,European_smoker, Maori_smoker,Pacific_smoker,Asian_smoker,YrUR0_smoker,YrUR1_4_9_smoker,YrUR10_14_29_smoker,YrUR30plus_smoker)

head(st_drop_geometry(married_smoker_total_1), n=10)

##    Age0_14_married Age15_19_married Age20_39_married Age40_59_married
## 1        0.1658986       0.05990783        0.1716590        0.2361751
## 2        0.2256604       0.07169811        0.1811321        0.3049057
## 3        0.2273263       0.07538280        0.2473498        0.2932862
## 4        0.1934605       0.07084469        0.2179837        0.3433243
## 5        0.1864407       0.03389831        0.1355932        0.2881356
## 6        0.1803279       0.06557377        0.1536885        0.3565574
## 7        0.1926007       0.05440696        0.3220892        0.2763874
## 8        0.2301136       0.05113636        0.2301136        0.2528409
## 9        0.2282609       0.06763285        0.1811594        0.3623188
## 10       0.2105263       0.07218045        0.1879699        0.3609023
##    Age60plus_married European_married Maori_married Pacific_married
## 1          0.3663594        0.8834499    0.05244755      0.01864802
## 2          0.2166038        0.8714499    0.06427504      0.02167414
## 3          0.1566549        0.8547297    0.09346847      0.01914414
## 4          0.1743869        0.8963585    0.04201681      0.01120448
## 5          0.3559322        0.9210526    0.03508772      0.03508772
## 6          0.2438525        0.8092243    0.14884696      0.02515723
## 7          0.1545158        0.3563596    0.09100877      0.13815789
## 8          0.2357955        0.6849315    0.22739726      0.05205479
## 9          0.1606280        0.9026217    0.06991261      0.01498127
## 10         0.1684211        0.8437979    0.11026034      0.02297090
##    Asian_married single_male single_female male_smoker female_smoker
## 1     0.04545455   0.4643678     0.5356322   0.4643678     0.5356322
## 2     0.04260090   0.4799698     0.5200302   0.4799698     0.5200302
## 3     0.03265766   0.4882353     0.5117647   0.4882353     0.5117647
## 4     0.05042017   0.5068120     0.4931880   0.5054348     0.4918478
## 5     0.00877193   0.5000000     0.5000000   0.5000000     0.5000000
## 6     0.01677149   0.5071575     0.4928425   0.5071575     0.4928425
## 7     0.41447368   0.4689204     0.5310796   0.4689204     0.5310796
## 8     0.03561644   0.4899713     0.5100287   0.4899713     0.5100287
## 9     0.01248439   0.5006046     0.4993954   0.5012107     0.5000000
## 10    0.02297090   0.5015015     0.4984985   0.5022556     0.4992481
##    European_smoker Maori_smoker Pacific_smoker Asian_smoker YrUR0_smoker
## 1        0.8834499   0.05244755     0.01864802   0.04545455    0.2261614
## 2        0.8714499   0.06427504     0.02167414   0.04260090    0.2259348
## 3        0.8547297   0.09346847     0.01914414   0.03265766    0.2383292
## 4        0.8963585   0.04201681     0.01120448   0.05042017    0.1735294
## 5        0.9210526   0.03508772     0.03508772   0.00877193    0.2293578
## 6        0.8092243   0.14884696     0.02515723   0.01677149    0.2121212
## 7        0.3563596   0.09100877     0.13815789   0.41447368    0.2342995
## 8        0.6849315   0.22739726     0.05205479   0.03561644    0.2364217
## 9        0.9026217   0.06991261     0.01498127   0.01248439    0.1534392
## 10       0.8437979   0.11026034     0.02297090   0.02297090    0.1281198
##    YrUR1_4_9_smoker YrUR10_14_29_smoker YrUR30plus_smoker
## 1         0.5317848           0.2151589       0.026894866
## 2         0.5799523           0.1837709       0.010342084
## 3         0.5270270           0.2272727       0.007371007
## 4         0.5323529           0.2735294       0.020588235
## 5         0.4495413           0.2660550       0.055045872
## 6         0.5058275           0.2331002       0.048951049
## 7         0.5277778           0.2089372       0.028985507
## 8         0.6102236           0.1405751       0.012779553
## 9         0.5767196           0.2473545       0.022486772
## 10        0.4891847           0.3044925       0.078202995

Note: The marrage as this seems unrealistic -- It could prevalent in some cultures or could be as a reslut of peer pressure/negative unfluences.

Dimension reduction methods

I will be using the Dimension Reduction Method (DRM) and the Principal Component Analysis (PCA), to linearly map of the information to a reduced dimensional space. It will allow the maximization of the variance of the data in reduced dimensions.

I will be running the DRM and PCA as the variables have similar distributions with each other. This means the variables are non-independent and are likely to be weighted together.

I will run the PCA though princomp

Note: na values and geometry columns needs to be removed as the PCA will not work.

library(sf)
library(tmap)
library(tidyr)

smoker_married_total <- drop_na(married_smoker_total_1) # removing the na values
smoker_married_total.d <- st_drop_geometry(smoker_married_total) # removing the geometry columns

smoker_married_total.pca <- princomp(smoker_married_total.d, cor=TRUE)  # the cor=TRUE indcate that there is a correlation.

The summary displays 21 components along with the Standard deviation, Proportion of Variance and Cumulative Proportion. These components are ordered from the highest Proportion of Variance to the lowest. For example, the Proportion of Variance for component 1 is 0.30 (30%), followed by 2.0 for component 2 (etc..).

From this dataset, about 80% is accounted by the first 5 principal components; is not all of the variance but it is a big portion of it.

Therefore, it is effective to select 5 attributes as it is easier to analyse than 21 attributes as we add more of these components there will be extra bits of additional information adding on. It better to add less.

summary(smoker_married_total.pca) # Summarises the PCA of individuals who are smokers and married.

## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4     Comp.5
## Standard deviation     2.5174842 2.0538368 1.9836965 1.33486235 1.24244596
## Proportion of Variance 0.3017965 0.2008688 0.1873834 0.08485036 0.07350819
## Cumulative Proportion  0.3017965 0.5026654 0.6900488 0.77489913 0.84840732
##                            Comp.6     Comp.7    Comp.8     Comp.9    Comp.10
## Standard deviation     0.95209596 0.83052486 0.7961795 0.64467722 0.56647303
## Proportion of Variance 0.04316603 0.03284626 0.0301858 0.01979089 0.01528056
## Cumulative Proportion  0.89157335 0.92441961 0.9546054 0.97439631 0.98967687
##                            Comp.11      Comp.12      Comp.13      Comp.14
## Standard deviation     0.453030116 0.1072603132 6.690120e-03 3.100448e-08
## Proportion of Variance 0.009773156 0.0005478464 2.131319e-06 4.577513e-17
## Cumulative Proportion  0.999450022 0.9999978687 1.000000e+00 1.000000e+00
##                        Comp.15 Comp.16 Comp.17 Comp.18 Comp.19 Comp.20 Comp.21
## Standard deviation           0       0       0       0       0       0       0
## Proportion of Variance       0       0       0       0       0       0       0
## Cumulative Proportion        1       1       1       1       1       1       1

We can find the principal components of each weighted amount of the original variables by usingloadings.

smoker_married_total.pca$loadings

## 
## Loadings:
##                     Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## Age0_14_married      0.233  0.231  0.193         0.262  0.208  0.195       
## Age15_19_married     0.171                0.274 -0.186  0.727 -0.347       
## Age20_39_married     0.174 -0.327        -0.168 -0.277 -0.172  0.338 -0.137
## Age40_59_married    -0.302                       0.252  0.271  0.199 -0.169
## Age60plus_married   -0.236  0.169 -0.116               -0.396 -0.618  0.290
## European_married    -0.369               -0.197         0.125              
## Maori_married        0.257  0.154  0.185 -0.327               -0.302 -0.358
## Pacific_married      0.330         0.181  0.102        -0.110  0.119  0.365
## Asian_married        0.111 -0.309 -0.241  0.337  0.140        -0.121 -0.239
## single_male         -0.126 -0.256  0.392                                   
## single_female        0.126  0.256 -0.392                                   
## male_smoker         -0.120 -0.258  0.392                                   
## female_smoker        0.130  0.252 -0.389                                   
## European_smoker     -0.369               -0.197         0.125              
## Maori_smoker         0.257  0.154  0.185 -0.327               -0.302 -0.358
## Pacific_smoker       0.330         0.181  0.102        -0.110  0.119  0.365
## Asian_smoker         0.111 -0.309 -0.241  0.337  0.140        -0.121 -0.239
## YrUR0_smoker               -0.349        -0.328 -0.287  0.140         0.151
## YrUR1_4_9_smoker                  -0.142 -0.146  0.698                     
## YrUR10_14_29_smoker -0.133  0.298  0.147  0.384 -0.115                     
## YrUR30plus_smoker           0.240  0.170  0.257 -0.296 -0.271  0.164 -0.449
##                     Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
## Age0_14_married                     0.734                   0.361         
## Age15_19_married     0.196 -0.291  -0.235                   0.156         
## Age20_39_married    -0.191 -0.318  -0.194                   0.587         
## Age40_59_married            0.635  -0.370                   0.346         
## Age60plus_married    0.147                                  0.458         
## European_married           -0.167                                  -0.253 
## Maori_married       -0.141                                         -0.363 
## Pacific_married             0.136  -0.181                                 
## Asian_married                       0.119                   0.120  -0.278 
## single_male                                         0.497           0.416 
## single_female                                      -0.497           0.416 
## male_smoker                                -0.746  -0.443                 
## female_smoker                              -0.661   0.555                 
## European_smoker            -0.167                                   0.313 
## Maori_smoker        -0.141                                          0.379 
## Pacific_smoker              0.136  -0.181                                 
## Asian_smoker                        0.119                           0.316 
## YrUR0_smoker         0.182  0.375   0.183                  -0.235  -0.118 
## YrUR1_4_9_smoker     0.247 -0.353  -0.275                  -0.161         
## YrUR10_14_29_smoker -0.572 -0.144                          -0.218  -0.110 
## YrUR30plus_smoker    0.649                                                
##                     Comp.16 Comp.17 Comp.18 Comp.19 Comp.20 Comp.21
## Age0_14_married                                      0.164         
## Age15_19_married                                                   
## Age20_39_married                                     0.267         
## Age40_59_married                                     0.157         
## Age60plus_married                                    0.208         
## European_married             0.541  -0.201  -0.404  -0.157  -0.406 
## Maori_married        0.168   0.116   0.241   0.437          -0.303 
## Pacific_married      0.522           0.389  -0.399          -0.184 
## Asian_married        0.320  -0.354  -0.376  -0.147  -0.265  -0.218 
## single_male          0.329   0.190  -0.302   0.257          -0.132 
## single_female        0.329   0.190  -0.302   0.257          -0.132 
## male_smoker                                                        
## female_smoker                                                      
## European_smoker             -0.567   0.270          -0.183  -0.467 
## Maori_smoker        -0.163  -0.123  -0.223  -0.525                 
## Pacific_smoker      -0.509          -0.341   0.164  -0.178  -0.427 
## Asian_smoker        -0.309   0.338   0.419                  -0.329 
## YrUR0_smoker                -0.134                   0.518  -0.216 
## YrUR1_4_9_smoker                                     0.354  -0.147 
## YrUR10_14_29_smoker         -0.124                   0.481  -0.200 
## YrUR30plus_smoker                                    0.144         
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.048  0.048  0.048  0.048  0.048  0.048  0.048  0.048  0.048
## Cumulative Var  0.048  0.095  0.143  0.190  0.238  0.286  0.333  0.381  0.429
##                Comp.10 Comp.11 Comp.12 Comp.13 Comp.14 Comp.15 Comp.16 Comp.17
## SS loadings      1.000   1.000   1.000   1.000   1.000   1.000   1.000   1.000
## Proportion Var   0.048   0.048   0.048   0.048   0.048   0.048   0.048   0.048
## Cumulative Var   0.476   0.524   0.571   0.619   0.667   0.714   0.762   0.810
##                Comp.18 Comp.19 Comp.20 Comp.21
## SS loadings      1.000   1.000   1.000   1.000
## Proportion Var   0.048   0.048   0.048   0.048
## Cumulative Var   0.857   0.905   0.952   1.000

The table of loadings helps us find the interpretation of individual component from the variable weights.

The components work by multiplying and adding each of attributes within that component. For example, in component 1, Age0_14_married x 0.233 + Age15_19_married x 0.171 + Age20_39_married etc...

For component 1:

The highest weight was both pacific married (0.330) and pacific smokers (0.330).
The lowest weights were European smoker and European married (-0.369).
This means that most smokers are pacific and the lowest is European smokers within that neighborhood.

For component 2:

Single female smokers (0.252- 0.256) had the highest weight.
The low weights were for zero-year smoking residence in New Zealand who are married Asian smokers.
The neighborhoods could possibly have more females.

Here, I will create a biplot

biplot(smoker_married_total.pca, pc.biplot=TRUE, cex=0.8)

A biplot enables us to view which observations (e.g., census territories) score have high or low variable and which variable weight are similar to the components (that point in similar directions) in space by examining which ones are related.

Each of these points relates to the observation of these data that shows which of these observations have high weights on original values.

For example, principal component 1 weight depends highly on Maori smokers who are females while component 2 relies highly on smokers. Component 2 dimension focuses on smoker who has residency while the component emphasis on race, marriage, and smoking status.

Note: It is hard to read but it gives us a rough idea.

Both loadings and biplot allows us to figure out the analysis of principal components Finally, we can take the components from the PCA analysis scores result which will allow me to create a map (shown below)

#install.packages("webshot")
#webshot::install_phantomjs()

smoker_married_total$PC1 <- smoker_married_total.pca$scores[,1] # the [,1] means taking the first principle component of the first column 
tmap_mode('view')

## tmap mode set to interactive viewing

tm_shape(smoker_married_total) +
  tm_polygons(col='PC1') +
  tm_legend(legend.outside=T)

## Variable(s) "PC1" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Results:

Component 1 seems to relate to Maori and pacific smokers who are married (component values over 0.2).
The values ascend from dark green (5-10) to red (-10 to -5). The red indicates a significant amount of Pasifika while the opposite for dark green.
There are less Pacific smokers in the urban areas of Auckland (e.g., Auckland city Central district).
Kawau Island had the highest number of pacific smokers.
Areas towards the beach (e.g., Muriwai, Piha) tend to have more Mauri and Pacifica's.
PCA is popular when first looking at data dimensional reduction techniques.
Pacifica's and Maoris tend to be married towards rural areas. Couples tend to be benefited from factors such as low property prices, more space and privacy.

Clustering

The dimensional reduction methods focus on the variables in a dataset.

The clustering methods focuses on the observations of the variables by breaking the dataset into groups of observations. These groups allow us to find the differences and similarities between the variables in the dataset.

To understand differences and similarities between the variables in the dataset, it is important to comprehend the distance in Euclidean (two dimensional) space. It is the square root of the number of the squared difference of the individual coordinate to upper dimensions.

Note: all attributes can be rescaled. The changes in a single large, valued attribute do not mask out the changes in other variables. A the strongly correlated variables maybe exclude in the analysis. Therefore, clustering is typically used on principal component scores.

Here, the K-means clustering method is used by 1) finding the number of k clusters and the centers of the cluster, 2) Allocate each outcome to the closest cluster center and compute the average center of individual cluster.

km <- kmeans(smoker_married_total.d, 5) # we use the kmans function  to our smoker_married_total.d dataset 
                                        # the number of clusters is set to 5
clusters <- fitted(km, method='classes') # setting the cluster method to "classes"
smoker_married_total$c5 <- clusters # we can add the new variables to our dataset
tmap_mode('view') # maping it

## tmap mode set to interactive viewing

tm_shape(smoker_married_total) +
  tm_polygons(col='c5', palette='cat') + # setting the colour palette to cat
  tm_legend(legend.outside=T)

Here, we are using 5 clusters, so we are clustering alike variables to one another - a paartitioning of the areas into clusters.

Before we run the kmeans function, I will need to choose the number of clusters, I picked 5 because it seemed reasonable for the variable in our dataset. We got the cluster assignments viafitted function, changed palette to cat as we are dealing with categorical variable and then we map it.

Thsi method is not determanistic which means I will have diiferent outcomes each time we run it, but it can be quite similar sometimes.

We can improve the quality of this method from the kmeansfunction; to measure the variance within and between cluster.

we can see the the rural areas tend to be grouped together which is shown as yellow with at value of 1. As the clusters approtch towards the center/high/ main part of the urban areas; it changes colours. For example, from the yellow urban cluster, the cluster 3 orange and red cluster 2 aproches to cluster 4 and 5 shown as a light and dark purple colour which somewhat represents high urban areas or even high populated areas. I know this, as the dark and light purple custered areas are at west and city center of auckland district.

Hierarchical clustering

K-means does not tell us about the structure of the clusters it creates. The Hierarchical methods tells us how the outcomes have been gathered into the clusters. The algorithm works by 'agglomerative' approach as we work with each observation.
Hierarchical clustering works by:

computing the (multivariate) distances from every set of observations
locating the closest pair/set of observations and combined them into a cluster
Finding the recently generated cluster which will tell us the distances from all the remaining observations.

This method is popular with network data when cluster detection (which is known as community detection).

To do this, we will need to use the hclust function

hc <- hclust(dist(smoker_married_total.d))

We will use the tmap to display the outcome. It is better than using the cluster dendrogram (from 'plot(hc)'). The height of the joined pair clusters and merging branches may make the dendrogram look messy.

plot(hc)

smoker_married_total$hc5 <- cutree(hc, k=5) # can use the cutree function  to focus on Hierarchical clustering of 5 clusters. 

tm_shape(smoker_married_total) +
  tm_polygons(col='hc5', palette='cat') +
  tm_legend(legend.outside=T)

There is a clear similarity between the Hierarchical clustering and k-means one. output

Once clusters have been assigned, we can do further analysis by comparing the characteristics of different clusters. For example:

boxplot(smoker_married_total$Pacific_smoker  ~ smoker_married_total$hc5, xlab='Cluster', ylab='Pacific_smoker') # here we are ploting the orginal vaiables to the Hierarchical clusters of 5.

Boxplot can interpret the Hierarchical cluster map. For example, both Hierarchical clustering map and boxplot show 5 cluster with high amounts of pacific smokers.

A special thanks to David O'Sullivan for teaching the process and method of running a Multivariate Analysis.