12.2
dist = as.dist(matrix(c(0, 0.3, 0.4, 0.7,
0.3, 0, 0.5, 0.8,
0.4, 0.5, 0.0, 0.45,
0.7, 0.8, 0.45, 0.0), nrow=4))
plot(hclust(dist, method="complete"))
part a)
Break 1: 0.8
Break Between 1 & 2: 0.3
Break Between 3 & 4: around 0.45
plot(hclust(dist, method="single"))
part b)
Break #1 = 0.45
Break #2 = 0.4
Break #3 = 0.3
part c) Cutting complete hierarchical dendogram between [0.45 & 0.8] gives us two groups: (1,2) & (3,4)
part d) Cutting single hierarchical dendogram between [0.4,0.45] gives us two groups: (4) & (3,1,2)
part e)
plot(hclust(dist, method="complete"), labels=c(2,1,4,3))
| ## Question 2: | Track re | cords data - PCA |
|---|---|---|
| Argentina Australia Austria Belgium Bermuda Brazil Burma Canada Chile China Columbia CookIs Costa_Rica Czech Denmark DomRep Finland France GDR FRG GB_NI Greece Guatemala Hungary India Indonesia Ireland Israel Italy Japan Kenya Korea DPRKorea Luxembourg Malaysia Mauritius Mexico Netherlands NZ Norway PNG Philippines Poland Portugal Rumania Singapore Spain Sweden Switzerland Taipei Thailand Turkey USA USSR WSamoa | 39 13 24 21 34 33 46 5 35 31 30 53 36 22 12 50 16 18 20 4 7 42 51 19 45 49 6 26 11 8 41 28 40 37 43 54 23 14 2 3 52 47 25 10 29 44 27 17 15 38 32 48 1 9 55 | 40 2 36 11 46 31 43 12 34 32 19 55 38 33 16 52 15 27 10 25 6 35 42 30 26 48 28 39 17 3 9 37 21 44 51 50 8 5 7 23 47 45 20 4 29 53 22 14 18 41 49 24 1 13 54 |
| (ii) Does this ANS: | ranking | correspond with your intuition of athletic excellence (when it comes to running) for the various countries? |
| My initial imp | ression a | bout the data is that it seems to fit the assumption of temperate nations having faster marathon times; due to their fair weather, they are able to train for longer races over multiple seasons year-round. By comparison, Samoa and the Philippines are small island nations with hot climates year-round; making it very difficult to train for long distances overall. |
| (iii) Is the r | anking di | fferent for men and women? |
| ANS: For t | he most p | art the ranking is different for men and women, but it is important to notice that when they do differ it seems to be very drastic in those circumstances (Norway, Kenya, FRG, Turkey) |
| #————- ## Question 3: | ——— Track re | ————————————————————————————————- cords - Clustering |
| Consider again Here the objec *Hint: you can | the trac tive is t get W(Ck | k records data for men and women from problem 1. o examine how countries cluster together based on the track record data. ) directly in R by: <kmeans.fit>$tot.withinss* |
| #part (a) Perform K-Mean (i) Plot the t | s cluster otal with | ing with k = 2,…15 for both data sets: in-group sum-of-squares W(Ck) against k. |
| ```r Wtrack_k.means #clusters 2 - for (i in 2:15 km.out <- km Wtrack_k.mea } par(mfrow= c(2 plot(2:15, Wtr main = “W ylab =”T xlab = “N | = c() 15 ){ eans(Wtra ns[i-1] = | |
| Mtrack_k.means for (i in 2:15 km.out <- km Mtrack_k.mea } plot(2:15, Mtr main = “M ylab =”T xlab = “N ``` | = c() ){ eans(Mtra ns[i-1] = | |
| <img src=“jdk1 | 60_HW3_ST | AT326_files/figure-html/3a K means-1.png” width=“672” /> |
| (ii) While the | W(Ck) ge | nerally decreases with increasing k, is there a point where the differences become small (i.e. does the graph show an “elbow”)? |
| ANS: There | appears | to be an elbow point at k=5, as it tends to taper off pretty quickly after hitting k=6. As a result, I would recommend using k=5. |
| #part (b) For both data | sets, mak | e a table that lists the countries in each cluster according to K-Means clustering with k = 4. |
| ```r k4_Women <- km k4_Women.table Cluster.frame | eans(Wtra <- table <- data.f | ck[,1:7], center = 4, nstart = 20) (Wtrack\(Country, k4_Women\)cluster) rame(k4_Women$cluster) |
| k4_Men <- kmea k4_Men.table < | ns(Mtrack - table(M | [,1:8], center = 4, nstart = 20) track\(Country, k4_Men\)cluster) |
| kable(list(k4_ ``` | Women.tab | le, k4_Men.table), caption = “Women’s vs Men’s Clusters (k=4)”) |
| <table class=” | kable_wra ’s vs Men | pper”> ’s Clusters (k=4) |
| | | |:———–| |Argentina | |Australia | |Austria | |Belgium | |Bermuda | |Brazil | |Burma | |Canada | |Chile | |China | |Columbia | |CookIs | |Costa_Rica | |Czech | |Denmark | |DomRep | |DPRKorea | |Finland | |France | |FRG | |GB_NI | |GDR | |Greece | |Guatemala | |Hungary | |India | |Indonesia | |Ireland | |Israel | |Italy | |Japan | |Kenya | |Korea | |Luxembourg | |Malaysia | |Mauritius | |Mexico | |Netherlands | |Norway | |NZ | |Philippines | |PNG | |Poland | |Portugal | |Rumania | |Singapore | |Spain | |Sweden | |Switzerland | |Taipei | |Thailand | |Turkey | |USA | |USSR | |WSamoa | | 1| 2| –:|–:|- 0| 0| 0| 1| 0| 1| 0| 1| 0| 0| 0| 0| 0| 0| 0| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1| 0| 1| 0| 0| 0| 0| 0| 1| 0| 1| 0| 1| 0| 1| 0| 1| 0| 0| 0| 0| 0| 1| 0| 0| 0| 0| 0| 1| 0| 1| 0| 1| 0| 1| 0| 0| 0| 0| 0| 0| 0| 0| 1| 0| 0| 1| 0| 1| 0| 1| 0| 1| 0| 0| 0| 0| 0| 1| 0| 1| 0| 0| 0| 0| 0| 1| 0| 1| 0| 1| 0| 0| 0| 0| 0| 0| 0| 1| 0| 1| 1| 0| | 3| 4| -:|–:| 0| 1| 0| 0| 0| 0| 0| 0| 0| 1| 0| 1| 0| 1| 0| 0| 0| 1| 0| 1| 0| 1| 1| 0| 0| 1| 0| 0| 0| 0| 1| 0| 0| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1| 1| 0| 0| 0| 0| 1| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1| 0| 1| 0| 1| 0| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1| 0| 1| 0| 0| 0| 0| 0| 0| 1| 0| 1| 0| 0| 0| 0| 0| 0| 0| 1| 0| 1| 1| 0| 0| 0| 0| 0| 0| 0| |
| | | |:———–| |Argentina | |Australia | |Austria | |Belgium | |Bermuda | |Brazil | |Burma | |Canada | |Chile | |China | |Columbia | |CookIs | |Costa_Rica | |Czech | |Denmark | |DomRep | |DPRKorea | |Finland | |France | |FRG | |GB_NI | |GDR | |Greece | |Guatemala | |Hungary | |India | |Indonesia | |Ireland | |Israel | |Italy | |Japan | |Kenya | |Korea | |Luxembourg | |Malaysia | |Mauritius | |Mexico | |Netherlands | |Norway | |NZ | |Philippines | |PNG | |Poland | |Portugal | |Rumania | |Singapore | |Spain | |Sweden | |Switzerland | |Taipei | |Thailand | |Turkey | |USA | |USSR | |WSamoa | | 1| 2| –:|–:|- 1| 0| 0| 1| 1| 0| 0| 1| 0| 0| 0| 1| 1| 0| 0| 1| 0| 1| 0| 1| 0| 1| 0| 0| 1| 0| 0| 1| 0| 1| 0| 0| 0| 1| 0| 1| 0| 1| 0| 1| 0| 1| 0| 1| 1| 0| 1| 0| 0| 1| 0| 1| 0| 0| 0| 1| 1| 0| 0| 1| 0| 1| 0| 1| 1| 0| 1| 0| 0| 0| 0| 0| 0| 1| 0| 1| 0| 1| 0| 1| 0| 0| 0| 0| 0| 1| 0| 1| 0| 1| 0| 0| 0| 1| 0| 1| 0| 1| 1| 0| 0| 0| 0| 1| 0| 1| 0| 1| 0| 0| | 3| 4| -:|–:| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1| 0| 0| 0| 0| 0| 0| 0| 0| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1| 0| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1| 0| 1| 0| 0| 0| 0| 0| 0| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1| 0| 0| 0| 0| 0| 0| 1| 0| |
| #part (c) For both data | sets, vis | ualize the clustering according to K-Means clustering with k = 4 in two ways. |
| (i) First, plo (ii) Second, p (iii) To get a | t the fir lot the 1 better v | st two principal component scores from Problem 1 and color them according to the clusters. 00 m records versus the marathon records and color according to clusters. iew of cluster groups plot the names of the countries instead of points. |
| ```r par(mfrow=c(1, plot(Wtrack_pc xlab = “P main =”C text(Wtrack_pc col = alp | 2)) $x[,1], W C1”, ylab lustering $x[,1], W ha(k4_Wom | |
| plot(Mtrack_pc xlab = “P main =”C text(Mtrack_pc col = alp ``` | $x[,1], M C1”, ylab lustering $x[,1], M ha(k4_Men | track_pc\(x[,2], = "PC2", type = "n", for Men at k = 4", cex.main = 0.7) track_pc\)x[,2], labels = Mtrack$Country, $cluster, 0.7), cex = 0.3) |
| <img src=“jdk1 Shows us the g | 60_HW3_ST eneral tr | AT326_files/figure-html/cluster names-1.png” width=“672” /> end when observing the spread between Men and Women relative to their respective countries. This gives us an idea of how far outliers are between Men’s & Women’s teams for the same clusters & countries. |
``r par(mfrow = c( plot(Wtrack$1 col = k4_ xlab = ’1 main = “W text(Wtrack$`1 col = k4_ |
1,2)) 00m, Wtr Women$clu 00m',ylab omen\'s 1 00m, Wtr Women$clu |
|
| plot(Mtrack\(`1 col = k4_ xlab = '1 main = "M text(Mtrack\)`1 col = k4_ ``` | 00m, Mtr Men$clust 00m',ylab en\'s 100 00m, Mtr Men$clust |
ack\(Marathon, type = "n", er, = 'Marathon', m vs Marathon", cex.main = 0.85) ack\)Marathon, labels = Mtrack$Country, er, cex = 0.3) |
| <img src=“jdk1 (iv) Comment o | 60_HW3_ST n how the | AT326_files/figure-html/clusterplot-1.png” width=“672” /> clusters relate to the PC scores and to the two chose predictors (100 m and marathon). Hint: You can use the type = “n” option in plot() and then use the text() function to put the country names on the plot |
| ANS: Based | on the c | luster locations, we can determine that the bottom left corner is where the strongest performers are (USA is a good reference point since they are 1st for both men & women), and the top right corner is where the weaker performers are. Scores of women lean more towards the 100m race for determining rank, while for men it relies more on the marathon. |
| #part (d) Comment on the | clusters | , e.g. are there any surprises or do you think that the clusters represent a reasonable split? |
| d.) | ||
r Country_Total. Country_Total. Country_Total. |
Rank.Fram Rank.Fram Rank.Fram | e\(MenCluster <-k4_Men\)cluster e\(WomenCluster <-k4_Women\)cluster e |
| <div data-page <script data {“columns”:[{” | dtable=“f -pagedtab label”:[” | alse”> le-source type=“application/json”> Country”],“name”:[1],“type”:[“chr”],“align”:[“left”]},{“label”:[“Women”],“name”:[2],“type”:[“dbl”],“align”:[“right”]},{“label”:[“Men”],“name”:[3],“type”:[“dbl”],“align”:[“right”]},{“label”:[“MenCluster”],“name”:[4],“type”:[“int”],“align”:[“right”]},{“label”:[“WomenCluster”],“name”:[5],“type”:[“int”],“align”:[“right”]}],“data”:[{“1”:“Argentina”,“2”:“39”,“3”:“40”,“4”:“1”,“5”:“4”},{“1”:“Australia”,“2”:“13”,“3”:“2”,“4”:“2”,“5”:“2”},{“1”:“Austria”,“2”:“24”,“3”:“36”,“4”:“1”,“5”:“2”},{“1”:“Belgium”,“2”:“21”,“3”:“11”,“4”:“2”,“5”:“2”},{“1”:“Bermuda”,“2”:“34”,“3”:“46”,“4”:“4”,“5”:“4”},{“1”:“Brazil”,“2”:“33”,“3”:“31”,“4”:“2”,“5”:“4”},{“1”:“Burma”,“2”:“46”,“3”:“43”,“4”:“1”,“5”:“4”},{“1”:“Canada”,“2”:“5”,“3”:“12”,“4”:“2”,“5”:“2”},{“1”:“Chile”,“2”:“35”,“3”:“34”,“4”:“2”,“5”:“4”},{“1”:“China”,“2”:“31”,“3”:“32”,“4”:“2”,“5”:“4”},{“1”:“Columbia”,“2”:“30”,“3”:“19”,“4”:“2”,“5”:“4”},{“1”:“CookIs”,“2”:“53”,“3”:“55”,“4”:“3”,“5”:“3”},{“1”:“Costa_Rica”,“2”:“36”,“3”:“38”,“4”:“1”,“5”:“4”},{“1”:“Czech”,“2”:“22”,“3”:“33”,“4”:“2”,“5”:“2”},{“1”:“Denmark”,“2”:“12”,“3”:“16”,“4”:“2”,“5”:“2”},{“1”:“DomRep”,“2”:“50”,“3”:“52”,“4”:“4”,“5”:“3”},{“1”:“Finland”,“2”:“16”,“3”:“15”,“4”:“2”,“5”:“2”},{“1”:“France”,“2”:“18”,“3”:“27”,“4”:“2”,“5”:“2”},{“1”:“GDR”,“2”:“20”,“3”:“10”,“4”:“2”,“5”:“2”},{“1”:“FRG”,“2”:“4”,“3”:“25”,“4”:“2”,“5”:“2”},{“1”:“GB_NI”,“2”:“7”,“3”:“6”,“4”:“2”,“5”:“2”},{“1”:“Greece”,“2”:“42”,“3”:“35”,“4”:“1”,“5”:“4”},{“1”:“Guatemala”,“2”:“51”,“3”:“42”,“4”:“1”,“5”:“3”},{“1”:“Hungary”,“2”:“19”,“3”:“30”,“4”:“2”,“5”:“2”},{“1”:“India”,“2”:“45”,“3”:“26”,“4”:“2”,“5”:“4”},{“1”:“Indonesia”,“2”:“49”,“3”:“48”,“4”:“4”,“5”:“3”},{“1”:“Ireland”,“2”:“6”,“3”:“28”,“4”:“2”,“5”:“2”},{“1”:“Israel”,“2”:“26”,“3”:“39”,“4”:“1”,“5”:“2”},{“1”:“Italy”,“2”:“11”,“3”:“17”,“4”:“2”,“5”:“2”},{“1”:“Japan”,“2”:“8”,“3”:“3”,“4”:“2”,“5”:“2”},{“1”:“Kenya”,“2”:“41”,“3”:“9”,“4”:“2”,“5”:“4”},{“1”:“Korea”,“2”:“28”,“3”:“37”,“4”:“1”,“5”:“4”},{“1”:“DPRKorea”,“2”:“40”,“3”:“21”,“4”:“2”,“5”:“4”},{“1”:“Luxembourg”,“2”:“37”,“3”:“44”,“4”:“1”,“5”:“4”},{“1”:“Malaysia”,“2”:“43”,“3”:“51”,“4”:“4”,“5”:“4”},{“1”:“Mauritius”,“2”:“54”,“3”:“50”,“4”:“4”,“5”:“1”},{“1”:“Mexico”,“2”:“23”,“3”:“8”,“4”:“2”,“5”:“2”},{“1”:“Netherlands”,“2”:“14”,“3”:“5”,“4”:“2”,“5”:“2”},{“1”:“NZ”,“2”:“2”,“3”:“7”,“4”:“2”,“5”:“2”},{“1”:“Norway”,“2”:“3”,“3”:“23”,“4”:“2”,“5”:“2”},{“1”:“PNG”,“2”:“52”,“3”:“47”,“4”:“4”,“5”:“3”},{“1”:“Philippines”,“2”:“47”,“3”:“45”,“4”:“4”,“5”:“3”},{“1”:“Poland”,“2”:“25”,“3”:“20”,“4”:“2”,“5”:“2”},{“1”:“Portugal”,“2”:“10”,“3”:“4”,“4”:“2”,“5”:“2”},{“1”:“Rumania”,“2”:“29”,“3”:“29”,“4”:“2”,“5”:“4”},{“1”:“Singapore”,“2”:“44”,“3”:“53”,“4”:“3”,“5”:“4”},{“1”:“Spain”,“2”:“27”,“3”:“22”,“4”:“2”,“5”:“2”},{“1”:“Sweden”,“2”:“17”,“3”:“14”,“4”:“2”,“5”:“2”},{“1”:“Switzerland”,“2”:“15”,“3”:“18”,“4”:“2”,“5”:“2”},{“1”:“Taipei”,“2”:“38”,“3”:“41”,“4”:“1”,“5”:“4”},{“1”:“Thailand”,“2”:“32”,“3”:“49”,“4”:“4”,“5”:“4”},{“1”:“Turkey”,“2”:“48”,“3”:“24”,“4”:“2”,“5”:“3”},{“1”:“USA”,“2”:“1”,“3”:“1”,“4”:“2”,“5”:“2”},{“1”:“USSR”,“2”:“9”,“3”:“13”,“4”:“2”,“5”:“2”},{“1”:“WSamoa”,“2”:“55”,“3”:“54”,“4”:“3”,“5”:“1”}],“options”:{“columns”:{“min”:{},“max”:[10]},“rows”:{“min”:[10],“max”:[10]},“pages”:{}}} |
| ANS: The s | plits app | ear to be reasonable since the table shows each countries’ corresponding group based on their performances. |
| e.) | ||
| #part (e) Perform hierar (i) Euclidean (ii) Remember (iii) Cut the (iv) Compare t | chical cl distance to put co dendogram o the clu | ustering on both men’s and women’s track record data using: and complete linkage and plot the dendograms. untry names on the dendograms, not just numbers 1-55. s so that you get 4 clusters and make a table that shows what countries are clustered together. sters you got with K-Means clustering with k = 4. |
| ```r W.dist <- dist M.dist <- dist | (Wtrack[, (Mtrack[, | 1:7]) 1:8]) |
| hc.Women <- h hc.Men <- hcl | clust(d = ust(d = M | W.dist, method = “complete”) .dist, method = “complete”) |
| #par(mfrow = c plot(hc.Women, label = W ``` | (2,1)) main = ” track$Cou | Women’s Complete linakge”, ntry, cex = 0.5, xlab = “Country”) |
| <img src=“jdk1 | 60_HW3_ST | AT326_files/figure-html/unnamed-chunk-3-1.png” width=“672” /> |
r plot(hc.Men, m label = M |
ain = “Me track$Cou | n’s Complete linakge”, ntry, cex = 0.5, xlab = “Country”) |
| <img src=“jdk1 | 60_HW3_ST | AT326_files/figure-html/unnamed-chunk-3-2.png” width=“672” /> |
| ```r Wtrack\(hc4 = c t3 <- table(Wt Mtrack\)hc4 = c t4 <- table(Mt | utree(hc. rack\(Coun utree(hc. rack\)Coun | Women, k = 4) try, Wtrack\(hc4) Men, k = 4) try, Mtrack\)hc4) |
| kable(list(t3, ``` | t4), capt | ion = “Women’s vs Men’s Clusters”) |
| <table class=” | kable_wra ’s vs Men | pper”> ’s Clusters |
| | | |:———–| |Argentina | |Australia | |Austria | |Belgium | |Bermuda | |Brazil | |Burma | |Canada | |Chile | |China | |Columbia | |CookIs | |Costa_Rica | |Czech | |Denmark | |DomRep | |DPRKorea | |Finland | |France | |FRG | |GB_NI | |GDR | |Greece | |Guatemala | |Hungary | |India | |Indonesia | |Ireland | |Israel | |Italy | |Japan | |Kenya | |Korea | |Luxembourg | |Malaysia | |Mauritius | |Mexico | |Netherlands | |Norway | |NZ | |Philippines | |PNG | |Poland | |Portugal | |Rumania | |Singapore | |Spain | |Sweden | |Switzerland | |Taipei | |Thailand | |Turkey | |USA | |USSR | |WSamoa | | 1| 2| –:|–:|- 1| 0| 0| 1| 0| 1| 0| 1| 0| 1| 0| 1| 1| 0| 0| 1| 0| 1| 0| 1| 0| 1| 0| 0| 0| 1| 0| 1| 0| 1| 1| 0| 1| 0| 0| 1| 0| 1| 0| 1| 0| 1| 0| 1| 1| 0| 1| 0| 0| 1| 1| 0| 1| 0| 0| 1| 0| 1| 0| 1| 0| 1| 1| 0| 0| 1| 0| 1| 1| 0| 0| 0| 0| 1| 0| 1| 0| 1| 0| 1| 1| 0| 0| 0| 0| 1| 0| 1| 0| 1| 1| 0| 0| 1| 0| 1| 0| 1| 1| 0| 0| 1| 1| 0| 0| 1| 0| 1| 0| 0| | 3| 4| -:|–:| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1| |
| | | |:———–| |Argentina | |Australia | |Austria | |Belgium | |Bermuda | |Brazil | |Burma | |Canada | |Chile | |China | |Columbia | |CookIs | |Costa_Rica | |Czech | |Denmark | |DomRep | |DPRKorea | |Finland | |France | |FRG | |GB_NI | |GDR | |Greece | |Guatemala | |Hungary | |India | |Indonesia | |Ireland | |Israel | |Italy | |Japan | |Kenya | |Korea | |Luxembourg | |Malaysia | |Mauritius | |Mexico | |Netherlands | |Norway | |NZ | |Philippines | |PNG | |Poland | |Portugal | |Rumania | |Singapore | |Spain | |Sweden | |Switzerland | |Taipei | |Thailand | |Turkey | |USA | |USSR | |WSamoa | | 1| 2| –:|–:|- 1| 0| 0| 1| 1| 0| 0| 1| 0| 0| 0| 1| 1| 0| 0| 1| 0| 1| 0| 1| 0| 1| 0| 0| 1| 0| 0| 1| 0| 1| 0| 0| 0| 1| 0| 1| 0| 1| 0| 1| 0| 1| 0| 1| 0| 1| 1| 0| 0| 1| 0| 1| 0| 0| 0| 1| 1| 0| 0| 1| 0| 1| 0| 1| 1| 0| 1| 0| 0| 0| 0| 0| 0| 1| 0| 1| 0| 1| 0| 1| 0| 0| 0| 0| 0| 1| 0| 1| 0| 1| 0| 0| 0| 1| 0| 1| 0| 1| 1| 0| 0| 0| 0| 1| 0| 1| 0| 1| 0| 0| | 3| 4| -:|–:| 0| 0| 0| 0| 0| 0| 0| 0| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1| 0| 0| 0| 0| 0| 0| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1| 0| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1| 0| 1| 0| 0| 0| 0| 0| 0| 0| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1| 0| 0| 0| 0| 0| 0| 0| 0| 1| |
522 home sales in 2002 from a city in the mid-west US. Objective: to predict sales price (dollars) using the following variables: sq feet, bed, bath,… highway
Note: Use log(price) as response - otherwise the normal linear model is not adequate.
#part (a) (i) Start by setting aside the last 100 observations as test data and keep the first 422 observations as training data.
For a range of \(\lambda\) values, fit a ridge and a lasso regression to the training data and determine the value of \(\lambda\) in each case using 10-fold cross validation.
Also obtain the ordinary least squares (OLS) estimates of the regression parameters.
Report the best \(\lambda\) values for both ridge and lasso.
# -------------Load Data-------------
RealEstate <- read.csv("RealEstate.csv")
RealEstate$id = NULL
RealEstate$price = log(RealEstate$price)
#-------------Train Data-------------
X <- model.matrix(price ~ . , data = RealEstate)[ , -1]
Xtrain <- X[1:422,]
Xtest <- X[423:522,]
Ytrain <- RealEstate$price[1:422]
Ytest <- RealEstate$price[423:522]
#dim(RealEstate)
#------------- FIT: Ridge & Lasso-------------
# Ridge regression for a set of lambdas: alpha=0
Fit.ridge <- cv.glmnet(Xtrain, Ytrain, alpha=0, nfolds=10)
best.lambda_Ridge <- Fit.ridge$lambda.min
best.lambda_Ridge
## [1] 0.03633256
# Lasso regression for a set of lambdas: alpha=1
Fit.lasso <- cv.glmnet(Xtrain, Ytrain, alpha=1, nfolds=10)
best.lambda_Lasso <- Fit.lasso$lambda.min
best.lambda_Lasso
## [1] 0.00180828
#-------------OBTAIN: OLS Estimates-------------
Fit.lm = lm(price ~ . , data = RealEstate[1:422,])
Fit.lm$coefficients
## (Intercept) sqfeet bed bath acyes
## 4.409174e+00 2.550016e-04 4.489280e-03 3.846894e-02 6.476310e-02
## cars poolyes year qualitylow qualitymedium
## 3.627883e-02 5.547058e-02 3.745838e-03 -4.088939e-01 -3.165030e-01
## lot highwayyes
## 5.509310e-06 -9.273726e-02
Ridge regression best \(\lambda\): 0.03633256
Lasso regression best \(\lambda\): 0.0009428388
#part (b) (i) Plot the solution paths for both ridge and lasso and include the OLS estimates. Add a horizontal line to indicate the selected \(\lambda\) value
## lambda values
grid <- 10^seq(3,-4,length=100)
## Ridge regression for a set of lambdas:
Fit.ridge <- glmnet(Xtrain, Ytrain, alpha = 0, lambda = grid) #ridge: alpha=0
## Lasso for a set of lambdas:
Fit.lasso <- glmnet(Xtrain, Ytrain, alpha = 1, lambda = grid) #lasso: alpha=1
p <- ncol(X)
MyColors <- rainbow(p)
## Calculate the effective degrees of freedom for each lamda
Xc <- scale(Xtrain, center = TRUE, scale = FALSE)
# center the data matrix
Xc.svd <- svd(Xc) # singular value decomposition of Xc
edf.ridge <- c()
for(lam in 1:length(grid)){
edf.ridge[lam] <- sum(Xc.svd$d^2/(Xc.svd$d^2 + grid[lam]))
}
edf.lasso <- c()
for(lam in 1:length(grid)){
edf.lasso[lam] <- sum(Xc.svd$d^2/(Xc.svd$d^2 + grid[lam]))
}
## Plot solution paths against effective df
par(mfrow =c(1,2))
matplot(log(Fit.ridge$lambda), t(Fit.ridge$beta),
type='l', col = MyColors, lty = 1,
xlab=expression(log(lambda)),
ylab="Regression coefficients",
main = "Ridge regression")
points(rep(-5, times = p), coef(Fit.lm)[-1],
col = MyColors, pch = 20)
legend("bottomright", legend=rownames(Fit.ridge$beta),
col=MyColors, lty=1, cex=0.5, bg = "white")
abline(v = log(best.lambda_Ridge), col = 'black')
matplot(log(Fit.lasso$lambda), t(Fit.lasso$beta), type='l',
col = MyColors, lty = 1,
xlab=expression(log(lambda)),
ylab="Regression coefficients",
main = "Lasso regression")
points(rep(-5, times = p), coef(Fit.lm)[-1], col = MyColors ,
pch = 20)
legend("bottomright", legend=rownames(Fit.ridge$beta),
col=MyColors, lty=1, cex=0.5, bg = "white")
abline(v = log(best.lambda_Lasso), col = 'black')
#part (c) (i) Provide a table with the estimated regression coefficients for ridge, lasso, and OLS.
Comment on the amount of shrinkage is applied in ridge and lasso for the selected \(\lambda\).
Did lasso exclude any of the predictors?
coef.ridge <- predict(Fit.ridge, s = best.lambda_Ridge, type = "coefficients")
coef.lasso <- predict(Fit.lasso, s = best.lambda_Lasso, type = "coefficients")
data.frame(Full = coef(Fit.lm), Ridge = as.numeric(coef.ridge),
Lasso = as.numeric(coef.lasso))
ANS:
Lasso did not exclude any predictors
Ridge regression has less shrinkage with “cars” & “bath”
Lasso has more shrinkage with “bed”.
#part (d) (i) Calculate the mean squared prediction errors (MSPE) for the training data you set aside in part (a). (ii) Which method performs best?
## Compare MSPE for Ridge, Lasso and OLS:
Pred.ridge <- predict(Fit.ridge, s = best.lambda_Ridge, newx=Xtest)
mean((Pred.ridge - Ytest)^2 )
## [1] 0.03009848
Pred.lasso <- predict(Fit.lasso, s = best.lambda_Lasso, newx=Xtest)
mean((Pred.lasso - Ytest)^2 )
## [1] 0.03035896
Pred.lm <- predict(Fit.lm, newdata = RealEstate[423:522,])
mean((Pred.lm - Ytest)^2 )
## [1] 0.03083113
ANS: When calculating the MSPE for the training data, of the three methods, Ridge regression does best (0.03009852).
End of Assignment Jared //