STAT 326: HW 3

Question 1: ISLR

12.2

dist = as.dist(matrix(c(0, 0.3, 0.4, 0.7, 
                     0.3, 0, 0.5, 0.8,
                     0.4, 0.5, 0.0, 0.45,
                     0.7, 0.8, 0.45, 0.0), nrow=4))
plot(hclust(dist, method="complete"))

part a)
Break 1: 0.8
Break Between 1 & 2: 0.3
Break Between 3 & 4: around 0.45

plot(hclust(dist, method="single"))

part b)
Break #1 = 0.45
Break #2 = 0.4
Break #3 = 0.3

part c) Cutting complete hierarchical dendogram between [0.45 & 0.8] gives us two groups: (1,2) & (3,4)
part d) Cutting single hierarchical dendogram between [0.4,0.45] gives us two groups: (4) & (3,1,2)

part e)

plot(hclust(dist, method="complete"), labels=c(2,1,4,3))

## Question 2:	Track re	cords data - PCA
Argentina Australia Austria Belgium Bermuda Brazil Burma Canada Chile China Columbia CookIs Costa_Rica Czech Denmark DomRep Finland France GDR FRG GB_NI Greece Guatemala Hungary India Indonesia Ireland Israel Italy Japan Kenya Korea DPRKorea Luxembourg Malaysia Mauritius Mexico Netherlands NZ Norway PNG Philippines Poland Portugal Rumania Singapore Spain Sweden Switzerland Taipei Thailand Turkey USA USSR WSamoa	39 13 24 21 34 33 46 5 35 31 30 53 36 22 12 50 16 18 20 4 7 42 51 19 45 49 6 26 11 8 41 28 40 37 43 54 23 14 2 3 52 47 25 10 29 44 27 17 15 38 32 48 1 9 55	40 2 36 11 46 31 43 12 34 32 19 55 38 33 16 52 15 27 10 25 6 35 42 30 26 48 28 39 17 3 9 37 21 44 51 50 8 5 7 23 47 45 20 4 29 53 22 14 18 41 49 24 1 13 54
(ii) Does this ANS:	ranking	correspond with your intuition of athletic excellence (when it comes to running) for the various countries?
My initial imp	ression a	bout the data is that it seems to fit the assumption of temperate nations having faster marathon times; due to their fair weather, they are able to train for longer races over multiple seasons year-round. By comparison, Samoa and the Philippines are small island nations with hot climates year-round; making it very difficult to train for long distances overall.
(iii) Is the r	anking di	fferent for men and women?
ANS: For t	he most p	art the ranking is different for men and women, but it is important to notice that when they do differ it seems to be very drastic in those circumstances (Norway, Kenya, FRG, Turkey)
#————- ## Question 3:	——— Track re	————————————————————————————————- cords - Clustering
Consider again Here the objec *Hint: you can	the trac tive is t get W(Ck	k records data for men and women from problem 1. o examine how countries cluster together based on the track record data. ) directly in R by: <kmeans.fit>$tot.withinss*
#part (a) Perform K-Mean (i) Plot the t	s cluster otal with	ing with k = 2,…15 for both data sets: in-group sum-of-squares W(Ck) against k.
```r Wtrack_k.means #clusters 2 - for (i in 2:15 km.out <- km Wtrack_k.mea } par(mfrow= c(2 plot(2:15, Wtr main = “W ylab =”T xlab = “N	= c() 15 ){ eans(Wtra ns[i-1] =
Mtrack_k.means for (i in 2:15 km.out <- km Mtrack_k.mea } plot(2:15, Mtr main = “M ylab =”T xlab = “N ```	= c() ){ eans(Mtra ns[i-1] =
<img src=“jdk1	60_HW3_ST	AT326_files/figure-html/3a K means-1.png” width=“672” />
(ii) While the	W(Ck) ge	nerally decreases with increasing k, is there a point where the differences become small (i.e. does the graph show an “elbow”)?
ANS: There	appears	to be an elbow point at k=5, as it tends to taper off pretty quickly after hitting k=6. As a result, I would recommend using k=5.
#part (b) For both data	sets, mak	e a table that lists the countries in each cluster according to K-Means clustering with k = 4.
```r k4_Women <- km k4_Women.table Cluster.frame	eans(Wtra <- table <- data.f	ck[,1:7], center = 4, nstart = 20) (Wtrack$Country, k4_Women$cluster) rame(k4_Women$cluster)
k4_Men <- kmea k4_Men.table <	ns(Mtrack - table(M	[,1:8], center = 4, nstart = 20) track$Country, k4_Men$cluster)
kable(list(k4_ ```	Women.tab	le, k4_Men.table), caption = “Women’s vs Men’s Clusters (k=4)”)
<table class=”	kable_wra ’s vs Men	pper”> ’s Clusters (k=4)
\| \| \|:———–\| \|Argentina \| \|Australia \| \|Austria \| \|Belgium \| \|Bermuda \| \|Brazil \| \|Burma \| \|Canada \| \|Chile \| \|China \| \|Columbia \| \|CookIs \| \|Costa_Rica \| \|Czech \| \|Denmark \| \|DomRep \| \|DPRKorea \| \|Finland \| \|France \| \|FRG \| \|GB_NI \| \|GDR \| \|Greece \| \|Guatemala \| \|Hungary \| \|India \| \|Indonesia \| \|Ireland \| \|Israel \| \|Italy \| \|Japan \| \|Kenya \| \|Korea \| \|Luxembourg \| \|Malaysia \| \|Mauritius \| \|Mexico \| \|Netherlands \| \|Norway \| \|NZ \| \|Philippines \| \|PNG \| \|Poland \| \|Portugal \| \|Rumania \| \|Singapore \| \|Spain \| \|Sweden \| \|Switzerland \| \|Taipei \| \|Thailand \| \|Turkey \| \|USA \| \|USSR \| \|WSamoa \|	1\| 2\| –:\|–:\|- 0\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 1\| 1\| 0\|	3\| 4\| -:\|–:\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 1\| 0\| 0\| 1\| 0\| 0\| 0\| 0\| 1\| 0\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 1\| 0\| 0\| 0\| 0\| 1\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 1\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 0\|

\| \| \|:———–\| \|Argentina \| \|Australia \| \|Austria \| \|Belgium \| \|Bermuda \| \|Brazil \| \|Burma \| \|Canada \| \|Chile \| \|China \| \|Columbia \| \|CookIs \| \|Costa_Rica \| \|Czech \| \|Denmark \| \|DomRep \| \|DPRKorea \| \|Finland \| \|France \| \|FRG \| \|GB_NI \| \|GDR \| \|Greece \| \|Guatemala \| \|Hungary \| \|India \| \|Indonesia \| \|Ireland \| \|Israel \| \|Italy \| \|Japan \| \|Kenya \| \|Korea \| \|Luxembourg \| \|Malaysia \| \|Mauritius \| \|Mexico \| \|Netherlands \| \|Norway \| \|NZ \| \|Philippines \| \|PNG \| \|Poland \| \|Portugal \| \|Rumania \| \|Singapore \| \|Spain \| \|Sweden \| \|Switzerland \| \|Taipei \| \|Thailand \| \|Turkey \| \|USA \| \|USSR \| \|WSamoa \|	1\| 2\| –:\|–:\|- 1\| 0\| 0\| 1\| 1\| 0\| 0\| 1\| 0\| 0\| 0\| 1\| 1\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 0\| 1\| 0\| 0\| 1\| 0\| 1\| 0\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 1\| 1\| 0\| 1\| 0\| 0\| 1\| 0\| 1\| 0\| 0\| 0\| 1\| 1\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 1\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 1\| 0\| 0\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 0\|	3\| 4\| -:\|–:\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\|

#part (c) For both data	sets, vis	ualize the clustering according to K-Means clustering with k = 4 in two ways.
(i) First, plo (ii) Second, p (iii) To get a	t the fir lot the 1 better v	st two principal component scores from Problem 1 and color them according to the clusters. 00 m records versus the marathon records and color according to clusters. iew of cluster groups plot the names of the countries instead of points.
```r par(mfrow=c(1, plot(Wtrack_pc xlab = “P main =”C text(Wtrack_pc col = alp	2)) $x[,1], W C1”, ylab lustering $x[,1], W ha(k4_Wom
plot(Mtrack_pc xlab = “P main =”C text(Mtrack_pc col = alp ```	$x[,1], M C1”, ylab lustering $x[,1], M ha(k4_Men	track_pc$x[,2], = "PC2", type = "n", for Men at k = 4", cex.main = 0.7) track_pc$x[,2], labels = Mtrack$Country, $cluster, 0.7), cex = 0.3)
<img src=“jdk1 Shows us the g	60_HW3_ST eneral tr	AT326_files/figure-html/cluster names-1.png” width=“672” /> end when observing the spread between Men and Women relative to their respective countries. This gives us an idea of how far outliers are between Men’s & Women’s teams for the same clusters & countries.
```r par(mfrow = c( plot(Wtrack$`1 col = k4_ xlab = ’1 main = “W text(Wtrack$`1 col = k4_	1,2)) 00m`, Wtr Women$clu 00m',ylab omen\'s 1 00m`, Wtr Women$clu
plot(Mtrack$`1 col = k4_ xlab = '1 main = "M text(Mtrack$`1 col = k4_ ```	00m`, Mtr Men$clust 00m',ylab en\'s 100 00m`, Mtr Men$clust	ack$Marathon, type = "n", er, = 'Marathon', m vs Marathon", cex.main = 0.85) ack$Marathon, labels = Mtrack$Country, er, cex = 0.3)
<img src=“jdk1 (iv) Comment o	60_HW3_ST n how the	AT326_files/figure-html/clusterplot-1.png” width=“672” /> clusters relate to the PC scores and to the two chose predictors (100 m and marathon). Hint: You can use the type = “n” option in plot() and then use the text() function to put the country names on the plot
ANS: Based	on the c	luster locations, we can determine that the bottom left corner is where the strongest performers are (USA is a good reference point since they are 1st for both men & women), and the top right corner is where the weaker performers are. Scores of women lean more towards the 100m race for determining rank, while for men it relies more on the marathon.
#part (d) Comment on the	clusters	, e.g. are there any surprises or do you think that the clusters represent a reasonable split?
d.)
`r Country_Total. Country_Total. Country_Total.`	Rank.Fram Rank.Fram Rank.Fram	e$MenCluster <-k4_Men$cluster e$WomenCluster <-k4_Women$cluster e
<div data-page <script data {“columns”:[{”	dtable=“f -pagedtab label”:[”	alse”> le-source type=“application/json”> Country”],“name”:[1],“type”:[“chr”],“align”:[“left”]},{“label”:[“Women”],“name”:[2],“type”:[“dbl”],“align”:[“right”]},{“label”:[“Men”],“name”:[3],“type”:[“dbl”],“align”:[“right”]},{“label”:[“MenCluster”],“name”:[4],“type”:[“int”],“align”:[“right”]},{“label”:[“WomenCluster”],“name”:[5],“type”:[“int”],“align”:[“right”]}],“data”:[{“1”:“Argentina”,“2”:“39”,“3”:“40”,“4”:“1”,“5”:“4”},{“1”:“Australia”,“2”:“13”,“3”:“2”,“4”:“2”,“5”:“2”},{“1”:“Austria”,“2”:“24”,“3”:“36”,“4”:“1”,“5”:“2”},{“1”:“Belgium”,“2”:“21”,“3”:“11”,“4”:“2”,“5”:“2”},{“1”:“Bermuda”,“2”:“34”,“3”:“46”,“4”:“4”,“5”:“4”},{“1”:“Brazil”,“2”:“33”,“3”:“31”,“4”:“2”,“5”:“4”},{“1”:“Burma”,“2”:“46”,“3”:“43”,“4”:“1”,“5”:“4”},{“1”:“Canada”,“2”:“5”,“3”:“12”,“4”:“2”,“5”:“2”},{“1”:“Chile”,“2”:“35”,“3”:“34”,“4”:“2”,“5”:“4”},{“1”:“China”,“2”:“31”,“3”:“32”,“4”:“2”,“5”:“4”},{“1”:“Columbia”,“2”:“30”,“3”:“19”,“4”:“2”,“5”:“4”},{“1”:“CookIs”,“2”:“53”,“3”:“55”,“4”:“3”,“5”:“3”},{“1”:“Costa_Rica”,“2”:“36”,“3”:“38”,“4”:“1”,“5”:“4”},{“1”:“Czech”,“2”:“22”,“3”:“33”,“4”:“2”,“5”:“2”},{“1”:“Denmark”,“2”:“12”,“3”:“16”,“4”:“2”,“5”:“2”},{“1”:“DomRep”,“2”:“50”,“3”:“52”,“4”:“4”,“5”:“3”},{“1”:“Finland”,“2”:“16”,“3”:“15”,“4”:“2”,“5”:“2”},{“1”:“France”,“2”:“18”,“3”:“27”,“4”:“2”,“5”:“2”},{“1”:“GDR”,“2”:“20”,“3”:“10”,“4”:“2”,“5”:“2”},{“1”:“FRG”,“2”:“4”,“3”:“25”,“4”:“2”,“5”:“2”},{“1”:“GB_NI”,“2”:“7”,“3”:“6”,“4”:“2”,“5”:“2”},{“1”:“Greece”,“2”:“42”,“3”:“35”,“4”:“1”,“5”:“4”},{“1”:“Guatemala”,“2”:“51”,“3”:“42”,“4”:“1”,“5”:“3”},{“1”:“Hungary”,“2”:“19”,“3”:“30”,“4”:“2”,“5”:“2”},{“1”:“India”,“2”:“45”,“3”:“26”,“4”:“2”,“5”:“4”},{“1”:“Indonesia”,“2”:“49”,“3”:“48”,“4”:“4”,“5”:“3”},{“1”:“Ireland”,“2”:“6”,“3”:“28”,“4”:“2”,“5”:“2”},{“1”:“Israel”,“2”:“26”,“3”:“39”,“4”:“1”,“5”:“2”},{“1”:“Italy”,“2”:“11”,“3”:“17”,“4”:“2”,“5”:“2”},{“1”:“Japan”,“2”:“8”,“3”:“3”,“4”:“2”,“5”:“2”},{“1”:“Kenya”,“2”:“41”,“3”:“9”,“4”:“2”,“5”:“4”},{“1”:“Korea”,“2”:“28”,“3”:“37”,“4”:“1”,“5”:“4”},{“1”:“DPRKorea”,“2”:“40”,“3”:“21”,“4”:“2”,“5”:“4”},{“1”:“Luxembourg”,“2”:“37”,“3”:“44”,“4”:“1”,“5”:“4”},{“1”:“Malaysia”,“2”:“43”,“3”:“51”,“4”:“4”,“5”:“4”},{“1”:“Mauritius”,“2”:“54”,“3”:“50”,“4”:“4”,“5”:“1”},{“1”:“Mexico”,“2”:“23”,“3”:“8”,“4”:“2”,“5”:“2”},{“1”:“Netherlands”,“2”:“14”,“3”:“5”,“4”:“2”,“5”:“2”},{“1”:“NZ”,“2”:“2”,“3”:“7”,“4”:“2”,“5”:“2”},{“1”:“Norway”,“2”:“3”,“3”:“23”,“4”:“2”,“5”:“2”},{“1”:“PNG”,“2”:“52”,“3”:“47”,“4”:“4”,“5”:“3”},{“1”:“Philippines”,“2”:“47”,“3”:“45”,“4”:“4”,“5”:“3”},{“1”:“Poland”,“2”:“25”,“3”:“20”,“4”:“2”,“5”:“2”},{“1”:“Portugal”,“2”:“10”,“3”:“4”,“4”:“2”,“5”:“2”},{“1”:“Rumania”,“2”:“29”,“3”:“29”,“4”:“2”,“5”:“4”},{“1”:“Singapore”,“2”:“44”,“3”:“53”,“4”:“3”,“5”:“4”},{“1”:“Spain”,“2”:“27”,“3”:“22”,“4”:“2”,“5”:“2”},{“1”:“Sweden”,“2”:“17”,“3”:“14”,“4”:“2”,“5”:“2”},{“1”:“Switzerland”,“2”:“15”,“3”:“18”,“4”:“2”,“5”:“2”},{“1”:“Taipei”,“2”:“38”,“3”:“41”,“4”:“1”,“5”:“4”},{“1”:“Thailand”,“2”:“32”,“3”:“49”,“4”:“4”,“5”:“4”},{“1”:“Turkey”,“2”:“48”,“3”:“24”,“4”:“2”,“5”:“3”},{“1”:“USA”,“2”:“1”,“3”:“1”,“4”:“2”,“5”:“2”},{“1”:“USSR”,“2”:“9”,“3”:“13”,“4”:“2”,“5”:“2”},{“1”:“WSamoa”,“2”:“55”,“3”:“54”,“4”:“3”,“5”:“1”}],“options”:{“columns”:{“min”:{},“max”:[10]},“rows”:{“min”:[10],“max”:[10]},“pages”:{}}}
ANS: The s	plits app	ear to be reasonable since the table shows each countries’ corresponding group based on their performances.
e.)
#part (e) Perform hierar (i) Euclidean (ii) Remember (iii) Cut the (iv) Compare t	chical cl distance to put co dendogram o the clu	ustering on both men’s and women’s track record data using: and complete linkage and plot the dendograms. untry names on the dendograms, not just numbers 1-55. s so that you get 4 clusters and make a table that shows what countries are clustered together. sters you got with K-Means clustering with k = 4.
```r W.dist <- dist M.dist <- dist	(Wtrack[, (Mtrack[,	1:7]) 1:8])
hc.Women <- h hc.Men <- hcl	clust(d = ust(d = M	W.dist, method = “complete”) .dist, method = “complete”)
#par(mfrow = c plot(hc.Women, label = W ```	(2,1)) main = ” track$Cou	Women’s Complete linakge”, ntry, cex = 0.5, xlab = “Country”)
<img src=“jdk1	60_HW3_ST	AT326_files/figure-html/unnamed-chunk-3-1.png” width=“672” />
`r plot(hc.Men, m label = M`	ain = “Me track$Cou	n’s Complete linakge”, ntry, cex = 0.5, xlab = “Country”)
<img src=“jdk1	60_HW3_ST	AT326_files/figure-html/unnamed-chunk-3-2.png” width=“672” />
```r Wtrack$hc4 = c t3 <- table(Wt Mtrack$hc4 = c t4 <- table(Mt	utree(hc. rack$Coun utree(hc. rack$Coun	Women, k = 4) try, Wtrack$hc4) Men, k = 4) try, Mtrack$hc4)
kable(list(t3, ```	t4), capt	ion = “Women’s vs Men’s Clusters”)
<table class=”	kable_wra ’s vs Men	pper”> ’s Clusters
\| \| \|:———–\| \|Argentina \| \|Australia \| \|Austria \| \|Belgium \| \|Bermuda \| \|Brazil \| \|Burma \| \|Canada \| \|Chile \| \|China \| \|Columbia \| \|CookIs \| \|Costa_Rica \| \|Czech \| \|Denmark \| \|DomRep \| \|DPRKorea \| \|Finland \| \|France \| \|FRG \| \|GB_NI \| \|GDR \| \|Greece \| \|Guatemala \| \|Hungary \| \|India \| \|Indonesia \| \|Ireland \| \|Israel \| \|Italy \| \|Japan \| \|Kenya \| \|Korea \| \|Luxembourg \| \|Malaysia \| \|Mauritius \| \|Mexico \| \|Netherlands \| \|Norway \| \|NZ \| \|Philippines \| \|PNG \| \|Poland \| \|Portugal \| \|Rumania \| \|Singapore \| \|Spain \| \|Sweden \| \|Switzerland \| \|Taipei \| \|Thailand \| \|Turkey \| \|USA \| \|USSR \| \|WSamoa \|	1\| 2\| –:\|–:\|- 1\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 1\| 1\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 1\| 0\| 1\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 1\| 1\| 0\| 1\| 0\| 0\| 1\| 1\| 0\| 1\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 1\| 1\| 0\| 0\| 1\| 0\| 1\| 1\| 0\| 0\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 1\| 1\| 0\| 0\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 1\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 1\| 0\| 0\| 1\| 1\| 0\| 0\| 1\| 0\| 1\| 0\| 0\|	3\| 4\| -:\|–:\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 1\|

\| \| \|:———–\| \|Argentina \| \|Australia \| \|Austria \| \|Belgium \| \|Bermuda \| \|Brazil \| \|Burma \| \|Canada \| \|Chile \| \|China \| \|Columbia \| \|CookIs \| \|Costa_Rica \| \|Czech \| \|Denmark \| \|DomRep \| \|DPRKorea \| \|Finland \| \|France \| \|FRG \| \|GB_NI \| \|GDR \| \|Greece \| \|Guatemala \| \|Hungary \| \|India \| \|Indonesia \| \|Ireland \| \|Israel \| \|Italy \| \|Japan \| \|Kenya \| \|Korea \| \|Luxembourg \| \|Malaysia \| \|Mauritius \| \|Mexico \| \|Netherlands \| \|Norway \| \|NZ \| \|Philippines \| \|PNG \| \|Poland \| \|Portugal \| \|Rumania \| \|Singapore \| \|Spain \| \|Sweden \| \|Switzerland \| \|Taipei \| \|Thailand \| \|Turkey \| \|USA \| \|USSR \| \|WSamoa \|	1\| 2\| –:\|–:\|- 1\| 0\| 0\| 1\| 1\| 0\| 0\| 1\| 0\| 0\| 0\| 1\| 1\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 0\| 1\| 0\| 0\| 1\| 0\| 1\| 0\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 1\| 1\| 0\| 0\| 1\| 0\| 1\| 0\| 0\| 0\| 1\| 1\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 1\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 1\| 0\| 0\| 0\| 0\| 1\| 0\| 1\| 0\| 1\| 0\| 0\|	3\| 4\| -:\|–:\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 1\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 0\| 1\|

Question 4: Real estate data - Regression

522 home sales in 2002 from a city in the mid-west US. Objective: to predict sales price (dollars) using the following variables: sq feet, bed, bath,… highway

Note: Use log(price) as response - otherwise the normal linear model is not adequate.

#part (a) (i) Start by setting aside the last 100 observations as test data and keep the first 422 observations as training data.

For a range of $\lambda$ values, fit a ridge and a lasso regression to the training data and determine the value of $\lambda$ in each case using 10-fold cross validation.
Also obtain the ordinary least squares (OLS) estimates of the regression parameters.
Report the best $\lambda$ values for both ridge and lasso.

# -------------Load Data-------------
RealEstate <- read.csv("RealEstate.csv")
RealEstate$id = NULL
RealEstate$price = log(RealEstate$price)

#-------------Train Data-------------
X <- model.matrix(price ~ . , data = RealEstate)[ , -1]
Xtrain <- X[1:422,]
Xtest <- X[423:522,]
Ytrain <- RealEstate$price[1:422]
Ytest <- RealEstate$price[423:522]
#dim(RealEstate)

#------------- FIT: Ridge & Lasso-------------

# Ridge regression for a set of lambdas: alpha=0
Fit.ridge <- cv.glmnet(Xtrain, Ytrain, alpha=0, nfolds=10) 
best.lambda_Ridge <- Fit.ridge$lambda.min
best.lambda_Ridge

## [1] 0.03633256

# Lasso regression for a set of lambdas: alpha=1
Fit.lasso <- cv.glmnet(Xtrain, Ytrain, alpha=1, nfolds=10)  
best.lambda_Lasso <- Fit.lasso$lambda.min
best.lambda_Lasso

## [1] 0.00180828

#-------------OBTAIN: OLS Estimates-------------
Fit.lm = lm(price ~ . , data = RealEstate[1:422,])
Fit.lm$coefficients

##   (Intercept)        sqfeet           bed          bath         acyes 
##  4.409174e+00  2.550016e-04  4.489280e-03  3.846894e-02  6.476310e-02 
##          cars       poolyes          year    qualitylow qualitymedium 
##  3.627883e-02  5.547058e-02  3.745838e-03 -4.088939e-01 -3.165030e-01 
##           lot    highwayyes 
##  5.509310e-06 -9.273726e-02

Ridge regression best $\lambda$: 0.03633256
Lasso regression best $\lambda$: 0.0009428388

#part (b) (i) Plot the solution paths for both ridge and lasso and include the OLS estimates. Add a horizontal line to indicate the selected $\lambda$ value

## lambda values
grid <- 10^seq(3,-4,length=100)

## Ridge regression for a set of lambdas:
Fit.ridge <- glmnet(Xtrain, Ytrain, alpha = 0, lambda = grid) #ridge: alpha=0 

## Lasso for a set of lambdas:
Fit.lasso <- glmnet(Xtrain, Ytrain, alpha = 1, lambda = grid) #lasso: alpha=1 

p <- ncol(X)
MyColors <- rainbow(p)
## Calculate the effective degrees of freedom for each lamda
Xc <- scale(Xtrain, center = TRUE, scale = FALSE) 
# center the data matrix
Xc.svd <- svd(Xc) # singular value decomposition of Xc
edf.ridge <- c()
for(lam in 1:length(grid)){ 
  edf.ridge[lam] <- sum(Xc.svd$d^2/(Xc.svd$d^2 + grid[lam]))
}
edf.lasso <- c()
for(lam in 1:length(grid)){ 
  edf.lasso[lam] <- sum(Xc.svd$d^2/(Xc.svd$d^2 + grid[lam]))
}

## Plot solution paths against effective df
par(mfrow =c(1,2))
matplot(log(Fit.ridge$lambda), t(Fit.ridge$beta), 
        type='l', col = MyColors, lty = 1,
        xlab=expression(log(lambda)), 
        ylab="Regression coefficients", 
        main = "Ridge regression")
points(rep(-5, times = p), coef(Fit.lm)[-1], 
       col = MyColors, pch = 20)
legend("bottomright", legend=rownames(Fit.ridge$beta),
       col=MyColors, lty=1, cex=0.5, bg = "white")
abline(v = log(best.lambda_Ridge), col = 'black')

matplot(log(Fit.lasso$lambda), t(Fit.lasso$beta), type='l', 
        col = MyColors, lty = 1, 
        xlab=expression(log(lambda)), 
        ylab="Regression coefficients", 
        main = "Lasso regression")
points(rep(-5, times = p), coef(Fit.lm)[-1], col = MyColors , 
       pch = 20)
legend("bottomright", legend=rownames(Fit.ridge$beta),
       col=MyColors, lty=1, cex=0.5, bg = "white")
abline(v = log(best.lambda_Lasso), col = 'black')

#part (c) (i) Provide a table with the estimated regression coefficients for ridge, lasso, and OLS.

Comment on the amount of shrinkage is applied in ridge and lasso for the selected $\lambda$.
Did lasso exclude any of the predictors?

coef.ridge <- predict(Fit.ridge, s = best.lambda_Ridge, type = "coefficients")
coef.lasso <- predict(Fit.lasso, s = best.lambda_Lasso, type = "coefficients")
data.frame(Full = coef(Fit.lm), Ridge = as.numeric(coef.ridge), 
           Lasso = as.numeric(coef.lasso))

ANS:

Lasso did not exclude any predictors
Ridge regression has less shrinkage with “cars” & “bath”
Lasso has more shrinkage with “bed”.

#part (d) (i) Calculate the mean squared prediction errors (MSPE) for the training data you set aside in part (a). (ii) Which method performs best?

## Compare MSPE for Ridge, Lasso and OLS:
Pred.ridge <- predict(Fit.ridge, s = best.lambda_Ridge, newx=Xtest)
mean((Pred.ridge - Ytest)^2 )

## [1] 0.03009848

Pred.lasso <- predict(Fit.lasso, s = best.lambda_Lasso, newx=Xtest)
mean((Pred.lasso - Ytest)^2 )

## [1] 0.03035896

Pred.lm <- predict(Fit.lm, newdata = RealEstate[423:522,])
mean((Pred.lm - Ytest)^2 )

## [1] 0.03083113

ANS: When calculating the MSPE for the training data, of the three methods, Ridge regression does best (0.03009852).

End of Assignment Jared //

STAT 326: HW 3

Jared Kokinos

4/19/2022

Question 1: ISLR

Question 4: Real estate data - Regression