1 - Clustering

Exercise 1

i. Question 9

9a)

data(USArrests)
distances <- dist(USArrests)
allArrests <- hclust(distances)
plot(allArrests)

9b)

clusters<- cutree(allArrests, 3)
cluster1 <- USArrests[clusters == 1,]
cluster2 <- USArrests[clusters == 2,]
cluster3 <- USArrests[clusters == 3,]

cluster 1:

print(cluster1)
##                Murder Assault UrbanPop Rape
## Alabama          13.2     236       58 21.2
## Alaska           10.0     263       48 44.5
## Arizona           8.1     294       80 31.0
## California        9.0     276       91 40.6
## Delaware          5.9     238       72 15.8
## Florida          15.4     335       80 31.9
## Illinois         10.4     249       83 24.0
## Louisiana        15.4     249       66 22.2
## Maryland         11.3     300       67 27.8
## Michigan         12.1     255       74 35.1
## Mississippi      16.1     259       44 17.1
## Nevada           12.2     252       81 46.0
## New Mexico       11.4     285       70 32.1
## New York         11.1     254       86 26.1
## North Carolina   13.0     337       45 16.1
## South Carolina   14.4     279       48 22.5

cluster 2:

print(cluster3)
##               Murder Assault UrbanPop Rape
## Connecticut      3.3     110       77 11.1
## Hawaii           5.3      46       83 20.2
## Idaho            2.6     120       54 14.2
## Indiana          7.2     113       65 21.0
## Iowa             2.2      56       57 11.3
## Kansas           6.0     115       66 18.0
## Kentucky         9.7     109       52 16.3
## Maine            2.1      83       51  7.8
## Minnesota        2.7      72       66 14.9
## Montana          6.0     109       53 16.4
## Nebraska         4.3     102       62 16.5
## New Hampshire    2.1      57       56  9.5
## North Dakota     0.8      45       44  7.3
## Ohio             7.3     120       75 21.4
## Pennsylvania     6.3     106       72 14.9
## South Dakota     3.8      86       45 12.8
## Utah             3.2     120       80 22.9
## Vermont          2.2      48       32 11.2
## West Virginia    5.7      81       39  9.3
## Wisconsin        2.6      53       66 10.8

cluster 3:

print(cluster3)
##               Murder Assault UrbanPop Rape
## Connecticut      3.3     110       77 11.1
## Hawaii           5.3      46       83 20.2
## Idaho            2.6     120       54 14.2
## Indiana          7.2     113       65 21.0
## Iowa             2.2      56       57 11.3
## Kansas           6.0     115       66 18.0
## Kentucky         9.7     109       52 16.3
## Maine            2.1      83       51  7.8
## Minnesota        2.7      72       66 14.9
## Montana          6.0     109       53 16.4
## Nebraska         4.3     102       62 16.5
## New Hampshire    2.1      57       56  9.5
## North Dakota     0.8      45       44  7.3
## Ohio             7.3     120       75 21.4
## Pennsylvania     6.3     106       72 14.9
## South Dakota     3.8      86       45 12.8
## Utah             3.2     120       80 22.9
## Vermont          2.2      48       32 11.2
## West Virginia    5.7      81       39  9.3
## Wisconsin        2.6      53       66 10.8

9c)

scaledDistances <- dist(scale(USArrests))
scaledArrests <- hclust(scaledDistances)
plot(scaledArrests)

9d)

Scaling should be done before computing dissimilarities, because if we compute dissimilarities and then try to normalize the variables, this would skew the clusters.

ii. Question 10

10a)

set.seed(12)
clusters <-  c(rep(1, 20), rep(2, 20), rep(3, 20))
df <- matrix(rnorm(60*50, mean = 0, sd = 0.001), ncol = 50)
df[1:20,group=1]<-df[1:20,group=1]+10
df[21:40,group=2]<- df[21:40,group=2]-10
df[21:40,group=2]<- df[21:40,group=2]+10
df[41:60,group=3]<- df[41:60,group=3]-10

10b)

pca = prcomp(df)
plot(pca$x[,1:2], col=1:3, pch =19, xlab ="First principal component", ylab="Second principal component")

2 - Linear Regression

Exercise 3

i. Question 4

4a)

We would expect the cubic RSS to be a bit lower as the cubic model might attempt to overfit outlying datapoints and thus result in smaller RSS because the model would be more flexible.

4b)

Linear would be lower here because, as the true relationship is a linear line, the linear regression line might have the lower RSS

4c)

again, we would expect the cubic RSS to be a bit lower as the cubic model might attempt to overfit outlying datapoints and thus result in smaller RSS because the model would be more flexible.

4d)

There is not enough information to tell. It depends on the nature of the relationship. If the relationship is not hugely nonlinear, than the linear regression might result in a lower RSS, however, the cubic might perform better if the relationship is exetremely nonlinear.

ii. Question 10

10a)

Carseats <- read.csv(file = 'Downloads/Carseats.csv')
data(Carseats)
## Warning in data(Carseats): data set 'Carseats' not found
carseatsdata <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(carseatsdata)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

10b)

Only price and USYes have statistically significant pvalues. There is a negative relationship between price and carseat sales and, the US experiences more sales than a non_US country.

10c)

Sales = 13 - 0.05(price) - 0.03(UrbanYes) + 1.2(USYes)

10d)

We reject the null hypothesis for price and USYes as they have very low p-values.

10e)

carseatsSignificant <- lm(Sales ~ Price + US, data = Carseats)
summary(carseatsSignificant)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

10f)

They both fit the data about the same. the r squared is about 0.23 for both of them.

10g)

confint(carseatsSignificant)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

10h)

plot(carseatsSignificant)

No evidence of outliers, but there are some high leverage points, based on the last graph.

Exercise 4

4a)

x <- runif(1000, min = 0, max = 10)
e <- rnorm(1000, 0, 1)
y <- 1 + 3 * x + e  
graph1 <- lm(formula = y ~ x, data = set.seed(1000))

4b)

lm(formula = y ~ x, data = set.seed(1000))
## 
## Call:
## lm(formula = y ~ x, data = set.seed(1000))
## 
## Coefficients:
## (Intercept)            x  
##       1.068        2.998
x <- runif(1000, min = 0, max = 10)
e <- rnorm(1000, -sqrt(3), sqrt(3))
y <- 1 + 3 * x + e 
graph2 <- lm(formula = y ~ x, data = set.seed(1000))
summary(graph2)
## 
## Call:
## lm(formula = y ~ x, data = set.seed(1000))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8212 -1.1553  0.0168  1.1627  4.8055 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.78766    0.11009  -7.155 1.62e-12 ***
## x            3.00860    0.01911 157.439  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.731 on 998 degrees of freedom
## Multiple R-squared:  0.9613, Adjusted R-squared:  0.9613 
## F-statistic: 2.479e+04 on 1 and 998 DF,  p-value: < 2.2e-16