data(USArrests)
distances <- dist(USArrests)
allArrests <- hclust(distances)
plot(allArrests)
clusters<- cutree(allArrests, 3)
cluster1 <- USArrests[clusters == 1,]
cluster2 <- USArrests[clusters == 2,]
cluster3 <- USArrests[clusters == 3,]
print(cluster1)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## California 9.0 276 91 40.6
## Delaware 5.9 238 72 15.8
## Florida 15.4 335 80 31.9
## Illinois 10.4 249 83 24.0
## Louisiana 15.4 249 66 22.2
## Maryland 11.3 300 67 27.8
## Michigan 12.1 255 74 35.1
## Mississippi 16.1 259 44 17.1
## Nevada 12.2 252 81 46.0
## New Mexico 11.4 285 70 32.1
## New York 11.1 254 86 26.1
## North Carolina 13.0 337 45 16.1
## South Carolina 14.4 279 48 22.5
print(cluster3)
## Murder Assault UrbanPop Rape
## Connecticut 3.3 110 77 11.1
## Hawaii 5.3 46 83 20.2
## Idaho 2.6 120 54 14.2
## Indiana 7.2 113 65 21.0
## Iowa 2.2 56 57 11.3
## Kansas 6.0 115 66 18.0
## Kentucky 9.7 109 52 16.3
## Maine 2.1 83 51 7.8
## Minnesota 2.7 72 66 14.9
## Montana 6.0 109 53 16.4
## Nebraska 4.3 102 62 16.5
## New Hampshire 2.1 57 56 9.5
## North Dakota 0.8 45 44 7.3
## Ohio 7.3 120 75 21.4
## Pennsylvania 6.3 106 72 14.9
## South Dakota 3.8 86 45 12.8
## Utah 3.2 120 80 22.9
## Vermont 2.2 48 32 11.2
## West Virginia 5.7 81 39 9.3
## Wisconsin 2.6 53 66 10.8
print(cluster3)
## Murder Assault UrbanPop Rape
## Connecticut 3.3 110 77 11.1
## Hawaii 5.3 46 83 20.2
## Idaho 2.6 120 54 14.2
## Indiana 7.2 113 65 21.0
## Iowa 2.2 56 57 11.3
## Kansas 6.0 115 66 18.0
## Kentucky 9.7 109 52 16.3
## Maine 2.1 83 51 7.8
## Minnesota 2.7 72 66 14.9
## Montana 6.0 109 53 16.4
## Nebraska 4.3 102 62 16.5
## New Hampshire 2.1 57 56 9.5
## North Dakota 0.8 45 44 7.3
## Ohio 7.3 120 75 21.4
## Pennsylvania 6.3 106 72 14.9
## South Dakota 3.8 86 45 12.8
## Utah 3.2 120 80 22.9
## Vermont 2.2 48 32 11.2
## West Virginia 5.7 81 39 9.3
## Wisconsin 2.6 53 66 10.8
scaledDistances <- dist(scale(USArrests))
scaledArrests <- hclust(scaledDistances)
plot(scaledArrests)
Scaling should be done before computing dissimilarities, because if we compute dissimilarities and then try to normalize the variables, this would skew the clusters.
set.seed(12)
clusters <- c(rep(1, 20), rep(2, 20), rep(3, 20))
df <- matrix(rnorm(60*50, mean = 0, sd = 0.001), ncol = 50)
df[1:20,group=1]<-df[1:20,group=1]+10
df[21:40,group=2]<- df[21:40,group=2]-10
df[21:40,group=2]<- df[21:40,group=2]+10
df[41:60,group=3]<- df[41:60,group=3]-10
pca = prcomp(df)
plot(pca$x[,1:2], col=1:3, pch =19, xlab ="First principal component", ylab="Second principal component")
We would expect the cubic RSS to be a bit lower as the cubic model might attempt to overfit outlying datapoints and thus result in smaller RSS because the model would be more flexible.
Linear would be lower here because, as the true relationship is a linear line, the linear regression line might have the lower RSS
again, we would expect the cubic RSS to be a bit lower as the cubic model might attempt to overfit outlying datapoints and thus result in smaller RSS because the model would be more flexible.
There is not enough information to tell. It depends on the nature of the relationship. If the relationship is not hugely nonlinear, than the linear regression might result in a lower RSS, however, the cubic might perform better if the relationship is exetremely nonlinear.
Carseats <- read.csv(file = 'Downloads/Carseats.csv')
data(Carseats)
## Warning in data(Carseats): data set 'Carseats' not found
carseatsdata <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(carseatsdata)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Only price and USYes have statistically significant pvalues. There is a negative relationship between price and carseat sales and, the US experiences more sales than a non_US country.
Sales = 13 - 0.05(price) - 0.03(UrbanYes) + 1.2(USYes)
We reject the null hypothesis for price and USYes as they have very low p-values.
carseatsSignificant <- lm(Sales ~ Price + US, data = Carseats)
summary(carseatsSignificant)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
They both fit the data about the same. the r squared is about 0.23 for both of them.
confint(carseatsSignificant)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
plot(carseatsSignificant)
No evidence of outliers, but there are some high leverage points, based on the last graph.
4a)
x <- runif(1000, min = 0, max = 10)
e <- rnorm(1000, 0, 1)
y <- 1 + 3 * x + e
graph1 <- lm(formula = y ~ x, data = set.seed(1000))
4b)
lm(formula = y ~ x, data = set.seed(1000))
##
## Call:
## lm(formula = y ~ x, data = set.seed(1000))
##
## Coefficients:
## (Intercept) x
## 1.068 2.998
x <- runif(1000, min = 0, max = 10)
e <- rnorm(1000, -sqrt(3), sqrt(3))
y <- 1 + 3 * x + e
graph2 <- lm(formula = y ~ x, data = set.seed(1000))
summary(graph2)
##
## Call:
## lm(formula = y ~ x, data = set.seed(1000))
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8212 -1.1553 0.0168 1.1627 4.8055
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.78766 0.11009 -7.155 1.62e-12 ***
## x 3.00860 0.01911 157.439 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.731 on 998 degrees of freedom
## Multiple R-squared: 0.9613, Adjusted R-squared: 0.9613
## F-statistic: 2.479e+04 on 1 and 998 DF, p-value: < 2.2e-16