The objectives of this problem set is to perform various activities in R. And to conduct a thoughtful exercise in appreciating the importance of data visualization. For each question created a code chunk or text response that completes/answers the activity or question requested.
anscombe data that is part of the library(datasets) in R. And assign that data to a new object called data.library(datasets)
data <- anscombe
data
## x1 x2 x3 x4 y1 y2 y3 y4
## 1 10 10 10 8 8.04 9.14 7.46 6.58
## 2 8 8 8 8 6.95 8.14 6.77 5.76
## 3 13 13 13 8 7.58 8.74 12.74 7.71
## 4 9 9 9 8 8.81 8.77 7.11 8.84
## 5 11 11 11 8 8.33 9.26 7.81 8.47
## 6 14 14 14 8 9.96 8.10 8.84 7.04
## 7 6 6 6 8 7.24 6.13 6.08 5.25
## 8 4 4 4 19 4.26 3.10 5.39 12.50
## 9 12 12 12 8 10.84 9.13 8.15 5.56
## 10 7 7 7 8 4.82 7.26 6.42 7.91
## 11 5 5 5 8 5.68 4.74 5.73 6.89
fBasics() package!)## Mean of each variable
mu <- lapply(data, mean)
MU <- unlist(mu)
MU
## x1 x2 x3 x4 y1 y2 y3 y4
## 9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909
## variance of each variable
sigma.Sq <- lapply(data, var)
sigma.Sqr <- unlist(sigma.Sq)
sigma.Sqr
## x1 x2 x3 x4 y1 y2 y3
## 11.000000 11.000000 11.000000 11.000000 4.127269 4.127629 4.122620
## y4
## 4.123249
## Correlation between $x, $y
cor.func <- function(x){
cordata <- vector("numeric", 4L)
for(i in 1:4) {
cordata[i] <- cor(x[,i], x[,i+4])
}
return(cordata)
}
cor.func(data)
## [1] 0.8164205 0.8162365 0.8162867 0.8165214
plot(data$x1,data$y1, xlab ="x1", ylab="y1", main = "plot of (x1,y1)")
plot(data$x2,data$y2, xlab ="x2", ylab="y2", main = "plot of (x2,y2)")
plot(data$x3,data$y3, xlab ="x3", ylab="y3", main = "plot of (x3,y3)")
plot(data$x4,data$y4, xlab ="x4", ylab="y4", main = "plot of (x4,y4)")
par(mfrow = c(2,2))
plot(data$x1,data$y1, xlab ="x1", ylab="y1", main = "plot of (x1,y1)", pch =16, cex =1.4)
plot(data$x2,data$y2, xlab ="x2", ylab="y2", main = "plot of (x2,y2)", pch =16, cex =1.4)
plot(data$x3,data$y3, xlab ="x3", ylab="y3", main = "plot of (x3,y3)", pch =16, cex =1.4)
plot(data$x4,data$y4, xlab ="x4", ylab="y4", main = "plot of (x4,y4)", pch =16, cex =1.4)
lm() function.lm1 <- lm(y1 ~ x1, data)
lm2 <- lm(y2 ~ x2, data)
lm3 <- lm(y3 ~ x3, data)
lm4 <- lm(y4 ~ x4, data)
par(mfrow = c(2,2))
plot(data$x1,data$y1, xlab ="x1", ylab="y1", main = "model fit of (x1,y1)", pch =16, cex =1.4)
abline(lm1)
plot(data$x2,data$y2, xlab ="x2", ylab="y2", main = "model fit of (x2,y2)", pch =16, cex =1.4)
abline(lm2)
plot(data$x3,data$y3, xlab ="x3", ylab="y3", main = "model fit of (x3,y3)", pch =16, cex =1.4)
abline(lm3)
plot(data$x4,data$y4, xlab ="x4", ylab="y4", main = "model fit of (x4,y4)", pch =16, cex =1.4)
abline(lm4)
anova(lm1)
Analysis of Variance Table
Response: y1 Df Sum Sq Mean Sq F value Pr(>F)
x1 1 27.510 27.5100 17.99 0.00217 ** Residuals 9 13.763 1.5292
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
anova(lm2)
Analysis of Variance Table
Response: y2 Df Sum Sq Mean Sq F value Pr(>F)
x2 1 27.500 27.5000 17.966 0.002179 ** Residuals 9 13.776 1.5307
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
anova(lm3)
Analysis of Variance Table
Response: y3 Df Sum Sq Mean Sq F value Pr(>F)
x3 1 27.470 27.4700 17.972 0.002176 ** Residuals 9 13.756 1.5285
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
anova(lm4)
Analysis of Variance Table
Response: y4 Df Sum Sq Mean Sq F value Pr(>F)
x4 1 27.490 27.4900 18.003 0.002165 ** Residuals 9 13.742 1.5269
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
At first glance datasets looked identical, summarizing the data too lead us to believe that data are identical. Only after visualizing the data it shows indeed data are not at all identical. Hence, Visualization helped in avoiding false conclusions.