The objectives of this problem set is to orient you to a number of activities in R
. And to conduct a thoughtful exercise in appreciating the importance of data visualization. For each question create a code chunk or text response that completes/answers the activity or question requested. Finally, upon completion name your final output .html
file as: YourName_ANLY512-Section-Year-Semester.html
and upload it to the “Problem Set 2” assignmenet on Moodle.
anscombe
data that is part of the library(datasets)
in R
. And assign that data to a new object called data
.data("anscombe")
data = anscombe
summary(data)
## x1 x2 x3 x4
## Min. : 4.0 Min. : 4.0 Min. : 4.0 Min. : 8
## 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 8
## Median : 9.0 Median : 9.0 Median : 9.0 Median : 8
## Mean : 9.0 Mean : 9.0 Mean : 9.0 Mean : 9
## 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.: 8
## Max. :14.0 Max. :14.0 Max. :14.0 Max. :19
## y1 y2 y3 y4
## Min. : 4.260 Min. :3.100 Min. : 5.39 Min. : 5.250
## 1st Qu.: 6.315 1st Qu.:6.695 1st Qu.: 6.25 1st Qu.: 6.170
## Median : 7.580 Median :8.140 Median : 7.11 Median : 7.040
## Mean : 7.501 Mean :7.501 Mean : 7.50 Mean : 7.501
## 3rd Qu.: 8.570 3rd Qu.:8.950 3rd Qu.: 7.98 3rd Qu.: 8.190
## Max. :10.840 Max. :9.260 Max. :12.74 Max. :12.500
fBasics()
package!)library("fBasics")
# calculate mean
colStats(data, FUN = 'mean')
## x1 x2 x3 x4 y1 y2 y3 y4
## 9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909
# calculate variance
colStats(data, FUN = 'var')
## x1 x2 x3 x4 y1 y2 y3
## 11.000000 11.000000 11.000000 11.000000 4.127269 4.127629 4.122620
## y4
## 4.123249
# calculate correlation for all 4 sets
colStats(data$x1, data$y1, FUN = 'cor')
## [1] 0.8164205
colStats(data$x2, data$y2, FUN = 'cor')
## [1] 0.8162365
colStats(data$x3, data$y3, FUN = 'cor')
## [1] 0.8162867
colStats(data$x4, data$y4, FUN = 'cor')
## [1] 0.8165214
plot(data$x1,data$y1,xlab="x1",ylab="y1",main="Dataset 1")
plot(data$x2,data$y2,xlab="x2",ylab="y2",main="Dataset 2")
plot(data$x3,data$y3,xlab="x3",ylab="y3",main="Dataset 3")
plot(data$x4,data$y4,xlab="x4",ylab="y4",main="Dataset 4")
par(mfrow=c(2,2))
plot(data$x1,data$y1,xlab="x1",ylab="y1",main="Dataset 1",pch=19)
plot(data$x2,data$y2,xlab="x2",ylab="y2",main="Dataset 2",pch=19)
plot(data$x3,data$y3,xlab="x3",ylab="y3",main="Dataset 3",pch=19)
plot(data$x4,data$y4,xlab="x4",ylab="y4",main="Dataset 4",pch=19)
lm()
function.model1 = lm(y1 ~ x1, data = data)
summary(model1)
##
## Call:
## lm(formula = y1 ~ x1, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.92127 -0.45577 -0.04136 0.70941 1.83882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0001 1.1247 2.667 0.02573 *
## x1 0.5001 0.1179 4.241 0.00217 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
model2 = lm(y2 ~ x2, data = data)
summary(model2)
##
## Call:
## lm(formula = y2 ~ x2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9009 -0.7609 0.1291 0.9491 1.2691
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.001 1.125 2.667 0.02576 *
## x2 0.500 0.118 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179
model3 = lm(y3 ~ x3, data = data)
summary(model3)
##
## Call:
## lm(formula = y3 ~ x3, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1586 -0.6146 -0.2303 0.1540 3.2411
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0025 1.1245 2.670 0.02562 *
## x3 0.4997 0.1179 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared: 0.6663, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176
model4 = lm(y4 ~ x4, data = data)
summary(model4)
##
## Call:
## lm(formula = y4 ~ x4, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.751 -0.831 0.000 0.809 1.839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0017 1.1239 2.671 0.02559 *
## x4 0.4999 0.1178 4.243 0.00216 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared: 0.6667, Adjusted R-squared: 0.6297
## F-statistic: 18 on 1 and 9 DF, p-value: 0.002165
par(mfrow=c(2,2))
plot(data$x1,data$y1,pch=19)
abline(model1)
plot(data$x2,data$y2,pch=19)
abline(model2)
plot(data$x3,data$y3,pch=19)
abline(model3)
plot(data$x4,data$y4,pch=19)
abline(model4)
We can compare the models by checking the adjusted r-square value, the seem to fit on approximately same level.
summary(model1)$adj.r.squared
[1] 0.6294916
summary(model2)$adj.r.squared
[1] 0.6291578
summary(model3)$adj.r.squared
[1] 0.6292489
summary(model4)$adj.r.squared
[1] 0.6296747
Anscombe’s Quartet consists of 4 different dataset (x ~ y pairs) with nealy identical descriptive statistics but totally different graphs when visualized. The first graph appears to be a simple and standard linear relationship. The second one is not linear, but the relationship between the two variables is obvious. In the third graph, the distribution is linear, but hae a different regression line due to one outlier. The fourth graph shows that one outlier is enough to produce a relatively high correlation coefficient, given all other dat points do not indicate any relationship. This dataset is important to illustrate the importance of data visualization.