The objectives of this problem set is to orient you to a number of activities in R. And to conduct a thoughtful exercise in appreciating the importance of data visualization. For each question create a code chunk or text response that completes/answers the activity or question requested. Finally, upon completion name your final output .html file as: YourName_ANLY512-Section-Year-Semester.html and upload it to the Rpubs site and submit the link to the hosted file via Moodle.
anscombe data that is part of the library(datasets) in R. And assign that data to a new object called data.library(datasets)
library(fBasics)
## Warning: package 'fBasics' was built under R version 3.2.5
## Loading required package: timeDate
## Loading required package: timeSeries
data <- anscombe
fBasics() package!)colMeans(data)
## x1 x2 x3 x4 y1 y2 y3 y4
## 9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909
colVars(data)
## x1 x2 x3 x4 y1 y2 y3
## 11.000000 11.000000 11.000000 11.000000 4.127269 4.127629 4.122620
## y4
## 4.123249
cor(data$x1,data$y1)
## [1] 0.8164205
cor(data$x2,data$y2)
## [1] 0.8162365
cor(data$x3,data$y3)
## [1] 0.8162867
cor(data$x4,data$y4)
## [1] 0.8165214
plot(data$x1,data$y1,main="Scatter Plot of x1 & y1")
plot(data$x2,data$y2,main="Scatter Plot of x2 & y2")
plot(data$x3,data$y3,main="Scatter Plot of x3 & y3")
plot(data$x4,data$y4,main="Scatter Plot of x4 & y4")
4. Now change the symbols on the scatter plots to solid circles and plot them together as a 4 panel graphic
par(mfrow=c(2,2))
plot(data$x1,data$y1,main="Scatter Plot of x1 & y1",pch=19)
plot(data$x2,data$y2,main="Scatter Plot of x2 & y2",pch=19)
plot(data$x3,data$y3,main="Scatter Plot of x3 & y3",pch=19)
plot(data$x4,data$y4,main="Scatter Plot of x4 & y4",pch=19)
lm() function.summary(lm(y1~x1,data=data))
##
## Call:
## lm(formula = y1 ~ x1, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.92127 -0.45577 -0.04136 0.70941 1.83882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0001 1.1247 2.667 0.02573 *
## x1 0.5001 0.1179 4.241 0.00217 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
summary(lm(y2~x2,data=data))
##
## Call:
## lm(formula = y2 ~ x2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9009 -0.7609 0.1291 0.9491 1.2691
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.001 1.125 2.667 0.02576 *
## x2 0.500 0.118 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179
summary(lm(y3~x3,data=data))
##
## Call:
## lm(formula = y3 ~ x3, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1586 -0.6146 -0.2303 0.1540 3.2411
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0025 1.1245 2.670 0.02562 *
## x3 0.4997 0.1179 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared: 0.6663, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176
summary(lm(y4~x4,data=data))
##
## Call:
## lm(formula = y4 ~ x4, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.751 -0.831 0.000 0.809 1.839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0017 1.1239 2.671 0.02559 *
## x4 0.4999 0.1178 4.243 0.00216 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared: 0.6667, Adjusted R-squared: 0.6297
## F-statistic: 18 on 1 and 9 DF, p-value: 0.002165
Regression1<-lm(y1~x1,data=data)
Regression2<-lm(y2~x2,data=data)
Regression3<-lm(y3~x3,data=data)
Regression4<-lm(y4~x4,data=data)
par(mfrow=c(2,2))
plot(data$x1,data$y1,main="Regression x1 and y1",pch=19)
abline(Regression1)
plot(data$x2,data$y2,main="Regression x2 and y2",pch=19)
abline(Regression2)
plot(data$x3,data$y3,main="Regression x3 and y3",pch=19)
abline(Regression3)
plot(data$x4,data$y4,main="Regression x4 and y4",pch=19)
abline(Regression4)
CompareTable<-data.frame(c(summary(Regression1)$adj.r.squared,summary(Regression2)$adj.r.squared,summary(Regression3)$adj.r.squared,summary(Regression4)$adj.r.squared))
rownames(CompareTable)<-c("x1&y1","x2&y2","x3&y3","x4&y4")
colnames(CompareTable)<-c("Adjusted R-square")
CompareTable
Adjusted R-square
x1&y1 0.6294916 x2&y2 0.6291578 x3&y3 0.6292489 x4&y4 0.6296747
anova(Regression1)
## Analysis of Variance Table
##
## Response: y1
## Df Sum Sq Mean Sq F value Pr(>F)
## x1 1 27.510 27.5100 17.99 0.00217 **
## Residuals 9 13.763 1.5292
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(Regression2)
## Analysis of Variance Table
##
## Response: y2
## Df Sum Sq Mean Sq F value Pr(>F)
## x2 1 27.500 27.5000 17.966 0.002179 **
## Residuals 9 13.776 1.5307
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(Regression3)
## Analysis of Variance Table
##
## Response: y3
## Df Sum Sq Mean Sq F value Pr(>F)
## x3 1 27.470 27.4700 17.972 0.002176 **
## Residuals 9 13.756 1.5285
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(Regression4)
## Analysis of Variance Table
##
## Response: y4
## Df Sum Sq Mean Sq F value Pr(>F)
## x4 1 27.490 27.4900 18.003 0.002165 **
## Residuals 9 13.742 1.5269
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Results from above models are quite close.
The lesson of Anscombe’s Quartet discusses that eventough summary statistics allow us to have a general idea about the data, it could be misleading for certain situation when only depend on stat summary. Anscombe is a typical example to show the necessary for data visualization comparing with only using data summary. Above four datasets show extremely similar pattern statistically while they actually tell quite different story when we visualize the data.