The objectives of this problem set is to orient you to a number of activities in R. And to conduct a thoughtful exercise in appreciating the importance of data visualization. For each question create a code chunk or text response that completes/answers the activity or question requested. Finally, upon completion post your assignment on Rpubs and upload a link to it to the “Problem Set 2” assignmenet on Moodle.
anscombe data that is part of the library(datasets) in R. And assign that data to a new object called data.library(datasets)
head(anscombe)
## x1 x2 x3 x4 y1 y2 y3 y4
## 1 10 10 10 8 8.04 9.14 7.46 6.58
## 2 8 8 8 8 6.95 8.14 6.77 5.76
## 3 13 13 13 8 7.58 8.74 12.74 7.71
## 4 9 9 9 8 8.81 8.77 7.11 8.84
## 5 11 11 11 8 8.33 9.26 7.81 8.47
## 6 14 14 14 8 9.96 8.10 8.84 7.04
data = anscombe
fBasics() package!)library("coefplot", lib.loc = "~/R/win-library/3.3")
## Warning: package 'coefplot' was built under R version 3.3.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.3.3
## Warning: Installed Rcpp (0.12.7) different from Rcpp used to build dplyr (0.12.12).
## Please reinstall dplyr to avoid random crashes or undefined behavior.
library("dplyr",lib.loc = "~/R/win-library/3.3")
## Warning: package 'dplyr' was built under R version 3.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
sapply(data,mean)
## x1 x2 x3 x4 y1 y2 y3 y4
## 9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909
sapply(data,var)
## x1 x2 x3 x4 y1 y2 y3
## 11.000000 11.000000 11.000000 11.000000 4.127269 4.127629 4.122620
## y4
## 4.123249
cor(data)
## x1 x2 x3 x4 y1 y2
## x1 1.0000000 1.0000000 1.0000000 -0.5000000 0.8164205 0.8162365
## x2 1.0000000 1.0000000 1.0000000 -0.5000000 0.8164205 0.8162365
## x3 1.0000000 1.0000000 1.0000000 -0.5000000 0.8164205 0.8162365
## x4 -0.5000000 -0.5000000 -0.5000000 1.0000000 -0.5290927 -0.7184365
## y1 0.8164205 0.8164205 0.8164205 -0.5290927 1.0000000 0.7500054
## y2 0.8162365 0.8162365 0.8162365 -0.7184365 0.7500054 1.0000000
## y3 0.8162867 0.8162867 0.8162867 -0.3446610 0.4687167 0.5879193
## y4 -0.3140467 -0.3140467 -0.3140467 0.8165214 -0.4891162 -0.4780949
## y3 y4
## x1 0.8162867 -0.3140467
## x2 0.8162867 -0.3140467
## x3 0.8162867 -0.3140467
## x4 -0.3446610 0.8165214
## y1 0.4687167 -0.4891162
## y2 0.5879193 -0.4780949
## y3 1.0000000 -0.1554718
## y4 -0.1554718 1.0000000
plot(data$x1, data$y1)
plot(data$x2, data$y2)
plot(data$x3, data$y3)
plot(data$x4, data$y4)
plot(data$x1, data$y1, pch=20) + plot(data$x2, data$y2, pch=20) + plot(data$x3, data$y3, pch=20) + plot(data$x4, data$y4, pch=20)
## numeric(0)
lm() function.plot(data$x1, data$y1)
abline(coef(lm(data$x1~data$y1)))
plot(data$x2, data$y2)
abline(coef(lm(data$x2~data$y2)))
plot(data$x3, data$y3)
abline(coef(lm(data$x3~data$y3)))
plot(data$x4, data$y4)
abline(coef(lm(data$x4~data$y4)))
plot(data$x1, data$y1, pch = 20)
abline(coef(lm(data$x1~data$y1)))
plot(data$x2, data$y2, pch = 20)
abline(coef(lm(data$x2~data$y2)))
plot(data$x3, data$y3, pch = 20)
abline(coef(lm(data$x3~data$y3)))
plot(data$x4, data$y4, pch = 20)
abline(coef(lm(data$x4~data$y4)))
(1) The x1-y1 plot shows virtually no correlation. The data points are scattered.
(2) The points in the x2-y2 plot follow a nonlinear line. They obviously do not follow a linear path. The nonlinear path could be quadratic.
(3) If we ignore the outlier in the x3-y3 plot, we see that the remaining points are almost linear. They almost all rest on a line; however, not on the regression line. Although, y3 increases as x3 increases.
(4) No correlation between x4 and y4. However, y4 is not influenced by x4 at all: regardless of y4's variation, x4 remains the same.
Anscombe was concerned that a lot of key factors are not considered when using the theoretical description (A). He mentions that sometimes we also should consider the relations between residuals and fitted values. He also lists the things to look for in a plot of {residuals} against {fitted values} or {x-sub i}. He indicates that outliers will sometimes yield valuable information; and, that removing the outliers and analyzing the remaining data, only to study the outliers later, is a good idea. He goes on to discuss more general regession analysis. Incidentally, Anscombe’s quartet (Fig. 1-4) is a wonderful data visualization that helps one understand x-y relations, outliers, and it facilitates understanding of regression analysis.