The objectives of this problem set is to orient you to a number of activities in R. And to conduct a thoughtful exercise in appreciating the importance of data visualization. For each question create a code chunk or text response that completes/answers the activity or question requested. Finally, upon completion upload your document to rpubs.com and share the link to the “Problem Set 2” assignmenet on Moodle.
anscombe data that is part of the library(datasets) in R. And assign that data to a new object called data.data<-anscombe
fBasics() package!)library(fBasics)
colStats(anscombe,FUN = mean)
## x1 x2 x3 x4 y1 y2 y3 y4
## 9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909
colStats(anscombe,FUN = var)
## x1 x2 x3 x4 y1 y2 y3
## 11.000000 11.000000 11.000000 11.000000 4.127269 4.127629 4.122620
## y4
## 4.123249
cor(anscombe[,1:4],anscombe[,5:8])
## y1 y2 y3 y4
## x1 0.8164205 0.8162365 0.8162867 -0.3140467
## x2 0.8164205 0.8162365 0.8162867 -0.3140467
## x3 0.8164205 0.8162365 0.8162867 -0.3140467
## x4 -0.5290927 -0.7184365 -0.3446610 0.8165214
attach(anscombe)
plot(x1,y1)
plot(x2,y2)
plot(x3,y3)
plot(x4,y4)
par(mfrow=c(2,2))
plot(x1,y1,pch=19)
plot(x2,y2,pch=19)
plot(x3,y3,pch=19)
plot(x4,y4,pch=19)
par(mfrow=c(1,1))
lm() function.fit1 <- lm(y1~x1)
fit2 <- lm(y2~x2)
fit3 <- lm(y3~x3)
fit4 <- lm(y4~x4)
par(mfrow=c(2,2))
plot(x1,y1,pch=19)
abline(lm(y1 ~ x1))
plot(x2,y2,pch=19)
abline(lm(y2 ~ x2))
plot(x3,y3,pch=19)
abline(lm(y3 ~ x3))
plot(x4,y4,pch=19)
abline(lm(y4 ~ x4))
summary(fit1)
Call: lm(formula = y1 ~ x1)
Residuals: Min 1Q Median 3Q Max -1.92127 -0.45577 -0.04136 0.70941 1.83882
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0001 1.1247 2.667 0.02573 * x1 0.5001 0.1179 4.241 0.00217 ** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
Residual standard error: 1.237 on 9 degrees of freedom Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295 F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
summary(fit2)
Call: lm(formula = y2 ~ x2)
Residuals: Min 1Q Median 3Q Max -1.9009 -0.7609 0.1291 0.9491 1.2691
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.001 1.125 2.667 0.02576 * x2 0.500 0.118 4.239 0.00218 ** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
Residual standard error: 1.237 on 9 degrees of freedom Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292 F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179
summary(fit3)
Call: lm(formula = y3 ~ x3)
Residuals: Min 1Q Median 3Q Max -1.1586 -0.6146 -0.2303 0.1540 3.2411
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0025 1.1245 2.670 0.02562 * x3 0.4997 0.1179 4.239 0.00218 ** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
Residual standard error: 1.236 on 9 degrees of freedom Multiple R-squared: 0.6663, Adjusted R-squared: 0.6292 F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176
summary(fit4)
Call: lm(formula = y4 ~ x4)
Residuals: Min 1Q Median 3Q Max -1.751 -0.831 0.000 0.809 1.839
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0017 1.1239 2.671 0.02559 * x4 0.4999 0.1178 4.243 0.00216 ** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
Residual standard error: 1.236 on 9 degrees of freedom Multiple R-squared: 0.6667, Adjusted R-squared: 0.6297 F-statistic: 18 on 1 and 9 DF, p-value: 0.002165
It shows the model fit for 4 different models are pretty similar-with independent variable significant at 99% significance level,an adjusted R^2 of around 63% and residual standard error of around 1.236.
In short, Anscombe mentioned that even if there’s a tendency of valuing quant calculation more than visualization in research, graphs sometimes can reveal and detect some specific data features, trends and problems that may can’t be seen in the results of math calculation.
He took linear regression model as the topic of exploration. There are four types of data features that may not be detected by math calculation but are clear in graphs.
Solve: do a detailed research of the outliers and create a subset for them to do further exploartion, instead of simply delete the outliers. Because they may sometimes bring interesting and useful insights.
Solve: Transform y such as taking a log / Transform x by adding extra term in the formula (such as an exponentiation term)
3)Progressive change in the variability of the residuals as the fitted value increase
Solve: Transform y
Solve: Transform y
At last, the for datasets example indicates that even if the final fit attribution data are all similar, the real data shape can vary. Which re-states that it’s necessary for a statistical anlysis to have both quant calculation and visualization.