The objectives of this problem set is to orient you to a number of activities in R. And to conduct a thoughtful exercise in appreciating the importance of data visualization. For each question create a code chunk or text response that completes/answers the activity or question requested. Finally, upon completion name your final output .html file as: YourName_ANLY512-Section-Year-Semester.html and upload it to the Rpubs site and submit the link to the hosted file via Moodle.
anscombe data that is part of the library(datasets) in R. And assign that data to a new object called data.library(datasets)
data<-anscombe
fBasics() package!)library(fBasics)
## Loading required package: timeDate
## Loading required package: timeSeries
colMeans(data)
## x1 x2 x3 x4 y1 y2 y3 y4
## 9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909
colVars(data)
## x1 x2 x3 x4 y1 y2 y3
## 11.000000 11.000000 11.000000 11.000000 4.127269 4.127629 4.122620
## y4
## 4.123249
attach(data)
correlationTest(x1,y1, method = "pearson")
##
## Title:
## Pearson's Correlation Test
##
## Test Results:
## PARAMETER:
## Degrees of Freedom: 9
## SAMPLE ESTIMATES:
## Correlation: 0.8164
## STATISTIC:
## t: 4.2415
## P VALUE:
## Alternative Two-Sided: 0.00217
## Alternative Less: 0.9989
## Alternative Greater: 0.001085
## CONFIDENCE INTERVAL:
## Two-Sided: 0.4244, 0.9507
## Less: -1, 0.9388
## Greater: 0.5113, 1
##
## Description:
## Mon May 21 22:16:18 2018
correlationTest(x2,y2, method = "pearson")
##
## Title:
## Pearson's Correlation Test
##
## Test Results:
## PARAMETER:
## Degrees of Freedom: 9
## SAMPLE ESTIMATES:
## Correlation: 0.8162
## STATISTIC:
## t: 4.2386
## P VALUE:
## Alternative Two-Sided: 0.002179
## Alternative Less: 0.9989
## Alternative Greater: 0.001089
## CONFIDENCE INTERVAL:
## Two-Sided: 0.4239, 0.9506
## Less: -1, 0.9387
## Greater: 0.5109, 1
##
## Description:
## Mon May 21 22:16:18 2018
correlationTest(x3,y3, method = "pearson")
##
## Title:
## Pearson's Correlation Test
##
## Test Results:
## PARAMETER:
## Degrees of Freedom: 9
## SAMPLE ESTIMATES:
## Correlation: 0.8163
## STATISTIC:
## t: 4.2394
## P VALUE:
## Alternative Two-Sided: 0.002176
## Alternative Less: 0.9989
## Alternative Greater: 0.001088
## CONFIDENCE INTERVAL:
## Two-Sided: 0.4241, 0.9507
## Less: -1, 0.9387
## Greater: 0.511, 1
##
## Description:
## Mon May 21 22:16:18 2018
correlationTest(x4,y4, method = "pearson")
##
## Title:
## Pearson's Correlation Test
##
## Test Results:
## PARAMETER:
## Degrees of Freedom: 9
## SAMPLE ESTIMATES:
## Correlation: 0.8165
## STATISTIC:
## t: 4.243
## P VALUE:
## Alternative Two-Sided: 0.002165
## Alternative Less: 0.9989
## Alternative Greater: 0.001082
## CONFIDENCE INTERVAL:
## Two-Sided: 0.4246, 0.9507
## Less: -1, 0.9388
## Greater: 0.5115, 1
##
## Description:
## Mon May 21 22:16:18 2018
plot(x1,y1, main="Relationship between x1 and y1")
plot(x2,y2, main="Relationship between x2 and y2")
plot(x3,y3, main="Relationship between x3 and y3")
plot(x4,y4, main="Relationship between x4 and y4")
par(mfrow=c(2,2))
plot(x1,y1, pch=19, main="Relationship between x1 and y1")
plot(x2,y2, pch=19, main="Relationship between x2 and y2")
plot(x3,y3, pch=19, main="Relationship between x3 and y3")
plot(x4,y4, pch=19, main="Relationship between x4 and y4")
lm() function.fit1 <- lm(y1 ~ x1, data=data)
fit2 <- lm(y2 ~ x2, data=data)
fit3 <- lm(y3 ~ x3, data=data)
fit4 <- lm(y4 ~ x4, data=data)
par(mfrow=c(2,2))
plot(x1,y1, pch=19, main="Relationship between x1 and y1")
abline(fit1, pch=2, col="red")
plot(x2,y2, pch=19, main="Relationship between x2 and y2")
abline(fit2, pch=2, col="red")
plot(x3,y3, pch=19, main="Relationship between x3 and y3")
abline(fit3, pch=2, col="red")
plot(x4,y4, pch=19, main="Relationship between x4 and y4")
abline(fit4, pch=2, col="red")
summary(fit1)
Call: lm(formula = y1 ~ x1, data = data)
Residuals: Min 1Q Median 3Q Max -1.92127 -0.45577 -0.04136 0.70941 1.83882
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0001 1.1247 2.667 0.02573 * x1 0.5001 0.1179 4.241 0.00217 ** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
Residual standard error: 1.237 on 9 degrees of freedom Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295 F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
summary(fit2)
Call: lm(formula = y2 ~ x2, data = data)
Residuals: Min 1Q Median 3Q Max -1.9009 -0.7609 0.1291 0.9491 1.2691
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.001 1.125 2.667 0.02576 * x2 0.500 0.118 4.239 0.00218 ** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
Residual standard error: 1.237 on 9 degrees of freedom Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292 F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179
summary(fit3)
Call: lm(formula = y3 ~ x3, data = data)
Residuals: Min 1Q Median 3Q Max -1.1586 -0.6146 -0.2303 0.1540 3.2411
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0025 1.1245 2.670 0.02562 * x3 0.4997 0.1179 4.239 0.00218 ** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
Residual standard error: 1.236 on 9 degrees of freedom Multiple R-squared: 0.6663, Adjusted R-squared: 0.6292 F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176
summary(fit4)
Call: lm(formula = y4 ~ x4, data = data)
Residuals: Min 1Q Median 3Q Max -1.751 -0.831 0.000 0.809 1.839
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0017 1.1239 2.671 0.02559 * x4 0.4999 0.1178 4.243 0.00216 ** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
Residual standard error: 1.236 on 9 degrees of freedom Multiple R-squared: 0.6667, Adjusted R-squared: 0.6297 F-statistic: 18 on 1 and 9 DF, p-value: 0.002165
From the summary of the four linear regression models, we can see that x variables are all significant. The coefficients of x are also similar. The adjusted R-squared are close to each other.
The example Anscombe’s Quartet shows indicates why data visualization is important. The descriptive statistics show the means and standard deviations of the x variables are very similar (same for y variables). The correlation coefficients of the four pairs of x and y variables are almost the same. Besides, from the summary of the four linear regression models, we can see that x variables are all significant. The coefficients of x are also similar. The adjusted R-squared are close to each other. All the statistics reveal the four models are almost the same. However, the graphs show the relationship between x and y are very different in fact. Thus, data visualization can tell us something that calculation cannot provide. It demonstrates the importance of graphing data before analyzing it.