The objectives of this problem set is to orient you to a number of activities in R. And to conduct a thoughtful exercise in appreciating the importance of data visualization. For each question create a code chunk or text response that completes/answers the activity or question requested. Finally, upon completion name your final output .html file as: YourName_ANLY512-Section-Year-Semester.html and upload it to the “Problem Set 2” assignment to your R Pubs account and submit the link to Moodle. Points will be deducted for uploading the improper format.
anscombe data that is part of the library(datasets) in R. And assign that data to a new object called data.library(datasets)
str(anscombe)
## 'data.frame': 11 obs. of 8 variables:
## $ x1: num 10 8 13 9 11 14 6 4 12 7 ...
## $ x2: num 10 8 13 9 11 14 6 4 12 7 ...
## $ x3: num 10 8 13 9 11 14 6 4 12 7 ...
## $ x4: num 8 8 8 8 8 8 8 19 8 8 ...
## $ y1: num 8.04 6.95 7.58 8.81 8.33 ...
## $ y2: num 9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 ...
## $ y3: num 7.46 6.77 12.74 7.11 7.81 ...
## $ y4: num 6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 ...
data <- data("anscombe")
x1 <- anscombe[,1]
x2 <- anscombe[,2]
x3 <- anscombe[,3]
x4 <- anscombe[,4]
y1 <- anscombe[,5]
y2 <- anscombe[,6]
y3 <- anscombe[,7]
y4 <- anscombe[,8]
fBasics() package!)mean(x1)
## [1] 9
var(x1)
## [1] 11
mean(x2)
## [1] 9
var(x2)
## [1] 11
mean(x3)
## [1] 9
var(x3)
## [1] 11
mean(x4)
## [1] 9
var(x4)
## [1] 11
mean(y1)
## [1] 7.500909
var(y1)
## [1] 4.127269
mean(y2)
## [1] 7.500909
var(y2)
## [1] 4.127629
mean(y3)
## [1] 7.5
var(y3)
## [1] 4.12262
mean(y4)
## [1] 7.500909
var(y4)
## [1] 4.123249
cor(x1,y1)
## [1] 0.8164205
cor(x2,y2)
## [1] 0.8162365
cor(x3,y3)
## [1] 0.8162867
cor(x4,y4)
## [1] 0.8165214
library(ggplot2)
plot(x1,y1, main = "Scatter plot - Pair1 (x1 & y1)")
plot(x2,y2,main = "Scatter plot - Pair2 (x2 & y2)")
plot(x3,y3,main = "Scatter plot - Pair3 (x3 & y3)")
plot(x4,y4,main = "Scatter plot - Pair4 (x4 & y4)")
par(mfrow = c(2,2))
plot(x1,y1, main = "Scatter plot - Pair1 (x1 & y1)", pch = 20)
plot(x2,y2, main = "Scatter plot - Pair2 (x2 & y2)", pch = 20)
plot(x3,y3, main = "Scatter plot - Pair3 (x3 & y3)", pch = 20)
plot(x4,y4, main = "Scatter plot - Pair4 (x4 & y4)", pch = 20)
lm() function.LM1 <- lm(y1~x1)
summary(LM1)
##
## Call:
## lm(formula = y1 ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.92127 -0.45577 -0.04136 0.70941 1.83882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0001 1.1247 2.667 0.02573 *
## x1 0.5001 0.1179 4.241 0.00217 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
LM2 <- lm(y2~x2)
summary(LM2)
##
## Call:
## lm(formula = y2 ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9009 -0.7609 0.1291 0.9491 1.2691
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.001 1.125 2.667 0.02576 *
## x2 0.500 0.118 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179
LM3 <- lm(y3~x3)
summary(LM3)
##
## Call:
## lm(formula = y3 ~ x3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1586 -0.6146 -0.2303 0.1540 3.2411
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0025 1.1245 2.670 0.02562 *
## x3 0.4997 0.1179 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared: 0.6663, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176
LM4 <- lm(y4~x4)
summary(LM4)
##
## Call:
## lm(formula = y4 ~ x4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.751 -0.831 0.000 0.809 1.839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0017 1.1239 2.671 0.02559 *
## x4 0.4999 0.1178 4.243 0.00216 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared: 0.6667, Adjusted R-squared: 0.6297
## F-statistic: 18 on 1 and 9 DF, p-value: 0.002165
par(mfrow = c(2,2))
plot(x1,y1, main = "Scatter plot - Pair1 (x1 & y1)", pch = 20)
abline(LM1, col="red")
plot(x2,y2, main = "Scatter plot - Pair2 (x2 & y2)", pch = 20)
abline(LM2, col="red")
plot(x3,y3, main = "Scatter plot - Pair3 (x3 & y3)", pch = 20)
abline(LM3, col="red")
plot(x4,y4, main = "Scatter plot - Pair4 (x4 & y4)", pch = 20)
abline(LM4, col="red")
anova(LM1)
Analysis of Variance Table
Response: y1 Df Sum Sq Mean Sq F value Pr(>F)
x1 1 27.510 27.5100 17.99 0.00217 ** Residuals 9 13.763 1.5292
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
anova(LM2)
Analysis of Variance Table
Response: y2 Df Sum Sq Mean Sq F value Pr(>F)
x2 1 27.500 27.5000 17.966 0.002179 ** Residuals 9 13.776 1.5307
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
anova(LM3)
Analysis of Variance Table
Response: y3 Df Sum Sq Mean Sq F value Pr(>F)
x3 1 27.470 27.4700 17.972 0.002176 ** Residuals 9 13.756 1.5285
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
anova(LM4)
Analysis of Variance Table
Response: y4 Df Sum Sq Mean Sq F value Pr(>F)
x4 1 27.490 27.4900 18.003 0.002165 ** Residuals 9 13.742 1.5269
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
Anscombe’s Quartet has four datasets that appear to be identical with respect to their summary statistics. However, after plotting the data, the graphs of all the four datasets are very different. Pair 1 has weak linear relationship. Pair 2 doesn’t have any linear relationship. Pair 3 has a much stronger linear relationship than pair1 except for one outlier. Pair4 shows almost the constant values of x with one outlier.
From Anscombe’s Quartet we can conclude that, there is a need to add data visualization to our statistical analysis as data visualization gives us a much better picture about our data.