Q1. Anscombes quartet is a set of 4 \(x,y\) data sets that were published by Francis Anscombe in a 1973 paper Graphs in statistical analysis. For this first question, examine the built-in R data set `anscombe’.
str(anscombe)
## 'data.frame': 11 obs. of 8 variables:
## $ x1: num 10 8 13 9 11 14 6 4 12 7 ...
## $ x2: num 10 8 13 9 11 14 6 4 12 7 ...
## $ x3: num 10 8 13 9 11 14 6 4 12 7 ...
## $ x4: num 8 8 8 8 8 8 8 19 8 8 ...
## $ y1: num 8.04 6.95 7.58 8.81 8.33 ...
## $ y2: num 9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 ...
## $ y3: num 7.46 6.77 12.74 7.11 7.81 ...
## $ y4: num 6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 ...
There are 8 variables in this dataset “anscombe”
data <- data("anscombe")
x1 <- anscombe[,1]
x2 <- anscombe[,2]
x3 <- anscombe[,3]
x4 <- anscombe[,4]
y1 <- anscombe[,5]
y2 <- anscombe[,6]
y3 <- anscombe[,7]
y4 <- anscombe[,8]
mean(x1)
## [1] 9
var(x1)
## [1] 11
mean(x2)
## [1] 9
var(x2)
## [1] 11
mean(x3)
## [1] 9
var(x3)
## [1] 11
mean(x4)
## [1] 9
var(x4)
## [1] 11
mean(y1)
## [1] 7.500909
var(y1)
## [1] 4.127269
mean(y2)
## [1] 7.500909
var(y2)
## [1] 4.127629
mean(y3)
## [1] 7.5
var(y3)
## [1] 4.12262
mean(y4)
## [1] 7.500909
var(y4)
## [1] 4.123249
library(fBasics)
## Warning: package 'fBasics' was built under R version 3.4.4
## Loading required package: timeDate
## Warning: package 'timeDate' was built under R version 3.4.4
## Loading required package: timeSeries
## Warning: package 'timeSeries' was built under R version 3.4.4
correlationTest(x1,y1)
##
## Title:
## Pearson's Correlation Test
##
## Test Results:
## PARAMETER:
## Degrees of Freedom: 9
## SAMPLE ESTIMATES:
## Correlation: 0.8164
## STATISTIC:
## t: 4.2415
## P VALUE:
## Alternative Two-Sided: 0.00217
## Alternative Less: 0.9989
## Alternative Greater: 0.001085
## CONFIDENCE INTERVAL:
## Two-Sided: 0.4244, 0.9507
## Less: -1, 0.9388
## Greater: 0.5113, 1
##
## Description:
## Thu Sep 13 22:39:14 2018
correlationTest(x2,y2)
##
## Title:
## Pearson's Correlation Test
##
## Test Results:
## PARAMETER:
## Degrees of Freedom: 9
## SAMPLE ESTIMATES:
## Correlation: 0.8162
## STATISTIC:
## t: 4.2386
## P VALUE:
## Alternative Two-Sided: 0.002179
## Alternative Less: 0.9989
## Alternative Greater: 0.001089
## CONFIDENCE INTERVAL:
## Two-Sided: 0.4239, 0.9506
## Less: -1, 0.9387
## Greater: 0.5109, 1
##
## Description:
## Thu Sep 13 22:39:14 2018
correlationTest(x3,y3)
##
## Title:
## Pearson's Correlation Test
##
## Test Results:
## PARAMETER:
## Degrees of Freedom: 9
## SAMPLE ESTIMATES:
## Correlation: 0.8163
## STATISTIC:
## t: 4.2394
## P VALUE:
## Alternative Two-Sided: 0.002176
## Alternative Less: 0.9989
## Alternative Greater: 0.001088
## CONFIDENCE INTERVAL:
## Two-Sided: 0.4241, 0.9507
## Less: -1, 0.9387
## Greater: 0.511, 1
##
## Description:
## Thu Sep 13 22:39:14 2018
correlationTest(x4,y4)
##
## Title:
## Pearson's Correlation Test
##
## Test Results:
## PARAMETER:
## Degrees of Freedom: 9
## SAMPLE ESTIMATES:
## Correlation: 0.8165
## STATISTIC:
## t: 4.243
## P VALUE:
## Alternative Two-Sided: 0.002165
## Alternative Less: 0.9989
## Alternative Greater: 0.001082
## CONFIDENCE INTERVAL:
## Two-Sided: 0.4246, 0.9507
## Less: -1, 0.9388
## Greater: 0.5115, 1
##
## Description:
## Thu Sep 13 22:39:14 2018
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
plot(x1,y1, main = "Scatter plot between x1 & y1")
plot(x2,y2,main = "Scatter plot between x2 & y2")
plot(x3,y3, main = "Scatter plot between x3 & y3")
plot(x4,y4, main = "Scatter plot between x4 & y4")
par(mfrow = c(2,2))
plot(x1,y1, main = "Scatter plot between x1 & y1", pch = 19)
plot(x2,y2,main = "Scatter plot between x2 & y2", pch = 19)
plot(x3,y3, main = "Scatter plot between x3 & y3", pch = 19)
plot(x4,y4, main = "Scatter plot between x4 & y4", pch = 19)
Lm1 <- lm( x1~y1)
summary(Lm1)
##
## Call:
## lm(formula = x1 ~ y1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6522 -1.5117 -0.2657 1.2341 3.8946
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.9975 2.4344 -0.410 0.69156
## y1 1.3328 0.3142 4.241 0.00217 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.019 on 9 degrees of freedom
## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
Lm2 <- lm(x2~y2)
summary(Lm2)
##
## Call:
## lm(formula = x2 ~ y2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8516 -1.4315 -0.3440 0.8467 4.2017
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.9948 2.4354 -0.408 0.69246
## y2 1.3325 0.3144 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.02 on 9 degrees of freedom
## Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179
Lm3 <- lm(x3~y3)
summary(Lm3)
##
## Call:
## lm(formula = x3 ~ y3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9869 -1.3733 -0.0266 1.3200 3.2133
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.0003 2.4362 -0.411 0.69097
## y3 1.3334 0.3145 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.019 on 9 degrees of freedom
## Multiple R-squared: 0.6663, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176
Lm4 <- lm(x4~y4)
summary(Lm4)
##
## Call:
## lm(formula = x4 ~ y4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7859 -1.4122 -0.1853 1.4551 3.3329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.0036 2.4349 -0.412 0.68985
## y4 1.3337 0.3143 4.243 0.00216 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.018 on 9 degrees of freedom
## Multiple R-squared: 0.6667, Adjusted R-squared: 0.6297
## F-statistic: 18 on 1 and 9 DF, p-value: 0.002165
anova(Lm1, test ="Chisq")
## Analysis of Variance Table
##
## Response: x1
## Df Sum Sq Mean Sq F value Pr(>F)
## y1 1 73.32 73.320 17.99 0.00217 **
## Residuals 9 36.68 4.076
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(Lm2, test ="Chisq")
## Analysis of Variance Table
##
## Response: x2
## Df Sum Sq Mean Sq F value Pr(>F)
## y2 1 73.287 73.287 17.966 0.002179 **
## Residuals 9 36.713 4.079
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(Lm3, test ="Chisq")
## Analysis of Variance Table
##
## Response: x3
## Df Sum Sq Mean Sq F value Pr(>F)
## y3 1 73.296 73.296 17.972 0.002176 **
## Residuals 9 36.704 4.078
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(Lm4, test ="Chisq")
## Analysis of Variance Table
##
## Response: x4
## Df Sum Sq Mean Sq F value Pr(>F)
## y4 1 73.338 73.338 18.003 0.002165 **
## Residuals 9 36.662 4.074
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The data frame we will be working with today is called datasaurus_dozen and it’s in the datasauRus package. This single data frame contains 13 datasets, designed to show us why data visualisation is important and how summary statistics alone can be misleading.
To find out more about the dataset, type the following in your Console or in R markdown: ?datasaurus_dozen. A question mark before the name of an object will always bring up its help file.
?datasaurus_dozen
## No documentation for 'datasaurus_dozen' in specified packages and libraries:
## you could try '??datasaurus_dozen'
str("datasaurus_dozen")
## chr "datasaurus_dozen"
tail("datasaurus_dozen")
## [1] "datasaurus_dozen"
names("datasaurus_dozen") # column names
## NULL
We will plot x-y values of the dino 13 sets to see the visual pattern.
if(require(ggplot2)){
library(ggplot2)
library(datasauRus)
ggplot(datasaurus_dozen, aes(x=x, y=y, colour=dataset))+
geom_point()+
theme_void()+
theme(legend.position = "none")+
facet_wrap(~dataset, ncol=3)
}
## Warning: package 'datasauRus' was built under R version 3.4.4
plot(y ~ x, data = subset(datasaurus_dozen, dataset = "dino"),
main = "The Datasaurus", xlab = "x", ylab = "y",
pch = 19,las=1)
par(mar=c(3.5,3.5,2,2))
columns <- unique(datasaurus_dozen$dataset)
par(mfrow=c(4,4))
for(i in columns){
plot(y ~ x, data = subset(datasaurus_dozen, dataset == i))
}