anscombe data that is part of the library(datasets) in R. And assign that data to a new object called data.library(datasets)
data(anscombe) #load 'anscombe' by using data
data <- anscombe #assign the new object name it data
data
## x1 x2 x3 x4 y1 y2 y3 y4
## 1 10 10 10 8 8.04 9.14 7.46 6.58
## 2 8 8 8 8 6.95 8.14 6.77 5.76
## 3 13 13 13 8 7.58 8.74 12.74 7.71
## 4 9 9 9 8 8.81 8.77 7.11 8.84
## 5 11 11 11 8 8.33 9.26 7.81 8.47
## 6 14 14 14 8 9.96 8.10 8.84 7.04
## 7 6 6 6 8 7.24 6.13 6.08 5.25
## 8 4 4 4 19 4.26 3.10 5.39 12.50
## 9 12 12 12 8 10.84 9.13 8.15 5.56
## 10 7 7 7 8 4.82 7.26 6.42 7.91
## 11 5 5 5 8 5.68 4.74 5.73 6.89
fBasics() package!)library(fBasics)
## Loading required package: timeDate
## Loading required package: timeSeries
library(timeDate)
library(timeSeries)
x1<-data[,1];x2<-data[,2];x3<-data[,3];x4<-data[,4]
y1<-data[,5];y2<-data[,6];y3<-data[,7];y4<-data[,8]
mean(x1);var(x1)
## [1] 9
## [1] 11
mean(x2);var(x2)
## [1] 9
## [1] 11
mean(x3);var(x3)
## [1] 9
## [1] 11
mean(x4);var(x4)
## [1] 9
## [1] 11
mean(y1);var(y1)
## [1] 7.500909
## [1] 4.127269
mean(y2);var(y2)
## [1] 7.500909
## [1] 4.127629
mean(y3);var(y3)
## [1] 7.5
## [1] 4.12262
mean(y4);var(y4) #calculating the mean, var by each column
## [1] 7.500909
## [1] 4.123249
correlationTest(x1,y1);correlationTest(x2,y2);correlationTest(x3,y3);correlationTest(x4,y4)
##
## Title:
## Pearson's Correlation Test
##
## Test Results:
## PARAMETER:
## Degrees of Freedom: 9
## SAMPLE ESTIMATES:
## Correlation: 0.8164
## STATISTIC:
## t: 4.2415
## P VALUE:
## Alternative Two-Sided: 0.00217
## Alternative Less: 0.9989
## Alternative Greater: 0.001085
## CONFIDENCE INTERVAL:
## Two-Sided: 0.4244, 0.9507
## Less: -1, 0.9388
## Greater: 0.5113, 1
##
## Description:
## Thu Aug 23 00:07:40 2018
##
## Title:
## Pearson's Correlation Test
##
## Test Results:
## PARAMETER:
## Degrees of Freedom: 9
## SAMPLE ESTIMATES:
## Correlation: 0.8162
## STATISTIC:
## t: 4.2386
## P VALUE:
## Alternative Two-Sided: 0.002179
## Alternative Less: 0.9989
## Alternative Greater: 0.001089
## CONFIDENCE INTERVAL:
## Two-Sided: 0.4239, 0.9506
## Less: -1, 0.9387
## Greater: 0.5109, 1
##
## Description:
## Thu Aug 23 00:07:40 2018
##
## Title:
## Pearson's Correlation Test
##
## Test Results:
## PARAMETER:
## Degrees of Freedom: 9
## SAMPLE ESTIMATES:
## Correlation: 0.8163
## STATISTIC:
## t: 4.2394
## P VALUE:
## Alternative Two-Sided: 0.002176
## Alternative Less: 0.9989
## Alternative Greater: 0.001088
## CONFIDENCE INTERVAL:
## Two-Sided: 0.4241, 0.9507
## Less: -1, 0.9387
## Greater: 0.511, 1
##
## Description:
## Thu Aug 23 00:07:40 2018
##
## Title:
## Pearson's Correlation Test
##
## Test Results:
## PARAMETER:
## Degrees of Freedom: 9
## SAMPLE ESTIMATES:
## Correlation: 0.8165
## STATISTIC:
## t: 4.243
## P VALUE:
## Alternative Two-Sided: 0.002165
## Alternative Less: 0.9989
## Alternative Greater: 0.001082
## CONFIDENCE INTERVAL:
## Two-Sided: 0.4246, 0.9507
## Less: -1, 0.9388
## Greater: 0.5115, 1
##
## Description:
## Thu Aug 23 00:07:40 2018
#find correlation between each pair
library(ggplot2)
plot(x1,y1, main = "Scatter plot for (x1, y1)")
plot(x2,y2, main = "Scatter plot for (x2, y2)")
plot(x3,y3, main = "Scatter plot for (x3, y3)")
plot(x4,y4, main = "Scatter plot for (x4, y4)")
par(mfrow = c(2,2))
plot(x1, y1, main = "Scatter plot for (x1, y1)", type = 'p', col = 'purple', pch = 16)
plot(x2, y2,main = "Scatter plot for (x2, y2)", type = 'p', col = 'red', pch = 16)
plot(x3, y3, main = "Scatter plot for (x3, y3)", type = 'p', col = 'blue', pch = 16)
plot(x4, y4, main = "Scatter plot for (x4, y4)", type = 'p', col = 'green', pch = 16)
lm() function.fit1<-lm(y1~x1)
summary(fit1)
##
## Call:
## lm(formula = y1 ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.92127 -0.45577 -0.04136 0.70941 1.83882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0001 1.1247 2.667 0.02573 *
## x1 0.5001 0.1179 4.241 0.00217 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
fit2<-lm(y2~x2)
summary(fit2)
##
## Call:
## lm(formula = y2 ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9009 -0.7609 0.1291 0.9491 1.2691
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.001 1.125 2.667 0.02576 *
## x2 0.500 0.118 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179
fit3<-lm(y3~x3)
summary(fit3)
##
## Call:
## lm(formula = y3 ~ x3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1586 -0.6146 -0.2303 0.1540 3.2411
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0025 1.1245 2.670 0.02562 *
## x3 0.4997 0.1179 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared: 0.6663, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176
fit4<-lm(y4~x4)
summary(fit4)
##
## Call:
## lm(formula = y4 ~ x4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.751 -0.831 0.000 0.809 1.839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0017 1.1239 2.671 0.02559 *
## x4 0.4999 0.1178 4.243 0.00216 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared: 0.6667, Adjusted R-squared: 0.6297
## F-statistic: 18 on 1 and 9 DF, p-value: 0.002165
par(mfrow=c(2,2))
plot(x1,y1, main="Scatter plot for (x1,y1)",pch=19)
abline(fit1, col="purple")
plot(x2,y2, main="Scatter plot for (x2,y2)",pch=19)
abline(fit2, col="orange")
plot(x3,y3, main="Scatter plot for (x3,y3)",pch=19)
abline(fit3, col="green")
plot(x4,y4, main="Scatter plot for (x4,y4)",pch=19)
abline(fit4, col="blue")
anova(fit1);anova(fit2);anova(fit3);anova(fit4)
Analysis of Variance Table
Response: y1 Df Sum Sq Mean Sq F value Pr(>F)
x1 1 27.510 27.5100 17.99 0.00217 ** Residuals 9 13.763 1.5292
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1 Analysis of Variance Table
Response: y2 Df Sum Sq Mean Sq F value Pr(>F)
x2 1 27.500 27.5000 17.966 0.002179 ** Residuals 9 13.776 1.5307
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1 Analysis of Variance Table
Response: y3 Df Sum Sq Mean Sq F value Pr(>F)
x3 1 27.470 27.4700 17.972 0.002176 ** Residuals 9 13.756 1.5285
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1 Analysis of Variance Table
Response: y4 Df Sum Sq Mean Sq F value Pr(>F)
x4 1 27.490 27.4900 18.003 0.002165 ** Residuals 9 13.742 1.5269
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
Anscombe’s Quartet’s dataset include 8 columns from x1 to x4, and from y1 to y4. One intereting fact is that after summarize the mean and variance of each column of the data, we found that the data are quite similar in their statistics, consequently, we would guess that the graph for these datasets are similar as well. However, after the scatter plot, we found the graphs are very different. In conclusion, it is not a good idea to do the comparison by summarize statistics. Instead, we probably need to visulaize the data in order to expand the view on data analysis.