The objectives of this problem set is to orient you to a number of activities in R. And to conduct a thoughtful exercise in appreciating the importance of data visualization. For each question create a code chunk or text response that completes/answers the activity or question requested. Finally, upon completion name your final output .html file as: YourName_ANLY512-Section-Year-Semester.html and upload it to the “Problem Set 2” assignment to your R Pubs account and submit the link to Moodle. Points will be deducted for uploading the improper format.
anscombe data that is part of the library(datasets) in R. And assign that data to a new object called data.library(datasets)
data<-anscombe
data #Display the anscombe dataset
## x1 x2 x3 x4 y1 y2 y3 y4
## 1 10 10 10 8 8.04 9.14 7.46 6.58
## 2 8 8 8 8 6.95 8.14 6.77 5.76
## 3 13 13 13 8 7.58 8.74 12.74 7.71
## 4 9 9 9 8 8.81 8.77 7.11 8.84
## 5 11 11 11 8 8.33 9.26 7.81 8.47
## 6 14 14 14 8 9.96 8.10 8.84 7.04
## 7 6 6 6 8 7.24 6.13 6.08 5.25
## 8 4 4 4 19 4.26 3.10 5.39 12.50
## 9 12 12 12 8 10.84 9.13 8.15 5.56
## 10 7 7 7 8 4.82 7.26 6.42 7.91
## 11 5 5 5 8 5.68 4.74 5.73 6.89
fBasics() package!)# Computing mean and variances of each column:
mean(data$x1) # mean of x1
## [1] 9
mean(data$x2) # mean of x2
## [1] 9
mean(data$x3) # mean of x3
## [1] 9
mean(data$x4) # mean of x4
## [1] 9
var(data$x1) # variance of x1
## [1] 11
var(data$x2) # variance of x2
## [1] 11
var(data$x3) # variance of x3
## [1] 11
var(data$x4) # variance of x4
## [1] 11
mean(data$y1) # mean of y1
## [1] 7.500909
mean(data$y2) # mean of y2
## [1] 7.500909
mean(data$y3) # mean of y3
## [1] 7.5
mean(data$y4) # mean of y4
## [1] 7.500909
var(data$y1) # variance of y1
## [1] 4.127269
var(data$y2) # variance of y2
## [1] 4.127629
var(data$y3) # variance of y3
## [1] 4.12262
var(data$y4) # variance of y4
## [1] 4.123249
library("fBasics")
## Warning: package 'fBasics' was built under R version 3.6.1
## Loading required package: timeDate
## Loading required package: timeSeries
## Warning: package 'timeSeries' was built under R version 3.6.1
# Correlation between each pair:
correlationTest(data$x1, data$y1)
##
## Title:
## Pearson's Correlation Test
##
## Test Results:
## PARAMETER:
## Degrees of Freedom: 9
## SAMPLE ESTIMATES:
## Correlation: 0.8164
## STATISTIC:
## t: 4.2415
## P VALUE:
## Alternative Two-Sided: 0.00217
## Alternative Less: 0.9989
## Alternative Greater: 0.001085
## CONFIDENCE INTERVAL:
## Two-Sided: 0.4244, 0.9507
## Less: -1, 0.9388
## Greater: 0.5113, 1
##
## Description:
## Tue Sep 24 20:48:24 2019
correlationTest(data$x2, data$y2)
##
## Title:
## Pearson's Correlation Test
##
## Test Results:
## PARAMETER:
## Degrees of Freedom: 9
## SAMPLE ESTIMATES:
## Correlation: 0.8162
## STATISTIC:
## t: 4.2386
## P VALUE:
## Alternative Two-Sided: 0.002179
## Alternative Less: 0.9989
## Alternative Greater: 0.001089
## CONFIDENCE INTERVAL:
## Two-Sided: 0.4239, 0.9506
## Less: -1, 0.9387
## Greater: 0.5109, 1
##
## Description:
## Tue Sep 24 20:48:24 2019
correlationTest(data$x3, data$y3)
##
## Title:
## Pearson's Correlation Test
##
## Test Results:
## PARAMETER:
## Degrees of Freedom: 9
## SAMPLE ESTIMATES:
## Correlation: 0.8163
## STATISTIC:
## t: 4.2394
## P VALUE:
## Alternative Two-Sided: 0.002176
## Alternative Less: 0.9989
## Alternative Greater: 0.001088
## CONFIDENCE INTERVAL:
## Two-Sided: 0.4241, 0.9507
## Less: -1, 0.9387
## Greater: 0.511, 1
##
## Description:
## Tue Sep 24 20:48:24 2019
correlationTest(data$x4, data$y4)
##
## Title:
## Pearson's Correlation Test
##
## Test Results:
## PARAMETER:
## Degrees of Freedom: 9
## SAMPLE ESTIMATES:
## Correlation: 0.8165
## STATISTIC:
## t: 4.243
## P VALUE:
## Alternative Two-Sided: 0.002165
## Alternative Less: 0.9989
## Alternative Greater: 0.001082
## CONFIDENCE INTERVAL:
## Two-Sided: 0.4246, 0.9507
## Less: -1, 0.9388
## Greater: 0.5115, 1
##
## Description:
## Tue Sep 24 20:48:24 2019
plot(data$x1, data$y1, main = "Scater Plot of x1,y1",xlab="x1",ylab="y1")
plot(data$x2, data$y2, main = "Scater Plot of x2,y2",xlab="x2",ylab="y2")
plot(data$x3, data$y3, main = "Scater Plot of x3,y3",xlab="x3",ylab="y3")
plot(data$x4, data$y4, main = "Scater Plot of x4,y4",xlab="x4",ylab="y4")
par(mfrow=c(2,2))
plot(data$x1, data$y1, main = "Scater Plot of x1,y1",xlab="x1",ylab="y1", pch=20)
plot(data$x2, data$y2, main = "Scater Plot of x2,y2",xlab="x2",ylab="y2", pch=20)
plot(data$x3, data$y3, main = "Scater Plot of x3,y3",xlab="x3",ylab="y3", pch=20)
plot(data$x4, data$y4, main = "Scater Plot of x4,y4",xlab="x4",ylab="y4", pch=20)
lm() function.model1<-lm(data$y1~data$x1)
summary(model1)
##
## Call:
## lm(formula = data$y1 ~ data$x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.92127 -0.45577 -0.04136 0.70941 1.83882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0001 1.1247 2.667 0.02573 *
## data$x1 0.5001 0.1179 4.241 0.00217 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
model2<-lm(data$y2~data$x2)
summary(model2)
##
## Call:
## lm(formula = data$y2 ~ data$x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9009 -0.7609 0.1291 0.9491 1.2691
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.001 1.125 2.667 0.02576 *
## data$x2 0.500 0.118 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179
model3<-lm(data$y3~data$x3)
summary(model3)
##
## Call:
## lm(formula = data$y3 ~ data$x3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1586 -0.6146 -0.2303 0.1540 3.2411
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0025 1.1245 2.670 0.02562 *
## data$x3 0.4997 0.1179 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared: 0.6663, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176
model4<-lm(data$y4~data$x4)
summary(model4)
##
## Call:
## lm(formula = data$y4 ~ data$x4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.751 -0.831 0.000 0.809 1.839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0017 1.1239 2.671 0.02559 *
## data$x4 0.4999 0.1178 4.243 0.00216 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared: 0.6667, Adjusted R-squared: 0.6297
## F-statistic: 18 on 1 and 9 DF, p-value: 0.002165
par(mfrow=c(2,2))
plot(data$x1, data$y1, main = "Scater Plot of x1,y1",xlab="x1",ylab="y1", pch=20)
abline(model1,col="green")
plot(data$x2, data$y2, main = "Scater Plot of x2,y2",xlab="x2",ylab="y2", pch=20)
abline(model2,col="green")
plot(data$x3, data$y3, main = "Scater Plot of x3,y3",xlab="x3",ylab="y3", pch=20)
abline(model3,col="green")
plot(data$x4, data$y4, main = "Scater Plot of x4,y4",xlab="x4",ylab="y4", pch=20)
abline(model4,col="green")
# Using anova() function to compare fits of models
anova(model1)
Analysis of Variance Table
Response: data\(y1 Df Sum Sq Mean Sq F value Pr(>F) data\)x1 1 27.510 27.5100 17.99 0.00217 ** Residuals 9 13.763 1.5292
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1
anova(model2)
Analysis of Variance Table
Response: data\(y2 Df Sum Sq Mean Sq F value Pr(>F) data\)x2 1 27.500 27.5000 17.966 0.002179 ** Residuals 9 13.776 1.5307
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1
anova(model3)
Analysis of Variance Table
Response: data\(y3 Df Sum Sq Mean Sq F value Pr(>F) data\)x3 1 27.470 27.4700 17.972 0.002176 ** Residuals 9 13.756 1.5285
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1
anova(model4)
Analysis of Variance Table
Response: data\(y4 Df Sum Sq Mean Sq F value Pr(>F) data\)x4 1 27.490 27.4900 18.003 0.002165 ** Residuals 9 13.742 1.5269
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1
Anscombe’s Quartet consists of four sets of data that are similar. Each of these sets consists of 11 pairs of values for x and y. It is also seen that the summary statistics are almost similar across each column values. However, our exercise shows that summary statistics are not sufficient to analyze data specifically for Anscombe’s Quartet and it does not entirely reflect the data as explained below.
In terms of summary statistics, the four data sets appear to be identical i.e: The mean and variance of all Xs is 9 and 11 respectively. The mean and variance of all Ys is 7.5 and 4.12 respectively.
However, scatter plots of x, y pairs show that the data sets are not similar: - Dataset1 somewhat follows a linear relationship with positive slope. - Dataset2 does not show a linear relationship, but more like a curve. - Dataset3 shows a strong linear relationship with very close points, except for one outlier. - Dataset4 shows that x remains constant across most values of y, except for one outlier.
Hence, from this exercise on Anscombe’s Quartet it is important to note that, summary statists alone may not tell us all the complete story about the data. In fact, the visualization of data is an extremely important component of analyzizng and understanding data.