The objectives of this problem set is to orient you to a number of
activities in R
and to conduct a thoughtful exercise in
appreciating the importance of data visualization. For each question
enter your code or text response in the code chunk that
completes/answers the activity or question requested. To submit this
homework you will create the document in Rstudio, using the knitr
package (button included in Rstudio) and then submit the document to
your Rpubs account. Once uploaded you
will submit the link to that document on Canvas. Please make sure that
this link is hyper linked and that I can see the visualization and the
code required to create it. Each question is worth 5 points.
anscombe
data that is part of the
library(datasets)
in R
. And assign that data
to a new object called data
.str(anscombe)
## 'data.frame': 11 obs. of 8 variables:
## $ x1: num 10 8 13 9 11 14 6 4 12 7 ...
## $ x2: num 10 8 13 9 11 14 6 4 12 7 ...
## $ x3: num 10 8 13 9 11 14 6 4 12 7 ...
## $ x4: num 8 8 8 8 8 8 8 19 8 8 ...
## $ y1: num 8.04 6.95 7.58 8.81 8.33 ...
## $ y2: num 9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 ...
## $ y3: num 7.46 6.77 12.74 7.11 7.81 ...
## $ y4: num 6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 ...
dplyr
package!)data <- data("anscombe")
x1 <- anscombe[,1]
x2 <- anscombe[,2]
x3 <- anscombe[,3]
x4 <- anscombe[,4]
y1 <- anscombe[,5]
y2 <- anscombe[,6]
y3 <- anscombe[,7]
y4 <- anscombe[,8]
mean(x1)
## [1] 9
var(x1)
## [1] 11
mean(x2)
## [1] 9
var(x2)
## [1] 11
mean(x3)
## [1] 9
var(x3)
## [1] 11
mean(x4)
## [1] 9
var(x4)
## [1] 11
mean(y1)
## [1] 7.500909
var(y1)
## [1] 4.127269
mean(y2)
## [1] 7.500909
var(y2)
## [1] 4.127629
mean(y3)
## [1] 7.5
var(y3)
## [1] 4.12262
mean(y4)
## [1] 7.500909
var(y4)
## [1] 4.123249
library(fBasics)
correlationTest(x1,y1)
##
## Title:
## Pearson's Correlation Test
##
## Test Results:
## PARAMETER:
## Degrees of Freedom: 9
## SAMPLE ESTIMATES:
## Correlation: 0.8164
## STATISTIC:
## t: 4.2415
## P VALUE:
## Alternative Two-Sided: 0.00217
## Alternative Less: 0.9989
## Alternative Greater: 0.001085
## CONFIDENCE INTERVAL:
## Two-Sided: 0.4244, 0.9507
## Less: -1, 0.9388
## Greater: 0.5113, 1
##
## Description:
## Sun Mar 12 16:34:38 2023
correlationTest(x2,y2)
##
## Title:
## Pearson's Correlation Test
##
## Test Results:
## PARAMETER:
## Degrees of Freedom: 9
## SAMPLE ESTIMATES:
## Correlation: 0.8162
## STATISTIC:
## t: 4.2386
## P VALUE:
## Alternative Two-Sided: 0.002179
## Alternative Less: 0.9989
## Alternative Greater: 0.001089
## CONFIDENCE INTERVAL:
## Two-Sided: 0.4239, 0.9506
## Less: -1, 0.9387
## Greater: 0.5109, 1
##
## Description:
## Sun Mar 12 16:34:38 2023
correlationTest(x3,y3)
##
## Title:
## Pearson's Correlation Test
##
## Test Results:
## PARAMETER:
## Degrees of Freedom: 9
## SAMPLE ESTIMATES:
## Correlation: 0.8163
## STATISTIC:
## t: 4.2394
## P VALUE:
## Alternative Two-Sided: 0.002176
## Alternative Less: 0.9989
## Alternative Greater: 0.001088
## CONFIDENCE INTERVAL:
## Two-Sided: 0.4241, 0.9507
## Less: -1, 0.9387
## Greater: 0.511, 1
##
## Description:
## Sun Mar 12 16:34:38 2023
correlationTest(x4,y4)
##
## Title:
## Pearson's Correlation Test
##
## Test Results:
## PARAMETER:
## Degrees of Freedom: 9
## SAMPLE ESTIMATES:
## Correlation: 0.8165
## STATISTIC:
## t: 4.243
## P VALUE:
## Alternative Two-Sided: 0.002165
## Alternative Less: 0.9989
## Alternative Greater: 0.001082
## CONFIDENCE INTERVAL:
## Two-Sided: 0.4246, 0.9507
## Less: -1, 0.9388
## Greater: 0.5115, 1
##
## Description:
## Sun Mar 12 16:34:38 2023
library(ggplot2)
plot(x1,y1, main = "Scatter Plot: x1 & y1")
plot(x2,y2,main = "Scatter Plot: x2 & y2")
plot(x3,y3, main = "Scatter Plot: x3 & y3")
plot(x4,y4, main = "Scatter Plot: x4 & y4")
par(mfrow = c(2,2))
plot(x1,y1, main = "Scatter Plot: x1 & y1", pch = 19)
plot(x2,y2,main = "Scatter Plot: x2 & y2", pch = 19)
plot(x3,y3, main = "Scatter Plot: x3 & y3", pch = 19)
plot(x4,y4, main = "Scatter Plot: x4 & y4", pch = 19)
lm()
function.Lm1 <- lm( x1~y1)
summary(Lm1)
##
## Call:
## lm(formula = x1 ~ y1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6522 -1.5117 -0.2657 1.2341 3.8946
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.9975 2.4344 -0.410 0.69156
## y1 1.3328 0.3142 4.241 0.00217 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.019 on 9 degrees of freedom
## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
Lm2 <- lm(x2~y2)
summary(Lm2)
##
## Call:
## lm(formula = x2 ~ y2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8516 -1.4315 -0.3440 0.8467 4.2017
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.9948 2.4354 -0.408 0.69246
## y2 1.3325 0.3144 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.02 on 9 degrees of freedom
## Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179
Lm3 <- lm(x3~y3)
summary(Lm3)
##
## Call:
## lm(formula = x3 ~ y3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9869 -1.3733 -0.0266 1.3200 3.2133
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.0003 2.4362 -0.411 0.69097
## y3 1.3334 0.3145 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.019 on 9 degrees of freedom
## Multiple R-squared: 0.6663, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176
Lm4 <- lm(x4~y4)
summary(Lm4)
##
## Call:
## lm(formula = x4 ~ y4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7859 -1.4122 -0.1853 1.4551 3.3329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.0036 2.4349 -0.412 0.68985
## y4 1.3337 0.3143 4.243 0.00216 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.018 on 9 degrees of freedom
## Multiple R-squared: 0.6667, Adjusted R-squared: 0.6297
## F-statistic: 18 on 1 and 9 DF, p-value: 0.002165
par(mfrow = c(2,2))
plot(Lm1)
plot(Lm2)
plot(Lm3)
plot(Lm4)
anova(Lm1, test ="Chisq")
Analysis of Variance Table
Response: x1 Df Sum Sq Mean Sq F value Pr(>F)
y1 1 73.32 73.320 17.99 0.00217 ** Residuals 9 36.68 4.076
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05
‘.’ 0.1 ’ ’ 1
anova(Lm2, test ="Chisq")
Analysis of Variance Table
Response: x2 Df Sum Sq Mean Sq F value Pr(>F)
y2 1 73.287 73.287 17.966 0.002179 ** Residuals 9 36.713 4.079
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05
‘.’ 0.1 ’ ’ 1
anova(Lm3, test ="Chisq")
Analysis of Variance Table
Response: x3 Df Sum Sq Mean Sq F value Pr(>F)
y3 1 73.296 73.296 17.972 0.002176 ** Residuals 9 36.704 4.078
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05
‘.’ 0.1 ’ ’ 1
anova(Lm4, test ="Chisq")
Analysis of Variance Table
Response: x4 Df Sum Sq Mean Sq F value Pr(>F)
y4 1 73.338 73.338 18.003 0.002165 ** Residuals 9 36.662 4.074
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05
‘.’ 0.1 ’ ’ 1
Anscombe’s quartet is a set if 4 datasets that consists of nearly identical statistical properties but visualized differently. It demonstrated the importance of data visualization and disadvantage of relying solely on numerical summaries.
The lesson of Anscombe’s Quartet is that statistical measures can be misleading if used in isolation, and that visualizing the data is crucial for exploring and analyzing data.