The objectives of this problem set is to orient you to a number of activities in R. And to conduct a thoughtful exercise in appreciating the importance of data visualization. For each question create a code chunk or text response that completes/answers the activity or question requested. Finally, upon completion name your final output .html file as: YourName_ANLY512-Section-Year-Semester.html and upload it to the “Problem Set 2” assignmenet on Moodle.
anscombe data that is part of the library(datasets) in R. And assign that data to a new object called data.library(datasets)
data = anscombe
fBasics() package!)#install.packages("fBasics")
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(reshape2)
dataA=select(data,x=x1,y=y1)
dataB=select(data,x=x2,y=y2)
dataC=select(data,x=x3,y=y3)
dataD=select(data,x=x4,y=y4)
dataA$group='DataA'
dataB$group='DataB'
dataC$group='DataC'
dataD$group='DataD'
data_all=rbind(dataA,dataB,dataC,dataD)
library("fBasics")
## Loading required package: timeDate
## Loading required package: timeSeries
##
## Rmetrics Package fBasics
## Analysing Markets and calculating Basic Statistics
## Copyright (C) 2005-2014 Rmetrics Association Zurich
## Educational Software for Financial Engineering and Computational Science
## Rmetrics is free software and comes with ABSOLUTELY NO WARRANTY.
## https://www.rmetrics.org --- Mail to: info@rmetrics.org
correlationTest(dataA$x, dataA$y)
##
## Title:
## Pearson's Correlation Test
##
## Test Results:
## PARAMETER:
## Degrees of Freedom: 9
## SAMPLE ESTIMATES:
## Correlation: 0.8164
## STATISTIC:
## t: 4.2415
## P VALUE:
## Alternative Two-Sided: 0.00217
## Alternative Less: 0.9989
## Alternative Greater: 0.001085
## CONFIDENCE INTERVAL:
## Two-Sided: 0.4244, 0.9507
## Less: -1, 0.9388
## Greater: 0.5113, 1
##
## Description:
## Tue Sep 12 08:38:22 2017
correlationTest(dataB$x, dataB$y)
##
## Title:
## Pearson's Correlation Test
##
## Test Results:
## PARAMETER:
## Degrees of Freedom: 9
## SAMPLE ESTIMATES:
## Correlation: 0.8162
## STATISTIC:
## t: 4.2386
## P VALUE:
## Alternative Two-Sided: 0.002179
## Alternative Less: 0.9989
## Alternative Greater: 0.001089
## CONFIDENCE INTERVAL:
## Two-Sided: 0.4239, 0.9506
## Less: -1, 0.9387
## Greater: 0.5109, 1
##
## Description:
## Tue Sep 12 08:38:22 2017
correlationTest(dataC$x, dataC$y)
##
## Title:
## Pearson's Correlation Test
##
## Test Results:
## PARAMETER:
## Degrees of Freedom: 9
## SAMPLE ESTIMATES:
## Correlation: 0.8163
## STATISTIC:
## t: 4.2394
## P VALUE:
## Alternative Two-Sided: 0.002176
## Alternative Less: 0.9989
## Alternative Greater: 0.001088
## CONFIDENCE INTERVAL:
## Two-Sided: 0.4241, 0.9507
## Less: -1, 0.9387
## Greater: 0.511, 1
##
## Description:
## Tue Sep 12 08:38:22 2017
correlationTest(dataD$x, dataD$y)
##
## Title:
## Pearson's Correlation Test
##
## Test Results:
## PARAMETER:
## Degrees of Freedom: 9
## SAMPLE ESTIMATES:
## Correlation: 0.8165
## STATISTIC:
## t: 4.243
## P VALUE:
## Alternative Two-Sided: 0.002165
## Alternative Less: 0.9989
## Alternative Greater: 0.001082
## CONFIDENCE INTERVAL:
## Two-Sided: 0.4246, 0.9507
## Less: -1, 0.9388
## Greater: 0.5115, 1
##
## Description:
## Tue Sep 12 08:38:22 2017
stats_summ=data_all%>%group_by(group)%>%summarise("Mean X"=mean(x),
"Sample Variance X"=var(x),
"Mean Y" = mean(y),
"Sample Variance Y"=var(y),
"Correlation Between X and Y"=cor(x,y))
plot(dataA$x, dataA$y, main = "Scatter Plot 1 - y1, x1")
plot(dataB$x, dataB$y, main = "Scatter Plot 2 - y2, x2")
plot(dataC$x, dataC$y, main = "Scatter Plot 3 - y3, x3")
plot(dataD$x, dataD$y, main = "Scatter Plot 4 - y4, x4")
par(mfrow= c(2,2))
plot(dataA$x, dataA$y, main = "Scatter Plot 1 - y1, x1", pch = 20)
plot(dataB$x, dataB$y, main = "Scatter Plot 2 - y2, x2", pch = 20)
plot(dataC$x, dataC$y, main = "Scatter Plot 3 - y3, x3", pch = 20)
plot(dataD$x, dataD$y, main = "Scatter Plot 4 - y4, x4", pch = 20)
lm() function.model1 = lm(dataA$y ~ dataA$x)
model2 = lm(dataB$y ~ dataB$x)
model3 = lm(dataC$y ~ dataC$x)
model4 = lm(dataD$y ~ dataD$x)
summary(model1)
##
## Call:
## lm(formula = dataA$y ~ dataA$x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.92127 -0.45577 -0.04136 0.70941 1.83882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0001 1.1247 2.667 0.02573 *
## dataA$x 0.5001 0.1179 4.241 0.00217 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
summary(model2)
##
## Call:
## lm(formula = dataB$y ~ dataB$x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9009 -0.7609 0.1291 0.9491 1.2691
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.001 1.125 2.667 0.02576 *
## dataB$x 0.500 0.118 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179
summary(model3)
##
## Call:
## lm(formula = dataC$y ~ dataC$x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1586 -0.6146 -0.2303 0.1540 3.2411
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0025 1.1245 2.670 0.02562 *
## dataC$x 0.4997 0.1179 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared: 0.6663, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176
summary(model4)
##
## Call:
## lm(formula = dataD$y ~ dataD$x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.751 -0.831 0.000 0.809 1.839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0017 1.1239 2.671 0.02559 *
## dataD$x 0.4999 0.1178 4.243 0.00216 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared: 0.6667, Adjusted R-squared: 0.6297
## F-statistic: 18 on 1 and 9 DF, p-value: 0.002165
linear_model = data_all %>% group_by(group) %>%
do(mod=lm(y~x,data=.)) %>%
do(data.frame(var=names(coef(.$mod)),coef=round(coef(.$mod),2),group=.$group)) %>%
dcast(.,group~var,value.var="coef")
## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
reg_summ=data_frame("Linear Regression"=paste0("y=",linear_model$"(Intercept)","+",linear_model$x,"x"))
stats_and_linear_model_summ = cbind(stats_summ,reg_summ)
stats_and_linear_model_summ
## group Mean X Sample Variance X Mean Y Sample Variance Y
## 1 DataA 9 11 7.500909 4.127269
## 2 DataB 9 11 7.500909 4.127629
## 3 DataC 9 11 7.500000 4.122620
## 4 DataD 9 11 7.500909 4.123249
## Correlation Between X and Y Linear Regression
## 1 0.8164205 y=3+0.5x
## 2 0.8162365 y=3+0.5x
## 3 0.8162867 y=3+0.5x
## 4 0.8165214 y=3+0.5x
ggplot(data_all, aes(x=x,y=y)) +geom_point(shape=21,color="blue",fill="purple",size=2) +ggtitle("Anscombe's Datasets") +geom_smooth(method ="lm", se = FALSE, color="red") +facet_wrap(~group,scales="free")
anova(model1)
Analysis of Variance Table
Response: dataA\(y Df Sum Sq Mean Sq F value Pr(>F) dataA\)x 1 27.510 27.5100 17.99 0.00217 ** Residuals 9 13.763 1.5292
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
anova(model2)
Analysis of Variance Table
Response: dataB\(y Df Sum Sq Mean Sq F value Pr(>F) dataB\)x 1 27.500 27.5000 17.966 0.002179 ** Residuals 9 13.776 1.5307
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
anova(model3)
Analysis of Variance Table
Response: dataC\(y Df Sum Sq Mean Sq F value Pr(>F) dataC\)x 1 27.470 27.4700 17.972 0.002176 ** Residuals 9 13.756 1.5285
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
anova(model4)
Analysis of Variance Table
Response: dataD\(y Df Sum Sq Mean Sq F value Pr(>F) dataD\)x 1 27.490 27.4900 18.003 0.002165 ** Residuals 9 13.742 1.5269
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
The four pairs of x and y values as well as the models appear to be identical: mean of x is 9, variance of x is 11, mean of y is 7.5 and variance of y is 4.12. The correlation between x and y in all four pairs is 0.816 and the linaer regression equations are all y = 3 + 0.5x.
However, if we examine the four plots of x and y, data visualization shows that the four datasets are very different. Dataset 1 shows to a small extent a linear relationship, dataset 2 shows no linear relationship, dataset 3 shows very strong linear relationship with an outlier, and dataset 4 shows constant x values with varying y values, except for one observation.
Anscombe’s Quartet shows exactly how important data visualization is in data analyses and how helpful it is in helping us make initial judgements on relationships between variables before conducting statistic calculations. Without data visualization, statistics alone could be misleading and land us at a wrong conclusion.