The objectives of this problem set is to orient you to a number of activities in R
. And to conduct a thoughtful exercise in appreciating the importance of data visualization. For each question create a code chunk or text response that completes/answers the activity or question requested.
anscombe
data that is part of the library(datasets)
in R
. And assign that data to a new object called data
.#load library
library(datasets)
#load data set and store it in data
data <- anscombe
fBasics()
package!)#install the package fBasics
#load library fBasics
library(fBasics)
#calculating the mean for each column
colMeans(data)
## x1 x2 x3 x4 y1 y2 y3 y4
## 9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909
#calculating the variance for each column
colVars(data)
## x1 x2 x3 x4 y1 y2 y3
## 11.000000 11.000000 11.000000 11.000000 4.127269 4.127629 4.122620
## y4
## 4.123249
xItems<- c('x1','x2','x3','x4')
yItems<- c('y1','y2','y3','y4')
for(i in 1:4){
print(correlationTest(data[[xItems[i]]], data[[yItems[i]]], method = c("pearson", "kendall", "spearman"),
title = paste(paste("correlation test between x",i,sep = ""), paste("and y",i,sep = ""))))
print("####################################################################")
}
##
## Title:
## correlation test between x1 and y1
##
## Test Results:
## PARAMETER:
## Degrees of Freedom: 9
## SAMPLE ESTIMATES:
## Correlation: 0.8164
## STATISTIC:
## t: 4.2415
## P VALUE:
## Alternative Two-Sided: 0.00217
## Alternative Less: 0.9989
## Alternative Greater: 0.001085
## CONFIDENCE INTERVAL:
## Two-Sided: 0.4244, 0.9507
## Less: -1, 0.9388
## Greater: 0.5113, 1
##
## Description:
## Wed Feb 07 20:12:53 2018
##
## [1] "####################################################################"
##
## Title:
## correlation test between x2 and y2
##
## Test Results:
## PARAMETER:
## Degrees of Freedom: 9
## SAMPLE ESTIMATES:
## Correlation: 0.8162
## STATISTIC:
## t: 4.2386
## P VALUE:
## Alternative Two-Sided: 0.002179
## Alternative Less: 0.9989
## Alternative Greater: 0.001089
## CONFIDENCE INTERVAL:
## Two-Sided: 0.4239, 0.9506
## Less: -1, 0.9387
## Greater: 0.5109, 1
##
## Description:
## Wed Feb 07 20:12:53 2018
##
## [1] "####################################################################"
##
## Title:
## correlation test between x3 and y3
##
## Test Results:
## PARAMETER:
## Degrees of Freedom: 9
## SAMPLE ESTIMATES:
## Correlation: 0.8163
## STATISTIC:
## t: 4.2394
## P VALUE:
## Alternative Two-Sided: 0.002176
## Alternative Less: 0.9989
## Alternative Greater: 0.001088
## CONFIDENCE INTERVAL:
## Two-Sided: 0.4241, 0.9507
## Less: -1, 0.9387
## Greater: 0.511, 1
##
## Description:
## Wed Feb 07 20:12:53 2018
##
## [1] "####################################################################"
##
## Title:
## correlation test between x4 and y4
##
## Test Results:
## PARAMETER:
## Degrees of Freedom: 9
## SAMPLE ESTIMATES:
## Correlation: 0.8165
## STATISTIC:
## t: 4.243
## P VALUE:
## Alternative Two-Sided: 0.002165
## Alternative Less: 0.9989
## Alternative Greater: 0.001082
## CONFIDENCE INTERVAL:
## Two-Sided: 0.4246, 0.9507
## Less: -1, 0.9388
## Greater: 0.5115, 1
##
## Description:
## Wed Feb 07 20:12:53 2018
##
## [1] "####################################################################"
# ploting the scatterplot
for(i in 1:4){
plot(data[[xItems[i]]], data[[yItems[i]]], main="Scatterplot",
xlab="X axis ", ylab="Y axis ", pch=4)
}
#Ploting on the same panel
par(mfrow=c(2,2))
# ploting the scatterplot
for(i in 1:4){
plot(data[[xItems[i]]], data[[yItems[i]]], main=paste(paste("Scatterplot X",i),paste("and Y",i)),
xlab="X axis ", ylab="Y axis ", pch=19)
}
lm()
function.# fit linear model
fit1 <- lm(data[[yItems[1]]]~data[[xItems[1]]], data)
summary(fit1)
##
## Call:
## lm(formula = data[[yItems[1]]] ~ data[[xItems[1]]], data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.92127 -0.45577 -0.04136 0.70941 1.83882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0001 1.1247 2.667 0.02573 *
## data[[xItems[1]]] 0.5001 0.1179 4.241 0.00217 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
# fit linear model
fit2 <- lm(data[[yItems[2]]]~data[[xItems[2]]], data)
summary(fit2)
##
## Call:
## lm(formula = data[[yItems[2]]] ~ data[[xItems[2]]], data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9009 -0.7609 0.1291 0.9491 1.2691
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.001 1.125 2.667 0.02576 *
## data[[xItems[2]]] 0.500 0.118 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179
# fit linear model
fit3 <- lm(data[[yItems[3]]]~data[[xItems[3]]], data)
summary(fit2)
##
## Call:
## lm(formula = data[[yItems[2]]] ~ data[[xItems[2]]], data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9009 -0.7609 0.1291 0.9491 1.2691
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.001 1.125 2.667 0.02576 *
## data[[xItems[2]]] 0.500 0.118 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179
# fit linear model
fit4 <- lm(data[[yItems[4]]]~data[[xItems[4]]], data)
summary(fit4)
##
## Call:
## lm(formula = data[[yItems[4]]] ~ data[[xItems[4]]], data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.751 -0.831 0.000 0.809 1.839
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0017 1.1239 2.671 0.02559 *
## data[[xItems[4]]] 0.4999 0.1178 4.243 0.00216 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.236 on 9 degrees of freedom
## Multiple R-squared: 0.6667, Adjusted R-squared: 0.6297
## F-statistic: 18 on 1 and 9 DF, p-value: 0.002165
#Ploting on the same panel
par(mfrow=c(2,2))
plot(fit1)
par(mfrow=c(2,2))
plot(fit2)
par(mfrow=c(2,2))
plot(fit3)
par(mfrow=c(2,2))
plot(fit4)
anova(fit1)
Analysis of Variance Table
Response: data[[yItems[1]]] Df Sum Sq Mean Sq F value Pr(>F)
data[[xItems[1]]] 1 27.510 27.5100 17.99 0.00217 ** Residuals 9 13.763 1.5292
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
anova(fit2)
## Analysis of Variance Table
##
## Response: data[[yItems[2]]]
## Df Sum Sq Mean Sq F value Pr(>F)
## data[[xItems[2]]] 1 27.500 27.5000 17.966 0.002179 **
## Residuals 9 13.776 1.5307
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fit3)
## Analysis of Variance Table
##
## Response: data[[yItems[3]]]
## Df Sum Sq Mean Sq F value Pr(>F)
## data[[xItems[3]]] 1 27.470 27.4700 17.972 0.002176 **
## Residuals 9 13.756 1.5285
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fit4)
## Analysis of Variance Table
##
## Response: data[[yItems[4]]]
## Df Sum Sq Mean Sq F value Pr(>F)
## data[[xItems[4]]] 1 27.490 27.4900 18.003 0.002165 **
## Residuals 9 13.742 1.5269
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anscombe’s Quartet explains the importance of visualization models which helps to understand the datasets they represent very accurately. It contains 4 datasets; even though their simple statistical values are identical, the graphs representing the data are very different; thus the importance of visualizating data using graphs to make distinction.