Inclass Visualization-Tanya

Q1. Anscombes quartet is a set of 4 \(x,y\) data sets that were published by Francis Anscombe in a 1973 paper Graphs in statistical analysis. For this first question, examine the built-in R data set `anscombe’.

str(anscombe)

## 'data.frame':    11 obs. of  8 variables:
##  $ x1: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x2: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x3: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x4: num  8 8 8 8 8 8 8 19 8 8 ...
##  $ y1: num  8.04 6.95 7.58 8.81 8.33 ...
##  $ y2: num  9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 ...
##  $ y3: num  7.46 6.77 12.74 7.11 7.81 ...
##  $ y4: num  6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 ...

There are 8 variables in this dataset “anscombe”

data <- data("anscombe")
x1 <- anscombe[,1]
x2 <- anscombe[,2]
x3 <- anscombe[,3]
x4 <- anscombe[,4]
y1 <- anscombe[,5]
y2 <- anscombe[,6]
y3 <- anscombe[,7]
y4 <- anscombe[,8]

Summarize the data by calculating the mean, variance, for each column and the correlation between each pair (eg. x1 and y1, x2 and y2, etc).

mean(x1)

## [1] 9

var(x1)

## [1] 11

mean(x2)

## [1] 9

var(x2)

## [1] 11

mean(x3)

## [1] 9

var(x3)

## [1] 11

mean(x4)

## [1] 9

var(x4)

## [1] 11

mean(y1)

## [1] 7.500909

var(y1)

## [1] 4.127269

mean(y2)

## [1] 7.500909

var(y2)

## [1] 4.127629

mean(y3)

## [1] 7.5

var(y3)

## [1] 4.12262

mean(y4)

## [1] 7.500909

var(y4)

## [1] 4.123249

library(fBasics)

## Warning: package 'fBasics' was built under R version 3.4.4

## Loading required package: timeDate

## Warning: package 'timeDate' was built under R version 3.4.4

## Loading required package: timeSeries

## Warning: package 'timeSeries' was built under R version 3.4.4

correlationTest(x1,y1)

## 
## Title:
##  Pearson's Correlation Test
## 
## Test Results:
##   PARAMETER:
##     Degrees of Freedom: 9
##   SAMPLE ESTIMATES:
##     Correlation: 0.8164
##   STATISTIC:
##     t: 4.2415
##   P VALUE:
##     Alternative Two-Sided: 0.00217 
##     Alternative      Less: 0.9989 
##     Alternative   Greater: 0.001085 
##   CONFIDENCE INTERVAL:
##     Two-Sided: 0.4244, 0.9507
##          Less: -1, 0.9388
##       Greater: 0.5113, 1
## 
## Description:
##  Thu Sep 13 22:39:14 2018

correlationTest(x2,y2)

## 
## Title:
##  Pearson's Correlation Test
## 
## Test Results:
##   PARAMETER:
##     Degrees of Freedom: 9
##   SAMPLE ESTIMATES:
##     Correlation: 0.8162
##   STATISTIC:
##     t: 4.2386
##   P VALUE:
##     Alternative Two-Sided: 0.002179 
##     Alternative      Less: 0.9989 
##     Alternative   Greater: 0.001089 
##   CONFIDENCE INTERVAL:
##     Two-Sided: 0.4239, 0.9506
##          Less: -1, 0.9387
##       Greater: 0.5109, 1
## 
## Description:
##  Thu Sep 13 22:39:14 2018

correlationTest(x3,y3)

## 
## Title:
##  Pearson's Correlation Test
## 
## Test Results:
##   PARAMETER:
##     Degrees of Freedom: 9
##   SAMPLE ESTIMATES:
##     Correlation: 0.8163
##   STATISTIC:
##     t: 4.2394
##   P VALUE:
##     Alternative Two-Sided: 0.002176 
##     Alternative      Less: 0.9989 
##     Alternative   Greater: 0.001088 
##   CONFIDENCE INTERVAL:
##     Two-Sided: 0.4241, 0.9507
##          Less: -1, 0.9387
##       Greater: 0.511, 1
## 
## Description:
##  Thu Sep 13 22:39:14 2018

correlationTest(x4,y4)

## 
## Title:
##  Pearson's Correlation Test
## 
## Test Results:
##   PARAMETER:
##     Degrees of Freedom: 9
##   SAMPLE ESTIMATES:
##     Correlation: 0.8165
##   STATISTIC:
##     t: 4.243
##   P VALUE:
##     Alternative Two-Sided: 0.002165 
##     Alternative      Less: 0.9989 
##     Alternative   Greater: 0.001082 
##   CONFIDENCE INTERVAL:
##     Two-Sided: 0.4246, 0.9507
##          Less: -1, 0.9388
##       Greater: 0.5115, 1
## 
## Description:
##  Thu Sep 13 22:39:14 2018

Create scatter plots for each \(x, y\) pair of data?

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.4.4

plot(x1,y1, main = "Scatter plot between x1 & y1")

plot(x2,y2,main = "Scatter plot between x2 & y2")

plot(x3,y3, main = "Scatter plot between x3 & y3")

plot(x4,y4, main = "Scatter plot between x4 & y4")

Place the scatterplots in a 4 panel graphic and fit a linear model to each data set using the ‘lm’ function.

par(mfrow = c(2,2))
plot(x1,y1, main = "Scatter plot between x1 & y1", pch = 19)
plot(x2,y2,main = "Scatter plot between x2 & y2", pch = 19)
plot(x3,y3, main = "Scatter plot between x3 & y3", pch = 19)
plot(x4,y4, main = "Scatter plot between x4 & y4", pch = 19)

Lm1 <- lm( x1~y1)
summary(Lm1)

## 
## Call:
## lm(formula = x1 ~ y1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6522 -1.5117 -0.2657  1.2341  3.8946 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -0.9975     2.4344  -0.410  0.69156   
## y1            1.3328     0.3142   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.019 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217

Lm2 <- lm(x2~y2)
summary(Lm2)

## 
## Call:
## lm(formula = x2 ~ y2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8516 -1.4315 -0.3440  0.8467  4.2017 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -0.9948     2.4354  -0.408  0.69246   
## y2            1.3325     0.3144   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.02 on 9 degrees of freedom
## Multiple R-squared:  0.6662, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002179

Lm3 <- lm(x3~y3)
summary(Lm3)

## 
## Call:
## lm(formula = x3 ~ y3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9869 -1.3733 -0.0266  1.3200  3.2133 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -1.0003     2.4362  -0.411  0.69097   
## y3            1.3334     0.3145   4.239  0.00218 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.019 on 9 degrees of freedom
## Multiple R-squared:  0.6663, Adjusted R-squared:  0.6292 
## F-statistic: 17.97 on 1 and 9 DF,  p-value: 0.002176

Lm4 <- lm(x4~y4)
summary(Lm4)

## 
## Call:
## lm(formula = x4 ~ y4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7859 -1.4122 -0.1853  1.4551  3.3329 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -1.0036     2.4349  -0.412  0.68985   
## y4            1.3337     0.3143   4.243  0.00216 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.018 on 9 degrees of freedom
## Multiple R-squared:  0.6667, Adjusted R-squared:  0.6297 
## F-statistic:    18 on 1 and 9 DF,  p-value: 0.002165

Now compare the model fits for each model object.

anova(Lm1, test ="Chisq")

## Analysis of Variance Table
## 
## Response: x1
##           Df Sum Sq Mean Sq F value  Pr(>F)   
## y1         1  73.32  73.320   17.99 0.00217 **
## Residuals  9  36.68   4.076                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(Lm2, test ="Chisq")

## Analysis of Variance Table
## 
## Response: x2
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## y2         1 73.287  73.287  17.966 0.002179 **
## Residuals  9 36.713   4.079                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(Lm3, test ="Chisq")

## Analysis of Variance Table
## 
## Response: x3
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## y3         1 73.296  73.296  17.972 0.002176 **
## Residuals  9 36.704   4.078                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(Lm4, test ="Chisq")

## Analysis of Variance Table
## 
## Response: x4
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## y4         1 73.338  73.338  18.003 0.002165 **
## Residuals  9 36.662   4.074                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Explore Datasaurus Dozen

The data frame we will be working with today is called datasaurus_dozen and it’s in the datasauRus package. This single data frame contains 13 datasets, designed to show us why data visualisation is important and how summary statistics alone can be misleading.

To find out more about the dataset, type the following in your Console or in R markdown: ?datasaurus_dozen. A question mark before the name of an object will always bring up its help file.

?datasaurus_dozen

## No documentation for 'datasaurus_dozen' in specified packages and libraries:
## you could try '??datasaurus_dozen'

From the Help file, how many rows and columns does the datasaurus_dozen file have? 1846 rows and 3 columns

str("datasaurus_dozen")

##  chr "datasaurus_dozen"

Use the tail function to obtain the last several rows of data

tail("datasaurus_dozen")

## [1] "datasaurus_dozen"

names("datasaurus_dozen") # column names

## NULL

Plot datasaurus_dozen x-y plots.

We will plot x-y values of the dino 13 sets to see the visual pattern.

if(require(ggplot2)){
library(ggplot2)
library(datasauRus)
ggplot(datasaurus_dozen, aes(x=x, y=y, colour=dataset))+
  geom_point()+
  theme_void()+
  theme(legend.position = "none")+
  facet_wrap(~dataset, ncol=3)
}

## Warning: package 'datasauRus' was built under R version 3.4.4

plot(y ~ x, data = subset(datasaurus_dozen, dataset = "dino"),
     main = "The Datasaurus", xlab = "x", ylab = "y",
     pch = 19,las=1)

par(mar=c(3.5,3.5,2,2))
columns <- unique(datasaurus_dozen$dataset)
par(mfrow=c(4,4))
for(i in columns){
    plot(y ~ x, data = subset(datasaurus_dozen, dataset == i))
}

Inclass Visualization-Tanya

Tanya Mohte

September 13, 2018