ANLY 512-90- O-2018/Spring - Data Visualization

library(datasets)
View(anscombe)

1.Anscombes quartet is a set of 4 x,y data sets that were published by Francis Anscombe in a 1973 paper Graphs in statistical analysis. For this first question load the anscombe data that is part of the library(datasets) in R. And assign that data to a new object called data.

data <- anscombe
data

##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89

2. Summarise the data by calculating the mean, variance, for each column and the correlation between each pair (eg. x1 and y1, x2 and y2, etc) (Hint: use the fBasics() package!)

mx1 <- mean(data$x1)
mx1

## [1] 9

mx2 <- mean(data$x2)
mx2

## [1] 9

mx3 <- mean(data$x3)
mx3

## [1] 9

mx4 <- mean(data$x4)
mx4

## [1] 9

my1 <- mean(data$y1)
my1

## [1] 7.500909

my2 <- mean(data$y2)
my2

## [1] 7.500909

my3 <- mean(data$y3)
my3

## [1] 7.5

my4 <- mean(data$y4)
my4

## [1] 7.500909

varx1 <- var(data$x1)
varx1

## [1] 11

varx2 <- var(data$x2)
varx2

## [1] 11

varx3 <- var(data$x3)
varx3

## [1] 11

varx4 <- var(data$x4)
varx4

## [1] 11

vary1 <- var(data$y1)
vary1

## [1] 4.127269

vary2 <- var(data$y2)
vary2

## [1] 4.127629

vary3 <- var(data$y3)
vary3

## [1] 4.12262

vary4 <- var(data$y4)
vary4

## [1] 4.123249

library(timeDate)
library(timeSeries)
library(fBasics)

correlationTest(data$x1,data$y1)

## 
## Title:
##  Pearson's Correlation Test
## 
## Test Results:
##   PARAMETER:
##     Degrees of Freedom: 9
##   SAMPLE ESTIMATES:
##     Correlation: 0.8164
##   STATISTIC:
##     t: 4.2415
##   P VALUE:
##     Alternative Two-Sided: 0.00217 
##     Alternative      Less: 0.9989 
##     Alternative   Greater: 0.001085 
##   CONFIDENCE INTERVAL:
##     Two-Sided: 0.4244, 0.9507
##          Less: -1, 0.9388
##       Greater: 0.5113, 1
## 
## Description:
##  Mon Feb 19 16:19:42 2018

cor(data$x1,data$y1)

## [1] 0.8164205

correlationTest(data$x2,data$y2)

## 
## Title:
##  Pearson's Correlation Test
## 
## Test Results:
##   PARAMETER:
##     Degrees of Freedom: 9
##   SAMPLE ESTIMATES:
##     Correlation: 0.8162
##   STATISTIC:
##     t: 4.2386
##   P VALUE:
##     Alternative Two-Sided: 0.002179 
##     Alternative      Less: 0.9989 
##     Alternative   Greater: 0.001089 
##   CONFIDENCE INTERVAL:
##     Two-Sided: 0.4239, 0.9506
##          Less: -1, 0.9387
##       Greater: 0.5109, 1
## 
## Description:
##  Mon Feb 19 16:19:42 2018

cor(data$x2,data$y2)

## [1] 0.8162365

correlationTest(data$x3,data$y3)

## 
## Title:
##  Pearson's Correlation Test
## 
## Test Results:
##   PARAMETER:
##     Degrees of Freedom: 9
##   SAMPLE ESTIMATES:
##     Correlation: 0.8163
##   STATISTIC:
##     t: 4.2394
##   P VALUE:
##     Alternative Two-Sided: 0.002176 
##     Alternative      Less: 0.9989 
##     Alternative   Greater: 0.001088 
##   CONFIDENCE INTERVAL:
##     Two-Sided: 0.4241, 0.9507
##          Less: -1, 0.9387
##       Greater: 0.511, 1
## 
## Description:
##  Mon Feb 19 16:19:42 2018

cor(data$x3,data$y3)

## [1] 0.8162867

correlationTest(data$x4,data$y4)

## 
## Title:
##  Pearson's Correlation Test
## 
## Test Results:
##   PARAMETER:
##     Degrees of Freedom: 9
##   SAMPLE ESTIMATES:
##     Correlation: 0.8165
##   STATISTIC:
##     t: 4.243
##   P VALUE:
##     Alternative Two-Sided: 0.002165 
##     Alternative      Less: 0.9989 
##     Alternative   Greater: 0.001082 
##   CONFIDENCE INTERVAL:
##     Two-Sided: 0.4246, 0.9507
##          Less: -1, 0.9388
##       Greater: 0.5115, 1
## 
## Description:
##  Mon Feb 19 16:19:42 2018

cor(data$x4,data$y4)

## [1] 0.8165214

3. Create scatter plots for each x,y pair of data.

plot(data$y1 ~ data$x1 , xlab="x1" , ylab="y1" , main="scatter plot of x1,y1" , col="purple")

plot(data$y2 ~ data$x2 , xlab="x2" , ylab="y2" , main="scatter plot of x2,y2" , col="green")

plot(data$y3 ~ data$x3 , xlab="x3" , ylab="y3" , main="scatter plot of x3,y3" , col="blue")

plot(data$y4 ~ data$x4 , xlab="x4" , ylab="y4" , main="scatter plot of x4,y4" , col="red")

4. Now change the symbols on the scatter plots to solid circles and plot them together as a 4 panel graphic

First I changed the symbols of scatter plots to solid circles and in next step plotted them as a 4 panel graphic.

plot(data$y1 ~ data$x1 , xlab="x1" , ylab="y1" , main="scatter plot of x1,y1" ,pch=16 , col="purple")

plot(data$y2 ~ data$x2 , xlab="x2" , ylab="y2" , main="scatter plot of x2,y2" ,pch=16 , col="green")

plot(data$y3 ~ data$x3 , xlab="x3" , ylab="y3" , main="scatter plot of x3,y3" ,pch=16 , col="blue")

plot(data$y4 ~ data$x4 , xlab="x4" , ylab="y4" , main="scatter plot of x4,y4" ,pch=16 , col="red")

par(mfrow=c(2,2))
plot(data$y1 ~ data$x1 , xlab="x1" , ylab="y1" , main="scatter plot of x1,y1" ,pch=16 , col="purple")
plot(data$y2 ~ data$x2 , xlab="x2" , ylab="y2" , main="scatter plot of x2,y2" ,pch=16 , col="green")
plot(data$y3 ~ data$x3 , xlab="x3" , ylab="y3" , main="scatter plot of x3,y3" ,pch=16 , col="blue")
plot(data$y4 ~ data$x4 , xlab="x4" , ylab="y4" , main="scatter plot of x4,y4" ,pch=16 , col="red")

5. Now fit a linear model to each data set using the lm() function.

plot(data$y1 ~ data$x1)
lmx1y1 <- lm(y1 ~ x1 , data = anscombe)
abline(lmx1y1 , col="green")

plot(data$y2 ~ data$x2)
lmx2y2 <- lm(y2 ~ x2 , data = anscombe)
abline(lmx2y2 , col="red")

plot(data$y3 ~ data$x3)
lmx3y3 <- lm(y3 ~x3 , data = anscombe)
abline(lmx3y3 , col="blue")

plot(data$y4 ~ data$x4)
lmx4y4 <- lm(y4 ~ x4 , data = anscombe)
abline(lmx4y4 , col="purple")

6. Now combine the last two tasks. Create a four panel scatter plot matrix that has both the data points and the regression lines. (hint: the model objects will carry over chunks!)

par(mfrow=c(2,2))
plot(data$y1 ~ data$x1 ,xlab="x1" , ylab="y1" , main="scatter plot of x1,y1" ,pch=16)
lmx1y1 <- lm(y1 ~ x1 , data = anscombe)
abline(lmx1y1 , col="green")

plot(data$y2 ~ data$x2 ,xlab="x2" , ylab="y2" , main="scatter plot of x2,y2" ,pch=16)
lmx2y2 <- lm(y2 ~ x2 , data = anscombe)
abline(lmx2y2 , col="red")

plot(data$y3 ~ data$x3 ,xlab="x3" , ylab="y3" , main="scatter plot of x3,y3" ,pch=16)
lmx3y3 <- lm(y3 ~x3 , data = anscombe)
abline(lmx3y3 , col="blue")

plot(data$y4 ~ data$x4 ,xlab="x4" , ylab="y4" , main="scatter plot of x4,y4" ,pch=16)
lmx4y4 <- lm(y4 ~ x4 , data = anscombe)
abline(lmx4y4 , col="purple")

7. Now compare the model fits for each model object.

anova(lmx1y1)

## Analysis of Variance Table
## 
## Response: y1
##           Df Sum Sq Mean Sq F value  Pr(>F)   
## x1         1 27.510 27.5100   17.99 0.00217 **
## Residuals  9 13.763  1.5292                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(lmx2y2)

## Analysis of Variance Table
## 
## Response: y2
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## x2         1 27.500 27.5000  17.966 0.002179 **
## Residuals  9 13.776  1.5307                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(lmx3y3)

## Analysis of Variance Table
## 
## Response: y3
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## x3         1 27.470 27.4700  17.972 0.002176 **
## Residuals  9 13.756  1.5285                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(lmx4y4)

## Analysis of Variance Table
## 
## Response: y4
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## x4         1 27.490 27.4900  18.003 0.002165 **
## Residuals  9 13.742  1.5269                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the means and variances are same for each x,y datasets, and pvalues are less than alpha we expect to have siginficant values for coefficients of fitted model which means the model is fitted perfectly but surprisingly this is not the case for last model which means we need a more accurate model or it might be an outlier only.

8. In text, summarize the lesson of Anscombe’s Quartet and what it says about the value of data visualization.

Anscombe’s quartet helped me to understand, even though different datasets are similiar or I can say exactly same in their statistical characteristics like variance or mean, can be totally different in their visualizations. I expected to see the same graphical models but they were extremly different when I plotted them and saw their visulize models.They were compeletly different, graphically.

ANLY 512-90- O-2018/Spring - Data Visualization - Problem set2

Nazanin Yousefzadeh

February 19, 2018

1.Anscombes quartet is a set of 4 x,y data sets that were published by Francis Anscombe in a 1973 paper Graphs in statistical analysis. For this first question load the anscombe data that is part of the library(datasets) in R. And assign that data to a new object called data.

2. Summarise the data by calculating the mean, variance, for each column and the correlation between each pair (eg. x1 and y1, x2 and y2, etc) (Hint: use the fBasics() package!)

3. Create scatter plots for each x,y pair of data.

4. Now change the symbols on the scatter plots to solid circles and plot them together as a 4 panel graphic

First I changed the symbols of scatter plots to solid circles and in next step plotted them as a 4 panel graphic.

5. Now fit a linear model to each data set using the lm() function.

6. Now combine the last two tasks. Create a four panel scatter plot matrix that has both the data points and the regression lines. (hint: the model objects will carry over chunks!)

7. Now compare the model fits for each model object.

8. In text, summarize the lesson of Anscombe’s Quartet and what it says about the value of data visualization.