Objectives

The objectives of this problem set is to orient you to a number of activities in R. And to conduct a thoughtful exercise in appreciating the importance of data visualization. For each question create a code chunk or text response that completes/answers the activity or question requested. Finally, upon completion name your final output .html file as: YourName_ANLY512-Section-Year-Semester.html and upload it to the “Problem Set 2” assignment to your R Pubs account and submit the link to Moodle. Points will be deducted for uploading the improper format.

Questions

  1. Anscombes quartet is a set of 4 \(x,y\) data sets that were published by Francis Anscombe in a 1973 paper Graphs in statistical analysis. For this first question load the anscombe data that is part of the library(datasets) in R. And assign that data to a new object called data.
library(datasets)
data("anscombe")
data <- anscombe # Assign anscombe data to a new object called "data"
  1. Summarise the data by calculating the mean, variance, for each column and the correlation between each pair (eg. x1 and y1, x2 and y2, etc) (Hint: use the fBasics() package!)
library("fBasics")
## Warning: package 'fBasics' was built under R version 3.5.3
## Loading required package: timeDate
## Loading required package: timeSeries
## Warning: package 'timeSeries' was built under R version 3.5.3
colMeans(data) # compute mean for each column
##       x1       x2       x3       x4       y1       y2       y3       y4 
## 9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909
colVars(data) # compute variance for each column
##        x1        x2        x3        x4        y1        y2        y3 
## 11.000000 11.000000 11.000000 11.000000  4.127269  4.127629  4.122620 
##        y4 
##  4.123249
# Assign correlationTest class to variable p
# using for loop to print estimate correlation for each pair (eg. x1 and y1, x2 and y2, etc)
for (i in 1:4) {
  p <- correlationTest(data[[paste0("x", i)]], data[[paste0("y", i)]], "pearson")
  print(paste0("The correlation between x", i, " and y", i, " is ", p@test[["estimate"]][["Correlation"]]))
}
## [1] "The correlation between x1 and y1 is 0.81642051634484"
## [1] "The correlation between x2 and y2 is 0.816236506000243"
## [1] "The correlation between x3 and y3 is 0.816286739489598"
## [1] "The correlation between x4 and y4 is 0.816521436888503"
  1. Create scatter plots for each \(x, y\) pair of data.
library(car)
## Warning: package 'car' was built under R version 3.5.3
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:fBasics':
## 
##     densityPlot
# Using for loop to plot scatter plots for each x, y pair of data
for (i in 1:4) {
  scatterplot(data[[paste0("y", i)]] ~ data[[paste0("x", i)]], data = data, 
            xlab = paste0("x", i), ylab = paste0("y", i), 
            main = paste0("Scatter Plots for x", i, " and y", i))
}

## Warning in smoother(.x, .y, col = col[1], log.x = logged("x"), log.y =
## logged("y"), : could not fit negative part of the spread

  1. Now change the symbols on the scatter plots to solid circles and plot them together as a 4 panel graphic
# with par() function ,we can include the option mfrow=c(nrows, ncols) to create a matrix of nrwos x ncols plots
par(mfrow = c(2, 2)) 
for (i in 1:4) {
  plot(data[[paste0("x", i)]], data[[paste0("y", i)]], 
            xlab = paste0("x", i), ylab = paste0("y", i), 
            main = paste0("Scatter Plots for x", i, " and y", i), pch=19)
}

  1. Now fit a linear model to each data set using the lm() function.
# use for loog to assign linear model of each pair to M1, M2, M3, M4 respectively
for (i in 1:4) {
  assign(paste0("M", i), lm(data[[paste0("y", i)]] ~ data[[paste0("x", i)]]))
}
  1. Now combine the last two tasks. Create a four panel scatter plot matrix that has both the data points and the regression lines. (hint: the model objects will carry over chunks!)
# with par() function ,we can include the option mfrow=c(nrows, ncols) to create a matrix of nrwos x ncols plots
par(mfrow = c(2, 2))
for (i in 1:4) {
  plot(data[[paste0("x", i)]], data[[paste0("y", i)]], 
            xlab = paste0("x", i), ylab = paste0("y", i), 
            main = paste0("Scatter Plots for x", i, " and y", i), pch=19)
  abline(lm(data[[paste0("y", i)]] ~ data[[paste0("x", i)]]), col = "red") # regression line (y ~ x)
}

  1. Now compare the model fits for each model object.
# use anova to compare the model fits for each model objet M1, M2, M3, M4
anova(M1, M2, M3, M4)

Analysis of Variance Table

Model 1: data[[paste0(“y”, i)]] ~ data[[paste0(“x”, i)]] Model 2: data[[paste0(“y”, i)]] ~ data[[paste0(“x”, i)]] Model 3: data[[paste0(“y”, i)]] ~ data[[paste0(“x”, i)]] Model 4: data[[paste0(“y”, i)]] ~ data[[paste0(“x”, i)]] Res.Df RSS Df Sum of Sq F Pr(>F) 1 9 13.763
2 9 13.776 0 -0.013601
3 9 13.756 0 0.020099
4 9 13.742 0 0.013702

  1. In text, summarize the lesson of Anscombe’s Quartet and what it says about the value of data visualization.

From the above ANOVA to compare the model fits for each model objet M1, M2, M3, M4 respectively, we did not see any significant difference between the four. But if we look at the scatter plot of four pair, we know that they are quite different, particularly for the fourth pair (x4 and y4). So we can say that data visulization give us a more precisely picture of what the data and relationsihp between variables really are.