Shridhar_ProblemSet

##

title: “ANLY 512 - Problem Set 2” subtitle: “Anscombe’s quartet” author: “Shridhar Kulkarni” date: “11/01/2019” output: html_document —

Objectives

The objectives of this problem set is to orient you to a number of activities in R. And to conduct a thoughtful exercise in appreciating the importance of data visualization. For each question create a code chunk or text response that completes/answers the activity or question requested. Finally, upon completion name your final output .html file as: YourName_ANLY512-Section-Year-Semester.html and upload it to the “Problem Set 2” assignment to your R Pubs account and submit the link to Moodle. Points will be deducted for uploading the improper format.

Questions

Anscombes quartet is a set of 4 \(x,y\) data sets that were published by Francis Anscombe in a 1973 paper Graphs in statistical analysis. For this first question load the anscombe data that is part of the library(datasets) in R. And assign that data to a new object called data.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(datasets)
data <- anscombe
print(data)

##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89

Summarise the data by calculating the mean, variance, for each column and the correlation between each pair (eg. x1 and y1, x2 and y2, etc) (Hint: use the fBasics() package!)

Mean <- apply (data, 2, mean)
print(Mean)

##       x1       x2       x3       x4       y1       y2       y3       y4 
## 9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909

Variance <- apply(data, 2, var)
print (Variance)

##        x1        x2        x3        x4        y1        y2        y3        y4 
## 11.000000 11.000000 11.000000 11.000000  4.127269  4.127629  4.122620  4.123249

Corelation <- cor(data[, 1:4], data[, 5:8])
Corelation <- c(Corelation[1, 1], Corelation[2, 2], Corelation[3, 3], Corelation[4, 4])
print (Corelation)

## [1] 0.8164205 0.8162365 0.8162867 0.8165214

Create scatter plots for each \(x, y\) pair of data.

plot(data$x1, data$y1, xlab="x1", ylab= "y1")

plot(data$x2, data$y3, xlab="x2", ylab= "y2")

plot(data$x3, data$y3, xlab="x3", ylab= "y3")

plot(data$x4, data$y4, xlab="x4", ylab= "y4")


4. Now change the symbols on the scatter plots to solid circles and plot them together as a 4 panel graphic


```r
par(mfrow = c(2, 2))
plot(data$x1, data$y1, xlab="x1", ylab= "y1", pch = 16)
plot(data$x2, data$y3, xlab="x2", ylab= "y2", pch = 16)
plot(data$x3, data$y3, xlab="x3", ylab= "y3", pch = 16)
plot(data$x4, data$y4, xlab="x4", ylab= "y4", pch = 16)

Now fit a linear model to each data set using the lm() function.

lm1 <- lm(data$y1 ~ data$x1)
lm2 <- lm(data$y2 ~ data$x2)
lm3 <- lm(data$y3 ~ data$x3)
lm4 <- lm(data$y4 ~ data$x4)

Now combine the last two tasks. Create a four panel scatter plot matrix that has both the data points and the regression lines. (hint: the model objects will carry over chunks!)

par(mfrow = c(2, 2))
with(data, plot(x1, y1, pch = 16))
abline(lm1)
with(data, plot(x2, y2, pch = 16))
abline(lm2)
with(data, plot(x3, y3, pch = 16))
abline(lm3)
with(data, plot(x4, y4, pch = 16))
abline(lm4)

Now compare the model fits for each model object.

summary(lm1)$adj.r.squared

## [1] 0.6294916

summary(lm2)$adj.r.squared

## [1] 0.6291578

summary(lm3)$adj.r.squared

## [1] 0.6292489

summary(lm4)$adj.r.squared

## [1] 0.6296747

In text, summarize the lesson of Anscombe’s Quartet and what it says about the value of data visualization.

Anscombe’s Quartet dataset shows us that even though certain properties of a data very similar to each other and other statistical properties may vary vastly. We can utilize visualization tools to identify these differnces and obtain better understanding when comparing datasets.