The objectives of this problem set is to orient you to a number of activities in R
. And to conduct a thoughtful exercise in appreciating the importance of data visualization. For each question create a code chunk or text response that completes/answers the activity or question requested. Finally, upon completion name your final output .html
file as: YourName_ANLY512-Section-Year-Semester.html
and upload it to the “Problem Set 2” assignment on Moodle.
anscombe
data that is part of the library(datasets)
in R
. And assign that data to a new object called data
.data("anscombe")
data <- anscombe
data
## x1 x2 x3 x4 y1 y2 y3 y4
## 1 10 10 10 8 8.04 9.14 7.46 6.58
## 2 8 8 8 8 6.95 8.14 6.77 5.76
## 3 13 13 13 8 7.58 8.74 12.74 7.71
## 4 9 9 9 8 8.81 8.77 7.11 8.84
## 5 11 11 11 8 8.33 9.26 7.81 8.47
## 6 14 14 14 8 9.96 8.10 8.84 7.04
## 7 6 6 6 8 7.24 6.13 6.08 5.25
## 8 4 4 4 19 4.26 3.10 5.39 12.50
## 9 12 12 12 8 10.84 9.13 8.15 5.56
## 10 7 7 7 8 4.82 7.26 6.42 7.91
## 11 5 5 5 8 5.68 4.74 5.73 6.89
fBasics()
package!)library(fBasics)
## Loading required package: timeDate
## Loading required package: timeSeries
##
## Rmetrics Package fBasics
## Analysing Markets and calculating Basic Statistics
## Copyright (C) 2005-2014 Rmetrics Association Zurich
## Educational Software for Financial Engineering and Computational Science
## Rmetrics is free software and comes with ABSOLUTELY NO WARRANTY.
## https://www.rmetrics.org --- Mail to: info@rmetrics.org
# Summary Data - Means
colMeans(data)
## x1 x2 x3 x4 y1 y2 y3 y4
## 9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909
#Summary Data - Variances
colVars(data)
## x1 x2 x3 x4 y1 y2 y3
## 11.000000 11.000000 11.000000 11.000000 4.127269 4.127629 4.122620
## y4
## 4.123249
# Correlation between x1 and y1
cor(data$x1, data$y1)
## [1] 0.8164205
# Correlation between x2 and y2
cor(data$x2, data$y2)
## [1] 0.8162365
# Correlation between x3 and y3
cor(data$x3, data$y3)
## [1] 0.8162867
# Correlation between x4 and y4
cor(data$x4, data$y4)
## [1] 0.8165214
#Scatterplot x1_y1
plot_1 <- plot(data$x1, data$y1, main = "Scatterplot x1_y1", xlab = "x1", ylab = "y1")
#Scatterplot x2_y2
plot_2 <- plot(data$x2, data$y2, main = "Scatterplot x2_y2", xlab = "x2", ylab = "y2")
#Scatterplot x3_y3
plot_3 <- plot(data$x3, data$y3, main = "Scatterplot x3_y3", xlab = "x3", ylab = "y3")
#Scatterplot x4_y4
plot_4 <- plot(data$x4, data$y4, main = "Scatterplot x4_y4", xlab = "x4", ylab = "y4")
library(ggplot2)
library(ggthemes)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.4.1
plot1 <- ggplot(data, aes(x1, y1)) + geom_point() + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("Scatterplot x1 vs y1")
plot2 <- ggplot(data, aes(x2, y2)) + geom_point() + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("Scatterplot x2 vs y2")
plot3 <- ggplot(data, aes(x3, y3)) + geom_point() + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("Scatterplot x3 vs y3")
plot4 <- ggplot(data, aes(x4, y4)) + geom_point() + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("Scatterplot x4 vs y4")
grid.arrange(plot1, plot2, plot3, plot4)
lm()
function.plot1 <- ggplot(data, aes(x1, y1)) + geom_point() + geom_smooth(se = FALSE, method = "lm") + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("x1_y1 with lm")
plot1
plot2 <- ggplot(data, aes(x2, y2)) + geom_point() + geom_smooth(se = FALSE, method = "lm") + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("x2_y2 with lm")
plot2
plot3 <- ggplot(data, aes(x3, y3)) + geom_point() + geom_smooth(se = FALSE, method = "lm") + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("x3_y3 with lm")
plot3
plot4 <- ggplot(data, aes(x4, y4)) + geom_point() + geom_smooth(se = FALSE, method = "lm") + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("x4_y4 with lm")
plot4
grid.arrange(plot1, plot2, plot3, plot4)
plot1_glm <- ggplot(data, aes(x1, y1)) + geom_point() + geom_smooth(se = FALSE, method = "glm") + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("x1_y1 with glm")
plot2_glm <- ggplot(data, aes(x2, y2)) + geom_point() + geom_smooth(se = FALSE, method = "glm") + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("x2_y2 with glm")
plot3_glm <- ggplot(data, aes(x3, y3)) + geom_point() + geom_smooth(se = FALSE, method = "glm") + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("x3_y3 with glm")
plot4_glm <- ggplot(data, aes(x4, y4)) + geom_point() + geom_smooth(se = FALSE, method = "glm") + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("x4_y4 with glm")
plot1_loess <- ggplot(data, aes(x1, y1)) + geom_point() + geom_smooth(se = FALSE, method = "loess") + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("x1_y1 with loess")
plot2_loess <- ggplot(data, aes(x2, y2)) + geom_point() + geom_smooth(se = FALSE, method = "loess") + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("x2_y2 with loess")
plot3_loess <- ggplot(data, aes(x3, y3)) + geom_point() + geom_smooth(se = FALSE, method = "loess") + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("x3_y3 with loess")
plot4_loess <- ggplot(data, aes(x4, y4)) + geom_point() + geom_smooth(se = FALSE, method = "loess") + theme(panel.background = element_rect(fill = "white")) + theme(axis.line = element_line(colour = "black")) + ggtitle("x4_y4 with loess")
grid.arrange(plot1, plot1_glm, plot1_loess, plot2,plot2_glm, plot2_loess, plot3, plot3_glm, plot3_loess, plot4, plot4_glm, plot4_loess)
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : at 7.945
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : radius 0.003025
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : all data on boundary of neighborhood. make span bigger
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 7.945
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 0.055
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 122.21
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : zero-width neighborhood. make span bigger
## Warning: Computation failed in `stat_smooth()`:
## NA/NaN/Inf in foreign function call (arg 5)
I have created a model fit with the following methods: lm, glm, and loess. We can see that for x1_y1, lm and glm trendlines are very close ot the data points while for x2_y2 it is clear that Loess provides the better fit to the data. x3_y3 also gets a very good fit with loess, although the fit from the linear model and the general linear models are also acceptable (the fits from both linear models are very similar). x4_y4 only gets a model fit with lm and glm, there is no fit possible with loess.
The summary statistics of Anscombe’s Quartet gives us very similar results for the different variables. Indeed we get the same mean, variance, and correlation for each column and [x ; y] set respectively. This would lead us to believe that each [x ; y] combination is very similar and could even be interchangeable. However, the graphical representation shows us that those are in fact very different datasets. This realization would have been impossible to detect without data visualization. Therefore, Anscombe Quartet is a valuable vote of confidence for data visualization especially in response to people claiming that numbers and summary statistics are often more accurate than graphs. It shows that data visualization is a very important tool for data analysis, to be used in conjunction with summary statistics and other tools in order to get a more complete understanding of the data and what it represents.