The objectives of this problem set is to orient you to a number of activities in R
. And to conduct a thoughtful exercise in appreciating the importance of data visualization. For each question create a code chunk or text response that completes/answers the activity or question requested. Finally, upon completion name your final output .html
file as: YourName_ANLY512-Section-Year-Semester.html
and upload it to the “Problem Set 2” assignment to your R Pubs account and submit the link to Moodle. Points will be deducted for uploading the improper format.
anscombe
data that is part of the library(datasets)
in R
. And assign that data to a new object called data
.library(datasets)
data <- anscombe
fBasics()
package!)library(fBasics)
## Warning: package 'fBasics' was built under R version 3.5.3
## Loading required package: timeDate
## Loading required package: timeSeries
## Warning: package 'timeSeries' was built under R version 3.5.3
fBasics::basicStats(data)
## x1 x2 x3 x4 y1 y2
## nobs 11.000000 11.000000 11.000000 11.000000 11.000000 11.000000
## NAs 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## Minimum 4.000000 4.000000 4.000000 8.000000 4.260000 3.100000
## Maximum 14.000000 14.000000 14.000000 19.000000 10.840000 9.260000
## 1. Quartile 6.500000 6.500000 6.500000 8.000000 6.315000 6.695000
## 3. Quartile 11.500000 11.500000 11.500000 8.000000 8.570000 8.950000
## Mean 9.000000 9.000000 9.000000 9.000000 7.500909 7.500909
## Median 9.000000 9.000000 9.000000 8.000000 7.580000 8.140000
## Sum 99.000000 99.000000 99.000000 99.000000 82.510000 82.510000
## SE Mean 1.000000 1.000000 1.000000 1.000000 0.612541 0.612568
## LCL Mean 6.771861 6.771861 6.771861 6.771861 6.136083 6.136024
## UCL Mean 11.228139 11.228139 11.228139 11.228139 8.865735 8.865795
## Variance 11.000000 11.000000 11.000000 11.000000 4.127269 4.127629
## Stdev 3.316625 3.316625 3.316625 3.316625 2.031568 2.031657
## Skewness 0.000000 0.000000 0.000000 2.466911 -0.048374 -0.978693
## Kurtosis -1.528926 -1.528926 -1.528926 4.520661 -1.199123 -0.514319
## y3 y4
## nobs 11.000000 11.000000
## NAs 0.000000 0.000000
## Minimum 5.390000 5.250000
## Maximum 12.740000 12.500000
## 1. Quartile 6.250000 6.170000
## 3. Quartile 7.980000 8.190000
## Mean 7.500000 7.500909
## Median 7.110000 7.040000
## Sum 82.500000 82.510000
## SE Mean 0.612196 0.612242
## LCL Mean 6.135943 6.136748
## UCL Mean 8.864057 8.865070
## Variance 4.122620 4.123249
## Stdev 2.030424 2.030579
## Skewness 1.380120 1.120774
## Kurtosis 1.240044 0.628751
sapply(1:4, function(x) cor(data[ , x], data[ , x+4]))
## [1] 0.8164205 0.8162365 0.8162867 0.8165214
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.2
p1 <- ggplot(data) +
geom_point(aes(x1, y1), color = "black", size = 1.5) +
scale_x_continuous(breaks = seq(0,20,2)) +
scale_y_continuous(breaks = seq(0,12,2)) +
expand_limits(x = 0, y = 0) +
labs(x = "x1", y = "y1",
title = "Dataset 1" ) +
theme_bw()
p1
p2 <- ggplot(data) +
geom_point(aes(x2, y2), color = "black", size = 1.5) +
scale_x_continuous(breaks = seq(0,20,2)) +
scale_y_continuous(breaks = seq(0,12,2)) +
expand_limits(x = 0, y = 0) +
labs(x = "x2", y = "y2",
title = "Dataset 2" ) +
theme_bw()
p2
p3 <- ggplot(data) +
geom_point(aes(x3, y3), color = "black", size = 1.5) +
scale_x_continuous(breaks = seq(0,20,2)) +
scale_y_continuous(breaks = seq(0,12,2)) +
expand_limits(x = 0, y = 0) +
labs(x = "x3", y = "y3",
title = "Dataset 3" ) +
theme_bw()
p3
p4 <- ggplot(data) +
geom_point(aes(x4, y4), color = "black", size = 1.5) +
scale_x_continuous(breaks = seq(0,20,2)) +
scale_y_continuous(breaks = seq(0,12,2)) +
expand_limits(x = 0, y = 0) +
labs(x = "x4", y = "y4",
title = "Dataset 4" ) +
theme_bw()
p4
library(grid)
library(gridExtra)
grid.arrange(grobs = list(p1, p2, p3, p4),
ncol = 2,
top = "Anscombe's Quartet")
lm()
function.lm1 <- lm(y1 ~ x1, data = data)
lm1
##
## Call:
## lm(formula = y1 ~ x1, data = data)
##
## Coefficients:
## (Intercept) x1
## 3.0001 0.5001
lm2 <- lm(y2 ~ x2, data = data)
lm2
##
## Call:
## lm(formula = y2 ~ x2, data = data)
##
## Coefficients:
## (Intercept) x2
## 3.001 0.500
lm3 <- lm(y3 ~ x3, data = data)
lm3
##
## Call:
## lm(formula = y3 ~ x3, data = data)
##
## Coefficients:
## (Intercept) x3
## 3.0025 0.4997
lm4 <- lm(y4 ~ x4, data = data)
lm4
##
## Call:
## lm(formula = y4 ~ x4, data = data)
##
## Coefficients:
## (Intercept) x4
## 3.0017 0.4999
p1_fitted <- p1 + geom_abline(intercept = 3.0001, slope = 0.5001, color = "red")
p2_fitted <- p2 + geom_abline(intercept = 3.001, slope = 0.500, color = "red")
p3_fitted <- p3 + geom_abline(intercept = 3.0025, slope = 0.4997, color = "red")
p4_fitted <- p4 + geom_abline(intercept = 3.0017, slope = 0.499, color = "red")
grid.arrange(grobs = list(p1_fitted, p2_fitted,
p3_fitted, p4_fitted),
ncol = 2,
top = "Anscombe's Quartet")
#The dataset 1 has moderately-positive linear fit with a correlation coefficient of 0.82 which indicates that knowing the value of x will give a value of y with very less noise. In dataset 2, we can clearly see that the relationship between x an y is not linear, however the correlation coeeficent is of 0.82 again and that indicates that knwoing the value of x will help in finding the value of y with very less noise. Hence, a linear model cannot be fitted to dataset 2. In dataset 3, although the relationship is linear the model doesn't fit the data that well since only one data point falls on the regression line. Dataset 4 also doesnot have a well fitting model as can be seen from the plot above.
#From looking at the various datasets in Anscombe's Quartet, it was intersting to learn how the different datasets having the same statistical summary in terms of mean, standard deviation and correlation can look very different when plotted in a graph. Hence, indicating that correctly interpreting the data by just reporting various statistics can be errogenous if data vizualization is overlooked.