The objectives of this problem set is to orient you to a number of activities in R. And to conduct a thoughtful exercise in appreciating the importance of data visualization. For each question create a code chunk or text response that completes/answers the activity or question requested. Finally, upon completion name your final output .html file as: YourName_ANLY512-Section-Year-Semester.html and upload it to the “Problem Set 2” assignmenet on Moodle.
anscombe data that is part of the library(datasets) in R. And assign that data to a new object called data.summary(anscombe)
## x1 x2 x3 x4
## Min. : 4.0 Min. : 4.0 Min. : 4.0 Min. : 8
## 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 8
## Median : 9.0 Median : 9.0 Median : 9.0 Median : 8
## Mean : 9.0 Mean : 9.0 Mean : 9.0 Mean : 9
## 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.: 8
## Max. :14.0 Max. :14.0 Max. :14.0 Max. :19
## y1 y2 y3 y4
## Min. : 4.260 Min. :3.100 Min. : 5.39 Min. : 5.250
## 1st Qu.: 6.315 1st Qu.:6.695 1st Qu.: 6.25 1st Qu.: 6.170
## Median : 7.580 Median :8.140 Median : 7.11 Median : 7.040
## Mean : 7.501 Mean :7.501 Mean : 7.50 Mean : 7.501
## 3rd Qu.: 8.570 3rd Qu.:8.950 3rd Qu.: 7.98 3rd Qu.: 8.190
## Max. :10.840 Max. :9.260 Max. :12.74 Max. :12.500
data <- anscombe
str(data)
## 'data.frame': 11 obs. of 8 variables:
## $ x1: num 10 8 13 9 11 14 6 4 12 7 ...
## $ x2: num 10 8 13 9 11 14 6 4 12 7 ...
## $ x3: num 10 8 13 9 11 14 6 4 12 7 ...
## $ x4: num 8 8 8 8 8 8 8 19 8 8 ...
## $ y1: num 8.04 6.95 7.58 8.81 8.33 ...
## $ y2: num 9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 ...
## $ y3: num 7.46 6.77 12.74 7.11 7.81 ...
## $ y4: num 6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 ...
fBasics() package!)summary(data)
## x1 x2 x3 x4
## Min. : 4.0 Min. : 4.0 Min. : 4.0 Min. : 8
## 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 8
## Median : 9.0 Median : 9.0 Median : 9.0 Median : 8
## Mean : 9.0 Mean : 9.0 Mean : 9.0 Mean : 9
## 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.: 8
## Max. :14.0 Max. :14.0 Max. :14.0 Max. :19
## y1 y2 y3 y4
## Min. : 4.260 Min. :3.100 Min. : 5.39 Min. : 5.250
## 1st Qu.: 6.315 1st Qu.:6.695 1st Qu.: 6.25 1st Qu.: 6.170
## Median : 7.580 Median :8.140 Median : 7.11 Median : 7.040
## Mean : 7.501 Mean :7.501 Mean : 7.50 Mean : 7.501
## 3rd Qu.: 8.570 3rd Qu.:8.950 3rd Qu.: 7.98 3rd Qu.: 8.190
## Max. :10.840 Max. :9.260 Max. :12.74 Max. :12.500
fBasics::ghMean(data)
## x1 x2 x3 x4 y1 y2 y3 y4
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 0
## 10 0 0 0 0 0 0 0 0
## 11 0 0 0 0 0 0 0 0
library(ggplot2)
plot(data$x1,data$y1, main = "x1 & y1 Scatter Plot")
plot(data$x2,data$y2, main = "x2 & y2 Scatter Plot")
plot(data$x3,data$y3, main = "x3 & y3 Scatter Plot")
plot(data$x4,data$y4, main = "x4 & y4 Scatter Plot")
# P1 <- ggplot(data,aes(data$x1,data$y1)) +
# title("x1 & y1 Scatter Plot") +
# geom_point(shape = 19)
# p1
# install.packages("gridExtra")
#
# gridExtra::grid.arrange(P1,P2,P3,P4,
# LABELS = c("1","2","3","4"),
# ncol = 2, nrow = 2,
# heights = c(1,2,3,4))
#
#
#
# gridExtra::grid.arrange(P1,P2,P3,P4, nrow = 2)
#
# gridExtra::grid.arrange(
# grobs = gl,
# width = c(2,1,1)
# layout_matrix = rbind()
#
# P1 <- ggplot(data, aes(data$x1,data$y1))+geom_point(shape=19)+title("x1 & y1 Scatter Plot")
par(mfrow = c(2,2))
P1 <- plot(data$x1,data$y1, main = "x1 & y1 Scatter Plot", pch = 19)
P2 <- plot(data$x2,data$y2, main = "x2 & y2 Scatter Plot", pch = 19)
P3 <- plot(data$x3,data$y3, main = "x3 & y3 Scatter Plot", pch = 19)
P4 <- plot(data$x4,data$y4, main = "x4 & y4 Scatter Plot", pch = 19)
lm() function.L1 <- lm(data$x1~data$y1)
L2 <- lm(data$x2~data$y2)
L3 <- lm(data$x3~data$y3)
L4 <- lm(data$x4~data$y4)
L1
##
## Call:
## lm(formula = data$x1 ~ data$y1)
##
## Coefficients:
## (Intercept) data$y1
## -0.9975 1.3328
summary(L1)
##
## Call:
## lm(formula = data$x1 ~ data$y1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6522 -1.5117 -0.2657 1.2341 3.8946
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.9975 2.4344 -0.410 0.69156
## data$y1 1.3328 0.3142 4.241 0.00217 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.019 on 9 degrees of freedom
## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
L2
##
## Call:
## lm(formula = data$x2 ~ data$y2)
##
## Coefficients:
## (Intercept) data$y2
## -0.9948 1.3325
summary(L2)
##
## Call:
## lm(formula = data$x2 ~ data$y2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8516 -1.4315 -0.3440 0.8467 4.2017
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.9948 2.4354 -0.408 0.69246
## data$y2 1.3325 0.3144 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.02 on 9 degrees of freedom
## Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179
L3
##
## Call:
## lm(formula = data$x3 ~ data$y3)
##
## Coefficients:
## (Intercept) data$y3
## -1.000 1.333
summary(L3)
##
## Call:
## lm(formula = data$x3 ~ data$y3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9869 -1.3733 -0.0266 1.3200 3.2133
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.0003 2.4362 -0.411 0.69097
## data$y3 1.3334 0.3145 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.019 on 9 degrees of freedom
## Multiple R-squared: 0.6663, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176
L4
##
## Call:
## lm(formula = data$x4 ~ data$y4)
##
## Coefficients:
## (Intercept) data$y4
## -1.004 1.334
summary(L4)
##
## Call:
## lm(formula = data$x4 ~ data$y4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7859 -1.4122 -0.1853 1.4551 3.3329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.0036 2.4349 -0.412 0.68985
## data$y4 1.3337 0.3143 4.243 0.00216 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.018 on 9 degrees of freedom
## Multiple R-squared: 0.6667, Adjusted R-squared: 0.6297
## F-statistic: 18 on 1 and 9 DF, p-value: 0.002165
par(mfrow = c(2,2))
plot(L1)
# par(mfrow = c(2,2))
plot(L2)
# par(mfrow = c(2,2))
plot(L3)
# par(mfrow = c(2,2))
plot(L4)
str(L1)
List of 12 $ coefficients : Named num [1:2] -0.998 1.333 ..- attr(, “names”)= chr [1:2] “(Intercept)” “data$y1” $ residuals : Named num [1:11] 0.281 -0.266 3.895 -1.745 0.895 … ..- attr(, “names”)= chr [1:11] “1” “2” “3” “4” … $ effects : Named num [1:11] -29.85 8.563 3.832 -1.865 0.797 … ..- attr(, “names”)= chr [1:11] “(Intercept)” “data$y1” “” “” … $ rank : int 2 $ fitted.values: Named num [1:11] 9.72 8.27 9.11 10.74 10.11 … ..- attr(, “names”)= chr [1:11] “1” “2” “3” “4” … $ assign : int [1:2] 0 1 $ qr :List of 5 ..$ qr : num [1:11, 1:2] -3.317 0.302 0.302 0.302 0.302 … .. ..- attr(, “dimnames”)=List of 2 .. .. ..$ : chr [1:11] “1” “2” “3” “4” … .. .. ..$ : chr [1:2] “(Intercept)” “data\(y1" .. ..- attr(*, "assign")= int [1:2] 0 1 ..\) qraux: num [1:2] 1.3 1.11 ..$ pivot: int [1:2] 1 2 ..$ tol : num 1e-07 ..$ rank : int 2 ..- attr(, “class”)= chr “qr” $ df.residual : int 9 $ xlevels : Named list() $ call : language lm(formula = data\(x1 ~ data\)y1) $ terms :Classes ‘terms’, ‘formula’ language data\(x1 ~ data\)y1 .. ..- attr(,”variables“)= language list(data\(x1, data\)y1) .. ..- attr(, “factors”)= int [1:2, 1] 0 1 .. .. ..- attr(,”dimnames“)=List of 2 .. .. .. ..$ : chr [1:2]”data\(x1" "data\)y1" .. .. .. ..$ : chr “data\(y1" .. ..- attr(*, "term.labels")= chr "data\)y1” .. ..- attr(, “order”)= int 1 .. ..- attr(, “intercept”)= int 1 .. ..- attr(, “response”)= int 1 .. ..- attr(, “.Environment”)=<environment: R_GlobalEnv> .. ..- attr(, “predvars”)= language list(data\(x1, data\)y1) .. ..- attr(, “dataClasses”)= Named chr [1:2] “numeric” “numeric” .. .. ..- attr(, “names”)= chr [1:2] “data\(x1" "data\)y1” $ model :‘data.frame’: 11 obs. of 2 variables: ..$ data\(x1: num [1:11] 10 8 13 9 11 14 6 4 12 7 ... ..\) data\(y1: num [1:11] 8.04 6.95 7.58 8.81 8.33 ... ..- attr(*, "terms")=Classes 'terms', 'formula' language data\)x1 ~ data\(y1 .. .. ..- attr(*, "variables")= language list(data\)x1, data\(y1) .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1 .. .. .. ..- attr(*, "dimnames")=List of 2 .. .. .. .. ..\) : chr [1:2] “data\(x1" "data\)y1” .. .. .. .. ..$ : chr “data\(y1" .. .. ..- attr(*, "term.labels")= chr "data\)y1” .. .. ..- attr(, “order”)= int 1 .. .. ..- attr(, “intercept”)= int 1 .. .. ..- attr(, “response”)= int 1 .. .. ..- attr(, “.Environment”)=<environment: R_GlobalEnv> .. .. ..- attr(, “predvars”)= language list(data\(x1, data\)y1) .. .. ..- attr(, “dataClasses”)= Named chr [1:2] “numeric” “numeric” .. .. .. ..- attr(, “names”)= chr [1:2] “data\(x1" "data\)y1” - attr(, “class”)= chr “lm”
anova(L1)
Analysis of Variance Table
Response: data\(x1 Df Sum Sq Mean Sq F value Pr(>F) data\)y1 1 73.32 73.320 17.99 0.00217 ** Residuals 9 36.68 4.076
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
anova(L2)
Analysis of Variance Table
Response: data\(x2 Df Sum Sq Mean Sq F value Pr(>F) data\)y2 1 73.287 73.287 17.966 0.002179 ** Residuals 9 36.713 4.079
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
anova(L3)
Analysis of Variance Table
Response: data\(x3 Df Sum Sq Mean Sq F value Pr(>F) data\)y3 1 73.296 73.296 17.972 0.002176 ** Residuals 9 36.704 4.078
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
anova(L4)
Analysis of Variance Table
Response: data\(x4 Df Sum Sq Mean Sq F value Pr(>F) data\)y4 1 73.338 73.338 18.003 0.002165 ** Residuals 9 36.662 4.074
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
Anscombe’s Quartet essentially has four sets of data, by juxtaposing them, we can compare and contrast each set individually. When I used the summary function to examine individual data, I noticed that X1,X2,X3 have similar characteristics. My initial assumption is that these datasets are identical. However, further along the process, I cam to realize that they are rather different. P1 appears to be randomly scattered, P2 appears to be a curve trend which can like be fitted with a regression. P3 has one outlier, while the rest appear to be follow a linear relationship, and P4 is rather interesting, majority of the datapoints appear to be stacked together, with one outlier. I do find that visualization is a good way of examine data, as sometime, the data does not tell the whole story. It is fairly quick to examine data in a graphical way as well.