03-13-2023The objectives of this problem set is to orient you to a number of
activities in R and to conduct a thoughtful exercise in
appreciating the importance of data visualization. For each question
enter your code or text response in the code chunk that
completes/answers the activity or question requested. To submit this
homework you will create the document in Rstudio, using the knitr
package (button included in Rstudio) and then submit the document to
your Rpubs account. Once uploaded you
will submit the link to that document on Canvas. Please make sure that
this link is hyper linked and that I can see the visualization and the
code required to create it. Each question is worth 5 points.
anscombe data that is part of the
library(datasets) in R. And assign that data
to a new object called data.library(datasets)
data(anscombe)
data <- anscombe
data
## x1 x2 x3 x4 y1 y2 y3 y4
## 1 10 10 10 8 8.04 9.14 7.46 6.58
## 2 8 8 8 8 6.95 8.14 6.77 5.76
## 3 13 13 13 8 7.58 8.74 12.74 7.71
## 4 9 9 9 8 8.81 8.77 7.11 8.84
## 5 11 11 11 8 8.33 9.26 7.81 8.47
## 6 14 14 14 8 9.96 8.10 8.84 7.04
## 7 6 6 6 8 7.24 6.13 6.08 5.25
## 8 4 4 4 19 4.26 3.10 5.39 12.50
## 9 12 12 12 8 10.84 9.13 8.15 5.56
## 10 7 7 7 8 4.82 7.26 6.42 7.91
## 11 5 5 5 8 5.68 4.74 5.73 6.89
str(data)
## 'data.frame': 11 obs. of 8 variables:
## $ x1: num 10 8 13 9 11 14 6 4 12 7 ...
## $ x2: num 10 8 13 9 11 14 6 4 12 7 ...
## $ x3: num 10 8 13 9 11 14 6 4 12 7 ...
## $ x4: num 8 8 8 8 8 8 8 19 8 8 ...
## $ y1: num 8.04 6.95 7.58 8.81 8.33 ...
## $ y2: num 9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 ...
## $ y3: num 7.46 6.77 12.74 7.11 7.81 ...
## $ y4: num 6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 ...
dplyr package!)library(dplyr)
summary(data)
## x1 x2 x3 x4 y1
## Min. : 4.0 Min. : 4.0 Min. : 4.0 Min. : 8 Min. : 4.260
## 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 8 1st Qu.: 6.315
## Median : 9.0 Median : 9.0 Median : 9.0 Median : 8 Median : 7.580
## Mean : 9.0 Mean : 9.0 Mean : 9.0 Mean : 9 Mean : 7.501
## 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.: 8 3rd Qu.: 8.570
## Max. :14.0 Max. :14.0 Max. :14.0 Max. :19 Max. :10.840
## y2 y3 y4
## Min. :3.100 Min. : 5.39 Min. : 5.250
## 1st Qu.:6.695 1st Qu.: 6.25 1st Qu.: 6.170
## Median :8.140 Median : 7.11 Median : 7.040
## Mean :7.501 Mean : 7.50 Mean : 7.501
## 3rd Qu.:8.950 3rd Qu.: 7.98 3rd Qu.: 8.190
## Max. :9.260 Max. :12.74 Max. :12.500
colMeans(data)
## x1 x2 x3 x4 y1 y2 y3 y4
## 9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909
apply(data,2,var)
## x1 x2 x3 x4 y1 y2 y3 y4
## 11.000000 11.000000 11.000000 11.000000 4.127269 4.127629 4.122620 4.123249
cor(data[,1:4],data[,5:8])
## y1 y2 y3 y4
## x1 0.8164205 0.8162365 0.8162867 -0.3140467
## x2 0.8164205 0.8162365 0.8162867 -0.3140467
## x3 0.8164205 0.8162365 0.8162867 -0.3140467
## x4 -0.5290927 -0.7184365 -0.3446610 0.8165214
library(ggplot2)
par(mfrow = c(2, 2))
plot(data$x1, data$y1, main = "Scatterplot for Pair x1 and y1", xlab = "x1", ylab = "y1")
plot(data$x2, data$y2, main = "Scatterplot for Pair x2 and y2", xlab = "x2", ylab = "y2")
plot(data$x3, data$y3, main = "Scatterplot for Pair x3 and y3", xlab = "x3", ylab = "y3")
plot(data$x4, data$y4, main = "Scatterplot for Pair x4 and y4", xlab = "x4", ylab = "y4")
par(mfrow = c(2, 2))
plot(data$x1, data$y1, main = "Scatterplot for Pair x1 and y1", xlab = "x1", ylab = "y1", pch = 19, col='Blue')
plot(data$x2, data$y2, main = "Scatterplot for Pair x2 and y2", xlab = "x2", ylab = "y2", pch = 19, col='Blue')
plot(data$x3, data$y3, main = "Scatterplot for Pair x3 and y3", xlab = "x3", ylab = "y3", pch = 19, col='Blue')
plot(data$x4, data$y4, main = "Scatterplot for Pair x4 and y4", xlab = "x4", ylab = "y4", pch = 19, col='Blue')
lm()
function.attach(data)
Lm1 <- lm(x1~y1)
summary(Lm1)
##
## Call:
## lm(formula = x1 ~ y1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6522 -1.5117 -0.2657 1.2341 3.8946
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.9975 2.4344 -0.410 0.69156
## y1 1.3328 0.3142 4.241 0.00217 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.019 on 9 degrees of freedom
## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295
## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
Lm2 <- lm(x2~y2)
summary(Lm2)
##
## Call:
## lm(formula = x2 ~ y2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8516 -1.4315 -0.3440 0.8467 4.2017
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.9948 2.4354 -0.408 0.69246
## y2 1.3325 0.3144 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.02 on 9 degrees of freedom
## Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179
Lm3 <- lm(x3~y3)
summary(Lm3)
##
## Call:
## lm(formula = x3 ~ y3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9869 -1.3733 -0.0266 1.3200 3.2133
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.0003 2.4362 -0.411 0.69097
## y3 1.3334 0.3145 4.239 0.00218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.019 on 9 degrees of freedom
## Multiple R-squared: 0.6663, Adjusted R-squared: 0.6292
## F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176
Lm4 <- lm(x4~y4)
summary(Lm4)
##
## Call:
## lm(formula = x4 ~ y4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7859 -1.4122 -0.1853 1.4551 3.3329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.0036 2.4349 -0.412 0.68985
## y4 1.3337 0.3143 4.243 0.00216 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.018 on 9 degrees of freedom
## Multiple R-squared: 0.6667, Adjusted R-squared: 0.6297
## F-statistic: 18 on 1 and 9 DF, p-value: 0.002165
attach(data)
## The following objects are masked from data (pos = 3):
##
## x1, x2, x3, x4, y1, y2, y3, y4
par(mfrow=c(2,2))
plot(x1,y1,main="Scatterplot for x1 vs y1",xlab="x1",ylab="y1",pch=20)
abline(lm(x1 ~ y1))
plot(x2,y2,main="Scatterplot for x2 vs y2",xlab="x2",ylab="y2",pch=20)
abline(lm(x2 ~ y2))
plot(x3,y3,main="Scatterplot for x3 vs y3 ",xlab="x3",ylab="y3",pch=20)
abline(lm(x3 ~ y3))
plot(x4,y4,main="Scatterplot for x4 vs y4",xlab="x4",ylab="y4",pch=20)
abline(lm(x4 ~ y4))
Looking at the illustration for Dataset 1, it appears that the linear regression model is a good fit for the data. However, the data in the figure from Dataset 2 appears to be non-linear, perhaps quadratic, rendering the linear model fitting inappropriate. Similarly, only one data point passes through the fitted line in the figure based on Dataset 3, and another point is very far from the regression fitted line, indicating that the linear model is incorrect. The points in Dataset 4 tend to cluster around the same x value, with one obvious outlier. As a result, verifying the accuracy of the information is essential. The linear model fit should also be reported without further adjustment if the data are reliable. But it’s important to note that one of the data points was crucial in fitting the linear regression model.
Despite their visual differences, all four datasets in Anscombe’s Quartet share the same statistical characteristics. This demonstrates that relying solely on numerical summaries may not be fruitful in discerning the true patterns present in data. In order to discover, comprehend, and convey data relationships and patterns, data visualization is crucial. It’s useful for picking out anomalies, trends, and patterns that would otherwise be obscured by raw data. As a result, Anscombe’s Quartet stresses the significance of data visualization in analysis and decision-making.