The objectives of this problem set is to orient you to a number of activities in R. And to conduct a thoughtful exercise in appreciating the importance of data visualization. For each question create a code chunk or text response that completes/answers the activity or question requested. Finally, upon completion name your final output .html file as: YourName_ANLY512-Section-Year-Semester.html and upload it to the “Problem Set 2” assignmenet on Moodle.
anscombe data that is part of the library(datasets) in R. And assign that data to a new object called data.data <- anscombe
fBasics() package!)library(fBasics)
## Warning: package 'fBasics' was built under R version 3.4.4
## Loading required package: timeDate
## Warning: package 'timeDate' was built under R version 3.4.3
## Loading required package: timeSeries
## Warning: package 'timeSeries' was built under R version 3.4.4
colMeans(data)
## x1 x2 x3 x4 y1 y2 y3 y4
## 9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909
colVars(data)
## x1 x2 x3 x4 y1 y2 y3
## 11.000000 11.000000 11.000000 11.000000 4.127269 4.127629 4.122620
## y4
## 4.123249
cor_x1_y1 <- cor(data$x1,data$y1)
cat("The correlation between x1 and y1 is",cor_x1_y1,"\n")
## The correlation between x1 and y1 is 0.8164205
cor_x2_y2 <- cor(data$x2,data$y2)
cat("The correlation between x2 and y2 is",cor_x2_y2,"\n")
## The correlation between x2 and y2 is 0.8162365
cor_x3_y3 <- cor(data$x3,data$y3)
cat("The correlation between x3 and y3 is",cor_x3_y3,"\n")
## The correlation between x3 and y3 is 0.8162867
cor_x4_y4 <- cor(data$x4,data$y4)
cat("The correlation between x4 and y4 is",cor_x4_y4,"\n")
## The correlation between x4 and y4 is 0.8165214
plot(data$x1,data$y1,main="Scatter plot lot between x1 and y1")
plot(data$x2,data$y2,main="Scatter plot lot between x2 and y2")
plot(data$x3,data$y3,main="Scatter plot lot between x3 and y3")
plot(data$x4,data$y4,main="Scatter plot lot between x4 and y4")
par(mfrow=c(2,2))
plot(data$x1,data$y1,main="Scatter plot lot between x1 and y1",pch=19)
plot(data$x2,data$y2,main="Scatter plot lot between x2 and y2",pch=19)
plot(data$x3,data$y3,main="Scatter plot lot between x3 and y3",pch=19)
plot(data$x4,data$y4,main="Scatter plot lot between x4 and y4",pch=19)
lm() function.model_x1y1 <- lm(data$x1~data$y1)
model_x2y2 <- lm(data$x2~data$y2)
model_x3y3 <- lm(data$x3~data$y3)
model_x4y4 <- lm(data$x4~data$y4)
par(mfrow=c(2,2))
plot(data$x1,data$y1,main="Scatter plot lot between x1 and y1",pch=19,abline(model_x1y1))
plot(data$x2,data$y2,main="Scatter plot lot between x2 and y2",pch=19,abline(model_x2y2))
plot(data$x3,data$y3,main="Scatter plot lot between x3 and y3",pch=19,abline(model_x3y3))
plot(data$x4,data$y4,main="Scatter plot lot between x4 and y4",pch=19,abline(model_x4y4))
summary(model_x1y1)
Call: lm(formula = data\(x1 ~ data\)y1)
Residuals: Min 1Q Median 3Q Max -2.6522 -1.5117 -0.2657 1.2341 3.8946
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.9975 2.4344 -0.410 0.69156
data$y1 1.3328 0.3142 4.241 0.00217 ** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
Residual standard error: 2.019 on 9 degrees of freedom Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295 F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
summary(model_x2y2)
Call: lm(formula = data\(x2 ~ data\)y2)
Residuals: Min 1Q Median 3Q Max -1.8516 -1.4315 -0.3440 0.8467 4.2017
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.9948 2.4354 -0.408 0.69246
data$y2 1.3325 0.3144 4.239 0.00218 ** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
Residual standard error: 2.02 on 9 degrees of freedom Multiple R-squared: 0.6662, Adjusted R-squared: 0.6292 F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002179
summary(model_x3y3)
Call: lm(formula = data\(x3 ~ data\)y3)
Residuals: Min 1Q Median 3Q Max -2.9869 -1.3733 -0.0266 1.3200 3.2133
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.0003 2.4362 -0.411 0.69097
data$y3 1.3334 0.3145 4.239 0.00218 ** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
Residual standard error: 2.019 on 9 degrees of freedom Multiple R-squared: 0.6663, Adjusted R-squared: 0.6292 F-statistic: 17.97 on 1 and 9 DF, p-value: 0.002176
summary(model_x4y4)
Call: lm(formula = data\(x4 ~ data\)y4)
Residuals: Min 1Q Median 3Q Max -2.7859 -1.4122 -0.1853 1.4551 3.3329
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.0036 2.4349 -0.412 0.68985
data$y4 1.3337 0.3143 4.243 0.00216 ** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
Residual standard error: 2.018 on 9 degrees of freedom Multiple R-squared: 0.6667, Adjusted R-squared: 0.6297 F-statistic: 18 on 1 and 9 DF, p-value: 0.002165
As we can see from the outputs of quesion 2, four different pairs of data showed highly similar summary statistics, they have very close means, variances and correlations. However, when we look at the scatter plots of those pairs, they have totally different distributions. That being said, the summary statistics of dataset could be very misleading without visualization of the data, it’s very dangerous to sumarized a dataset just using the statistic summary. As we can see from the linear regression line, they don’t follow the same pattern. In summary, data visualization is very important when we try to figure out the whole story of a data set. We can’t just rely on the statistic summary, instead, we need to combine the statistic summay and data visualization in order to draw accurate conclusions.