The objectives of this problem set is to orient you to a number of activities in R. And to conduct a thoughtful exercise in appreciating the importance of data visualization. For each question create a code chunk or text response that completes/answers the activity or question requested. Finally, upon completion name your final output .html file as: YourName_ANLY512-Section-Year-Semester.html and upload it to the “Problem Set 2” assignment to your R Pubs account and submit the link to Moodle. Points will be deducted for uploading the improper format.
anscombe data that is part of the library(datasets) in R. And assign that data to a new object called data.library(datasets)
data("anscombe")
data <- anscombe # Assign anscombe data to a new object called "data"
fBasics() package!)library("fBasics")
## Warning: package 'fBasics' was built under R version 3.5.3
## Loading required package: timeDate
## Loading required package: timeSeries
## Warning: package 'timeSeries' was built under R version 3.5.3
colMeans(data) # compute mean for each column
## x1 x2 x3 x4 y1 y2 y3 y4
## 9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909
colVars(data) # compute variance for each column
## x1 x2 x3 x4 y1 y2 y3
## 11.000000 11.000000 11.000000 11.000000 4.127269 4.127629 4.122620
## y4
## 4.123249
# Assign correlationTest class to variable p
# using for loop to print estimate correlation for each pair (eg. x1 and y1, x2 and y2, etc)
for (i in 1:4) {
p <- correlationTest(data[[paste0("x", i)]], data[[paste0("y", i)]], "pearson")
print(paste0("The correlation between x", i, " and y", i, " is ", p@test[["estimate"]][["Correlation"]]))
}
## [1] "The correlation between x1 and y1 is 0.81642051634484"
## [1] "The correlation between x2 and y2 is 0.816236506000243"
## [1] "The correlation between x3 and y3 is 0.816286739489598"
## [1] "The correlation between x4 and y4 is 0.816521436888503"
library(car)
## Warning: package 'car' was built under R version 3.5.3
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:fBasics':
##
## densityPlot
# Using for loop to plot scatter plots for each x, y pair of data
for (i in 1:4) {
scatterplot(data[[paste0("y", i)]] ~ data[[paste0("x", i)]], data = data,
xlab = paste0("x", i), ylab = paste0("y", i),
main = paste0("Scatter Plots for x", i, " and y", i))
}
## Warning in smoother(.x, .y, col = col[1], log.x = logged("x"), log.y =
## logged("y"), : could not fit negative part of the spread
# with par() function ,we can include the option mfrow=c(nrows, ncols) to create a matrix of nrwos x ncols plots
par(mfrow = c(2, 2))
for (i in 1:4) {
plot(data[[paste0("x", i)]], data[[paste0("y", i)]],
xlab = paste0("x", i), ylab = paste0("y", i),
main = paste0("Scatter Plots for x", i, " and y", i), pch=19)
}
lm() function.# use for loog to assign linear model of each pair to M1, M2, M3, M4 respectively
for (i in 1:4) {
assign(paste0("M", i), lm(data[[paste0("y", i)]] ~ data[[paste0("x", i)]]))
}
# with par() function ,we can include the option mfrow=c(nrows, ncols) to create a matrix of nrwos x ncols plots
par(mfrow = c(2, 2))
for (i in 1:4) {
plot(data[[paste0("x", i)]], data[[paste0("y", i)]],
xlab = paste0("x", i), ylab = paste0("y", i),
main = paste0("Scatter Plots for x", i, " and y", i), pch=19)
abline(lm(data[[paste0("y", i)]] ~ data[[paste0("x", i)]]), col = "red") # regression line (y ~ x)
}
# use anova to compare the model fits for each model objet M1, M2, M3, M4
anova(M1, M2, M3, M4)
Analysis of Variance Table
Model 1: data[[paste0(“y”, i)]] ~ data[[paste0(“x”, i)]] Model 2: data[[paste0(“y”, i)]] ~ data[[paste0(“x”, i)]] Model 3: data[[paste0(“y”, i)]] ~ data[[paste0(“x”, i)]] Model 4: data[[paste0(“y”, i)]] ~ data[[paste0(“x”, i)]] Res.Df RSS Df Sum of Sq F Pr(>F) 1 9 13.763
2 9 13.776 0 -0.013601
3 9 13.756 0 0.020099
4 9 13.742 0 0.013702
From the above ANOVA to compare the model fits for each model objet M1, M2, M3, M4 respectively, we did not see any significant difference between the four. But if we look at the scatter plot of four pair, we know that they are quite different, particularly for the fourth pair (x4 and y4). So we can say that data visulization give us a more precisely picture of what the data and relationsihp between variables really are.