The objectives of this problem set is to orient you to a number of activities in R. And to conduct a thoughtful exercise in appreciating the importance of data visualization. For each question create a code chunk or text response that completes/answers the activity or question requested. Finally, upon completion name your final output .html file as: YourName_ANLY512-Section-Year-Semester.html and upload it to the “Problem Set 2” assignment to your R Pubs account and submit the link to Moodle. Points will be deducted for uploading the improper format.
anscombe data that is part of the library(datasets) in R. And assign that data to a new object called data.#view data set anscombe
View(anscombe)
#assign anscombe to new object data
data<-anscombe
View(data)
fBasics() package!)#summarize data
summary(data)
## x1 x2 x3 x4
## Min. : 4.0 Min. : 4.0 Min. : 4.0 Min. : 8
## 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 6.5 1st Qu.: 8
## Median : 9.0 Median : 9.0 Median : 9.0 Median : 8
## Mean : 9.0 Mean : 9.0 Mean : 9.0 Mean : 9
## 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.:11.5 3rd Qu.: 8
## Max. :14.0 Max. :14.0 Max. :14.0 Max. :19
## y1 y2 y3 y4
## Min. : 4.260 Min. :3.100 Min. : 5.39 Min. : 5.250
## 1st Qu.: 6.315 1st Qu.:6.695 1st Qu.: 6.25 1st Qu.: 6.170
## Median : 7.580 Median :8.140 Median : 7.11 Median : 7.040
## Mean : 7.501 Mean :7.501 Mean : 7.50 Mean : 7.501
## 3rd Qu.: 8.570 3rd Qu.:8.950 3rd Qu.: 7.98 3rd Qu.: 8.190
## Max. :10.840 Max. :9.260 Max. :12.74 Max. :12.500
chooseCRANmirror(graphics=FALSE, ind=1)
knitr::opts_chunk$set(echo = TRUE)
#Installing r package fBasics and calling the library to use its functions
install.packages("fBasics")
## Installing package into 'C:/Users/sowmyapk/Documents/R/win-library/3.5'
## (as 'lib' is unspecified)
## package 'fBasics' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'fBasics'
##
## The downloaded binary packages are in
## C:\Users\sowmyapk\AppData\Local\Temp\Rtmp4AHq5U\downloaded_packages
require(fBasics)
## Loading required package: fBasics
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'fBasics'
#calculate the means of all columns in data
colMeans(data)
## x1 x2 x3 x4 y1 y2 y3 y4
## 9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909
#calculate variance of all columns in data
#colVars(data)
#Correlation test
cor(data$x1,data$y1)
## [1] 0.8164205
cor(data$x2,data$y2)
## [1] 0.8162365
cor(data$x3,data$y3)
## [1] 0.8162867
cor(data$x4,data$y4)
## [1] 0.8165214
#Scatter plot for each pair of x,y
plot(x = data$x1,y = data$y1,
xlab = "x1",
ylab = "y1",
main = "x1 vs y1"
)
plot(x = data$x2,y = data$y2,
xlab = "x2",
ylab = "y2",
main = "x2 vs y2"
)
plot(x = data$x3,y = data$y3,
xlab = "x3",
ylab = "y3",
main = "x3 vs y3"
)
plot(x = data$x4,y = data$y4,
xlab = "x4",
ylab = "y4",
main = "x4 vs y4"
)
par(mfrow = c(2,2))
plot(data$x1, data$y1, pch=19, col=c("green", "red"), xlab="x1",ylab="y1", main ="x1 vs y1")
plot(data$x2, data$y2, pch=19, col=c("blue", "brown"), xlab="x2",ylab="y2", main ="x2 vs y2")
plot(data$x3, data$y3, pch=19, col=c("yellow", "pink"), xlab="x3",ylab="y3", main ="x3 vs y3")
plot(data$x4, data$y4, pch=19, col=c("orange", "purple"), xlab="x4",ylab="y4", main ="x4 vs y4")
lm() function.#Linear regression of y using x
lm1 <- lm(data$y1 ~ data$x1, data = anscombe)
lm2 <- lm(data$y2 ~ data$x2, data = anscombe)
lm3 <- lm(data$y3 ~ data$x3, data = anscombe)
lm4 <- lm(data$y4 ~ data$x4, data = anscombe)
par(mfrow = c(2,2))
plot(data$x1, data$y1, pch=19, col=c("green", "red"), xlab="x1",ylab="y1", main ="x1 vs y1")
abline(lm1, col='black')
plot(data$x2, data$y2, pch=19, col=c("blue", "brown"), xlab="x2",ylab="y2", main ="x2 vs y2")
abline(lm2, col='black')
plot(data$x3, data$y3, pch=19, col=c("yellow", "pink"), xlab="x3",ylab="y3", main ="x3 vs y3")
abline(lm3, col='black')
plot(data$x4, data$y4, pch=19, col=c("orange", "purple"), xlab="x4",ylab="y4", main ="x4 vs y4")
abline(lm4, col='black')
anova(lm1, test="Chisq")
Analysis of Variance Table
Response: data\(y1 Df Sum Sq Mean Sq F value Pr(>F) data\)x1 1 27.510 27.5100 17.99 0.00217 ** Residuals 9 13.763 1.5292
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
anova(lm2, test="Chisq")
Analysis of Variance Table
Response: data\(y2 Df Sum Sq Mean Sq F value Pr(>F) data\)x2 1 27.500 27.5000 17.966 0.002179 ** Residuals 9 13.776 1.5307
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
anova(lm3, test="Chisq")
Analysis of Variance Table
Response: data\(y3 Df Sum Sq Mean Sq F value Pr(>F) data\)x3 1 27.470 27.4700 17.972 0.002176 ** Residuals 9 13.756 1.5285
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
anova(lm4, test="Chisq")
Analysis of Variance Table
Response: data\(y4 Df Sum Sq Mean Sq F value Pr(>F) data\)x4 1 27.490 27.4900 18.003 0.002165 ** Residuals 9 13.742 1.5269
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ‘’ 1
From the anscombe data, we can see that just doing a summary statistics tells us that the x and y quartets have similar means, medians and variances. The correlation matrix for each data set is also identical. But when we plot these data sets, we notice that in case of data set 1, the relationship b/w x and y is linear resulting in correlated data, for dataset 2, its hard to tell the relationship b/w x and y and it seems more like inverted hyperbole. Data set 3 has near perfect linear relationship but its outlier is pulling down its correlation value, whereas data set 4 has an outlier thats causing a high correlation effect even though the x4 and y4 variables arent linear. The exercise shows the imporatnce of graphical visualization to draw insights from nearly identical data sets.