The objectives of this problem set is to orient you to a number of
activities in R and to conduct a thoughtful exercise in
appreciating the importance of data visualization. For each question
enter your code or text response in the code chunk that
completes/answers the activity or question requested. To submit this
homework you will create the document in Rstudio, using the knitr
package (button included in Rstudio) and then submit the document to
your Rpubs account. Once uploaded you
will submit the link to that document on Canvas. Please make sure that
this link is hyper linked and that I can see the visualization and the
code required to create it. Each question is worth 5 points.
anscombe data that is part of the
library(datasets) in R. And assign that data
to a new object called data.data = datasets::anscombe
dplyr package!)data %>%
summarise(across(x1:y4, mean))
## x1 x2 x3 x4 y1 y2 y3 y4
## 1 9 9 9 9 7.500909 7.500909 7.5 7.500909
data %>% summarise(cor(x1,y1))
## cor(x1, y1)
## 1 0.8164205
data %>%summarise(cor(x2,y2))
## cor(x2, y2)
## 1 0.8162365
data %>%summarise(cor(x3,y3))
## cor(x3, y3)
## 1 0.8162867
data %>%summarise(cor(x4,y4))
## cor(x4, y4)
## 1 0.8165214
ggplot1 =
ggplot(data, aes(x1,y1))+
geom_point()
ggplot2 =
ggplot(data, aes(x2,y2))+
geom_point()
ggplot3 =
ggplot(data, aes(x3,y3))+
geom_point()
ggplot4 =
ggplot(data, aes(x4,y4))+
geom_point()
grid.arrange(ggplot1, ggplot2, ggplot3, ggplot4, ncol=2)
ggplot1 =
ggplot(data, aes(x1,y1))+
geom_point(color = "blue")
ggplot2 =
ggplot(data, aes(x2,y2))+
geom_point(color = "blue")
ggplot3 =
ggplot(data, aes(x3,y3))+
geom_point(color = "blue")
ggplot4 =
ggplot(data, aes(x4,y4))+
geom_point(color = "blue")
grid.arrange(ggplot1, ggplot2, ggplot3, ggplot4, ncol=2)
lm()
function.ggplot1 =
ggplot(data, aes(x1,y1))+
geom_point(color = "blue")+
geom_smooth(method='lm', se = FALSE, color = "black")
ggplot2 =
ggplot(data, aes(x2,y2))+
geom_point(color = "blue")+
geom_smooth(method='lm', se = FALSE, color = "black")
ggplot3 =
ggplot(data, aes(x3,y3))+
geom_point(color = "blue")+
geom_smooth(method='lm', se = FALSE, color = "black")
ggplot4 =
ggplot(data, aes(x4,y4))+
geom_point(color = "blue")+
geom_smooth(method='lm', se = FALSE, color = "black")
grid.arrange(ggplot1, ggplot2, ggplot3, ggplot4, ncol=2)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
#previous task does this
#For the first model, we can clearly see from the graph that the data is mostly linear. The second model shows a clear inverse parabola, indicating an inverse quadratic relationship between x2 and y2. Model 3 also looks linear, but there is a clear outlier that skewes or linear regression line. Graph 4 is not linear at all, all x values are 8 except for one.
#This was a very interesting exercise. If we look at only the mean and covariances of the different sets of x and y values, we could easily make the incorrect conclusion that the datasets show similar patterns. However, after plotting the data we can clearly see this is not the case. This teaches us that simply looking at some basic statistics about the data do not tell us the entire story, and this further enhances the importance of data visualization in understanding the big picture of a data set.