Exercises: 1-5 (Pgs. 6-7); 1-2, 5 (Pg. 12); 1-5 (Pgs. 20-21); Open Response
Submission: Submit via an electronic document on Sakai. Must be submitted as a HTML file generated in RStudio. All assigned problems are chosen according to the textbook R for Data Science. You do not need R code to answer every question. If you answer without using R code, delete the code chunk. If the question requires R code, make sure you display R code. If the question requires a figure, make sure you display a figure. A lot of the questions can be answered in written response, but require R code and/or figures for understanding and explaining.
ggplot(data=mpg)
I see absolutely nothing. There is just a blank space for a graph. Why am I even doing this nonsense?
dim(mpg)
## [1] 234 11
nrow(mpg)
## [1] 234
ncol(mpg)
## [1] 11
There are 234 rows and 11 columns in the dataset mpg.
?mpg
unique(mpg$drv)
## [1] "f" "4" "r"
The variable drg is a factor variable that takes the following values:
ggplot(data=mpg,aes(x=hwy,y=cyl)) +
geom_point() +
xlab("Highway Miles Per Gallon") +
ylab("Number of Cylinders")
ggplot(data=mpg,aes(x=class,y=drv)) +
geom_point() +
xlab("Type of Car") +
ylab("Type of Drive")
Scatter plots are not meant to visualize the relationship between two categorical/qualitative variables.
ggplot(data = mpg) +
geom_point(
mapping = aes(x = displ, y = hwy, color = "blue")
)
The colors are not blue because the color attribute is inside aes(). It takes the color as a categorical variable because of which the color name shows blue, but the color isn’t actually blue. The color attribute needs to be out of the aes() because it is fixed and not mapped with data.
str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
summary(mpg)
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:234 Length:234
## 1st Qu.:18.00 Class :character Class :character
## Median :24.00 Mode :character Mode :character
## Mean :23.44
## 3rd Qu.:27.00
## Max. :44.00
There are 6 categorical variables in the dataset mpg. There are 5 continuous variables in this dataset. All this information can be seen for mpg when we use summary(), as shown above in the code and the output.
ggplot(mtcars, aes(wt, mpg)) +
geom_point(shape = 2, colour = "black", fill = "white", size = 2, stroke = 5)
ggplot(mtcars, aes(wt, mpg)) +
geom_point(shape = 2, colour = "black", fill = "white", size = 2, stroke = 1)
The stroke aesthetic modifies the width of the border. As seen in the code and the output above, the first has the stroke set to 5 as compared to 1 for the second graph, so the circles end up thin because 1 is less width than 5.
It works for non-filled shapes (hollow shapes).
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_line()
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_boxplot()
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
ggplot(mpg, aes(x = displ)) +
geom_histogram(binwidth = 1)
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_area()
For a line chart: geom_line() For a box plot: geom_boxplot() For a histogram: geom_histogram() For an area chart:geom_area()
ggplot(
data = mpg,
mapping = aes(x = displ, y = hwy, color = drv)
) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The graph comes out to be what I predicted. I predicted that it is going to be a very clean graph because se has been put as False, which means that the shaded area (confidence band) will not occur around the smooth line.
ggplot(data = mpg) +
geom_smooth(
mapping = aes(x = displ, y = hwy, color = drv),
show.legend = FALSE
)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(data = mpg) +
geom_smooth(
mapping = aes(x = displ, y = hwy, color = drv),
)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
show.legend = false just hides the legend (i.e the key) of the graph of a specific geome. The problem is that the viewer won’t be able to know what color corresponds to what variable or what geome. If you remove it, the legend appears again, as shown in the 2nd graph. It was used earlier in the chapter because at that time, complex plots weren’t dealt with, so it was okay to not give the legend, because the variables were self-explanatory.
ggplot(
data = mpg,
mapping = aes(x = displ, y = hwy, color = drv)
) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(
data = mpg,
mapping = aes(x = displ, y = hwy, color = drv)
) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The se argument details the gray shaded area around the smooth line (this shaded area is also known as the confidence band). Using se = false, there will be no gray shaded area on the graph which can contribute to cleaner and better visualization of the data.
I don’t know if they will look different. Let me check.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot() +
geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
They do not look different. I am incredibly surprised.
For this exercise, use the diamonds dataset in the
tidyverse. Use ?diamonds to get more information about the
dataset.
geom_boxplot() and
facet_wrap to illustrate the empirical distributions of the
sample.ggplot(diamonds, aes(x = cut, y = price, fill = color)) +
geom_boxplot(outlier.size = 0.5, alpha = 0.8) +
facet_wrap(~ clarity) +
labs(
title = "Boxplot of Diamond Prices by Cut Faceted by Clarity",
x = "Cut",
y = "Price"
)
ggplot(diamonds, aes(x = carat, y = price, color = cut)) +
geom_point() +
facet_grid(color ~ clarity) +
labs(
title = "Relationship Between Carat, Price, Cut, and Clarity",
x = "Carat",
y = "Price",
color = "Cut"
)