Assignment 1 - Carlos Rivera (car808)

Step 2: Workbook exercises

Problem Section 3.2.4 First steps

Exercise 1.

Question:

Run ggplot(data = mpg). What do you see?

Answer:

Presents a blank canvas on the plot panel.

Exercise 2.

Question:

How many rows are in mpg? How many columns?

Answer:

the mpg dataset from ggplot has 234 rows and 11 columns.

Exercise 3.

Question:

What does the drv variable describe? Read the help for ?mpg to find out.

Answer:

drv: is a categorical variable about the type of drive train, with 3 categories, f = front-wheel drive, r = rear wheel drive, 4 = 4wd.

Exercise 4.

Question:

Make a scatterplot of hwy vs cyl.

Answer:

ggplot(mpg)+
  aes(hwy, cyl)+
  geom_point()

Exercise 5.

Question:

What happens if you make a scatterplot of class vs drv? Why is the plot not useful?

Answer:

Because both variables are categorical this causes only points at the intersection of the categories, this graph is no longer useful because all the observations overlap causing loss of information.

ggplot(mpg)+
  aes(class, drv)+
  geom_point()

Problem Section 3.3.1 Aesthetic mappings

Exercise 1.

Question:

What’s gone wrong with this code? Why are the points not blue?

Answer:

By setting color = "blue" as a parameter inside aes() the graph assumes that it is a new category variable, and assigns all observations to this category.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

Exercise 2.

Question:

Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?

Answer:

With the str() function we can observe the structure of the object, for this data.frame we have three types of variables: characters (chr) that are qualitative variables (manufacturer, model, trans, drv, fl, class), numeric (num) that are quantitative variables with decimals (displ), and ineger (int) that are quantitative variables but with integer value (year, cyl, cty,hwy).

str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

Exercise 3.

Question:

Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?

Answer:

When trying to use a continuous variable as a shape, an error occurs: A continuous variable cannot be mapped to the shape aesthetic this is because the form is assigned to each category in a variable.

Between continuous and categorical variables the difference between the aesthetic arguments color and size is the type of scale adopted for each graph, in categorical variables the color is assigned by category while in continuous variables it is a gradient of color, similarly in size the quantitative variables also receive a gradient of size, but in qualitative variables the size is assigned by category in alphabetical order.

ggplot(mpg)+
  aes(displ, hwy, color = year, size = cty)+
  geom_point()

Exercise 4.

Question:

What happens if you map the same variable to multiple aesthetics?

Answer:

This increases the complexity of the graph by having more properties mapped, but because it is redundant it does not necessarily provide or represent more information.

ggplot(mpg)+
  aes(displ, hwy, color = hwy, size = hwy)+
  geom_point()

Exercise 5.

Question:

What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)

Answer:

When geom_point is used with a shape that has a border and fill (i.e., shape = 21), the aesthetic stroke varies the thickness of the shape border.

ggplot(mpg)+
  aes(displ, hwy, stroke=cyl/2)+
  geom_point(fill='red', shape = 21)

Exercise 6.

Question:

What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)? Note, you’ll also need to specify x and y.

Answer:

In this case, by operating with the variable, this reassigns the values by the new result, then the aesthetic ceases to be $x$ and becomes $f(x)$ in this particular case, with displ < 5, the displ becomes a logical variable.

ggplot(mpg)+
  aes(displ, hwy, color = displ<5)+
  geom_point()

Problem Section 3.5.1 Facets

Exercise 1.

Question:

What happens if you facet on a continuous variable?

Answer:

It subdivides the graph by assigning to each box the unique values of the variable unique(mpg$displ), turning it into a categorical variable.

ggplot(mpg)+
  aes(displ, hwy)+
  geom_point()+
  facet_wrap(~displ)

Exercise 2.

Question:

What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?

ggplot(data = mpg) + geom_point(mapping = aes(x = drv, y = cyl))

Answer:

These empty cells are due to the fact that each box plots the data present in the combination of categories, so as can be seen in the point plot aes(x = drv, y = cyl) there are no observations for the combinations (cyl, drv): (4, r), (5,4), (5,r).

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ cyl)

Exercise 3.

Question:

What plots does the following code make? What does . do?

Answer:

the facet_grid function expects the formula that indicates the rows and columns for the grid, so by omitting one of the variables and replacing it with a . (dot) (which R generally interprets as using all), then there is no longer any subdivision into either rows or columns, depending on which side of the formula it is placed.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

Exercise 4.

Question:

Take the first faceted plot in this section:

```{r}
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_wrap(~ class, nrow = 2)
```

What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

Answer:

Sometimes having multiple categories and using colors to differentiate the categories can make it difficult for the human eye to identify patterns and interpret them, so faceted graphs allow you to subdivide the graph to better observe the relationships in each category.

Exercise 5.

Question:

Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?

Answer:

nrow determines the total number of rows desired in the graph, dividing the categories into multiple columns. In contrast, ncol determines the maximum number of columns and distributes the categories into multiple rows.

On the other hand, facet_grid does not have ncol or nrow since the number of categories of each variable in the formula determines the number of rows and columns.

Exercise 6.

Question:

When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?

Answer:

This is usually done because graphs have a horizontal aspect, which means that they are wider than they are tall, although this depends on the situation, as sometimes it is preferable to have more rows than columns.

Problems section 3.6.1 Geometric objects

Exercise 1.

Question:

What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?

Answer:

line chart: geom_line()
boxplot: geom_boxplot()
histogram: geom_histogram()
area chart: geom_area()

Exercise 2.

Question:

Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

Answer:

This graph represents a scatter plot with displ on the x-axis and hwy on the y-axis. It assigns a different color to each drv category and adds a trend line for each drv category (color). This line is adjusted following the $y\sim x$ formula and with the loess method.

ggplot(mpg) +
  aes(x = displ, y = hwy, color = drv)+
  geom_point() +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Exercise 3.

Question:

What does show.legend = FALSE do? What happens if you remove it? Why do you think I used it earlier in the chapter?

Answer:

In ggplot the legend of the graph is constructed from the aesthetic arguments such as color, shape, size, etc. when using show.legend = FALSE the legend of the graph is removed but the aesthetic properties are maintained.

Exercise 4.

Question:

What does the se argument to geom_smooth() do?

Answer:

Include the confidence interval region around the trend line; the default confidence level of 95% is used.

Exercise 5.

Question:

Will these two graphs look different? Why/why not?

Answer:

They wont look different, actually they are the same plot, but the difference is that the first one declares the dataset and the aestethic globally in the ggplot() function, while the second one does it individually for each geometry.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'


ggplot() +
  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +   
  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Exercise 6.

Question:

Recreate the R code necessary to generate the following graphs.

Answer:

p1 = ggplot(mpg) + aes(displ, hwy)

p1 + geom_point() + 
  geom_smooth(se = F)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

p1 + geom_point() +
  geom_smooth(aes(group=drv), se = F)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'



ggplot(mpg) + 
  aes(displ, hwy, color = drv)+
  geom_smooth(se = F)+
  geom_point() 
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'


p1 + 
  geom_point(aes(color = drv)) + 
  geom_smooth(se = F)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'


p1 + 
  geom_point(aes(color = drv)) + 
  geom_smooth(aes(linetype=drv), se = F)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'


p1 + 
  geom_point(size = 5, color='white')+
  geom_point(aes(color = drv))

Problems Section 3.7.1 Statistical transformations

Exercise 1.

Question:

What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?

Answer:

by default uses geom_pointrange()

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

diamonds %>% 
  group_by(cut) %>% 
  summarise(
    y_med = median(depth),
    y_min = min(depth),
    y_max = max(depth)
  ) %>% 
  ggplot()+
  aes(cut, y_med, ymin=y_min, ymax=y_max)+
  geom_pointrange()+
  ylab('depth')

Exercise 2.

Question:

What does geom_col() do? How is it different to geom_bar()?

Answer:

geom_col creates a column chart given the value of y for each category in x, when there are two or more values of y per category by default the sum of the values is calculated.

The difference with geom_bar is that geom_bar does not require the value of y, but calculates it as the count of observations of each category in x.

Exercise 3.

Question:

Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?

Answer:

geom_violin() and stat_ydensity()
geom_histogram() and stat_bin()
geom_contour() and stat_contour()
geom_function() and stat_function()
geom_bin_2d() and stat_bin_2d()
geom_boxplot() and stat_boxplot()
geom_count() and stat_sum()
geom_density() and stat_density()
geom_density_2d() and stat_density_2d()
geom_hex() and stat_binhex()
geom_quantile() and stat_quantile()
geom_smooth() and stat_smooth()

Exercise 4.

Question:

What variables does stat_smooth() compute? What parameters control its behaviour?

Answer:

stat_smooth() compute:

Predicted value.
Lower pointwise confidence interval around the mean.
Upper pointwise confidence interval around the mean.
Standard error.

Parameters:

method: determines the function used to perform the smoothing, by default loess is used, other values can be lm, glm, gam.
formula: determines the formula to be passed as argument to the smoothing function, by default y~x is used.
se: is a boolean argument that determines if you want to add the confidence interval to the smoothing line, by default the value is TRUE.

Exercise 5.

Question:

In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?

Answer:

when group = 1 is not added, the graph calculates the proportion for each independent group, so each category adds up to 1, but adding group = 1 calculates the proportion over the total, the same when adding the fill argument, which calculates the proportion for each combination of x:fill categories.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, y = after_stat(prop)))


ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color, y = after_stat(prop)))

Problems Section 3.8.1 Position adjustments

Exercise 1.

Question:

What is the problem with this plot? How could you improve it?

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point()

Answer:

The problem with this graph is that since several observations have the same x, y values, plotting in points overlaps several observations, and information is lost. Some solutions can be to use geom_jitter, but this can be confusing by adding random noise. Other solutions can be to add the frequency of observations for each chord as color or size.

ggplot(mpg)+
 aes(cty, hwy)+
 geom_count()

Exercise 2.

Question:

What parameters to geom_jitter() control the amount of jittering?

Answer:

width, height: Amount of vertical and horizontal jitter. The jitter is added in both positive and negative directions, so the total spread is twice the value specified here.

Exercise 3.

Question:

Compare and contrast geom_jitter() with geom_count().

Answer:

While geom_jitter() adds a random offset between the points, geom_count() calculates the frequency of points at each position and adds it size argument in the aesthetics.

Exercise 4.

Question:

What’s the default position adjustment for geom_boxplot()? Create a visualisation of the mpg dataset that demonstrates it.

Answer:

default in geom_boxplot() the argument position = "dodge2", which places the boxes next to each other in the same x-axis category.

ggplot(mpg)+
  aes(class, displ, fill = as.factor(cyl))+
  geom_boxplot(position = 'dodge2')

Problems Section 3.9.1 Coordinate systems

Exercise 1.

Question:

Turn a stacked bar chart into a pie chart using coord_polar().

Answer:

ggplot(diamonds) +
  aes('x', fill=cut)+
  geom_bar()+
  coord_polar('y')

Exercise 2.

Question:

What does labs() do? Read the documentation.

Answer:

Allows to modify the graphic labels, e.g. name of the axes, title, subtitle, title of each of the legends (aesthetic arguments).

Exercise 3.

Question:

What’s the difference between coord_quickmap() and coord_map()?

Answer:

coord_map() projects a portion of the earth, which is approximately spherical, onto a flat 2D plane using any projection defined by the mapproj package. Map projections do not, in general, preserve straight lines, so this requires considerable computation. coord_quickmap() is a quick approximation that does preserve straight lines. It works best for smaller areas closer to the equator.

Exercise 4.

Question:

What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?

Answer:

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() +
  geom_abline() +
  coord_fixed()

The graph depicts a positive linear correlation between miles per gallon traveled between city and highways.
coord_fixed() adjusts the scale of the x and y axes so that they are equally comparable.
geom_abline() adds a linear trend line (default intercept = 0, slope = 1)

Step 3: Generate your own exercises

Are we breathing cleaner air?

Few realize it, but Mississippi breathes cleaner air today than it did two decades ago. Data from the EPA’s Toxics Release Inventory show that air emissions of substances have decreased dramatically in the state in the period 2004 - 2023.

Since 1987, the EPA has registered the Toxics Release Inventory (TRI), which contains the 100 most commonly used data fields of the Form R TRI notification and the Form A Certification Statement. Among the variables measured are toxic pollutants released to the air in two forms (1) fugitive or non-point air emission, which refers to gases and vapors lost in industrial equipment, either through leaks, vents, or spills, and (2) stack or point air emission, which refers to direct emissions from industrial processes through primarily stack stacks.

Toxic substances emitted into the air can be both carcinogenic and non-carcinogenic, the latter being those that are released in greater proportion to the environment, reaching more than 10 times the carcinogenic ones per year. In both cases, for both carcinogenic and non-carcinogenic substances, a decreasing trend has been observed over the last 20 years pasando de cerca de 13,000 metric ton de contaminantes en el 2004 a cerca de de la mitad 6,500 metric ton en el 2023 (Figure 1).

Figure 1: Figure 1. Total emissions of toxic pollutants released into the air per year in the state of Mississippi, US, over the last two decades.

If we look in more detail at the county level, we can see that over the last two decades there are seven counties (Harrison, Jackson, Lawrence, Lowndes, Perry, Warren, and Yazoo) that are always in the top 10 of the highest emissions each year (Figure 2), even though Harrison County alone produced around 30% of the total emissions for the years 2013 and 2014. This group of counties (top 10) have led the state’s total emissions throughout the twenty years, concentrating a high percentage of total emissions, with an increasing trend in this percentage of participation over time with values above 70% in 2004, with a maximum peak of 90% in 2014 and maintaining around 80% of total emissions of toxic air pollutants in the last 5 years (Figure 3).

Figure 2: Top 10 counties releasing the highest emissions of toxic pollutants into the air in the state of Mississippi by year.

Figure 3: Percentage of toxic air pollutant emissions from the top 10 counties by year.

Finally, we can observe that in general, the highest percentage of counties in the state of Mississippi maintain low levels of air emissions either by fugitive or stack and only a few counties are the most responsible for emissions, mainly in the southern part of the state. In addition, it should be noted that unfortunately for the period 2004 - 2023 the EPA has no air emissions records for 7 counties, which does not necessarily imply zero emissions in those locations (Figure 4).

Figure 4: Map of average annual emissions of toxic contaminants for the period 2004 - 2023 for each county in the state of Mississippi.