Week 2 Assignment

library(ggplot2)

####Sections: Introduction, Prerequisites, First Steps, The mpg Data Frame, Creating a ggplot, A Graphing Template

#####Exercise 1 - Run ggplot(data = mpg). What do you see?

ggplot(data = mpg)#I see a blank chart

#####Exercise 2 - Run ?mpg How many rows are in mpg? How many columns?

?mpg #234 rows and 11 columns

## starting httpd help server ... done

#####Exercise 3 - What does the drv variable describe? Read the help for ?mpg to find out.

#drv = the type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = 4wd

#####Exercise 4 - Make a scatterplot of hwy vs cyl.

ggplot(data = mpg) + geom_point(mapping = aes(x = hwy, y = cyl)) #it shows a negative relationship because as the # of cylinders go up, the fuel efficiency goes down.It also shows some vehicles hover around 25 mpg regardless of cylinders, so there are other factors besides cylinders at play.

#####Exercise 5 - What happens if you make a scatterplot of class vs drv? Why is the plot not useful?

ggplot(data = mpg) + geom_point(mapping = aes(x = class, y = drv)) #it really doesn't show a strong or negative relationship/argument between the variables and it doesn't create new questions. Either we need more layers to uncover a relationship or an alternative visual.

####Sections: Sections: Aesthetic Mappings

#####Exercise 1 - What’s gone wrong with this code? Why are the points not blue?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue") #The points were not blue because blue argument was placed *inside* aes,in the wrong spot to manually code for blue. See corrected chart.

#####Exercise 2 - Which variables in mpg are categorical? Which variables are continuous? # (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?

#Categorical variables in mpg are: manufacturer, model, trans, drv, fl, and class. Continuous variables in mpg include: displ, cyl, cty, and hwy.
mpg #you can see it by running ?mpg or looking up mpg under Help in RStudio

## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # ℹ 224 more rows

#####Exercise 3 - Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = cyl, size = hwy, shape = drv)) #it's easier to identify correlations or find outliers within continuous variables using aesthetics whereas character variables use aesthetics for visual appeal. Continuous variables appear on a spectrum and categorical variables are descriptive.

#####Exercise 5 - What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), stroke = 3, shape = 25) #The stroke aesthetic defines thickness of a non-filled shape where as the shape aesthetic can be specified with an integer (between 0 and 25) and stroke works with non-filled shapes.

#####Exercise 6 - What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)? Note, you’ll also need to specify x and y.

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = displ < 5)) #R created a new variable using color that shows whether it is T or F in regards to our new code displ < 5.

####Sections: Sections: Common Problems, Facets

#####Exercise 3 - What plots does the following code make? What does . do?

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(drv ~ .) #Here we faceted the plot on the combination of two variables with no columns, just rows using the facet_grid(drv ~ .) whereas we used . instead of a variable name.

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(. ~ cyl) #Here we faceted the plot on the combination of two variables with no rows, just columns using the facet_grid(. ~ cyl) whereas we used . instead of a variable name.

#####Exercise 4 - Take the first faceted plot in this section: What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_wrap(~ class, nrow = 2) #Advantages: can add variables, can split your plot into facets (subplots), can grid and show multiple displays, allows for rapid review during exploratory data analysis. Color aesthetic can increase in categorical features which increases confusion. Disadvantages of faceting are how points are spread out so you may miss something on larger data sets, fewer displays, that may be better suited to color aesthetics.

#####Exercise 5 - Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?

#nrow sets Number of rows and ncol sets number of columns; other options are facet_grid(), facet_wrap(), facet_null; because facet_grid() is already equal to the number of unique levels in the row/column variables.

####Sections: Sections: Geometric Objects

#####Exercise 2 - Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + geom_point() + geom_smooth(se = FALSE)# I didn't know the confidence interval would be removed by se = FALSE, so that was definitely not expected,

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#####Exercise 5 - Will these two graphs look different? Why/why not?

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point() + geom_smooth()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot() + geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy)) #no, the graphs look the same. The only difference is the duplicate code in the second graph and that is something we should avoid.

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#####Exercise 6 - Recreate the R code necessary to generate the following graphs.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(size=class),show.legend = FALSE) + geom_smooth(se=FALSE)

## Warning: Using size for a discrete variable is not advised.

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(size=class),show.legend = FALSE) + geom_smooth(mapping = aes(group=drv),se=FALSE)

## Warning: Using size for a discrete variable is not advised.

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(color=drv,size=class),show.legend = TRUE) + geom_smooth(mapping = aes(color=drv),show.legend = TRUE,se=FALSE)

## Warning: Using size for a discrete variable is not advised.

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(color=drv,size=class),show.legend = TRUE) + geom_smooth(show.legend = FALSE,se=FALSE)

## Warning: Using size for a discrete variable is not advised.

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(color=drv,size=class),show.legend = TRUE) + geom_smooth(mapping = aes(linetype=drv),show.legend = TRUE,se=FALSE)

## Warning: Using size for a discrete variable is not advised.

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy,size=class)) + geom_point(mapping = aes(color=drv),show.legend = TRUE)

## Warning: Using size for a discrete variable is not advised.

####Sections: Sections:Statistical Transformations

#####Exercise 1 - What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?

#The default geom associated with stat_summary() is geom_pointrange()
ggplot(data = diamonds) + geom_pointrange(mapping = aes(x = cut, y = depth),stat = "summary", fun.min = min, fun.max = max, fun = median) #To rewrite the previous plot with geom, the values for fun.min, fun.max and fun need to be defined and Stat's identity was named.

#####Exercise 2 - What does geom_col() do? How is it different to geom_bar()?

#geom_bar() is the default stat of the bar proportional to the number of cases in each group whereas geom_col() represents raw values in the data and the bar.

#####Exercise 5 - In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?

ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = after_stat(prop)))

ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = color, y = after_stat(prop)))

ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = after_stat(prop),group = 1)) #the problem is that all the bars have the same height, so we need to tell ggplot2 to to use the whole data set when calculating proportions.

####Sections: Sections:Position Adjustments

#####Exercise 1 - What is the problem with this plot? How could you improve it?

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + geom_point(position = "jitter") #Over-plotting has occurred because of the number of data points, so position = "jitter" adds some depth and variation making the plot easier to read.

#####Exercise 2 - What parameters to geom_jitter() control the amount of jittering?

#Width and Height control how much jittering is applied otherwise it defaults to 40% of the resolution of the data.

#####Exercise 3 - Compare and contrast geom_jitter() with geom_count().

#So geom_jitter() is a shortcut for geom_point(position = "jitter"). It adds a small amount of noise to each point, and is a useful way of handling over-plotting caused by discreteness in smaller datasets whereas geom_count is a variant of geom_point() that counts the number of observations at each location, then maps the count to point area. It useful when you have discrete data and over-plotting.

####Sections: Sections: Coordinate Systems

#####Exercise 1 - Turn a stacked bar chart into a pie chart using coord_polar().

ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity), width = 1) #stacked bar chart

ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity), width = 1) + coord_polar() #stacked bar chart into pie chart with coord_polar

#####Exercise 3 - What’s the difference between coord_quickmap() and coord_map()?

#coord_map() projects a portion of the earth, which is approximately spherical, onto a flat 2D plane using any projection defined by the mapproj package. Map projections do not, in general, preserve straight lines, so this requires considerable computation. coord_quickmap() is a quick approximation that does preserve straight lines. It works best for smaller areas closer to the equator.^^NOTE: Both coord_map() and coord_quickmap() are superseded by coord_sf(), and should no longer be used in new code. All regular (non-sf) geoms can be used with coord_sf() by setting the default coordinate system via the default_crs argument. See also the examples for annotation_map() and geom_map().

#####Exercise 4 - What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?

#MPG increases proportionately in the city and the highway; A fixed scale coordinate system forces a specified ratio between the physical representation of data units on the axes. The ratio represents the number of units on the y-axis equivalent to one unit on the x-axis;These geoms add reference lines (sometimes called rules) to a plot, either horizontal, vertical, or diagonal (specified by slope and intercept). These are useful for annotating plots.

Week 2 Assignment

Carie Moreau

2023-05-30