Homework

Task 1

install.packages("ggplot2") #install ggplot
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
## (as 'lib' is unspecified)
library(ggplot2) #ensure ggplot installed

data(mpg) #load in data

ggplot(mpg, aes(x=hwy))+ #specify data and x axis
  geom_histogram(binwidth= 2)+ #create histogram and change bandwidth
  labs( #change axes and title names
    x= "Fuel Usage", #name x axis
    title = "Highway Fuel Effiency" #create title
  ) +
  theme_minimal() #minimal theme

  1. What range of highway fuel efficiency values is most common? 15-35 is the range with the most values.

  2. Does the distribution appear symmetric, skewed, or multimodal? The distribution is skewed right.

Task 2

ggplot(mpg, aes(x= displ))+ #create plot with mpg dataset
  geom_histogram(binwidth=0.5, fill= "purple1", col="purple4")+ #shorten binwidth, color purple & darker purple outline
  labs(
    x= "Displacement", #name x axis
    title = "Overall Engine Displacement" #create title
  )+
  theme_minimal() #minimal theme

Question

  1. What does the histogram suggest about the types of engines represented in the dataset? These engines have generally lower displacement, creating a skewed-right distribution.

Task 3

ggplot(mpg, aes(x= displ))+ #create plot with mpg data
  geom_density(fill= "skyblue", col= "blue4", alpha=0.5)+ #fill blue, dark blue outline, lower opacity
  labs(
    x = "Displacement", #name x axis
    y = "Density", #name y axis
    title = "Engine Displacement" #add title
  )+
  theme_minimal() #minimal theme

Question

  1. How does the density plot compare to the histogram you created earlier? It represents the same data, but shows a clearer curve and distribution.

Task 4

ggplot(mpg, aes(x= displ, fill = class))+ #create plot using mpg, fill by vehicle class
  geom_density(alpha= 0.5)+ #lower opacity
    guides(fill= guide_legend(title="Vehicle Class")) #change legend title

Question

  1. Which vehicle class tends to have the largest engines? SUVs have the largest engines because they have the highest displacement.

Task 5

ggplot(mpg, aes(x=hwy, color=class))+ #plot mpg data & color by vehicle class
  geom_freqpoly(binwidth=2) #change binwidth 

Question

  1. Why might frequency polygons be easier to interpret than overlapping histograms when comparing groups? Like density plots, they communicate data density more effectively.

Task 6

ggplot(mpg, aes(x= displ, y=hwy))+ #create mpg plot with displ & hwy
  geom_density_2d(aes(fill= after_stat(level)), geom= "polygon")+ #map fill to density + geom polygon
  theme_bw()+ #change theme
  labs(
    title = "Displacement & Fuel Efficiency", #add title
    x = "Displacement", #title x axis
    y = "Fuel Efficiency", #title y axis
    fill = "Density Level" 
  )
## Warning in geom_density_2d(aes(fill = after_stat(level)), geom = "polygon"):
## Ignoring unknown parameters: `geom`
## Warning in geom_density_2d(aes(fill = after_stat(level)), geom = "polygon"):
## Ignoring unknown aesthetics: fill

Question

  1. What additional information does the filled density plot provide compared to the contour plot? It more clearly shows the areas of higher density.

Task 7

data(diamonds) #load in diamonds data
install.packages("ggExtra") #install ggExtra
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
## (as 'lib' is unspecified)
install.packages("ggthemes") #install ggthemes
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
## (as 'lib' is unspecified)
library(ggExtra) #ensure ggExtra installed
library(ggthemes) #ensure ggthemes installed

diamond_plot <- ggplot(diamonds, aes(x= carat, y= price, color = color))+ #create diamond_plot
  geom_point(size=2, alpha= 0.2)+ #scatterplot w lowered opacity
  theme_minimal()+ #minimal theme
  labs(
    x = "Carat", #name x axis
    y = "Price", #name y axis
    title = "Diamond Carats versus Price" #add title
  )

ggMarginal(diamond_plot, type = "density", groupColour= TRUE, groupFill= TRUE)

Question

  1. Why might marginal plots be useful when analyzing relationships between variables? It may be useful for analyzing more complex data and relationships.