setwd("C:/Users/Sam/Documents/DSCI_605/Module_5/")
library(ggplot2)
library(dplyr)

You need to follow me to plot bubble graph and boxplot ( start from the beginning and end at 38 minutes 5 seconds). The bar graph and pie chart are optional.

Bubble

library(gapminder)
data <- gapminder %>% filter(year=="2007") %>% dplyr::select(-year)
names(data)
## [1] "country"   "continent" "lifeExp"   "pop"       "gdpPercap"
ggplot(data, aes(x=gdpPercap, y=lifeExp, size=pop)) +
  geom_point(aplha=0.7)

data %>%
  arrange(desc(pop)) %>%
  mutate(country = factor(country, country)) %>%
  ggplot(aes(x=gdpPercap, y=lifeExp, size=pop)) +
  geom_point(alpha=0.5) +
  scale_size(range = c(.1, 24), name = "Population (M)")

data %>%
  arrange(desc(pop)) %>%
  mutate(country = factor(country, country)) %>%
  ggplot(aes(x=gdpPercap, y=lifeExp, size=pop, color = continent)) +
  geom_point(alpha=0.5) +
  scale_size(range = c(.1, 24), name = "Population (M)")

library(hrbrthemes)
library(viridis)
data1 <- gapminder %>% filter(year=="2007") %>% dplyr::select(-year)
data1 %>%
  arrange(desc(pop)) %>%
  mutate(country = factor(country, country)) %>%
  ggplot(aes(x=gdpPercap, y=lifeExp, size=pop, fill = continent)) +
  geom_point(alpha=0.5, shape = 21, color = "black") +
  scale_size(range = c(.1, 24), name = "Population (M)") +
  scale_fill_viridis(discrete = TRUE, guide= "none", option = "A") +
  #theme_ipsum() +
  theme(legend.position = "bottom", legend.box = "vertical", legend.margin = margin(), legend.key = element_rect(fill = "white", color = "white")) +
  ylab("Life Expectancy") +
  xlab("Gdp per Capita")

Box

names(airquality)
## [1] "Ozone"   "Solar.R" "Wind"    "Temp"    "Month"   "Day"
boxplot(airquality$Ozone,
        main = "Mean ozone in parts per billion at Roosevelt Island",
        xlab = "Parts Per Billion",
        ylab = "Ozone",
        col =  "orange",
        border = "brown",
        horizontal = TRUE,
        notch = TRUE)

ozone <- airquality$Ozone
temp <- airquality$Temp
ozone_norm <- rnorm(200, mean = mean(ozone, na.rm = TRUE), sd=sd(ozone, na.rm = TRUE))
temp_norm <- rnorm(200, mean = mean(temp, na.rm = TRUE), sd=sd(temp, na.rm = TRUE))
boxplot(ozone, ozone_norm, temp, temp_norm,
        main = "Multiple Boxplots for Comparison",
        at = c(1,2,4,5),
        names = c("ozone", "normal", "temp", "normal"),
        las = 2,
        col =  c("orange", "red"),
        border = "brown",
        horizontal = TRUE,
        notch = TRUE)

boxplot(Temp~Month,
        data = airquality,
        main = "Different Box Plots For Each Month",
        xlab = "Month Number",
        ylab = "Degree Farenheit",
        col = "orange",
        border = "brown")

After coding with me step by step, please make a bubble graph and a boxplot by yourself. You can use your own data. You can use a built-in dataset. Please use a different dataset from my example and make your data visualization unique. Make some changes to my code. You can earn 5 bonus points.

library(help = "datasets")
df1 <- stackloss
summary(stackloss)
##     Air.Flow       Water.Temp     Acid.Conc.      stack.loss   
##  Min.   :50.00   Min.   :17.0   Min.   :72.00   Min.   : 7.00  
##  1st Qu.:56.00   1st Qu.:18.0   1st Qu.:82.00   1st Qu.:11.00  
##  Median :58.00   Median :20.0   Median :87.00   Median :15.00  
##  Mean   :60.43   Mean   :21.1   Mean   :86.29   Mean   :17.52  
##  3rd Qu.:62.00   3rd Qu.:24.0   3rd Qu.:89.00   3rd Qu.:19.00  
##  Max.   :80.00   Max.   :27.0   Max.   :93.00   Max.   :42.00

I chose this dataset as I have an interest in environmental analytics. This data set was also composed of completely numeric continuous variables which is great for a bubble plot. Since I do not know most of the specifics of a random dataset I chose in any situation, my goal is to create a chart and see what information I can discern without knowing the goals of the dataset. In other words, my goal is discovery through visualization.

Bubble

df1$Acid.Conc.[df1$Acid.Conc. >= 70 & df1$Acid.Conc. <= 79] <- "Low"
df1$Acid.Conc.[df1$Acid.Conc. >= 80 & df1$Acid.Conc. <= 89] <- "Med"
df1$Acid.Conc.[df1$Acid.Conc. >= 90 & df1$Acid.Conc. <= 99] <- "High"
df1$Acid.Conc. <- as.factor(df1$Acid.Conc.)
df1 %>%
  arrange(desc(stack.loss)) %>%
  ggplot(aes(x=Air.Flow, y=Water.Temp, size=stack.loss, color = Acid.Conc.,  fill = Acid.Conc.)) +
  geom_point(alpha=0.7, shape = 22, color = "black") +
  scale_size(range = c(.1, 24), name = "Stack Loss") +
  scale_fill_viridis(discrete = TRUE, guide= "none", option = "H") +
  theme(legend.position = "bottom", legend.box = "vertical", legend.margin = margin(), legend.key = element_rect(fill = "white", color = "white")) +
  ylab("Water Temperature") +
  xlab("Air Flow") +
  ggtitle("Plant Oxidation of Ammonia to Nitric Acid")

For my chart I added a title to the plot. I also changed the fill color, geom point shapes, and geom point size to show differences best since points overlap. For color scale fill I changed to option H as it seemed easiest to tell apart. I adjusted the alpha to 0.7 to change opacity. I also changed x and y labels. I did try to switch up further and add a color based off of my factored variable, but I didn’t have any luck. I was then going to have a legend showing color as well for Low, Med, and High of my variable. This is already designated by fill, but I was hoping that I could display a legend showing this so that I do not need to explain. With this chart you can see that increased air flow and water temperature leads to a greater stack loss in most cases.

Boxplot

df2 <- stackloss
boxplot(df2$Air.Flow, df2$Water.Temp, df2$Acid.Conc., df2$stack.loss,
        main = "Plant Oxidation of Ammonia to Nitric Acid",
        names = c("Air Flow", "Water Temperature", "Acid Concentration", "Stack Loss"),
        las = 2,
        col =  c("blue", "yellow", "green", "red"),
        border = "black",
        horizontal = TRUE,
        notch = FALSE)

For my chart “at = c(1,2,4,5)”, was removed to not scrunch the bars down. I decided let plot function decide locations for the bars. I renamed the names for each of my bar measurements and assigned a color to each of them to differentiate. I changed border color to black so that the border is more distinguishable. I also removed notches and added a title for the plot. In this chart we can see that Stack Loss and Air Flow have outliar points as opposed to Acid Concentration and Temperature. Temperature has its percentile range much closer to the max and min than that of the other bars, which are a little more towards the median.