This is late because it took me a week- A WEEK- to figure out how to update rlang. Coding is a PRISON.

knitr::opts_chunk$set(echo = TRUE)
#install.packages("dplyr")
#install.packages("ggplot2")
#install.packages("hexbin")
#install.packages("mgcv")
#install.packages("MASS")
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(hexbin)
library(mgcv)

## Loading required package: nlme

## 
## Attaching package: 'nlme'

## The following object is masked from 'package:dplyr':
## 
##     collapse

## This is mgcv 1.9-4. For overview type '?mgcv'.

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

#This might be unecessary but I'm getting frustrated
#I keep getting error messages about some component of stringr being corrupted, so I'm just going to comment it out for now
#install.packages("stringr")
#library(stringr)

2.2.1 Exercises

1.) group_by(x) could be used to group the data by values of a specific variable. head(mpg) gives the first few rows of the table. dim(mpg) gives the dimensions of the table. str(mpg) shows the title, type, and number of values of every variable. glimpse(mpg) shows the title, type, and first few values of every variable, as well as the dimensions of the table. 2.) You can find every dataset included with ggplot2 in its help page index. (If you’re having trouble finding it, search “ggplot2” in the Help window, then scroll to the bottom of the page and click “Index”.) 3.) Well, miles-per-gallon is a ratio (miles/gallon), and fuel consumption is also a ratio (liters/100km). So in order to convert mpg$cty and mpg$hwy to fuel consumption… I’m going to just insert a picture of my math. $A picture of my math.$

#Which manufacturer has the most models in the dataset?
mpg %>%
group_by(manufacturer) %>%
  summarise(manufacturerN = n())

## # A tibble: 15 × 2
##    manufacturer manufacturerN
##    <chr>                <int>
##  1 audi                    18
##  2 chevrolet               19
##  3 dodge                   37
##  4 ford                    25
##  5 honda                    9
##  6 hyundai                 14
##  7 jeep                     8
##  8 land rover               4
##  9 lincoln                  3
## 10 mercury                  4
## 11 nissan                  13
## 12 pontiac                  5
## 13 subaru                  14
## 14 toyota                  34
## 15 volkswagen              27

# Dodge does.

head(mpg)

## # A tibble: 6 × 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…

#How many models are there in the dataset? We need to find which model has the most variations, but first we need to figure out how many models there are.
mpg %>%
  group_by(model) %>%
  summarise(modelN = n())

## # A tibble: 38 × 2
##    model              modelN
##    <chr>               <int>
##  1 4runner 4wd             6
##  2 a4                      7
##  3 a4 quattro              8
##  4 a6 quattro              3
##  5 altima                  6
##  6 c1500 suburban 2wd      5
##  7 camry                   7
##  8 camry solara            7
##  9 caravan 2wd            11
## 10 civic                   9
## # ℹ 28 more rows

#There are 38 models in the dataset... IF you don't remove the redundant drive train specifiers, that is.
#In order to do that, you need to find every model value with a drive-train specification, then mutate it to remove that sequence of characters from each value.
mpg %>%
  #group_by(model) %>%
  #select(contains("2wd")) %>%
#I was going to use stringr for this but I guess it's corrupt SO I GUESS WE'RE DOING THIS PIECE BY PIECE HAHA SORRY
  mutate(model = recode(model, "c1500 suburban 2wd" = "c1500 suburban")) %>%
  mutate(model = recode(model, "a4 quattro" = "quattro" )) %>%
  mutate(model = recode(model, "a6 quattro" = "quattro" )) %>%
  mutate(model = recode(model, "k1500 tahoe 4wd" = "k1500 tahoe" )) %>%
  mutate(model = recode(model, "caravan 2wd" = "caravan" )) %>%
  mutate(model = recode(model, "dakota pickup 4wd" = "dakota pickup" )) %>%
  mutate(model = recode(model, "durango 4wd" = "durango" )) %>%
  mutate(model = recode(model, "ram 1500 pickup 4wd" = "ram 1500 pickup" )) %>%
  mutate(model = recode(model, "expedition 2wd" = "expedition" )) %>%
  mutate(model = recode(model, "explorer 4wd" = "explorer" )) %>%
  mutate(model = recode(model, "f150 pickup 4wd" = "f150 pickup" )) %>%
  mutate(model = recode(model, "grand cherokee 4wd" = "grand cherokee" )) %>%
  mutate(model = recode(model, "navigator 2wd" = "navigator" )) %>%
  mutate(model = recode(model, "mountaineer 4wd" = "mountaineer" )) %>%
  mutate(model = recode(model, "pathfinder 4wd" = "pathfinder" )) %>%
  mutate(model = recode(model, "forester awd" = "forester" )) %>%
  mutate(model = recode(model, "impreza awd" = "impreza" )) %>%
  mutate(model = recode(model, "4runner 4wd" = "4runner" )) %>%
  mutate(model = recode(model, "land cruiser wagon 4wd" = "land cruiser wagon" )) %>%
  mutate(model = recode(model, "toyota tacoma 4wd" = "toyota tacoma" )) %>%
  group_by(model) %>%
  summarize(modelN = n())

## # A tibble: 37 × 2
##    model          modelN
##    <chr>           <int>
##  1 4runner             6
##  2 a4                  7
##  3 altima              6
##  4 c1500 suburban      5
##  5 camry               7
##  6 camry solara        7
##  7 caravan            11
##  8 civic               9
##  9 corolla             5
## 10 corvette            5
## # ℹ 27 more rows

#Removing the drive strings from the model (in the least efficient way possible, I am SO SORRY) only reduced the total model count by... 1. That's a lot of work for nothing.

ggplot(mpg, aes(displ,cty))+
  geom_point()

ggplot(mpg, aes(displ,hwy))+
  geom_point()

# 1:
ggplot(mpg,aes(cty,hwy))+
  geom_point()

# As city mpg increases, highway mpg also increases. It doesn't look like a 1:1 correlation though. The problem with drawing conclusions by comparing the two is that they're both dependent variables, so it indicates that SOMETHING is making them different, but doesn't give any clues as to what exactly that is.

#2:
ggplot(mpg,aes(model,manufacturer)) + geom_point()

#This graph shows you which cars are manufactured by which company. The problem is that (as far as I can tell) there is not a single model on there that is manufactured by two different companies, and using a dot-plot for this makes it really hard to read, so it's not useful. Even if you're trying to see how many cars of a certain model are being manufactured, the dotplot is bad for that because it's all categorical variables and all dots just coalesce at the same x-y coordinates.
# You MIGHT get something more useful if you use a heatmap. 
ggplot(mpg,aes(model, manufacturer))+geom_hex()

#This heatmap shows the model-manufacturer intersection AND color-coordinates the frequency of each instance. The x-axis labels are still cluttered and nasty-looking, but it's better than a dotplot!

#3:
##Plot 1 is a plot that maps the city MPG and highway MPG (aesthetics) of each observation to a dotplot (geom_point()) layer.
ggplot(mpg,aes(cty,hwy))+geom_point()

##Plot 2 is a dataset about the prices of round-cut diamonds. It maps the carat on the x-axis and the price on the y-axis (aesthetics) and then displays each observation as a dot (geom_point()).
ggplot(diamonds,aes(carat,price))+geom_point()

##It also looks like someone spilled pen ink all over a desk.
##Plot 3 is a line graph showing the unemployment rate over a period of time. the aesthetics are date (x) and unemploy (y), and geom_line() is a layer that draws a line on the graph.
ggplot(economics,aes(date,unemploy))+geom_line()

##Plot 4 is a histogram showing the number of instances of certain cty (miles-per-gallon in a city) values in the mpg dataset.
ggplot(mpg,aes(cty))+geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

mpg

## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # ℹ 224 more rows

#So if you specify the color as an aesthetic layer within geom_point then it just uses that for the legend, not the color of the graph
ggplot(mpg,aes(displ,hwy))+geom_point(aes(color="blue"))

ggplot(mpg,aes(displ,hwy))+geom_point(color="blue")

# 1:
ggplot(mpg,aes(displ,hwy,shape=fl))+geom_point(color="darkred")

ggplot(mpg,aes(displ,hwy))+geom_point(color="darkred",shape="square")

ggplot(mpg,aes(displ,hwy))+geom_point(aes(shape="square", color="darkred"))

#"Shape" can't be defined by a non-continuous variable. Defining it in the ggplot call maps it to be different for each value of a discrete variable, calling it in geom_point() makes all of them the same, calling it as "aes()" in geom_point() makes it map those values to the key instead
#This probably applies to most (if not all) other ggplot aesthetics
ggplot(mpg,aes(displ,hwy,shape=fl,color=cty))+geom_point(size=2)

#It looks like "color" can be mapped to a continuous variable though
ggplot(mpg,aes(displ,hwy,size=cty))+geom_point(color="blue")

#And so can "size"
#It looks like using multiple aesthetics twice just makes the one you call the latest overwrite the previous one
ggplot(mpg,aes(displ,hwy,color=cty))+geom_point(color="yellow")

#2: 
#ggplot(mpg,aes(displ,hwy,shape=cty))+geom_point()
#Mapping a continuous variable to the "shape" aesthetic throws the error: "A continuous variable cannot be mapped to the shape aesthetic."

#ggplot(mpg,aes(displ,hwy,shape=trans))+geom_point()
# This displays a plot with some shapes left empty and gives you the following alert: 
   # "Warning: The shape palette can deal with a maximum of 6 discrete values because more than 6 becomes difficult to discriminate ℹ you have requested 10 values. Consider specifying shapes manually if you need that many of them.
   #" Warning: Removed 96 rows containing missing values or values outside the scale range (`geom_point()`)."
#This error shows up because there are a LOT of discrete values in trans

#3:
ggplot(mpg,aes(displ,hwy,shape=drv, color=drv))+geom_point()

#It looks like four-wheel drive vehicles tend to have the lowest miles-per-gallon compared to front-wheel and rear-wheel drive vehicles. However, it looks like front-wheel drive vehicles have higher engine displacement than rear-wheel drive ones.

ggplot(mpg,aes(displ,hwy,size=displ, color=class))+geom_point()

#There's a loose correlation between drive train, fuel economy, and engine size and class. Larger engine displacement seems to correlate to lower miles-per-gallon, and the colors show that vehicles in similar classes have loosely similar engine displacement and fuel economy.

#Install mgcv here!

#install.packages("mgcv")
#library(mgcv)

#2.6.1 exercises

## Who up smoothing they lines???
## "loess" works for small datasets, "gam" works best for large ones- apparently loess is more lossy. "rlm" uses a more robust algorithm than lm
#mgcv is required to use the "gam" smooth lines. MASS is required for the "rlm" fit

#There aren't any exercises in 2.6.1 so I'll just try making the graphs

ggplot(mpg,aes(displ,hwy))+
  geom_point()+
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Okay, the graph for this one looks EXTREMELY weird, why'd it convert every number to e-notation?
ggplot(mpg,aes(displ,hwy))+
  geom_point()+
  geom_smooth(span=0.1)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 3.1

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 0.2

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 1.8725e-16

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 0.01

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 3.1

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius 0.2

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 1.8725e-16

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 0.01

#However, this one looks normal. Was the previous one an example of the loess algorithm breaking down at larger datasets? Did I make it too wiggly?
ggplot(mpg,aes(displ,hwy))+
  geom_point()+
  geom_smooth(span=1)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#It's fine at 0.2
ggplot(mpg,aes(displ,hwy))+
  geom_point()+
  geom_smooth(span=0.2)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#Let's try 0.19
#I got the same warning message at 0.19 span; so maybe this is an example of the limits of loess?
ggplot(mpg,aes(displ,hwy))+
  geom_point()+
  geom_smooth(span=0.19)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 2

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 0.2

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 2.9291e-16

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at 2

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius 0.2

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 2.9291e-16

#It looks like loess starts to break down and spit out warning messages if you go below 0.2; then if you go below 0.16, it breaks entirely. Weird!
ggplot(mpg,aes(displ,hwy))+
  geom_point()+
  geom_smooth(span=0.15)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at 2.4

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 0.2

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 4.1964e-17

## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 0.01

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 2.4

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius 0.2

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 4.1964e-17

## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 0.01

#Let's try gam instead now!
#It uses the default formula `y ~ s(x, bs="cs")` if you don't enter a specific formula.
ggplot(mpg,aes(displ,hwy))+
  geom_point()+
  geom_smooth(method="gam")

## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

#Linear model...
ggplot(mpg,aes(displ,hwy))+
  geom_point()+
  geom_smooth(method="lm")

## `geom_smooth()` using formula = 'y ~ x'

#ROBUST linear model (requires MASS package)
ggplot(mpg,aes(displ,hwy))+
  geom_point()+
  geom_smooth(method="rlm")

## `geom_smooth()` using formula = 'y ~ x'

#2.6.6 EXERCISES

#1.) 
ggplot(mpg,aes(cty,hwy))+geom_point()

#The main problem is that it's two dependent variables taken out of context. If we want to compare how city and highway stack up for individual data points, we'd probably need some sort of third dimension to compare them?
#A time series plot doesn't work, a density plot doesn't compare the data points together, so I'd say that a faceted plot is our best bet here?
#Wait, I can't find anything to facet two separate variables, uh...
#Maybe I need to add some jitter to see if certain values occur multiple times?
ggplot(mpg,aes(cty,hwy))+geom_jitter(color="red")

#2.)
#You could reorder "class" by the number of values of each class variable?
ggplot(mpg,aes(class,hwy))+geom_boxplot()

# Let me see if the class counts ARE uneven though
classCounts <- table(mpg$class)
classCounts%>%
  barplot()

#They are definitely uneven. You could reorder them by dataset size OR by mean size 
ggplot(mpg,aes(reorder(class,hwy), hwy))+geom_boxplot()

##"The "default" method treats its first argument as a categorical variable, and reorders its levels based on the values of a second variable, usually numeric." It reorders class by hwy
## It looks like it's reordering based on... mean or median?

#3
ggplot(diamonds,aes(carat))+
  geom_histogram(binwidth=0.02)

ggplot(diamonds,aes(carat))+
  geom_freqpoly(binwidth=0.05)

#I think that smaller binwidths reveal more interesting patterns in the dataset. It's such a large dataset that using a small binwidth obfuscates a lot of information.

#4
ggplot(diamonds,aes(price, color=cut))+
  geom_freqpoly()+
  theme_dark()

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

#It looks like the prices are all on a skewed normal-curve and converge at around the same point, BUT the higher the cut quality, the more representation it has further along the x-axis. The difference in prices between categories is a bit hard to pick out, but it looks like the higher-quality diamonds are slightly more expensive and recorded in the data more often...

ggplot(diamonds,aes(cut, price))+
  geom_violin()+
  theme_dark()

#5:
##geom_violin() shows density distribution of a dataset. Its main upside is that it allows you to view the distribution as it correlates to TWO variables (x and y), grouped by a third, rather than just frequency across a single variable. The downside is that two of the variables HAVE to be numerical.
##geom_freqpoly() and the color aesthetic shows frequency, grouped by a single variable, on a single grid, separated by a second categorical variable. This representation of frequency only shows the frequency of ONE variable. The biggest downside that I'm seeing here is that if you have a huge dataset, then the trend-lines might overlap each other and make it hard to read, especially if they follow approximately the same curve.
##geom_histogram()+facet_wrap() solves the overlap problem, but it DOES make all of your graphs smaller, so you have to be prepared to compress bins/ranges. 
##If you wanted to solve the size problem with facetting, you COULD make multiple histograms separately, but that would be a pain to code unless you made it a function call...
##...You could also try making multiple geom_point()+geom_smooth() graphs, possibly color-coded point lines?

#6:
##It looks like the weight aesthetic makes the bar height proportional to the sum of the weights of each group? let's try that and see what it does
ggplot(diamonds,aes(cut))+
  geom_bar()

ggplot(diamonds,aes(cut))+
  geom_bar(aes(weight=price))

##I do not understand what this does
 
#7:
##model-to-manufacturer distribution. How the hell do I do that?
##This works. Sort of? The biggest problem is that the legend is huge and the colors are way too close together. I would love to fix that at some point.
ggplot(mpg,aes(manufacturer, color=model))+
  geom_histogram(stat="count")

## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth` and `bins`

##trans and class distribution. Let me check what the trans distribution is in the first place.
#Okay, there seem to be few enough class types that I could just facet this??
ggplot(mpg,aes(trans, color=class))+
  geom_histogram(stat="count")+
  facet_wrap(~class, ncol=2)

## Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
## `binwidth` and `bins`

##WAIT, NCOL MEANS "NUMBER OF COLUMNS"? NICE.

##cyl and trans is next. I think "cyl" is a continuous variable. This means that I could use a freqpoly plot to show how many cylinders the different transmission types have...
ggplot(mpg,aes(cyl, color=trans))+
  geom_freqpoly(binwidth=1)

##It... kind of works. Let's see what happens if it's a facetted histogram?

ggplot(mpg,aes(cyl, color=trans))+
  geom_histogram(binwidth=1)+
  facet_wrap(~trans, ncol=4)

#1:
## geom_point() to make a scatterplot
## geom_line() for a line chart
## geom_bar(stat="count") for a histogram
## geom_bar(stat="identity") for a bar chart
## I'm really not sure how you draw a pie chart; maybe geom_polygon() while passing the correct parameters?

#2:
##geom_path() only draws lines between the vertices specified; geom_polygon() fills in the spaces between those vertices.
##geom_line() connects the points in left-to-right order, geom_path() connects them in the order that they appear in the data.

#3:
##geom_smooth() appears to be geom_line(). It probably uses a whole bunch of vertices though.
##geom_boxplot() uses geom_point(), appears to use geom_tile() or geom_raster(), and definitely uses geom_path().
##geom_violin() appears to use geom_polygon() and geom_path().

DATAVIS Exercises

Griffin Carson

2026-03-01

2.2.1 Exercises