# FUN FACT: R doesn't let you knit a document unless EVERY package that you use in the code is installed and loaded in the code itself. AUGH.
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
install.packages("mgcv")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
install.packages("MASS")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ lubridate 1.9.5     ✔ tibble    3.3.1
## ✔ purrr     1.2.1     ✔ tidyr     1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(mgcv)
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## 
## The following object is masked from 'package:dplyr':
## 
##     collapse
## 
## This is mgcv 1.9-4. For overview type '?mgcv'.
library(MASS)
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

Part 1: Scatter Plots (Using diamonds)

The diamonds dataset contains information about ~54,000 diamonds, including price, carat, cut, clarity, and color.

Task 1: Basic Scatter Plot

Create a scatter plot with:

  • x-axis: carat

  • y-axis: price

ggplot(diamonds, aes(carat,price))+
geom_point()

Question: What type of relationship appears between carat and price? There’s a positive correlation between carat and price. There are also less datapoints at higher-carat values, presumably because heavier diamonds are rarer, so there is less data on them.

Task 2: Add Aesthetic Mappings

Modify your plot:

  • Color points by cut

  • Add a meaningful title and axis labels

  • Apply theme_minimal()

ggplot(diamonds, aes(carat,price, color=cut))+
geom_point()+
  labs(title="Correlation between diamond carat and price",x="Carats",y="Price (USD)")+
  theme_minimal()

Question: Which cut appears to have higher prices at similar carat values? There are bigger clusters of yellow towards the top of the graph, so Ideal appears to have higher prices at similar values.

Task 3: Add a Trend Line

Add a regression line:

  • geom_smooth(method = “lm”)
ggplot(diamonds, aes(carat,price, color=cut))+
geom_point()+
  geom_smooth(method="rlm")+
  labs(title="Correlation between diamond carat and price",x="Carats",y="Price (USD)")+
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Question: Does the relationship between carat and price appear linear? Within categories, it looks like it is. Outside of categories… eh, not so much. Question: What does the “lm” option do in the geom_smooth command? What are the other options and what do they do? “lm” seems to make a linear regression line AND split it up by the “color=” aes if you have one. “rlm” makes a linear regression model with a fitting algorithm that takes outliers into account better. It requires the MASS package though. “loess” makes a polynomial line, it’s… kind of wiggly. It doesn’t work well for huge datasets though. I tried it on this diamond dataset and it took over a minute to load. It looks like it has limited memory. “gam” uses a generalized additive model and draws a slightly wiggly line.

Task 4: Improve Visualization

Because the dataset is large, reduce overplotting by:

  • Adjusting alpha

  • Changing point size

  • Trying geom_jitter()

ggplot(diamonds, aes(carat,price, color=cut))+
geom_jitter(size=0.5, alpha=0.5)+
  geom_smooth(method="rlm")+
  labs(title="Correlation between diamond carat and price",x="Carats",y="Price (USD)")+
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

##It doesn't look like geom_jitter() made much of a difference to be honest.

Question: Why is overplotting a concern with large datasets? More points means more visual clutter. This makes individual data points harder to pick out. Question: What does the alpha command do and how does it help with overplotting? The alpha command adjusts the opacity of the points. In larger datasets, this means that overlapping data is easier to pick out, but it doesn’t actually remove any data from the set. Question: Based on what you see, what are the risks and benefits of using geom_jitter? The benefits are that it helps reduce visual clutter. However, if your dataset has very small differences between individual data points, then jitter can make your graph less accurate to your data.

Task 5: Challenge Scatter Plot

Create a scatter plot:

  • table vs price

  • Points colored by clarity

  • Facet by cut (we learn alot more about this later, but just give it a try!)

ggplot(diamonds, aes(table,price, color=clarity))+
geom_jitter(size=0.1, alpha=0.25)+
  geom_smooth(method="rlm", size=0.2)+
  labs(title="Correlation between diamond table and price")+
  theme_minimal()+
  facet_wrap(vars(cut))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

Question: Does the relationship differ by cut? Yeah, but I can barely see HOW it differs because the TABLE IS SO BADLY FORMATTED. AUGH.

Part 2: Line Plots (Using economics Dataset)

The economics dataset contains monthly US economic data over time.

Task 6: Basic Line Plot

Create a line plot:

  • x-axis: date

  • y-axis: unemploy

ggplot(economics,aes(date,unemploy))+
  geom_line()

Question: Describe the overall trend over time.

Task 7: Multiple Lines on One Plot

Reshape the data using pivot_longer() to plot:

  • uempmed

  • psavert

Then create a multi-line plot with:

  • color = variable
economics
## # A tibble: 574 × 6
##    date         pce    pop psavert uempmed unemploy
##    <date>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
##  1 1967-07-01  507. 198712    12.6     4.5     2944
##  2 1967-08-01  510. 198911    12.6     4.7     2945
##  3 1967-09-01  516. 199113    11.9     4.6     2958
##  4 1967-10-01  512. 199311    12.9     4.9     3143
##  5 1967-11-01  517. 199498    12.8     4.7     3066
##  6 1967-12-01  525. 199657    11.8     4.8     3018
##  7 1968-01-01  531. 199808    11.7     5.1     2878
##  8 1968-02-01  534. 199920    12.3     4.5     3001
##  9 1968-03-01  544. 200056    11.7     4.1     2877
## 10 1968-04-01  544  200208    12.3     4.6     2709
## # ℹ 564 more rows
#How on earth do I reshape this data? Why do I need to reshape it?
#Do I need to reshape the data so that uempmed and psavert are in separate rows/observations? And then I can use the line command to connect the points by category?

#Okay so I want to: 
##1.) remove all unnecessary columns (pce, pop, unemploy) from the dataset
##2.) Use pivot-longer to put psavert and uempmed into a single column, then indicate whether an observation is uempmed or psavert by a categorical variable in a separate column
##3.) Call the tidied dataset up 

economicsLong <- economics |> pivot_longer(cols = c(uempmed, psavert), names_to="newStatType",values_to="newStats",values_drop_na=TRUE)

economicsLong
## # A tibble: 1,148 × 6
##    date         pce    pop unemploy newStatType newStats
##    <date>     <dbl>  <dbl>    <dbl> <chr>          <dbl>
##  1 1967-07-01  507. 198712     2944 uempmed          4.5
##  2 1967-07-01  507. 198712     2944 psavert         12.6
##  3 1967-08-01  510. 198911     2945 uempmed          4.7
##  4 1967-08-01  510. 198911     2945 psavert         12.6
##  5 1967-09-01  516. 199113     2958 uempmed          4.6
##  6 1967-09-01  516. 199113     2958 psavert         11.9
##  7 1967-10-01  512. 199311     3143 uempmed          4.9
##  8 1967-10-01  512. 199311     3143 psavert         12.9
##  9 1967-11-01  517. 199498     3066 uempmed          4.7
## 10 1967-11-01  517. 199498     3066 psavert         12.8
## # ℹ 1,138 more rows
ggplot(economicsLong, aes(date,newStats, color=newStatType))+
  geom_line()

Question: Do these variables appear to move together over time? It looks like they have a LOOSE inverse correlation. So as the average personal savings rate went down, the median duration of unemployment went up. Uhm. Uh. That’s worrying?

Task 8: Customize Your Line Plot

Enhance your plot by:

  • Changing line width

  • Customizing colors

  • Formatting the date axis

  • Adding title, subtitle, and caption

  • Applying a theme (theme_bw() or theme_classic())

ggplot(economicsLong, aes(date,newStats, color=newStatType))+
  theme_dark()+
  geom_line(linewidth=0.5)+
  theme(axis.text.x=element_text(angle=0,hjust=2))+
  scale_color_viridis_d()+
  labs(title="Personal Savings Rate & Mean Unemployment in the US, 1965-2015", x="Years",y="Dollars and/or Mean Weeks Unemployed", subtitle="psavert=Personal Savings, uempmed=Median Unemployment In Weeks")