# FUN FACT: R doesn't let you knit a document unless EVERY package that you use in the code is installed and loaded in the code itself. AUGH.
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
install.packages("mgcv")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
install.packages("MASS")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ lubridate 1.9.5 ✔ tibble 3.3.1
## ✔ purrr 1.2.1 ✔ tidyr 1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(mgcv)
## Loading required package: nlme
##
## Attaching package: 'nlme'
##
## The following object is masked from 'package:dplyr':
##
## collapse
##
## This is mgcv 1.9-4. For overview type '?mgcv'.
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
The diamonds dataset contains information about ~54,000 diamonds, including price, carat, cut, clarity, and color.
Create a scatter plot with:
x-axis: carat
y-axis: price
ggplot(diamonds, aes(carat,price))+
geom_point()
Question: What type of relationship appears between carat and price? There’s a positive correlation between carat and price. There are also less datapoints at higher-carat values, presumably because heavier diamonds are rarer, so there is less data on them.
Modify your plot:
Color points by cut
Add a meaningful title and axis labels
Apply theme_minimal()
ggplot(diamonds, aes(carat,price, color=cut))+
geom_point()+
labs(title="Correlation between diamond carat and price",x="Carats",y="Price (USD)")+
theme_minimal()
Question: Which cut appears to have higher prices at similar carat values? There are bigger clusters of yellow towards the top of the graph, so Ideal appears to have higher prices at similar values.
Add a regression line:
ggplot(diamonds, aes(carat,price, color=cut))+
geom_point()+
geom_smooth(method="rlm")+
labs(title="Correlation between diamond carat and price",x="Carats",y="Price (USD)")+
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Question: Does the relationship between carat and price appear linear? Within categories, it looks like it is. Outside of categories… eh, not so much. Question: What does the “lm” option do in the geom_smooth command? What are the other options and what do they do? “lm” seems to make a linear regression line AND split it up by the “color=” aes if you have one. “rlm” makes a linear regression model with a fitting algorithm that takes outliers into account better. It requires the MASS package though. “loess” makes a polynomial line, it’s… kind of wiggly. It doesn’t work well for huge datasets though. I tried it on this diamond dataset and it took over a minute to load. It looks like it has limited memory. “gam” uses a generalized additive model and draws a slightly wiggly line.
Because the dataset is large, reduce overplotting by:
Adjusting alpha
Changing point size
Trying geom_jitter()
ggplot(diamonds, aes(carat,price, color=cut))+
geom_jitter(size=0.5, alpha=0.5)+
geom_smooth(method="rlm")+
labs(title="Correlation between diamond carat and price",x="Carats",y="Price (USD)")+
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
##It doesn't look like geom_jitter() made much of a difference to be honest.
Question: Why is overplotting a concern with large datasets? More points means more visual clutter. This makes individual data points harder to pick out. Question: What does the alpha command do and how does it help with overplotting? The alpha command adjusts the opacity of the points. In larger datasets, this means that overlapping data is easier to pick out, but it doesn’t actually remove any data from the set. Question: Based on what you see, what are the risks and benefits of using geom_jitter? The benefits are that it helps reduce visual clutter. However, if your dataset has very small differences between individual data points, then jitter can make your graph less accurate to your data.
Create a scatter plot:
table vs price
Points colored by clarity
Facet by cut (we learn alot more about this later, but just give it a try!)
ggplot(diamonds, aes(table,price, color=clarity))+
geom_jitter(size=0.1, alpha=0.25)+
geom_smooth(method="rlm", size=0.2)+
labs(title="Correlation between diamond table and price")+
theme_minimal()+
facet_wrap(vars(cut))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
Question: Does the relationship differ by cut? Yeah, but I can barely see HOW it differs because the TABLE IS SO BADLY FORMATTED. AUGH.
The economics dataset contains monthly US economic data over time.
Create a line plot:
x-axis: date
y-axis: unemploy
ggplot(economics,aes(date,unemploy))+
geom_line()
Question: Describe the overall trend over time.
Reshape the data using pivot_longer() to plot:
uempmed
psavert
Then create a multi-line plot with:
economics
## # A tibble: 574 × 6
## date pce pop psavert uempmed unemploy
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1967-07-01 507. 198712 12.6 4.5 2944
## 2 1967-08-01 510. 198911 12.6 4.7 2945
## 3 1967-09-01 516. 199113 11.9 4.6 2958
## 4 1967-10-01 512. 199311 12.9 4.9 3143
## 5 1967-11-01 517. 199498 12.8 4.7 3066
## 6 1967-12-01 525. 199657 11.8 4.8 3018
## 7 1968-01-01 531. 199808 11.7 5.1 2878
## 8 1968-02-01 534. 199920 12.3 4.5 3001
## 9 1968-03-01 544. 200056 11.7 4.1 2877
## 10 1968-04-01 544 200208 12.3 4.6 2709
## # ℹ 564 more rows
#How on earth do I reshape this data? Why do I need to reshape it?
#Do I need to reshape the data so that uempmed and psavert are in separate rows/observations? And then I can use the line command to connect the points by category?
#Okay so I want to:
##1.) remove all unnecessary columns (pce, pop, unemploy) from the dataset
##2.) Use pivot-longer to put psavert and uempmed into a single column, then indicate whether an observation is uempmed or psavert by a categorical variable in a separate column
##3.) Call the tidied dataset up
economicsLong <- economics |> pivot_longer(cols = c(uempmed, psavert), names_to="newStatType",values_to="newStats",values_drop_na=TRUE)
economicsLong
## # A tibble: 1,148 × 6
## date pce pop unemploy newStatType newStats
## <date> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 1967-07-01 507. 198712 2944 uempmed 4.5
## 2 1967-07-01 507. 198712 2944 psavert 12.6
## 3 1967-08-01 510. 198911 2945 uempmed 4.7
## 4 1967-08-01 510. 198911 2945 psavert 12.6
## 5 1967-09-01 516. 199113 2958 uempmed 4.6
## 6 1967-09-01 516. 199113 2958 psavert 11.9
## 7 1967-10-01 512. 199311 3143 uempmed 4.9
## 8 1967-10-01 512. 199311 3143 psavert 12.9
## 9 1967-11-01 517. 199498 3066 uempmed 4.7
## 10 1967-11-01 517. 199498 3066 psavert 12.8
## # ℹ 1,138 more rows
ggplot(economicsLong, aes(date,newStats, color=newStatType))+
geom_line()
Question: Do these variables appear to move together over time? It looks like they have a LOOSE inverse correlation. So as the average personal savings rate went down, the median duration of unemployment went up. Uhm. Uh. That’s worrying?
Enhance your plot by:
Changing line width
Customizing colors
Formatting the date axis
Adding title, subtitle, and caption
Applying a theme (theme_bw() or theme_classic())
ggplot(economicsLong, aes(date,newStats, color=newStatType))+
theme_dark()+
geom_line(linewidth=0.5)+
theme(axis.text.x=element_text(angle=0,hjust=2))+
scale_color_viridis_d()+
labs(title="Personal Savings Rate & Mean Unemployment in the US, 1965-2015", x="Years",y="Dollars and/or Mean Weeks Unemployed", subtitle="psavert=Personal Savings, uempmed=Median Unemployment In Weeks")