The diamonds dataset contains information about ~54,000 diamonds, including price, carat, cut, clarity, and color.
Create a scatter plot with:
x-axis: carat
y-axis: price
install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data(diamonds)
ggplot(data = diamonds, aes(x = carat, y = price))+
geom_point(size=1)
Question: What type of relationship appears between carat and price?
Answer: There is a positive correlation between carat and price.
Modify your plot:
Color points by cut
Add a meaningful title and axis labels
Apply theme_minimal()
ggplot(diamonds,
aes(x=carat,y=price,color=cut))+
geom_point(alpha=0.7)+
theme_minimal()+
labs(title = "Price of Diamonds compared to Carat",
y="Price (USD)",
x="Carat of Diamond")
Question: Which cut appears to have higher prices at similar carat values?
Answer: The “ideal” diamonds have the highest price to similar carat value ## Task 3: Add a Trend Line
Add a regression line:
ggplot(diamonds,
aes(x=carat, y=price)) +
geom_point(aes(color=cut), alpha=0.7) +
geom_smooth(method ="lm",color="black", linewidth=1) +
theme_minimal()+
labs(title = "Price of Diamonds Compared to Carat",
x="Carat of Diamond",
y="Price (USD)",
color="Cut Quality"
)
## `geom_smooth()` using formula = 'y ~ x'
Question: Does the relationship between carat and price appear linear? Question: What does the “lm” option do in the geom_smooth command? What are the other options and what do they do?
Answer: Yes, After adding the regression lie there is a clear sign of positive correlation where price increases as carat increases. Answer: The “lm” option in geom smooth is for making our line a linear model, our color code makes it black to stand out among the other colors, And linewidth controls how thick the regression line is. ## Task 4: Improve Visualization
Because the dataset is large, reduce overplotting by:
Adjusting alpha
Changing point size
Trying geom_jitter()
ggplot(diamonds,
aes(x=carat, y=price)) +
geom_jitter(aes(color=cut), alpha=0.5, size=1.5) +
geom_smooth(method ="lm",color="black", linewidth=.5) +
theme_minimal()+
labs(title = "Price of Diamonds Compared to Carat",
x="Carat of Diamond",
y="Price (USD)",
color="Cut Quality"
)
## `geom_smooth()` using formula = 'y ~ x'
Question: Why is overplotting a concern with large datasets? Answer: It can be overwhelming and can make your figure harder to read Question: What does the alpha command do and how does it help with overplotting? Answer: alpha helps with the transparency of each data point. This makes it much easier to see each point and reduces clutter. Question: Based on what you see, what are the risks and benefits of using geom_jitter? Answer: Jitter can help de clutter you figures but if used too much could make it loko wierd. Or it could have the opposite effect and make it seem like you dont have enough data.
Create a scatter plot:
table vs price
Points colored by clarity
Facet by cut (we learn alot more about this later, but just give it a try!)
ggplot(diamonds,
aes(x=table,y=price,color=clarity))+
geom_point(alpha=0.7)+
facet_grid(~cut)+
theme_minimal()+
labs(title = "Price of Diamonds compared to table",
y="Price (USD)",
x="Table of Diamond")
Question: Does the relationship differ by cut? Answer: The better the cut the tighter the spread of price. This can most easily be seen when comparing the Fair diamonds with the Idela diamonds.
The economics dataset contains monthly US economic data over time.
Create a line plot:
x-axis: date
y-axis: unemploy
data(economics)
ggplot(economics,aes(x=date,y=unemploy))+
geom_line()
Question: Describe the overall trend over time. Answer: The trend rises overall but ends up falling slightly at the end
Reshape the data using pivot_longer() to plot:
uempmed
psavert
Then create a multi-line plot with:
install.packages("tidyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
library(tidyr)
econ_long=economiecon_long=economiecon_long=economics %>%
pivot_longer(
cols=c(uempmed, psavert),
names_to="variable",
values_to="value")
ggplot(econ_long, aes(x=date,y=value,color=variable))+
geom_line()
Question: Do these variables appear to move together over time? Answer:Yes, they appear an inverse of each other.
Enhance your plot by:
Changing line width
Customizing colors
Formatting the date axis
Adding title, subtitle, and caption
Applying a theme (theme_bw() or theme_classic())
ggplot(econ_long, aes(x = date, y = value, color = variable))+
geom_line(linewidth = .5)+
scale_color_manual(values = c(
"uempmed" = "darkred",
"psavert" = "darkblue"))+
scale_x_date(
date_breaks = "5 years",
date_labels = "%Y")+
theme_classic()+
labs(title="Unemployment Time and Personal Savings Over Time",subtitle = "Two important indicators plotted together", caption = "A multi line graph showing Unemployment Time and Personal Savings compared with each other over time",
x="Date",
y="Value",
color="Variables")