It is a super package. Loading this package also loads 8 other packages, including “dplyr” and “ggplot2”.
The “dplyr” package is for data wrangling. It provides a consistent set of verbs that help you solve the most common data manipulation challenges:
mutate() adds new variables that are functions of existing variables.
select() picks variables (i.e. columns) based on their names.
filter() picks cases(i.e. rows) based on their values.
summarise() reduces multiple values down to a single summary.
arrange() changes the ordering of the rows.
The summarise() function should be used after grouping your data rows with the group_by() function, since a summary is usually done “by group”.
arrange(mtcars, cyl, -disp) # Arrange data in ascending order by cyl, then in descending order by disp
## mpg cyl disp hp drat wt qsec vs am gear carb
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
D = mtcars # Make a copy
mutate(D,
fake = mpg + disp, # A meaningless column called "fake" added to the D data frame
mpg_cat = (mpg>20), # Add another column called "mpg_cat". The column indicates whether mpg > 20
log_disp = log(disp), # Add still another column called "log_disp": A log-transformation to disp
mpg = log(mpg) # Replace the mpg column by its log value
)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 3.044522 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 3.044522 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 3.126761 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 3.063391 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 2.928524 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 2.895912 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 2.660260 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 3.194583 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 3.126761 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 2.954910 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 2.879198 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 2.797281 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 2.850707 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 2.721295 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 2.341806 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 2.341806 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 2.687847 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 3.478158 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 3.414443 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 3.523415 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 3.068053 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 2.740840 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 2.721295 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 2.587764 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 2.954910 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 3.306887 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 3.258097 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 3.414443 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 2.760010 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 2.980619 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 2.708050 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 3.063391 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## fake mpg_cat log_disp
## Mazda RX4 181.0 TRUE 5.075174
## Mazda RX4 Wag 181.0 TRUE 5.075174
## Datsun 710 130.8 TRUE 4.682131
## Hornet 4 Drive 279.4 TRUE 5.552960
## Hornet Sportabout 378.7 FALSE 5.886104
## Valiant 243.1 FALSE 5.416100
## Duster 360 374.3 FALSE 5.886104
## Merc 240D 171.1 TRUE 4.988390
## Merc 230 163.6 TRUE 4.947340
## Merc 280 186.8 FALSE 5.121580
## Merc 280C 185.4 FALSE 5.121580
## Merc 450SE 292.2 FALSE 5.619676
## Merc 450SL 293.1 FALSE 5.619676
## Merc 450SLC 291.0 FALSE 5.619676
## Cadillac Fleetwood 482.4 FALSE 6.156979
## Lincoln Continental 470.4 FALSE 6.131226
## Chrysler Imperial 454.7 FALSE 6.086775
## Fiat 128 111.1 TRUE 4.365643
## Honda Civic 106.1 TRUE 4.326778
## Toyota Corolla 105.0 TRUE 4.264087
## Toyota Corona 141.6 TRUE 4.788325
## Dodge Challenger 333.5 FALSE 5.762051
## AMC Javelin 319.2 FALSE 5.717028
## Camaro Z28 363.3 FALSE 5.857933
## Pontiac Firebird 419.2 FALSE 5.991465
## Fiat X1-9 106.3 TRUE 4.369448
## Porsche 914-2 146.3 TRUE 4.789989
## Lotus Europa 125.5 TRUE 4.554929
## Ford Pantera L 366.8 FALSE 5.860786
## Ferrari Dino 164.7 FALSE 4.976734
## Maserati Bora 316.0 FALSE 5.707110
## Volvo 142E 142.4 TRUE 4.795791
by_cyl <- mtcars %>% group_by(cyl)
# grouping doesn't change how the data looks (apart from listing how it's grouped):
by_cyl
## # A tibble: 32 × 11
## # Groups: cyl [3]
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
## # … with 22 more rows
class(mtcars)
## [1] "data.frame"
class(by_cyl) # The structure of the new data frame is richer
## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
by_cyl <- mtcars %>% group_by(cyl)
# grouping doesn't change how the data looks (apart from listing how it's grouped):
by_cyl # Now, the mtcars data frame has been grouped and called by_cyl.
## # A tibble: 32 × 11
## # Groups: cyl [3]
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
## # … with 22 more rows
by_cyl %>% summarise(
mpg_mean = mean(mpg), # Mean of disp for each cyl group
hp_mean = mean(hp) # Mean of hp for each cyl group
)
## # A tibble: 3 × 3
## cyl mpg_mean hp_mean
## <dbl> <dbl> <dbl>
## 1 4 26.7 82.6
## 2 6 19.7 122.
## 3 8 15.1 209.
by_cyl %>% summarise(freq = n()) # Frequency distribution of "cyl";
## # A tibble: 3 × 2
## cyl freq
## <dbl> <int>
## 1 4 11
## 2 6 7
## 3 8 14
# The resulting data frame has two columns: cyl and freq
Here are articles that show why data visualization is important:
https://www.mdpi.com/2072-6643/11/3/684 Can you reproduce Figures 2, 4, and 5?
https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(22)00327-0/fulltext Can you reproduce Figure 1?
This package allows you to create different kinds of plots using a more generalized approach. It is based on the graphic concept that a graph is a combination of 3 independent components: data, aesthetics, and geometry, where
data is a data frame (maybe NULL),
aesthetics is used to set the x- and y-axes or to choose the shape, size, or color of the plotting character.
geometry refers to the type of graphics (such as scatter plot, line plot, histogram, boxplot, density curve, barplot, pie chart, violin plot).
We use examples to demonstrate the use of ggplot2 package when plotting data based on the following situations:
One continuous variable (histogram, boxplot, density curve)
One categorical variable (barplot, pie chart)
Two continuous variables (scatter plot, line plot)
One continuous variable and one categorical variable (side-by side boxplots, overlaid density curves, overlaid histograms, violin plots)
Two categorical variables (side-by-side barplots, mosaic plots)
More than 3 variables
A warning: For each aesthetic, you use the aes() function to associate the name of the aesthetic with a variable to display. The aes() function gathers together each of the aesthetic mappings used by a layer and passes them to the layer’s mapping argument. You can also set the aesthetic properties of your geom manually. To set an aesthetic manually, set the aesthetic by name as an argument of your geom function; i.e. it goes outside of aes()!
In addition to the 3 components of a graph, there are more than 30 themes that control the appearance of the graph. They can be roughly grouped into five categories: plot, axis, legend, panel and facet. Check out the theme() function for details.
head(iris) # Check the data
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
ggplot(iris) # Does not plot anything, since no x or y coordinate is specified
ggplot(iris, aes(x = Sepal.Length)) # Only plot the coordinate system
ggplot(iris, aes(x = Sepal.Length)) +
geom_histogram(bins = 6,
color = "blue",
fill = "red"
) # Add a layer
ggplot(iris, aes(x = Sepal.Length)) + geom_density(color = "blue")
ggplot(iris, aes(x = Sepal.Length)) + geom_boxplot(color = "blue")
ggplot(iris, aes(x = Species)) + geom_bar(color = "blue", fill = "red")
ggplot(iris, aes(y = Species)) + geom_bar(color = "blue", fill = "red") # horizontal bars
p = ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point(col = "blue", size = 2) # shape = pch in base plot
p
ggplotly(p) # Using plotly package to make plot interactive
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_line(col = "blue", size = 2)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
ggplot(mtcars, aes(x = factor(cyl), y = wt, fill = factor(cyl))) +
geom_boxplot() +
theme(legend.position = "none") +
labs(x = "Cylinder", y = "Weight", title = "My Plots", subtitle = "Made by XYZ", caption = "Data courtesy")
ggplot(mtcars, aes(x = wt, color = factor(cyl))) +
geom_density() +
labs(x = "Weight", color = "Cylinder") # Legend's title is set via color here
ggplot(mtcars, aes(x = factor(cyl), y = wt, fill = factor(cyl))) +
geom_violin() +
labs(x = "Weight", fill = "Cylinder") # Legend's title is set via color here
More
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
ggplot(mtcars, mapping=aes(x = hp, y = mpg)) +
geom_point(color = "red")
## Want to do other plots?
ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_line(color = "red")
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(color = "blue", fill = "red", bins=6)
ggplot(mtcars, aes(x = mpg)) +
geom_density(color = "red")
ggplot(mtcars, aes(x = mpg, fill = factor(cyl))) +
geom_histogram(bins = 6) # bad
ggplot(mtcars, aes(x = mpg, colour = factor(cyl))) +
geom_density() +
labs(colour = "Cylinder") # rename legend title
ggplot(mtcars, aes(x = mpg, color= as.factor(cyl))) +
geom_density() +
labs(title = "My Plot Title", x="Miles per Gallon", y= "Density", color="Cylinder") + # rename legend title
theme(plot.title = element_text(hjust = 0.5)) # Center the title
ggplot(mtcars, aes(y = mpg)) +
geom_boxplot(fill = "red")
# More to learn about the theme()
ggplot(mtcars, aes(x = as.factor(cyl), y = mpg, fill = as.factor(cyl))) +
geom_boxplot() +
labs(title = "Mpg vs Cyl", subtitle = "Hello World!", x="Cyl", caption = "Data courtesy: xyz") +
theme(plot.background = element_rect(fill = "gray"),
panel.background = element_rect(fill = "yellow"),
plot.margin = unit(c(t=1,r=2,b=3,l=4), "cm"),
axis.title.x = element_text(hjust = 0.5, size = 20, color = "pink", angle = 45, vjust = -1),
plot.title = element_text(hjust = 0.5, size = 28, color = "green"),
plot.subtitle = element_text(hjust = 0.5, size = 15, color = "red"),
plot.caption = element_text(hjust = 0, size = 10, color = "green"),
legend.position = "none",
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank()
)
myData = data.frame(
Student = 1:24,
Major = c("Accounting", "IS", "CS", "Statistics", "EE", "Math", "Accounting", "CS", "CS", "Statistics", "EE", "Math","Accounting", "IS", "IS", "Statistics", "Math", "Math", "Accounting", "IS", "Statistics", "Statistics", "EE", "CS"),
Sex = c("Male", "Female", "Male", "Male", "Female", "Male", "Female", "Male", "Male", "Female", "Female", "Female", "Male", "Male", "Female", "Male", "Female", "Female", "Male", "Female", "Male", "Male", "Male", "Male"),
Score = c(87, 92, 69, 90, 88, 79, 94, 86, 92, 84, 78, 67, 95, 91, 93, 83, 82, 94, 86, 74, 95, 55, 93, 85)
)
myData
## Student Major Sex Score
## 1 1 Accounting Male 87
## 2 2 IS Female 92
## 3 3 CS Male 69
## 4 4 Statistics Male 90
## 5 5 EE Female 88
## 6 6 Math Male 79
## 7 7 Accounting Female 94
## 8 8 CS Male 86
## 9 9 CS Male 92
## 10 10 Statistics Female 84
## 11 11 EE Female 78
## 12 12 Math Female 67
## 13 13 Accounting Male 95
## 14 14 IS Male 91
## 15 15 IS Female 93
## 16 16 Statistics Male 83
## 17 17 Math Female 82
## 18 18 Math Female 94
## 19 19 Accounting Male 86
## 20 20 IS Female 74
## 21 21 Statistics Male 95
## 22 22 Statistics Male 55
## 23 23 EE Male 93
## 24 24 CS Male 85
ggplot(myData, aes(x = Score, color= Sex)) +
geom_density()
# Bar plots with 3 possible positions
ggplot(myData, aes(x = Major, fill = Sex)) +
geom_bar(position = "stack") # Each bar consists of two pieces representing the counts of each sex
ggplot(myData, aes(x = Major, fill = Sex)) +
geom_bar(position = "dodge") # Side-by-side
ggplot(myData, aes(x = Major, fill = Sex)) +
geom_bar(position = "fill") # Each bar has the same height of 1 and consists of two sexes in proportion
# Name-value kind of data
popData = data.frame(City = c("St Cloud", "Sartell", "Waite Park", "St Joseph"),
Population = c(66169, 18428, 7718, 7147))
popData
## City Population
## 1 St Cloud 66169
## 2 Sartell 18428
## 3 Waite Park 7718
## 4 St Joseph 7147
ggplot(popData, aes(x = City, y = Population)) +
geom_col() + # for Name-value kind of data
theme(axis.text.x = element_text(angle = 45))
# The Titanic Data
D=as.data.frame(Titanic)
D1=D %>% group_by(Class)
D2=D1 %>% summarise(Freq=sum(Freq))
D2
## # A tibble: 4 × 2
## Class Freq
## <fct> <dbl>
## 1 1st 325
## 2 2nd 285
## 3 3rd 706
## 4 Crew 885
ggplot(D2, aes(x=Class, y = Freq)) +
geom_col() +
geom_text(aes(label=Freq), vjust=2.5, col ="red", size=13)
ggplot(D, aes(x=Class, y = Freq, fill = Survived)) +
geom_col(position = "stack") +
labs(fill = "Survival Status") # Change title of legend
ggplot(D, aes(x=Class, y = Freq, fill = Survived)) +
geom_col(position = "dodge")
ggplot(D, aes(x=Class, y = Freq, fill = Survived)) +
geom_col(position = "dodge")
# Which one to use: geom_bar() or geom_col?
# Both functions create bar graphs. In general, for individual data, such as myData above, use geom_bar();
# for grouped data, such as D above, use geom_col().
# The function geom_bar() can also be applied to grouped data, if you use it this way: geom_bar(stat="identity").
A quick quiz 1: how many aesthetics are adopted in the following plot? How would you create such a plot?
## Warning: Using size for a discrete variable is not advised.
A quick quiz 2: how many aesthetics are adopted in the following plot? How would you create such a plot?
D = mtcars %>% mutate(cyl = factor(cyl), am = factor(am), gear = factor(gear))
p <- ggplot(data = D, aes(x=wt, y=mpg)) +
geom_line(aes(group=factor(gear), color=gear)) + # group seems not necessary?
labs(title = "Demonstrate the use of aesthetics", subtitle = 'Using the "mtcars" data') +
theme(plot.title = element_text(hjust = 0.5)) +
theme(plot.subtitle = element_text(hjust = 0.5))
p
Sometimes, you need to split your plot into facets, subplots that each display one subset of the data.
# Prepare data
D = mtcars %>% mutate(am = factor(am,
levels = c(0,1),
labels = c("Automatic", "Manual")
),
vs = factor(vs,
levels = c(0,1),
labels = c("V-shaped", "straight")
)
)
# Use one categorical variable for faceting
ggplot(data = mtcars, aes(x=wt, y=mpg)) +
geom_line() +
facet_grid(am~.)
# Or
ggplot(data = mtcars, aes(x=wt, y=mpg)) +
geom_line() +
facet_grid(.~am)
# Use two categorical variables for faceting
ggplot(data = D, aes(x=wt, y=mpg)) +
geom_line() +
facet_grid(am~vs)
# data
D=data.frame(state=rep(LETTERS[1:4], each = 2), year = rep(2020:2019, 4), prop=c(0.6, 0.3, 0.4, 0.7, 0.2,0.7, 0.1, 0.6))
D
## state year prop
## 1 A 2020 0.6
## 2 A 2019 0.3
## 3 B 2020 0.4
## 4 B 2019 0.7
## 5 C 2020 0.2
## 6 C 2019 0.7
## 7 D 2020 0.1
## 8 D 2019 0.6
# Dodged Barplots with Text Labels
ggplot(D, aes(x=state, y=prop, fill=factor(year))) +
geom_col(position = "dodge") +
geom_text(aes(label=prop),
position = position_dodge(width=1), # 1-bar distance between texts on side-by-side bars
color="white",
vjust = 1.5, # texts inside bars
hjust = 0.5 # centered texts
) +
labs(fill = "Year")
This example will use the data frame “diamonds” from the ggplot2 package. The dataset contains the prices and other attributes of almost 54,000 diamonds. The variables are as follows:
price: price in US dollars ($326–$18,823)
carat: weight of the diamond (0.2–5.01)
cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color: diamond colour, from D (best) to J (worst)
clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
x: length in mm (0–10.74)
y: width in mm (0–58.9)
z: depth in mm (0–31.8)
depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
table: width of top of diamond relative to widest point (43–95)
Reference: https://stackoverflow.com/questions/19233365/how-to-create-a-marimekko-mosaic-plot-in-ggplot2
head(diamonds, n = 10)
## # A tibble: 10 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
df <- diamonds %>%
group_by(cut, clarity) %>% ## Must do grouping first in order to do the following operations
summarise(count = n()) %>%
group_by(cut) %>% # If not specified, the following will still be done by default for each category of cut
mutate(cut.count = sum(count),
prop = count/sum(count)) %>%
ungroup() # Remove grouping after we have added the two new columns
## `summarise()` has grouped output by 'cut'. You can override using the `.groups`
## argument.
ggplot(df, aes(x = cut, y = prop, width = cut.count, fill = clarity)) +
geom_col(colour = "black") +
facet_grid(.~cut, scales = "free_x", space = "free_x") +
scale_fill_brewer(palette = "RdYlGn") +
theme_void()
# A reordered barplot
df <- diamonds %>%
group_by(cut) %>% ## Must do grouping first in order to do the following operations
summarise(count = n())
ggplot(df,
aes(x = reorder(cut, -count), y = count)) + # "-" means reorder "cut" by "count" from high to low
geom_col() +
theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
theme_bw() +
ggtitle("Distribution of Diamonds by Cut") +xlab("Cut")
There are nice references:
https://r-graph-gallery.com/297-circular-barplot-with-groups.html
https://www.azandisresearch.com/2019/07/19/create-a-radial-mirrored-barplot-with-ggplot/
D=USArrests
D$State = rownames(USArrests)
D=arrange(D, Murder)
angle <- 90-360 * (0:49) /50
hjust <- ifelse( angle < -90, 1, 0)
angle <- ifelse(angle < -90, angle+180, angle)
ggplot(D, aes(x = reorder(State, Murder))) +
geom_col(aes(y = Murder, fill = "Murder")) +
geom_col(aes(y = -Assault/10, fill = "Assault")) +
labs(fill = rep(c("black", "red"), c(c(nrow(D), nrow(D))))) +
coord_polar() +
theme_void() + # Void all themes
theme(panel.grid.major.x = element_line(colour = "lightgrey"),
panel.grid.major.y = element_blank()) +
geom_text(aes(y = 20, label = State), angle = angle-3, hjust = hjust, size = 3) +
geom_text(aes(y = Murder, label = Murder), size = 2, vjust = 0.5) +
geom_text(aes(y = -Assault/10, label = Assault), size = 2, vjust = 0.5) +
labs(title = "A Mirrored, Radial Barplot",
subtitle = "Murder & Assault Rates for States",
fill = "Crime Type") +
theme(legend.position = c(1.1, 0.9))
Challenging problems:
if states are classified as Northeast, Southwest, West, Southeast, and Midwest, how can you make plots like the ones in https://r-graph-gallery.com/297-circular-barplot-with-groups.html?
Can you make a plot like Figure 1 in this article: https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(22)00327-0/fulltext
We demonstrate the use of the “BasketballAnalyzeR” package for creating nice plots.
D = USArrests[1:10,] # Only use the first 10 rows of the USArrests data from the base package
D$group = rep(1:2, c(5,5)) # Add an arbitrary group just for demo
BasketballAnalyzeR::radialprofile(data=D,
std = TRUE, # variables are standardized.
title = rownames(D))
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
Install this package using code:
devtools::install_github(“ricardo-bion/ggradar”, dependencies = TRUE)
# Prepare data:
# the first column must be a character and other columns numeric between 0 and 1
df = data.frame(year = as.character(c(2019, 2020)),
intimidation = c(0.33, 0.29),
destruction = c(0.25, 0.34),
assault = c(0.12, 0.15),
other = c(0.3, 0.22)
)
# plot
ggradar::ggradar(df)
Reference: https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/sql.html
SQL is a database query language - a language designed specifically for interacting with a database. You might find this is a great tool for data wrangling. Better than tidyverse?
library(sqldf) # Load the package
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
sqldf('SELECT mpg FROM mtcars')
## mpg
## 1 21.0
## 2 21.0
## 3 22.8
## 4 21.4
## 5 18.7
## 6 18.1
## 7 14.3
## 8 24.4
## 9 22.8
## 10 19.2
## 11 17.8
## 12 16.4
## 13 17.3
## 14 15.2
## 15 10.4
## 16 10.4
## 17 14.7
## 18 32.4
## 19 30.4
## 20 33.9
## 21 21.5
## 22 15.5
## 23 15.2
## 24 13.3
## 25 19.2
## 26 27.3
## 27 26.0
## 28 30.4
## 29 15.8
## 30 19.7
## 31 15.0
## 32 21.4
sqldf('SELECT mpg, disp FROM mtcars')
## mpg disp
## 1 21.0 160.0
## 2 21.0 160.0
## 3 22.8 108.0
## 4 21.4 258.0
## 5 18.7 360.0
## 6 18.1 225.0
## 7 14.3 360.0
## 8 24.4 146.7
## 9 22.8 140.8
## 10 19.2 167.6
## 11 17.8 167.6
## 12 16.4 275.8
## 13 17.3 275.8
## 14 15.2 275.8
## 15 10.4 472.0
## 16 10.4 460.0
## 17 14.7 440.0
## 18 32.4 78.7
## 19 30.4 75.7
## 20 33.9 71.1
## 21 21.5 120.1
## 22 15.5 318.0
## 23 15.2 304.0
## 24 13.3 350.0
## 25 19.2 400.0
## 26 27.3 79.0
## 27 26.0 120.3
## 28 30.4 95.1
## 29 15.8 351.0
## 30 19.7 145.0
## 31 15.0 301.0
## 32 21.4 121.0
sqldf('SELECT mpg, disp FROM mtcars WHERE am = 1 ORDER BY mpg ASC')
## mpg disp
## 1 15.0 301.0
## 2 15.8 351.0
## 3 19.7 145.0
## 4 21.0 160.0
## 5 21.0 160.0
## 6 21.4 121.0
## 7 22.8 108.0
## 8 26.0 120.3
## 9 27.3 79.0
## 10 30.4 75.7
## 11 30.4 95.1
## 12 32.4 78.7
## 13 33.9 71.1
sqldf('SELECT * FROM mtcars WHERE (mpg > 20 AND disp < 95) OR carb > 1000')
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 3 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## 4 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
sqldf('SELECT * FROM mtcars WHERE carb NOT IN (1,2)')
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## 3 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## 4 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## 5 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## 6 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## 7 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## 8 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## 9 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## 10 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## 11 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## 12 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## 13 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## 14 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## 15 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
sqldf("SELECT am, AVG(mpg) AS Avgmpg FROM mtcars GROUP BY am")
## am Avgmpg
## 1 0 17.14737
## 2 1 24.39231
sqldf("SELECT am, COUNT() as count FROM mtcars GROUP BY am")
## am count
## 1 0 19
## 2 1 13
Another useful package that does query is RSQLite.