There are a number of datasets in this package to use to practice creating visualizations
# install.packages("dslabs") # these are data science labs
library("dslabs")
data(package="dslabs")
list.files(system.file("script", package = "dslabs"))
## [1] "make-admissions.R"
## [2] "make-brca.R"
## [3] "make-brexit_polls.R"
## [4] "make-death_prob.R"
## [5] "make-divorce_margarine.R"
## [6] "make-gapminder-rdas.R"
## [7] "make-greenhouse_gases.R"
## [8] "make-historic_co2.R"
## [9] "make-mnist_27.R"
## [10] "make-movielens.R"
## [11] "make-murders-rda.R"
## [12] "make-na_example-rda.R"
## [13] "make-nyc_regents_scores.R"
## [14] "make-olive.R"
## [15] "make-outlier_example.R"
## [16] "make-polls_2008.R"
## [17] "make-polls_us_election_2016.R"
## [18] "make-reported_heights-rda.R"
## [19] "make-research_funding_rates.R"
## [20] "make-stars.R"
## [21] "make-temp_carbon.R"
## [22] "make-tissue-gene-expression.R"
## [23] "make-trump_tweets.R"
## [24] "make-weekly_us_contagious_diseases.R"
## [25] "save-gapminder-example-csv.R"
Note that the package dslabs also includes some of the scripts used to wrangle the data from their original source:
This dataset includes gun murder data for US states in 2010. I use this dataset to introduce the basics of R program.
data("murders")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggthemes)
library(ggrepel)
library(dplyr)
library(readr)
library(ggplot2)
write_csv(murders, "murders.csv", na="")
view(murders)
Once we determine the per million rate to be r, this line is defined by the formula: y=rx, with y and x our axes: total murders and population in millions respectively.
In the log-scale this line turns into: log(y)=log(r)+log(x). So in our plot it’s a line with slope 1 and intercept log(r). To compute r, we use dplyr:
r <- murders %>%
summarize(rate = sum(total) / sum(population) * 10^6) %>%
pull(rate)
Use the data science theme. Plot the murders with the x-axis as population for each state per million, the y-axis as the total murders for each state.
Color by region, add a linear regression line based on your calculation for r above, where we only need the intercept: geom_abline(intercept = log10(r))
Scale the x- and y-axes by a factor of log 10, add axes labels and a title.
You can use the command nudge_x argument, if you want to move the text slightly to the right or to the left:
ds_theme_set()
murders %>% ggplot(aes(x = population/10^6, y = total, label = abb)) +
geom_abline(intercept = log10(r), lty=2, col="darkgrey") +
geom_point(aes(color=region), size = 3) +
geom_text_repel(nudge_x = 0.005) +
scale_x_log10("Populations in millions (log scale)") +
scale_y_log10("Total number of murders (log scale)") +
ggtitle("US Gun Murders in 2010") +
scale_color_discrete(name="Region")
#{r} #regions <- murders %>% # group_by(abb,region) %>% # summarize(gdp_tn = sum(gdp_tn, na.rm = TRUE)) %>% #arrange(abb,region) #
On your own: