\(~\)
\(~\)
R Studio. The chunk section is incredibly useful.\(~\)
\(~\)
\(~\)
Knit function: This function takes an input file, extracts the R code in it according to a list of patterns, evaluates the code and writes the output in another file.
Chunks: A code chunk is a runable piece of code. It can be an R code, but also a Python or STATA code (or others).
\(~\)
| Cunk option | Use |
|---|---|
| echo | Determines whether to display the source code in the output document. |
| warning | Determines whether to preserve warnings |
| message | Determines whether to preserve messages |
library(tidyverse) on one of the chunks.\(~\)
A short-list of useful short-cuts to work more efficiently (from RMarkdown for Scientists).
| Action | Windows/Linux | Mac |
|---|---|---|
| Show Keyboard Shortcut Reference | Alt+Shift+K | Option+Shift+K |
| Knit document | Ctrl + Shift + K | Cmd + Shift + K |
| Insert Chunk | Ctrl + Alt + I | Cmd + Option + I |
| Run Current Chunk | Ctrl + Alt + C | Cmd + Option + C |
| Jump to Shift+Alt+J | Cmd+Shift+Option+J | |
| Create multiple cursors | Ctrl + Alt + Up/Down | option + control + Up/Down |
| Delete the current line | Ctrl + D | Cmd + D |
| Un/Comment out a line | Ctrl + Shift + C | Cmd + Shift + C |
| Reformat Section | Ctrl + Shift + A | Cmd + Shift + A |
\(~\)
\(~\)
You need to be able to do, at least, the following actions on Github:
| Action |
|---|
| 1) Create and use a repository |
| 2) Clone and use and existing repository |
| 3) Start and manage a new branch |
| 4) Make changes to a file, and push them to GitHub as commits |
| 5) Open and merge a pull request |
I strongly encourge you to do the excercise that is in the Hello World. Git-hub guide
Here we are going to clone a repository and make changes to a file and push them* to the main branch.
Cloning a repository is creating a local copy of an existing repository on your computer.
Commmitting a change After modifying a file in the cloned repository you can commit a change. You are saving a change locally into a branch (either the main branch or other), with a name and a description of what this change makes.
Pushing updates your commits in to the remote repository. You can only push the branch to GitHub if you have write access to the repository. Nevertheless, you can always create a branch in GitHub Desktop if you have read access to a repository
\(~\)
\(~\)
A minimum list of commands and things you need to master now.
| Commands | Description |
|---|---|
| Everything in the Introduction to R in Data Camp | Check the outline of the data camp course and do whatever you don’t know |
| filter() | subset rows (observations) |
| select() | sorts rows |
| arrange() | keep or drop columns (observations, tip: use it as dplyr::select) |
| mutate() | generates a column |
| group_by() and summarise() | group_by() gives the instruction to operate within subsets, summarise() collapses the data by using a function |
| n() | counts |
| Everything about factors (Chapter 15) | categorical variables with a fix set of values |
\(~\)
\(~\)
Very quick review based on Exercise 2. Check the following table with commands.
| Commands | Description |
|---|---|
| str() | Check the structure of the data |
| glimpse() | Check the structure of the data (dplyr) |
| head() | Check the first few rows |
| tail() | Check the last few rows |
| view() | View the data 1 |
| just run the name of the object | View the data 2 |
| names(), as.data.frame(names()) | Names of the variables |
| summary() | General description of all the variables*** |
| unique() | Dplyr command that can be used to check for duplicates (crucial for merging) |
| table() | Frequency table or two way table (use exclude = NULL for showing NA) |
# Data-source 1: Gunsales - NYTimes
gun_sales <- analysis()
## [1] "Increase in monthly gun sales in Missouri = 8773.09"
# Note that if we hadn't use message = F warning = F, the final output would have been messier.
Several times the data provided during the courses comes from clean and nice data-sets. Don’t forget to check whether some of your data may have substantial problems like missing data or unwanted duplicates that will ruin your analysis and data-merging.
summary(gun_sales)
## year month guns_total guns_total_seas
## Min. :2000 Min. : 1.000 Min. : 409742 Min. : 573689
## 1st Qu.:2004 1st Qu.: 3.000 1st Qu.: 610066 1st Qu.: 637499
## Median :2008 Median : 6.000 Median : 761089 Median : 748970
## Mean :2008 Mean : 6.448 Mean : 838552 Mean : 835160
## 3rd Qu.:2012 3rd Qu.: 9.000 3rd Qu.: 987970 3rd Qu.:1053472
## Max. :2016 Max. :12.000 Max. :2388128 Max. :2015226
## guns_total_per_1000 guns_total_per_1000_scaled handgun
## Min. :1.985 Min. :2.044 Min. :148350
## 1st Qu.:2.230 1st Qu.:2.271 1st Qu.:178376
## Median :2.477 Median :2.668 Median :237259
## Mean :2.739 Mean :2.975 Mean :305079
## 3rd Qu.:3.353 3rd Qu.:3.752 3rd Qu.:412394
## Max. :6.393 Max. :7.179 Max. :886548
## longgun other multiple longgun_share
## Min. :356925 Min. : 0 Min. : 7421 Min. :0.3379
## 1st Qu.:377985 1st Qu.: 0 1st Qu.: 8464 1st Qu.:0.5471
## Median :400418 Median : 0 Median :14802 Median :0.6042
## Mean :427909 Mean : 6168 Mean :14674 Mean :0.5971
## 3rd Qu.:455140 3rd Qu.: 8798 3rd Qu.:18652 3rd Qu.:0.6815
## Max. :893224 Max. :50889 Max. :50669 Max. :0.7034
## handgun_share new_jersey maryland georgia
## Min. :0.2899 Min. :0.4150 Min. :0.6370 Min. :2.059
## 1st Qu.:0.3115 1st Qu.:0.5190 1st Qu.:0.7518 1st Qu.:2.341
## Median :0.3805 Median :0.5915 Median :0.8425 Median :2.571
## Mean :0.3871 Mean :0.6081 Mean :0.9179 Mean :2.683
## 3rd Qu.:0.4340 3rd Qu.:0.6760 3rd Qu.:1.0055 3rd Qu.:3.012
## Max. :0.6187 Max. :1.1320 Max. :5.4100 Max. :4.174
## louisiana mississippi missouri
## Min. :1.849 Min. :1.263 Min. :2.437
## 1st Qu.:2.276 1st Qu.:1.471 1st Qu.:2.737
## Median :2.397 Median :1.738 Median :3.191
## Mean :2.418 Mean :1.746 Mean :3.124
## 3rd Qu.:2.519 3rd Qu.:1.871 3rd Qu.:3.484
## Max. :3.471 Max. :2.480 Max. :4.034
## dc_handguns_per_100k_national_sales
## Min. : 0.000
## 1st Qu.: 0.500
## Median : 2.200
## Mean : 4.172
## 3rd Qu.: 7.950
## Max. :15.600
names(gun_sales)
## [1] "year" "month"
## [3] "guns_total" "guns_total_seas"
## [5] "guns_total_per_1000" "guns_total_per_1000_scaled"
## [7] "handgun" "longgun"
## [9] "other" "multiple"
## [11] "longgun_share" "handgun_share"
## [13] "new_jersey" "maryland"
## [15] "georgia" "louisiana"
## [17] "mississippi" "missouri"
## [19] "dc_handguns_per_100k_national_sales"
as.data.frame(names(gun_sales))
## names(gun_sales)
## 1 year
## 2 month
## 3 guns_total
## 4 guns_total_seas
## 5 guns_total_per_1000
## 6 guns_total_per_1000_scaled
## 7 handgun
## 8 longgun
## 9 other
## 10 multiple
## 11 longgun_share
## 12 handgun_share
## 13 new_jersey
## 14 maryland
## 15 georgia
## 16 louisiana
## 17 mississippi
## 18 missouri
## 19 dc_handguns_per_100k_national_sales
On your own: Try all these commands on the gun_sales data_set.
\(~\)
\(~\)
Run the following code to download an individual data-set for gun-related deaths in the US
gun_inddeaths <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/guns-data/master/full_data.csv")
gun_inddeaths <- gun_inddeaths[,2:dim(gun_inddeaths)[2]]
Answer the following question:
What are the individual level patterns of gun deaths?
\(~\)
glimpse(gun_inddeaths)
## Rows: 100,798
## Columns: 10
## $ year <dbl> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 201...
## $ month <chr> "01", "01", "01", "02", "02", "02", "02", "03", "02", "02...
## $ intent <chr> "Suicide", "Suicide", "Suicide", "Suicide", "Suicide", "S...
## $ police <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ sex <chr> "M", "F", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M...
## $ age <dbl> 34, 21, 60, 64, 31, 17, 48, 41, 50, NA, 30, 21, 43, 34, 2...
## $ race <chr> "Asian/Pacific Islander", "White", "White", "White", "Whi...
## $ hispanic <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 998, 100, 10...
## $ place <chr> "Home", "Street", "Other specified", "Home", "Other speci...
## $ education <chr> "BA+", "Some college", "BA+", "BA+", "HS/GED", "Less than...
Let’s say we are interested in exploring the relationship between race and intent. We can have a closer look to both variables.
ftable(gun_inddeaths$intent, exclude = NULL) # One missing value
## Accidental Homicide Suicide Undetermined NA
##
## 1639 35176 63175 807 1
ftable(gun_inddeaths$race, exclude = NULL) # No missing values
## Asian/Pacific Islander Black Hispanic Native American/Native Alaskan White
##
## 1326 23296 9022 917 66237
table(gun_inddeaths$race, gun_inddeaths$intent , exclude = NULL)
##
## Accidental Homicide Suicide Undetermined <NA>
## Asian/Pacific Islander 12 559 745 10 0
## Black 328 19510 3332 126 0
## Hispanic 145 5634 3171 72 0
## Native American/Native Alaskan 22 326 555 14 0
## White 1132 9147 55372 585 1
\(~\)
Looks like a very open question. What is the more suitable geom available in ggplot2?
Let’s look at the ggplot cheat sheet
Another way to check for the alternatives available is by visiting graph galleries.
Graph galleries also have codes that you can easily adapt to your purposes.
\(~\)
From the ggplot cheat sheet we know that geom_count is an alternative to visualize two discrete variables. Let’s try it:
ggplot(gun_inddeaths, aes(intent, race)) +
geom_count()
It looks difficult to compare, specially because of the frequencies in each race….
What about the bar_plots from class with fill position.
ggplot(gun_inddeaths) +
geom_bar(aes(x = race, fill = intent), position = "fill")
It looks, much better: easier to compare between races.
Blacks and Hispanics are less likely to die because of a suicide compare to the rest of the gun related-deaths.
\(~\)
Let’s fix all of this:
# How many NA?
ftable(gun_inddeaths$intent, exclude = NULL) # Only one!
## Accidental Homicide Suicide Undetermined NA
##
## 1639 35176 63175 807 1
ftable(gun_inddeaths$race, exclude = NULL)
## Asian/Pacific Islander Black Hispanic Native American/Native Alaskan White
##
## 1326 23296 9022 917 66237
# Let's fix this:
ggplot(gun_inddeaths %>% drop_na(intent)) + # We drop the only NA in intent
geom_bar(aes(x = race, fill = intent), position = "fill")
\(~\)
How can we work with discrete variables?: With factors.
gun_inddeaths$race <- factor(gun_inddeaths$race, levels = c("Black", "Hispanic", "Asian/Pacific Islander", "Native American/Native Alaskan", "White"))
gun_inddeaths$intent <- factor(gun_inddeaths$intent, levels = c("Undetermined", "Accidental", "Homicide", "Suicide"))
ggplot(gun_inddeaths %>% drop_na(intent)) + # We drop the only NA in intent
geom_bar(aes(x = race, fill = intent), position = "fill")
\(~\)
ggplot(gun_inddeaths %>% drop_na(intent)) + # We drop the only NA in intent
geom_bar(aes(x = factor(race, levels = c("Black", "Hispanic", "Asian/Pacific Islander", "Native American/Native Alaskan", "White")), fill = intent), position = "fill") +
xlab(NULL) + ylab(NULL) + theme(legend.title = element_blank())
\(~\)
levels(gun_inddeaths$race)[levels(gun_inddeaths$race)=="Asian/Pacific Islander"] <- "Asian/P. Islander"
levels(gun_inddeaths$race)[levels(gun_inddeaths$race)=="Native American/Native Alaskan"] <- "N. American/N. Alaskan"
ggplot(gun_inddeaths %>% drop_na(intent)) + # We drop the only NA in intent
geom_bar(aes(x = factor(race, levels = c("Black", "Hispanic", "Asian/P. Islander", "N. American/N. Alaskan", "White")), fill = intent), position = "fill") +
xlab(NULL) + ylab(NULL) + theme(legend.title = element_blank()) + theme(axis.text.x = element_text(size = 9)) + theme(legend.title = element_blank(), legend.position = "bottom")
\(~\)
Beautiful?
Functional?
levels(gun_inddeaths$race)[levels(gun_inddeaths$race)=="Asian/Pacific Islander"] <- "Asian/P. Islander"
levels(gun_inddeaths$race)[levels(gun_inddeaths$race)=="Native American/Native Alaskan"] <- "N. American/N. Alaskan"
ggplot(gun_inddeaths %>% drop_na(intent)) + # We drop the only NA in intent
geom_bar(aes(x = factor(race, levels = c("Black", "Hispanic", "Asian/P. Islander", "N. American/N. Alaskan", "White")), fill = intent), position = "fill") +
theme(legend.title = element_blank()) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.9, hjust=1)) +
geom_hline(yintercept=0.5, linetype="dashed", color = "black", size = 1) +
labs(x = NULL, y = NULL,
title = "Types of gun-related deaths by race \n", caption = "Source: FiveThirtyEight data.") +
scale_fill_manual("", values = c("#6666FF", "#99CCFF", "#CCCCFF", "red")) + theme_fivethirtyeight() + scale_y_continuous(labels = dollar_format(suffix = "%", prefix = "")) + scale_y_continuous(labels=percent_format()) + theme(axis.text.x = element_text(size = 9))