In the following notes, you will load data directly from a URL, directly from pre-build datasets in R, and finally from a file you save in your own folder.
You can load data from a folder or you can load data directly from a URL. The next example loads the dataset, “Test Scores”, directly from the URL where it resides.
library(tidyverse) # you will use the readr package in tidyverse to read in this data
## Warning: package 'tidyverse' was built under R version 4.1.2
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.1.1 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.2
## Warning: package 'tibble' was built under R version 4.1.2
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'readr' was built under R version 4.1.2
## Warning: package 'purrr' was built under R version 4.1.2
## Warning: package 'dplyr' was built under R version 4.1.2
## Warning: package 'stringr' was built under R version 4.1.2
## Warning: package 'forcats' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
allscores <- read_csv("https://goo.gl/MJyzNs")
## Rows: 22 Columns: 4
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (4): group, pre, post, diff
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(allscores)
## [1] 22 4
Notice R interprets the variable “group” as continuous values (col_double). We will fix this later. The command “dim” provides the dimensions of the data, which are 22 observations (rows) by 4 variables (columns).
Here is some easy code to create 3 groups of boxplots with some easy-to-access data, filled by group. Since the groups are discrete, you can get rid of the shading.
boxpl <- allscores %>% ggplot() + geom_boxplot(aes(y=diff, group=group,
fill=group))
boxpl
Notice that the legend give a continuous range of values for the scores, even though the scores are only 1, 2, or 3. The code guides(fill = FALSE) will get rid of the legend.
boxpl2 <- boxpl + guides(fill=FALSE)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
boxpl2
Ensure that the groups are considered as factors, rather than numbers. Then manually fill with the 3 colors: white, light gray, and dark gray. Make the boxplots orient horizontally.
allscores %>%
mutate(group=factor(group, levels=c("1","2","3"), ordered=TRUE)) %>%
ggplot() + geom_boxplot(aes(y=diff, group=group, fill=group)) +
scale_fill_manual(values=c("white","lightgray","darkgray")) +
theme(axis.text.y=element_blank()) +
ggtitle("Score Improvements Across Three Groups") +
coord_flip()
Some data frames are built in to R, such as mpg. Load the data, then use str and head to look at the data.
{r mpg} loads the data. Alternatively, you can use the command: load(“mpg”)
You will look at the data using the command “str” (gives the structure of the data), “head” (lists the first 6 rows of observations in the dataset), and “describe” from the “psych” package (gives quite detailed summary statistics on the continuous variables).
# install.packages("psych")
library(tidyverse)
library(psych)
## Warning: package 'psych' was built under R version 4.1.2
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
str(mpg)
## tibble [234 x 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
head(mpg)
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa~
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa~
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa~
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa~
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa~
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa~
describe(mpg)
## vars n mean sd median trimmed mad min max range
## manufacturer* 1 234 7.76 5.13 6.0 7.68 5.93 1.0 15 14.0
## model* 2 234 19.09 11.15 18.5 18.98 14.08 1.0 38 37.0
## displ 3 234 3.47 1.29 3.3 3.39 1.33 1.6 7 5.4
## year 4 234 2003.50 4.51 2003.5 2003.50 6.67 1999.0 2008 9.0
## cyl 5 234 5.89 1.61 6.0 5.86 2.97 4.0 8 4.0
## trans* 6 234 5.65 2.88 4.0 5.53 1.48 1.0 10 9.0
## drv* 7 234 1.67 0.66 2.0 1.59 1.48 1.0 3 2.0
## cty 8 234 16.86 4.26 17.0 16.61 4.45 9.0 35 26.0
## hwy 9 234 23.44 5.95 24.0 23.23 7.41 12.0 44 32.0
## fl* 10 234 4.63 0.70 5.0 4.77 0.00 1.0 5 4.0
## class* 11 234 4.59 1.99 5.0 4.64 2.97 1.0 7 6.0
## skew kurtosis se
## manufacturer* 0.21 -1.63 0.34
## model* 0.11 -1.23 0.73
## displ 0.44 -0.91 0.08
## year 0.00 -2.01 0.29
## cyl 0.11 -1.46 0.11
## trans* 0.29 -1.65 0.19
## drv* 0.48 -0.76 0.04
## cty 0.79 1.43 0.28
## hwy 0.36 0.14 0.39
## fl* -2.25 5.76 0.05
## class* -0.14 -1.52 0.13
It is essential to recognize that variables may be: int (integer), num (numeric), or double vs char (character) and factor (for categories)
Typically, chr or factor are used for discrete variables and int, dbl, or num for continuous variables.
The “gg” in ggplot2 stands for “grammar of graphics,” an approach to drawing charts devised by the statistician Leland Wilkinson. Rather than thinking in terms of finished charts like a scatter plot or a column chart, it starts by defining the coordinate system (usually the X and Y axes of a cartesian system), maps data onto those coordinates, and then adds layers such as points, bars and so on. This is the logic behind ggplot2 code.
Some key things to understand about ggplot2:
. ggplot This is the master function that creates a ggplot2 chart.
. aes This function, named for “aesthetic mapping,” is used whenever data values are mapped onto a chart. So it is used when you define which variables are plotted onto the X and Y axes, and also if you want to change the size or color of parts of the chart according to values for a variable.
. geom All of the functions that add layers to a chart start with geom, followed by an underscore, for example geom_point() or geom_bar(). The code in the brackets for any geom layer styles the items in that layer, and can include aes mappings of values from data.
. theme This function modifies the appearance of elements of a plot, used, for example, to set size and font face for text, the position of a legend, and so on.
. scale Functions that begin with scale, followed by an underscore, are used to modify the way an aes mapping of data appears on a chart. They can change the axis range, for example, or specify a color palette to be used to encode values in the data.
. + is used each time you add a layer, a scale, a theme, or elements like axis labels and a title After a + you can continue on the same line of code or move the next line. I usually write a new line after each +, which makes the code easier to follow.
Make a scatterplot of city vs highway miles per gallon, but sort/color points by either 4-wheel, front-wheel, or rear-wheel drive
Here is how we will code:
name the plot: “plot1” <-
call back the name of the dataset “mpg” and “pipe it” (more on that later) to create the frame for your plot
call “ggplot” to make a set of axes, with the aesthetics (aes) for city and highway mpg, but color points by the factors for drv
add geom_point to see the points
call plot1 to see the entire plot
plot1 <- mpg %>% ggplot(aes(cty, hwy, color = drv))+
geom_point()
plot1
Notice that the blue points for rear-wheel drive are only at the lower left side of the plot (i.e., not great mpg). Red points for 4-wheel drive have a wider spread of points, but they are also mainly at the lower left corner of the plot. The green points for front-wheel drive are mostly at the upper right, for the higher mpg.
Although there are already axes labels, we can do better. We should also add a title
plot1 <- mpg %>% ggplot(aes(cty, hwy, color = drv))+
geom_point()+
xlab("City miles per gallon") +
ylab("Highway miles per gallon") +
ggtitle("Scatterplot of City versus Highway Miles per Gallon")
plot1
Very soon, you will find data from other sources. The data will require some cleaning. Here are some important points to check: 1. Be sure the format is .csv 2. Be sure there are no spaces between variable names (headers). 3. Set all variable names to lowercase so you do not have to keep track of capitalizing.
# install.packages("zoo")
library(tidyverse)
library(zoo) # this package will help us re-format the period to be a useable date.
## Warning: package 'zoo' was built under R version 4.1.2
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
You may get data from a url, from a pre-built dataset, or you may load data from a working directory. A working directory is simply the file location on your computer.
The easiest way to find out what your current workinig directory is, use the command getwd().
getwd()
## [1] "C:/Users/rsaidi/Dropbox/Rachel/MontColl/DATA110/Notes"
This command shows you (in your console below) the path to your directory. My current path is: [1] “C:/Users/rsaidi/Dropbox/Rachel/MontColl/DATA110/Notes”
If you want to change the path, there are several ways to do so. I find the easiest way to change it is to click the “Session” tab at the top of R Studio. Select “Set Working Directory”, and then arrow over to “Choose Directory”. At this point, it will take you to your computer folders, and you need to select where your data is held. I suggest you create a folder called “Datasets” and keep all the data you load for this class in that folder.
Notice that down in the console below, it will show the new path you have chosen: setwd(“C:/Users/rsaidi/Dropbox/Rachel/MontColl/Datasets/Datasets”). At this point, I copy that command and put it directly into a new chunk.
The following data comes from New York Fed (https://www.newyorkfed.org/microeconomics/hhdc.html) regarding household debt for housing and non-housing expenses.
Download this dataset, Household_debt, from http://bit.ly/2P3084E and save it in your dataset folder. Change your working directory to load the dataset from YOUR folder. Then run this code.
setwd("C:/Users/rsaidi/Dropbox/Rachel/MontColl/Datasets/Datasets")
household <- read_csv("household_debt.csv")
## Rows: 64 Columns: 8
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): Period
## dbl (7): Mortgage, HE Revolving, Auto Loan, Credit Card, Student Loan, Other...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Very soon, you will find data from other sources. The data will require some cleaning. Here are some important points to check: 1. Be sure the format is .csv 2. Be sure there are no spaces between variable names (headers). 3. Set all variable names to lowercase so you do not have to keep track of capitalizing.
Make all headings (column names) lowercase. Remove all spaces between words in headings and replace them with underscores with the gsub command. Then look at it with “head” and “tail”.
names(household) <- tolower(names(household))
names(household) <- gsub(" ","_",names(household))
head(household)
## # A tibble: 6 x 8
## period mortgage he_revolving auto_loan credit_card student_loan other total
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 03:Q1 4.94 0.24 0.64 0.69 0.24 0.48 7.23
## 2 03:Q2 5.08 0.26 0.62 0.69 0.24 0.49 7.38
## 3 03:Q3 5.18 0.27 0.68 0.69 0.25 0.48 7.56
## 4 03:Q4 5.66 0.3 0.7 0.7 0.25 0.45 8.07
## 5 04:Q1 5.84 0.33 0.72 0.7 0.26 0.45 8.29
## 6 04:Q2 5.97 0.37 0.74 0.7 0.26 0.42 8.46
tail(household)
## # A tibble: 6 x 8
## period mortgage he_revolving auto_loan credit_card student_loan other total
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 17:Q3 8.74 0.45 1.21 0.81 1.36 0.39 13.0
## 2 17:Q4 8.88 0.44 1.22 0.83 1.38 0.39 13.2
## 3 18:Q1 8.94 0.44 1.23 0.82 1.41 0.39 13.2
## 4 18:Q2 9 0.43 1.24 0.83 1.41 0.39 13.3
## 5 18:Q3 9.14 0.42 1.27 0.84 1.44 0.4 13.5
## 6 18:Q4 9.12 0.41 1.27 0.87 1.46 0.41 13.5
Look at the dimensions and the structure of the data. Note that it will be listed as a tibble (to be discussed later in these notes).
dim(household)
## [1] 64 8
Mutate is a powerful command in tidyverse. It creates a new variable (column) in your dataset. In our dataset, “period” is not anything useful if we want to plot chronological data. So we will use mutate from “tidyverse” with the package “zoo” to create a useable date format.
You should see that there are 64 observations and 8 variables. All variables are “col_double” (continuous values) except “period”, which is interpreted as characters.We need to use the library “zoo” package to fix the unusual format of the “period”. We will mutate it to create a new variable, date.
household_debt_perc <- household %>%
mutate(date = as.Date(as.yearqtr(period, format = "%y:Q%q")))
household_debt_perc
## # A tibble: 64 x 9
## period mortgage he_revolving auto_loan credit_card student_loan other total
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 03:Q1 4.94 0.24 0.64 0.69 0.24 0.48 7.23
## 2 03:Q2 5.08 0.26 0.62 0.69 0.24 0.49 7.38
## 3 03:Q3 5.18 0.27 0.68 0.69 0.25 0.48 7.56
## 4 03:Q4 5.66 0.3 0.7 0.7 0.25 0.45 8.07
## 5 04:Q1 5.84 0.33 0.72 0.7 0.26 0.45 8.29
## 6 04:Q2 5.97 0.37 0.74 0.7 0.26 0.42 8.46
## 7 04:Q3 6.21 0.43 0.75 0.71 0.33 0.41 8.83
## 8 04:Q4 6.36 0.47 0.73 0.72 0.35 0.42 9.04
## 9 05:Q1 6.51 0.5 0.73 0.71 0.36 0.39 9.21
## 10 05:Q2 6.7 0.53 0.77 0.72 0.37 0.4 9.49
## # ... with 54 more rows, and 1 more variable: date <date>
plot1 <- household_debt_perc %>%
ggplot(aes(date, mortgage)) +
geom_point() +
ggtitle("Mortgage Debt Between 2003 and 2018")
plot1
plot2 <- household_debt_perc %>%
ggplot(aes(date, credit_card)) +
geom_point() +
ggtitle("Credit Card Debt Between 2003 to 2018")
plot2
Facet_wrap allows you to plot all variables together for comparison.
In order to do this, you have to “reshape the”data from a wide format to a long format. Use gather from tidyr package to do this.
Finally, color each variable with a different color.
household_debt_perc2 <- household_debt_perc %>% tidyr::gather("id", "debt", 2:7) %>%
ggplot(., aes(date, debt))+
geom_point()+
aes(color = as.factor(id)) +
facet_wrap(~id)
household_debt_perc2