Load Data from Three Different Sources

In the following notes, you will load data directly from a URL, directly from pre-build datasets in R, and finally from a file you save in your own folder.

Load Data from a URL

You can load data from a folder or you can load data directly from a URL. The next example loads the dataset, “Test Scores”, directly from the URL where it resides.

library(tidyverse)  # you will use the readr package in tidyverse to read in this data
## Warning: package 'tidyverse' was built under R version 4.1.2
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.2
## Warning: package 'tibble' was built under R version 4.1.2
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'readr' was built under R version 4.1.2
## Warning: package 'purrr' was built under R version 4.1.2
## Warning: package 'dplyr' was built under R version 4.1.2
## Warning: package 'stringr' was built under R version 4.1.2
## Warning: package 'forcats' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
allscores <- read_csv("https://goo.gl/MJyzNs")
## Rows: 22 Columns: 4
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (4): group, pre, post, diff
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(allscores)
## [1] 22  4

Notice R interprets the variable “group” as continuous values (col_double). We will fix this later. The command “dim” provides the dimensions of the data, which are 22 observations (rows) by 4 variables (columns).

Use Side-by-Side Boxplots

Here is some easy code to create 3 groups of boxplots with some easy-to-access data, filled by group. Since the groups are discrete, you can get rid of the shading.

boxpl <- allscores %>% ggplot() + geom_boxplot(aes(y=diff, group=group,
                                                fill=group))
boxpl

Notice that the legend give a continuous range of values for the scores, even though the scores are only 1, 2, or 3. The code guides(fill = FALSE) will get rid of the legend.

boxpl2 <- boxpl + guides(fill=FALSE)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
boxpl2

Ensure that the groups are considered as factors, rather than numbers. Then manually fill with the 3 colors: white, light gray, and dark gray. Make the boxplots orient horizontally.

allscores %>%
  mutate(group=factor(group, levels=c("1","2","3"), ordered=TRUE)) %>%
  ggplot() + geom_boxplot(aes(y=diff, group=group, fill=group)) +
  scale_fill_manual(values=c("white","lightgray","darkgray")) +
  theme(axis.text.y=element_blank()) +
  ggtitle("Score Improvements Across Three Groups") +
  coord_flip()

Load Built in Data from R

Some data frames are built in to R, such as mpg. Load the data, then use str and head to look at the data.

{r mpg} loads the data. Alternatively, you can use the command: load(“mpg”)

You will look at the data using the command “str” (gives the structure of the data), “head” (lists the first 6 rows of observations in the dataset), and “describe” from the “psych” package (gives quite detailed summary statistics on the continuous variables).

# install.packages("psych")
library(tidyverse)
library(psych)
## Warning: package 'psych' was built under R version 4.1.2
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
str(mpg)
## tibble [234 x 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...
head(mpg)
## # A tibble: 6 x 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa~
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa~
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa~
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa~
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa~
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa~
describe(mpg)
##               vars   n    mean    sd median trimmed   mad    min  max range
## manufacturer*    1 234    7.76  5.13    6.0    7.68  5.93    1.0   15  14.0
## model*           2 234   19.09 11.15   18.5   18.98 14.08    1.0   38  37.0
## displ            3 234    3.47  1.29    3.3    3.39  1.33    1.6    7   5.4
## year             4 234 2003.50  4.51 2003.5 2003.50  6.67 1999.0 2008   9.0
## cyl              5 234    5.89  1.61    6.0    5.86  2.97    4.0    8   4.0
## trans*           6 234    5.65  2.88    4.0    5.53  1.48    1.0   10   9.0
## drv*             7 234    1.67  0.66    2.0    1.59  1.48    1.0    3   2.0
## cty              8 234   16.86  4.26   17.0   16.61  4.45    9.0   35  26.0
## hwy              9 234   23.44  5.95   24.0   23.23  7.41   12.0   44  32.0
## fl*             10 234    4.63  0.70    5.0    4.77  0.00    1.0    5   4.0
## class*          11 234    4.59  1.99    5.0    4.64  2.97    1.0    7   6.0
##                skew kurtosis   se
## manufacturer*  0.21    -1.63 0.34
## model*         0.11    -1.23 0.73
## displ          0.44    -0.91 0.08
## year           0.00    -2.01 0.29
## cyl            0.11    -1.46 0.11
## trans*         0.29    -1.65 0.19
## drv*           0.48    -0.76 0.04
## cty            0.79     1.43 0.28
## hwy            0.36     0.14 0.39
## fl*           -2.25     5.76 0.05
## class*        -0.14    -1.52 0.13

It is essential to recognize that variables may be: int (integer), num (numeric), or double vs char (character) and factor (for categories)

Typically, chr or factor are used for discrete variables and int, dbl, or num for continuous variables.

Introducing ggplot2 and the grammar of graphics

The “gg” in ggplot2 stands for “grammar of graphics,” an approach to drawing charts devised by the statistician Leland Wilkinson. Rather than thinking in terms of finished charts like a scatter plot or a column chart, it starts by defining the coordinate system (usually the X and Y axes of a cartesian system), maps data onto those coordinates, and then adds layers such as points, bars and so on. This is the logic behind ggplot2 code.

Some key things to understand about ggplot2:

. ggplot This is the master function that creates a ggplot2 chart.

. aes This function, named for “aesthetic mapping,” is used whenever data values are mapped onto a chart. So it is used when you define which variables are plotted onto the X and Y axes, and also if you want to change the size or color of parts of the chart according to values for a variable.

. geom All of the functions that add layers to a chart start with geom, followed by an underscore, for example geom_point() or geom_bar(). The code in the brackets for any geom layer styles the items in that layer, and can include aes mappings of values from data.

. theme This function modifies the appearance of elements of a plot, used, for example, to set size and font face for text, the position of a legend, and so on.

. scale Functions that begin with scale, followed by an underscore, are used to modify the way an aes mapping of data appears on a chart. They can change the axis range, for example, or specify a color palette to be used to encode values in the data.

. + is used each time you add a layer, a scale, a theme, or elements like axis labels and a title After a + you can continue on the same line of code or move the next line. I usually write a new line after each +, which makes the code easier to follow.

Now make a scatterplot using ggplot2

Make a scatterplot of city vs highway miles per gallon, but sort/color points by either 4-wheel, front-wheel, or rear-wheel drive

Here is how we will code:

  1. name the plot: “plot1” <-

  2. call back the name of the dataset “mpg” and “pipe it” (more on that later) to create the frame for your plot

  3. call “ggplot” to make a set of axes, with the aesthetics (aes) for city and highway mpg, but color points by the factors for drv

  4. add geom_point to see the points

  5. call plot1 to see the entire plot

plot1 <- mpg %>% ggplot(aes(cty, hwy, color = drv))+ 
  geom_point()
plot1

Notice that the blue points for rear-wheel drive are only at the lower left side of the plot (i.e., not great mpg). Red points for 4-wheel drive have a wider spread of points, but they are also mainly at the lower left corner of the plot. The green points for front-wheel drive are mostly at the upper right, for the higher mpg.

Add a title and labels

Although there are already axes labels, we can do better. We should also add a title

plot1 <- mpg %>% ggplot(aes(cty, hwy, color = drv))+ 
  geom_point()+
  xlab("City miles per gallon") +
  ylab("Highway miles per gallon") +
  ggtitle("Scatterplot of City versus Highway Miles per Gallon")
plot1

Clean data headings and variable names

Very soon, you will find data from other sources. The data will require some cleaning. Here are some important points to check: 1. Be sure the format is .csv 2. Be sure there are no spaces between variable names (headers). 3. Set all variable names to lowercase so you do not have to keep track of capitalizing.

Load the libraries

# install.packages("zoo")
library(tidyverse)
library(zoo)  # this package will help us re-format the period to be a useable date.
## Warning: package 'zoo' was built under R version 4.1.2
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

Loading Data from a Working Directory

You may get data from a url, from a pre-built dataset, or you may load data from a working directory. A working directory is simply the file location on your computer.

The easiest way to find out what your current workinig directory is, use the command getwd().

getwd()
## [1] "C:/Users/rsaidi/Dropbox/Rachel/MontColl/DATA110/Notes"

This command shows you (in your console below) the path to your directory. My current path is: [1] “C:/Users/rsaidi/Dropbox/Rachel/MontColl/DATA110/Notes”

If you want to change the path, there are several ways to do so. I find the easiest way to change it is to click the “Session” tab at the top of R Studio. Select “Set Working Directory”, and then arrow over to “Choose Directory”. At this point, it will take you to your computer folders, and you need to select where your data is held. I suggest you create a folder called “Datasets” and keep all the data you load for this class in that folder.

Notice that down in the console below, it will show the new path you have chosen: setwd(“C:/Users/rsaidi/Dropbox/Rachel/MontColl/Datasets/Datasets”). At this point, I copy that command and put it directly into a new chunk.

Load the data

The following data comes from New York Fed (https://www.newyorkfed.org/microeconomics/hhdc.html) regarding household debt for housing and non-housing expenses.

Download this dataset, Household_debt, from http://bit.ly/2P3084E and save it in your dataset folder. Change your working directory to load the dataset from YOUR folder. Then run this code.

setwd("C:/Users/rsaidi/Dropbox/Rachel/MontColl/Datasets/Datasets")
household <- read_csv("household_debt.csv")
## Rows: 64 Columns: 8
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): Period
## dbl (7): Mortgage, HE Revolving, Auto Loan, Credit Card, Student Loan, Other...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Clean data headings and variable names

Very soon, you will find data from other sources. The data will require some cleaning. Here are some important points to check: 1. Be sure the format is .csv 2. Be sure there are no spaces between variable names (headers). 3. Set all variable names to lowercase so you do not have to keep track of capitalizing.

Here are some useful cleaning commands:

Make all headings (column names) lowercase. Remove all spaces between words in headings and replace them with underscores with the gsub command. Then look at it with “head” and “tail”.

names(household) <- tolower(names(household))
names(household) <- gsub(" ","_",names(household))
head(household)
## # A tibble: 6 x 8
##   period mortgage he_revolving auto_loan credit_card student_loan other total
##   <chr>     <dbl>        <dbl>     <dbl>       <dbl>        <dbl> <dbl> <dbl>
## 1 03:Q1      4.94         0.24      0.64        0.69         0.24  0.48  7.23
## 2 03:Q2      5.08         0.26      0.62        0.69         0.24  0.49  7.38
## 3 03:Q3      5.18         0.27      0.68        0.69         0.25  0.48  7.56
## 4 03:Q4      5.66         0.3       0.7         0.7          0.25  0.45  8.07
## 5 04:Q1      5.84         0.33      0.72        0.7          0.26  0.45  8.29
## 6 04:Q2      5.97         0.37      0.74        0.7          0.26  0.42  8.46
tail(household)
## # A tibble: 6 x 8
##   period mortgage he_revolving auto_loan credit_card student_loan other total
##   <chr>     <dbl>        <dbl>     <dbl>       <dbl>        <dbl> <dbl> <dbl>
## 1 17:Q3      8.74         0.45      1.21        0.81         1.36  0.39  13.0
## 2 17:Q4      8.88         0.44      1.22        0.83         1.38  0.39  13.2
## 3 18:Q1      8.94         0.44      1.23        0.82         1.41  0.39  13.2
## 4 18:Q2      9            0.43      1.24        0.83         1.41  0.39  13.3
## 5 18:Q3      9.14         0.42      1.27        0.84         1.44  0.4   13.5
## 6 18:Q4      9.12         0.41      1.27        0.87         1.46  0.41  13.5

Look at the dimensions and the structure of the data. Note that it will be listed as a tibble (to be discussed later in these notes).

dim(household)
## [1] 64  8

Mutate

Mutate is a powerful command in tidyverse. It creates a new variable (column) in your dataset. In our dataset, “period” is not anything useful if we want to plot chronological data. So we will use mutate from “tidyverse” with the package “zoo” to create a useable date format.

Create a new variable to use instead of “period”

You should see that there are 64 observations and 8 variables. All variables are “col_double” (continuous values) except “period”, which is interpreted as characters.We need to use the library “zoo” package to fix the unusual format of the “period”. We will mutate it to create a new variable, date.

household_debt_perc <- household %>%
   mutate(date = as.Date(as.yearqtr(period, format = "%y:Q%q")))
household_debt_perc
## # A tibble: 64 x 9
##    period mortgage he_revolving auto_loan credit_card student_loan other total
##    <chr>     <dbl>        <dbl>     <dbl>       <dbl>        <dbl> <dbl> <dbl>
##  1 03:Q1      4.94         0.24      0.64        0.69         0.24  0.48  7.23
##  2 03:Q2      5.08         0.26      0.62        0.69         0.24  0.49  7.38
##  3 03:Q3      5.18         0.27      0.68        0.69         0.25  0.48  7.56
##  4 03:Q4      5.66         0.3       0.7         0.7          0.25  0.45  8.07
##  5 04:Q1      5.84         0.33      0.72        0.7          0.26  0.45  8.29
##  6 04:Q2      5.97         0.37      0.74        0.7          0.26  0.42  8.46
##  7 04:Q3      6.21         0.43      0.75        0.71         0.33  0.41  8.83
##  8 04:Q4      6.36         0.47      0.73        0.72         0.35  0.42  9.04
##  9 05:Q1      6.51         0.5       0.73        0.71         0.36  0.39  9.21
## 10 05:Q2      6.7          0.53      0.77        0.72         0.37  0.4   9.49
## # ... with 54 more rows, and 1 more variable: date <date>

Finally plot various loan types

plot1 <- household_debt_perc %>% 
  ggplot(aes(date, mortgage)) +
  geom_point() +
  ggtitle("Mortgage Debt Between 2003 and 2018")
plot1

plot2 <- household_debt_perc %>% 
  ggplot(aes(date, credit_card)) +
  geom_point() + 
  ggtitle("Credit Card Debt Between 2003 to 2018")
plot2

Use “facet_wrap” to show all types of debt together

Facet_wrap allows you to plot all variables together for comparison.

In order to do this, you have to “reshape the”data from a wide format to a long format. Use gather from tidyr package to do this.

Finally, color each variable with a different color.

household_debt_perc2 <- household_debt_perc %>% tidyr::gather("id", "debt", 2:7) %>% 
  ggplot(., aes(date, debt))+
  geom_point()+
  aes(color = as.factor(id)) +
  facet_wrap(~id)
household_debt_perc2