Bar Charts (from Nicole Radziwill)

Summary of exercise with small data sets: Data generated by opening one package of regular M&Ms to look at the distribution of colors. Data is worked as a vector. The two primary types of spatial data are vector and raster data. Vector data is not made up of a grid of pixels. Instead vector graphics are comprised of vertices and paths. The three basic symbol types for vector data are points, lines and polygons (areas). Vector points are simply XY coordinates. Generally they are latitude and longitude with a spatial reference frame (gisgeography.com/spatial-data-types-vector-raster/). Vector lines connect each vertex with paths. Basically, you’re connecting the dots in a set order and it becomes a vector line with each dot representing a vertex. Raster data on the other hand is made up of pixels. This example will deal with vector data only.

The counts of colors from the M&M bag are: 12 blue, 6 brown, 8 green, etc.

We set a vector for the counts, a vector for the respective names of the values, and a vector of the palette of colors to be used (these match the “names”). Then we call a barplot of the counts using the vectors of colors.

mm.counts <- c(12,6,8,10,6,7)
names(mm.counts) <- c("blue","brown","green","orange",
"red","yellow")
mm.colors <- c("blue","brown","green","orange","red","yellow") # creates a vector of the palette of colors to be used in the bar chart (that match the m & m colors)
barplot(mm.counts,col=mm.colors)

Now we will load data to create boxplots within the language of Tidyverse

Load Data from Three Different Sources

In the following section, data has been loaded from three different sources: 1. directly from a URL, 2. directly from pre-build datasets in R, and 3. from a file saved on my own computer.

1. Load data from a URL

# install.packages("tidyverse")
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.1     v purrr   0.3.4
## v tibble  3.0.1     v dplyr   1.0.0
## v tidyr   1.1.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts ----------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
allscores <- readr::read_csv("https://goo.gl/MJyzNs")
## Parsed with column specification:
## cols(
##   group = col_double(),
##   pre = col_double(),
##   post = col_double(),
##   diff = col_double()
## )
dim(allscores)
## [1] 22  4

Note: R interprets the variable “group” as continuous values (col_double). This will be fixed later. The command “dim” provides dimensions of the data, which are 22 observations (rows) by 4 variables (columns).

Introducing ggplot2 and the grammar of graphics

ggplot2 is a package that will load when tidyverse is loaded, or it can be loaded on its own.

The “gg” in ggplot2 stands for “grammar of graphics,” an approach to drawing charts devised by the statistician Leland Wilkinson. Rather than thinking in terms of finished charts like a scatter plot or a column chart, it starts by defining the coordinate system (usually the X and Y axes of a Cartesian system), maps data onto those coordinates, and then adds layers such as points, bars and so on. This is the logic behind ggplot2 code.

Some key things to understand about ggplot2:

* ggplot = This is the master function that creates a ggplot2 chart

* aes = This function named for “aesthetic mapping,” is used whenever data values are mapped onto a chart. So it is used when you define which variables are plotted onto the X and Y axes, and also if you want to change the size or color of parts of the chart according to values for a variable.

* geom = All of the functions that add layers to a chart start with geom, followed by an underscore, for example geom_point() or geom_bar(). The code in the brackets for any geom layer styles the items in that layer, and can include aes mappings of values from data.

* theme = This function modifies the appearance of elements of a plot, used, for example, to set size and font face for a text, the position of a legend and so on.

* scale = Function that begin with scale, followed by an underscore, are used to modify the way an aes mapping of data appears on a chart. They can change the axis range, for example, or specify a color palette to be used to encode values in the data.

Use Side-by-Side Boxplots

Below is some easy code to create 3 groups of boxplots with some easy-to-access data, filled by group. Since the groups are discrete, you can get rid of the shading.

boxpl <- allscores %>%
  ggplot() +
  geom_boxplot(aes(y=diff, group=group, fill=group))
boxpl

Notice that the legend gives a continuous range of values for scores even though the scores are only 1, 2, or 3. The code guides (fill = FALSE) will get rid of the legend.

boxpl2 <- boxpl + guides(fill = FALSE)
boxpl2

Below the groups are considered as factors rather than numbers. The boxes are manually filled with the 3 colors: white, light gray, and dark gray. The boxplots orient horizontally.

allscores %>%
  mutate(group=factor(group, levels=c("1","2","3"), ordered=TRUE)) %>%
  ggplot() + geom_boxplot(aes(y=diff, group=group, fill=group)) +
  scale_fill_manual(values=c("white","lightgray","darkgray")) +
  theme(axis.text.y=element_blank()) +
  ggtitle("Score Improvements Across Three Groups") +
  coord_flip()

Load Built in Data from R

Some data frames are built into R, such as mpg. Load the data, then use str and head to look at the data.

{r mpg} loads the data. Alternatively the command: load(“mpg”) can be used.

The data is examined using the command “str” (gives the structure of the data), “head” (lists the first 6 rows of observation in the dataset), and “describe” from the “psych” package (gives quite detailed summary statistics on the continuous variables).

# {r mpg, warning = FALSE}
# install.packages("tidyverse")
writeLines('PATH="${RTOOLS40_HOME}\\usr\\bin;${PATH}"', con = "~/.Renviron")
Sys.which("make")
##                               make 
## "C:\\rtools40\\usr\\bin\\make.exe"
## "C:\\rtools40\\usr\\bin\\make.exe"
install.packages("psych")
## Installing package into 'C:/Users/Valued Customer/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## Warning: unable to access index for repository YOUR FAVORITE MIRROR/src/contrib:
##   scheme not supported in URL 'YOUR FAVORITE MIRROR/src/contrib/PACKAGES'
## Warning: package 'psych' is not available (for R version 4.0.1)
## Warning: unable to access index for repository YOUR FAVORITE MIRROR/bin/windows/contrib/4.0:
##   scheme not supported in URL 'YOUR FAVORITE MIRROR/bin/windows/contrib/4.0/PACKAGES'
local({r <- getOption("repos")
       r["CRAN"] <- "http://cran.r-project.org"
       options(repos=r)})
file.edit(file.path("~", ".Rprofile")) # edit .Rprofile in HOME
install.packages("installr")
## Installing package into 'C:/Users/Valued Customer/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## package 'installr' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Valued Customer\AppData\Local\Temp\RtmpQxfn3u\downloaded_packages
library(installr)
## 
## Welcome to installr version 0.22.0
## 
## More information is available on the installr project website:
## https://github.com/talgalili/installr/
## 
## Contact: <tal.galili@gmail.com>
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/installr/issues
## 
##          To suppress this message use:
##          suppressPackageStartupMessages(library(installr))
install.packages("psych")
## Installing package into 'C:/Users/Valued Customer/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## package 'psych' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Valued Customer\AppData\Local\Temp\RtmpQxfn3u\downloaded_packages
library(tidyverse)
library(psych) # used for the "describe" command below
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
str(mpg)
## tibble [234 x 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...
head(mpg)
## # A tibble: 6 x 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa~
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa~
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa~
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa~
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa~
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa~
describe(mpg)
## Warning in describe(mpg): NAs introduced by coercion

## Warning in describe(mpg): NAs introduced by coercion

## Warning in describe(mpg): NAs introduced by coercion

## Warning in describe(mpg): NAs introduced by coercion

## Warning in describe(mpg): NAs introduced by coercion

## Warning in describe(mpg): NAs introduced by coercion
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf

## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf

## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf

## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf

## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf

## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf

## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf

## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf

## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
##               vars   n    mean   sd median trimmed  mad    min  max range skew
## manufacturer*    1 234     NaN   NA     NA     NaN   NA    Inf -Inf  -Inf   NA
## model*           2 234     NaN   NA     NA     NaN   NA    Inf -Inf  -Inf   NA
## displ            3 234    3.47 1.29    3.3    3.39 1.33    1.6    7   5.4 0.44
## year             4 234 2003.50 4.51 2003.5 2003.50 6.67 1999.0 2008   9.0 0.00
## cyl              5 234    5.89 1.61    6.0    5.86 2.97    4.0    8   4.0 0.11
## trans*           6 234     NaN   NA     NA     NaN   NA    Inf -Inf  -Inf   NA
## drv*             7 234    4.00 0.00    4.0    4.00 0.00    4.0    4   0.0  NaN
## cty              8 234   16.86 4.26   17.0   16.61 4.45    9.0   35  26.0 0.79
## hwy              9 234   23.44 5.95   24.0   23.23 7.41   12.0   44  32.0 0.36
## fl*             10 234     NaN   NA     NA     NaN   NA    Inf -Inf  -Inf   NA
## class*          11 234     NaN   NA     NA     NaN   NA    Inf -Inf  -Inf   NA
##               kurtosis   se
## manufacturer*       NA   NA
## model*              NA   NA
## displ            -0.91 0.08
## year             -2.01 0.29
## cyl              -1.46 0.11
## trans*              NA   NA
## drv*               NaN 0.00
## cty               1.43 0.28
## hwy               0.14 0.39
## fl*                 NA   NA
## class*              NA   NA

It is essential to recognize that variables may be: int (integer), num (numeric), or double vs char (character) and factor (for categories). Typically, chr or factor are used for discrete variables and int, dbl, or num for continuous variables.

Now make a scatterplot using ggplot2

Below is a scatterplot of city vs. highway miles per gallon. Sort/color points are either 4-wheel, front-wheel, or rear-wheel drive.

The code is below:

1. name the plot: “plot1” <-

2. call back the name of the dataset “mpg” and “pipe it” (more on that later) to create the plot

3. call “ggplot” to make a set

4. add geom_point to see the points

5. call plot1 to see the entire plot

plot1 <- mpg %>%
  ggplot(aes(cty, hwy, color = drv))+
  geom_point()
plot1

Notice that the blue points for rear-wheel drive are only at the lower left side of the plot (i.e., not great mpg). Red points for 4-wheel drive have a wider spread of points, but they are also mainly at the lower left corner of the plot. The green points for front-wheel drive are mostly at the upper right, for the higher mpg.

Add a title and labels

Although there are already axes labels, we can do better. We should also add a title.

plot1 <- mpg %>%
  ggplot(aes(cty, hwy, color = drv))+
  geom_point()+
  xlab("City miles per gallon")+
  ylab("Highway miles per gallon")+
  ggtitle("Scatterplot of City versus Highway Miles per Gallon")
plot1

R practice from Week 1 Notes pages 16 - 17

"%>%" <- function(x,f) do.call(f,list(x))
pi %>% sin
## [1] 1.224606e-16
pi %>% sin %>% cos
## [1] 1
cos(sin(pi))
## [1] 1
# Initialize `x`
x <- c(0.109, 0.359, 0.63, 0.996, 0.515, 0.142, 0.017, 0.829, 0.907)
#Compute the logarithm of `x`, return suitably lagged and iterated differences, compute the exponential function and round the result
round(exp(diff(log(x))), 1)
## [1]  3.3  1.8  1.6  0.5  0.3  0.1 48.8  1.1
# Import `magrittr`
library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked _by_ '.GlobalEnv':
## 
##     %>%
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
# Perform the same computations on `x` as above
#x %>% log() %>%
#  diff() %>%
#  exp() %>%
#  round(1)
# kept getting an error on this one I could not resolve stating "Error in do.call(f, list(x)) : argument "x" is missing, with no default"