Articles to Read
Introduction to R Studio
What is R Markdown
Basic Data Structures in R
- Vectors in R
- Matrices in R
- Data Frames in R
  - The Package DT: Display data frames in a way that is searchable
  - Creating Data Frames from CSV files
- Lists in R
Remove R Objects from Memory
Reserved Words in R
Basics in R Programming
- Logical Expressions
- If-else Conditionals
- The For Loop: Example 1- Sum of the first 100 positive integers.
- The For Loop: Example 2- Fibbonacci Sequence.
- The While Loop
Some Useful Built-in R Functions
- The paste() Function
- The factor() Function
Numeric Summary of Data
Graphical Display of Data
Overlay Curves
User-defined Functions
The Pipe Operator: A Modern Way of Using R Functions
Debugging
An Introduction to the ggplot2 package
- Plotting data with the qplot() function from ggplot2 Package
- Plotting Functions with ggplot2
- More Examples with ggplot2
  - Custom Colors
  - Overlay curves from different sources of data
- Other Nice Plots
- Using arrange() to arrange observations in order
- Using mutate() to change or create a column
- Using group_by() to group data
- Using summarize() to summarize grouped data
- A Comprehensive Example
Data Wrangling with the “tidyr” Package
Query Using the sqldf Package
Relational database management system (RDBMS) in R
Regular Expression
Handling Dates in R
Some Interesting Topics
- Making Nice Tables through R Markdown
- Making a Fern
- Making a Word Cloud
A comparison between R and Python as a Data Science Programming Tool
Reference

Articles to Read

NHS backlog data analysis. https://www.bma.org.uk/advice-and-support/nhs-delivery-and-workforce/pressures/nhs-backlog-data-analysis

Introduction to R Studio

An integrated development environment (IDE) for R and Python, with a console, syntax-highlighting editor that supports direct code execution, and tools for plotting, history, debugging and workspace management (From https://rstudio.com/).

Watch the video: https://www.youtube.com/watch?v=PviVimazpz8

Once R and RStudio have been installed, open RStudio. Type “version” in the console window (the lower-left window) to check the version of R..

What is R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

x = 4
y = 7
z = 67
a = x*y + x/y - sqrt(x) - 2*abs(y) + exp(-0.3*x) +3*z
a

## [1] 213.8726

For Python fans, the following is simple python code. To run python code within RStudio, the package “reticulate” must be loaded in the “setup” code chunk (at the beginning of this document, below YAML).


fruits = ["apple", "banana", "cherry"]
for x in fruits:
  print(x)

## apple
## banana
## cherry

Basic Data Structures in R

Basic data structures in R include the vector, matrix, data frame, and list. Some of these structures require that all members be of the same data type (e.g. vectors, matrices) while others permit multiple data types (e.g. data frames, lists).

Vectors in R

x = c(3, 6, 7, 10, 12, 18, 20) # This is a numeric vector
class(x)

## [1] "numeric"

y = c("Tom", "Jerry", "Donald", "Bob") # This is a character vector
class(y)

## [1] "character"

z1 = c(4, 6, 9)
z2 = c(a = 4, b = 6, c = 9) # A named vector, looking like a table

print(x)

## [1]  3  6  7 10 12 18 20

## [1] "Tom"    "Jerry"  "Donald" "Bob"

z1

## [1] 4 6 9

z2

## a b c 
## 4 6 9

letters # a constant vector built in base R

##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"

state.name # another constant vector built in base R

##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"

Matrices in R

M = matrix(data = 1:12, nrow = 3, byrow = TRUE)
M # Print the matrix.

##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12

dim(M) # Dimension of matrix M

## [1] 3 4

M[2,3] # the element in row 2 and column 3, which is a scalar

## [1] 7

M[2,] # the second row, which is a vector

## [1] 5 6 7 8

M[,3] # the third column, which is a vector

## [1]  3  7 11

M2 = matrix(0, nrow = 3, ncol = 4)
M2

##      [,1] [,2] [,3] [,4]
## [1,]    0    0    0    0
## [2,]    0    0    0    0
## [3,]    0    0    0    0

class(M2)

## [1] "matrix" "array"

Data Frames in R

DF = data.frame(Name = c("John", "Alice", "Tod", "Megan", "Jessica"),
                Major = c("Software Engineering", "Statistics", "Accounting", "Computer Science", "Statistics"),
                `Score of Stat` = c(89, 92, 87, 90, 88), # When a column name has spaces, use quotation marks
                'Score of Calc' = c(77, 87, 92, 93, 79)
               )

DF # Print

##      Name                Major Score.of.Stat Score.of.Calc
## 1    John Software Engineering            89            77
## 2   Alice           Statistics            92            87
## 3     Tod           Accounting            87            92
## 4   Megan     Computer Science            90            93
## 5 Jessica           Statistics            88            79

# Subsetting the data
DF[2,3] # Extract the element on the second row and the third column and Print it.

## [1] 92

DF[2,] # Extract the second row

##    Name      Major Score.of.Stat Score.of.Calc
## 2 Alice Statistics            92            87

DF[, 3] # Extract the third column

## [1] 89 92 87 90 88

DF[, c(1,3)]

##      Name Score.of.Stat
## 1    John            89
## 2   Alice            92
## 3     Tod            87
## 4   Megan            90
## 5 Jessica            88

# Check the dimensions
dim(DF)

## [1] 5 4

# Get the row names. By default, row names are 1, 2, 3, ...
rownames(DF)

## [1] "1" "2" "3" "4" "5"

# Get the column names in two ways
colnames(DF)

## [1] "Name"          "Major"         "Score.of.Stat" "Score.of.Calc"

names(DF)

## [1] "Name"          "Major"         "Score.of.Stat" "Score.of.Calc"

# Check the class of an object
class(DF)

## [1] "data.frame"

# Rename the second column
colnames(DF)[2] = "Specialty"
DF

##      Name            Specialty Score.of.Stat Score.of.Calc
## 1    John Software Engineering            89            77
## 2   Alice           Statistics            92            87
## 3     Tod           Accounting            87            92
## 4   Megan     Computer Science            90            93
## 5 Jessica           Statistics            88            79

R has some built in data frames that we can use. Here are examples:

# 1. The pressure data
pressure

##    temperature pressure
## 1            0   0.0002
## 2           20   0.0012
## 3           40   0.0060
## 4           60   0.0300
## 5           80   0.0900
## 6          100   0.2700
## 7          120   0.7500
## 8          140   1.8500
## 9          160   4.2000
## 10         180   8.8000
## 11         200  17.3000
## 12         220  32.1000
## 13         240  57.0000
## 14         260  96.0000
## 15         280 157.0000
## 16         300 247.0000
## 17         320 376.0000
## 18         340 558.0000
## 19         360 806.0000

# 2. The growth data
PlantGrowth

##    weight group
## 1    4.17  ctrl
## 2    5.58  ctrl
## 3    5.18  ctrl
## 4    6.11  ctrl
## 5    4.50  ctrl
## 6    4.61  ctrl
## 7    5.17  ctrl
## 8    4.53  ctrl
## 9    5.33  ctrl
## 10   5.14  ctrl
## 11   4.81  trt1
## 12   4.17  trt1
## 13   4.41  trt1
## 14   3.59  trt1
## 15   5.87  trt1
## 16   3.83  trt1
## 17   6.03  trt1
## 18   4.89  trt1
## 19   4.32  trt1
## 20   4.69  trt1
## 21   6.31  trt2
## 22   5.12  trt2
## 23   5.54  trt2
## 24   5.50  trt2
## 25   5.37  trt2
## 26   5.29  trt2
## 27   4.92  trt2
## 28   6.15  trt2
## 29   5.80  trt2
## 30   5.26  trt2

# 3. The cars data
cars

##    speed dist
## 1      4    2
## 2      4   10
## 3      7    4
## 4      7   22
## 5      8   16
## 6      9   10
## 7     10   18
## 8     10   26
## 9     10   34
## 10    11   17
## 11    11   28
## 12    12   14
## 13    12   20
## 14    12   24
## 15    12   28
## 16    13   26
## 17    13   34
## 18    13   34
## 19    13   46
## 20    14   26
## 21    14   36
## 22    14   60
## 23    14   80
## 24    15   20
## 25    15   26
## 26    15   54
## 27    16   32
## 28    16   40
## 29    17   32
## 30    17   40
## 31    17   50
## 32    18   42
## 33    18   56
## 34    18   76
## 35    18   84
## 36    19   36
## 37    19   46
## 38    19   68
## 39    20   32
## 40    20   48
## 41    20   52
## 42    20   56
## 43    20   64
## 44    22   66
## 45    23   54
## 46    24   70
## 47    24   92
## 48    24   93
## 49    24  120
## 50    25   85

# 4. Motor Trend Car Road Tests
mtcars

##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

# 5. Fisher's iris data
iris

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica

# 6. Crime rates data
USArrests

##                Murder Assault UrbanPop Rape
## Alabama          13.2     236       58 21.2
## Alaska           10.0     263       48 44.5
## Arizona           8.1     294       80 31.0
## Arkansas          8.8     190       50 19.5
## California        9.0     276       91 40.6
## Colorado          7.9     204       78 38.7
## Connecticut       3.3     110       77 11.1
## Delaware          5.9     238       72 15.8
## Florida          15.4     335       80 31.9
## Georgia          17.4     211       60 25.8
## Hawaii            5.3      46       83 20.2
## Idaho             2.6     120       54 14.2
## Illinois         10.4     249       83 24.0
## Indiana           7.2     113       65 21.0
## Iowa              2.2      56       57 11.3
## Kansas            6.0     115       66 18.0
## Kentucky          9.7     109       52 16.3
## Louisiana        15.4     249       66 22.2
## Maine             2.1      83       51  7.8
## Maryland         11.3     300       67 27.8
## Massachusetts     4.4     149       85 16.3
## Michigan         12.1     255       74 35.1
## Minnesota         2.7      72       66 14.9
## Mississippi      16.1     259       44 17.1
## Missouri          9.0     178       70 28.2
## Montana           6.0     109       53 16.4
## Nebraska          4.3     102       62 16.5
## Nevada           12.2     252       81 46.0
## New Hampshire     2.1      57       56  9.5
## New Jersey        7.4     159       89 18.8
## New Mexico       11.4     285       70 32.1
## New York         11.1     254       86 26.1
## North Carolina   13.0     337       45 16.1
## North Dakota      0.8      45       44  7.3
## Ohio              7.3     120       75 21.4
## Oklahoma          6.6     151       68 20.0
## Oregon            4.9     159       67 29.3
## Pennsylvania      6.3     106       72 14.9
## Rhode Island      3.4     174       87  8.3
## South Carolina   14.4     279       48 22.5
## South Dakota      3.8      86       45 12.8
## Tennessee        13.2     188       59 26.9
## Texas            12.7     201       80 25.5
## Utah              3.2     120       80 22.9
## Vermont           2.2      48       32 11.2
## Virginia          8.5     156       63 20.7
## Washington        4.0     145       73 26.2
## West Virginia     5.7      81       39  9.3
## Wisconsin         2.6      53       66 10.8
## Wyoming           6.8     161       60 15.6

The Package DT: Display data frames in a way that is searchable

library(DT)
datatable(iris)

Creating Data Frames from CSV files

Creating data frames with the data.frame() function is only useful for small data. If data are saved in a CSV (Comma-Separated Values) file on a local computer or on internet, we can read them via the read.table(), read.csv(), or other R functions. For help. type ?read.table in the console.

Reading data from a data file is demonstrated below.

# Read data from a your own computer
filePath = "/Users/home/Downloads/daily.csv"
myData = read.table(file = filePath, header = TRUE, sep = ",", skip = 0)
head(myData) # Display the first 6 rows by default

##       date state positive negative pending hospitalizedCurrently
## 1 20200409    AK      235     6988      NA                    NA
## 2 20200409    AL     2769    18058      NA                    NA
## 3 20200409    AR     1119    13832      NA                    73
## 4 20200409    AS        0       20      11                    NA
## 5 20200409    AZ     3018    34160      NA                    NA
## 6 20200409    CA    18309   145191   14100                  2825
##   hospitalizedCumulative inIcuCurrently inIcuCumulative onVentilatorCurrently
## 1                     27             NA              NA                    NA
## 2                    333             NA              NA                    NA
## 3                    130             NA              43                    31
## 4                     NA             NA              NA                    NA
## 5                     NA             NA              NA                    NA
## 6                     NA           1132              NA                    NA
##   onVentilatorCumulative recovered                                     hash
## 1                     NA        49 b25c4a1e5bb964d05142022307f57e41e0872f26
## 2                     NA        NA 34ec8a688f0984d69a58ddef29d640ace6ee65bf
## 3                     39       288 d21f4f58502c23fd66e3842b4b6eb7c9c18ccdf0
## 4                     NA        NA b74c56fd55694ebfcdaf5fe53c9b3fd3ffa11195
## 5                     NA        NA a706debeb0a3336ce380a04afeca43b1f77d44c9
## 6                     NA        NA 415e9a2ee1c4debcc75db294bc356b28a291a185
##            dateChecked death hospitalized  total totalTestResults posNeg fips
## 1 2020-04-09T20:00:00Z     7           27   7223             7223   7223    2
## 2 2020-04-09T20:00:00Z    74          333  20827            20827  20827    1
## 3 2020-04-09T20:00:00Z    21          130  14951            14951  14951    5
## 4 2020-04-09T20:00:00Z     0           NA     31               20     20   60
## 5 2020-04-09T20:00:00Z    89           NA  37178            37178  37178    4
## 6 2020-04-09T20:00:00Z   492           NA 177600           163500 163500    6
##   deathIncrease hospitalizedIncrease negativeIncrease positiveIncrease
## 1             0                    0              146                9
## 2             8                   19             1305              400
## 3             3                    0              302              119
## 4             0                    0                0                0
## 5             9                    0             2322              292
## 6            50                    0            17884             1352
##   totalTestResultsIncrease
## 1                      155
## 2                     1705
## 3                      421
## 4                        0
## 5                     2614
## 6                    19236

# The following is a dataset from http://data.un.org/
url = "http://data.un.org/_Docs/SYB/CSV/SYB62_246_201907_Population%20Growth,%20Fertility%20and%20Mortality%20Indicators.csv"

D = read.table(file = url, header = TRUE, sep = ",", skip = 1, fill = TRUE)
dim(D) # To display the numbers of rows and columns of the data

## [1] 4966    7

D$Footnotes[1] # Notes on the Year column

## [1] "Data refers to a 5-year period preceding the reference year."

D = D[-c(1:564), 2:5] # Remove the first 564 rows (non-countries) and keep the 2nd to 5th columns

head(D) # Display the first 6 rows by default

##               X Year                                                   Series
## 565 Afghanistan 2005             Population annual rate of increase (percent)
## 566 Afghanistan 2005                Total fertility rate (children per women)
## 567 Afghanistan 2005  Infant mortality for both sexes (per 1,000 live births)
## 568 Afghanistan 2005 Maternal mortality ratio (deaths per 100,000 population)
## 569 Afghanistan 2005          Life expectancy at birth for both sexes (years)
## 570 Afghanistan 2005               Life expectancy at birth for males (years)
##        Value
## 565   4.2140
## 566   7.1816
## 567  84.6470
## 568 820.9010
## 569  56.9960
## 570  55.7900

head(D, n = 10) # Display the first n (10) rows

##               X Year                                                   Series
## 565 Afghanistan 2005             Population annual rate of increase (percent)
## 566 Afghanistan 2005                Total fertility rate (children per women)
## 567 Afghanistan 2005  Infant mortality for both sexes (per 1,000 live births)
## 568 Afghanistan 2005 Maternal mortality ratio (deaths per 100,000 population)
## 569 Afghanistan 2005          Life expectancy at birth for both sexes (years)
## 570 Afghanistan 2005               Life expectancy at birth for males (years)
## 571 Afghanistan 2005             Life expectancy at birth for females (years)
## 572 Afghanistan 2010             Population annual rate of increase (percent)
## 573 Afghanistan 2010                Total fertility rate (children per women)
## 574 Afghanistan 2010  Infant mortality for both sexes (per 1,000 live births)
##        Value
## 565   4.2140
## 566   7.1816
## 567  84.6470
## 568 820.9010
## 569  56.9960
## 570  55.7900
## 571  58.2900
## 572   2.5790
## 573   6.4784
## 574  72.1930

tail(D) # Display the last 6 rows by default

##             X Year                                                   Series
## 4961 Zimbabwe 2015                Total fertility rate (children per women)
## 4962 Zimbabwe 2015  Infant mortality for both sexes (per 1,000 live births)
## 4963 Zimbabwe 2015 Maternal mortality ratio (deaths per 100,000 population)
## 4964 Zimbabwe 2015          Life expectancy at birth for both sexes (years)
## 4965 Zimbabwe 2015               Life expectancy at birth for males (years)
## 4966 Zimbabwe 2015             Life expectancy at birth for females (years)
##         Value
## 4961   4.0897
## 4962  51.2210
## 4963 443.3138
## 4964  56.7100
## 4965  54.8800
## 4966  58.2600

tail(D, n = 10) # Display the last n (10) rows by default

##             X Year                                                   Series
## 4957 Zimbabwe 2010          Life expectancy at birth for both sexes (years)
## 4958 Zimbabwe 2010               Life expectancy at birth for males (years)
## 4959 Zimbabwe 2010             Life expectancy at birth for females (years)
## 4960 Zimbabwe 2015             Population annual rate of increase (percent)
## 4961 Zimbabwe 2015                Total fertility rate (children per women)
## 4962 Zimbabwe 2015  Infant mortality for both sexes (per 1,000 live births)
## 4963 Zimbabwe 2015 Maternal mortality ratio (deaths per 100,000 population)
## 4964 Zimbabwe 2015          Life expectancy at birth for both sexes (years)
## 4965 Zimbabwe 2015               Life expectancy at birth for males (years)
## 4966 Zimbabwe 2015             Life expectancy at birth for females (years)
##         Value
## 4957  45.0170
## 4958  43.3400
## 4959  46.6800
## 4960   1.6860
## 4961   4.0897
## 4962  51.2210
## 4963 443.3138
## 4964  56.7100
## 4965  54.8800
## 4966  58.2600

names(D) # Display the column (variable) names.

## [1] "X"      "Year"   "Series" "Value"

names(D)[1] = "Country" # Change the column name from "X" to "Country"

# Can you answer this question?
# During the 5-year period ending at 2005, what was the population annual rate of increase (percent)?
# During the 5-year period ending at 2010, what was the life expectancy at birth for females (years)?
# Your answer to previous questions should equal 2.0100 + 2.2040 and 65.1230 + 13.417, respectively.

# If your data are in the xls or xlsx format, install and load the package "readxl" and 
# then call read_excel() function to read the data.
# If you read an xls file, issue the code
# my_data <- read_excel("path to my_file.xls")
# If you read an xlsx file, issue the code
# my_data <- read_excel("path to my_file.xlsx")

Lists in R

L = list(Name = "Tom", Age = 19, Sex = "Male", Major = "Computer Science")
L

## $Name
## [1] "Tom"
## 
## $Age
## [1] 19
## 
## $Sex
## [1] "Male"
## 
## $Major
## [1] "Computer Science"

L[[3]] ## The third object (z) in the list L

## [1] "Male"

L$Age # Extract the Age element in the list L.

## [1] 19

length(L) # How many elements are in L

## [1] 4

class(L)

## [1] "list"

Remove R Objects from Memory

x=5:9
x

## [1] 5 6 7 8 9

rm(x) # Remove the R object x

## print(x) would result in an error!
ls() ## A vector that holds all R objects in memory

##  [1] "a"        "D"        "DF"       "filePath" "L"        "M"       
##  [7] "M2"       "myData"   "url"      "y"        "z"        "z1"      
## [13] "z2"

rm(list = ls()) ## Remove all

Reserved Words in R

Reserved words in R programming are a set of words that have special meaning and cannot be used as an identifier (variable name, function name etc.).

Here is a list of reserved words in the R’s parser.

if, else, repeat, while, function, for, in, next, break, TRUE, FALSE, NULL, Inf, NaN, NA

This list can be viewed by typing help(reserved) or ?reserved at the R command prompt.

Basics in R Programming

In any programming language, logical expressions, if-else conditions, and for- or while- Loops are fundamental.

Logical Expressions

x = 2
X = 3
y = 4
a = (x == X) # Check if x equals X
a # the value of a is FALSE

## [1] FALSE

b = (x > y)
b

## [1] FALSE

m = (x != X) # Check if x and X are NOT the same
m # The value of m is TRUE

## [1] TRUE

If-else Conditionals

A simple if-else structure is for the situation that there are only two cases to consider. It may look like:

x = 3
if (x < 0) {
  print("Negative")
} else {
  print("Non-negative")
}  ## Note: the "else" branch should follow an "{" immediately!

## [1] "Non-negative"

A more complicated if-else structure is for the situation that there are more than two cases to consider. An example is:

x = 3
if (x < 0) {
  print("Negative")
} else if (x<= 1){
  print("Between 0 and 1")
} else if (x<=5){
  print("greater than 1 and no more than 5")
} else {
  print("Greater than 5")
}

## [1] "greater than 1 and no more than 5"

The For Loop: Example 1- Sum of the first 100 positive integers.

## Find the sum of a given sequence
x = 1:100

s = 0 ## Initialize s to 0
for (a in x) {
  s = s + a ## Update s as x is retrieved
}

print(s)

## [1] 5050

The For Loop: Example 2- Fibbonacci Sequence.

# Print the first a few terms of a Fibbonacci Sequence
# A Fibbonacci sequence is the sequence: 1, 1, 2, 3, 5, 8, 13, 21, 34, ...
# The next term equals the sum of the previous two terms

a = 1 # the first term of the Fibbonacci sequence
b = 1 # the second term

for (i in 1:10){
  cat(i, ": ", a, "\n") # Outputs the objects, concatenating the representations.
  temp <- a # Save "a" temporarily
  a <- b
  b <- temp + b # Update "b"
}

## 1 :  1 
## 2 :  1 
## 3 :  2 
## 4 :  3 
## 5 :  5 
## 6 :  8 
## 7 :  13 
## 8 :  21 
## 9 :  34 
## 10 :  55

The While Loop

i <- 10
while (i > 5) {
  print(i)
  i = i-1  # This is not an equation. It means to update "i"
}

## [1] 10
## [1] 9
## [1] 8
## [1] 7
## [1] 6

Some Useful Built-in R Functions

We demonstrate the use of the following built-in functions in R: summary(), table(), names(), sample(), apply(), sapply(), toupper(), substr(), substring(), and aggregate().

x = c(6, 5, 2, 1, 6, 1, 4, 3, 2, 1, 3, 6, 2, 6, 4, 1, 5, 1, 4, 4, 3, 6, 1, 4, 6, 5, 6, 5, 5, 5)

summary(x) # Summarize the data to give a 5-number summary plus the eman

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   4.000   3.767   5.000   6.000

T = table(x) # Tablulate the values in x. The result is a one-way table with names.
T # Print the table

## x
## 1 2 3 4 5 6 
## 6 3 3 5 6 7

names(T)

## [1] "1" "2" "3" "4" "5" "6"

# You can change the names of the table to what you want.
names(T)[2] = "Two"
T # Print the table

##   1 Two   3   4   5   6 
##   6   3   3   5   6   7

names(T)

## [1] "1"   "Two" "3"   "4"   "5"   "6"

Names = c("Abel", "Bernoulli", "Catherine", "Dick", "Alice", "Issac") # A vector of individuals (can be viewed as a population)

sample(x = Names, size = 2, replace = FALSE)  ## Randomly select two individuals without replacement from the population

## [1] "Catherine" "Abel"

# Create a data frame with 3 columns
M = data.frame(x = c(8, 6, 7, 5, 9, 8),
               y = c(23, 43, 22, 17, 56, 48),
               z = c(45, 78, 90, 34, 67, 23)
              )

names(M)

## [1] "x" "y" "z"

apply(M, 2, mean) # Apply the function "mean" to each column of M. The result is a vector

##         x         y         z 
##  7.166667 34.833333 56.166667

apply(M, 2, summary) # Apply the function "summary" to each column of M. The result is a vector

##                x        y        z
## Min.    5.000000 17.00000 23.00000
## 1st Qu. 6.250000 22.25000 36.75000
## Median  7.500000 33.00000 56.00000
## Mean    7.166667 34.83333 56.16667
## 3rd Qu. 8.000000 46.75000 75.25000
## Max.    9.000000 56.00000 90.00000

x=c(9, 16, 25, 100)
sapply(x, sqrt) # Apply the function "sqrt" to each element of vector x. The result is a vector

## [1]  3  4  5 10

s1 = "Happy new year!"
s2 = c("tom robinson", "Alice", "jessica")

toupper(s1) # convert s1 in uppercase

## [1] "HAPPY NEW YEAR!"

toupper(s2)

## [1] "TOM ROBINSON" "ALICE"        "JESSICA"

tolower(s1)

## [1] "happy new year!"

tolower(s2)

## [1] "tom robinson" "alice"        "jessica"

substr(s1, start = 3, stop = 8) # substring of s1 from third char to 8th char; must provide a stop.

## [1] "ppy ne"

substring(s2, first = 2, last = 6)

## [1] "om ro" "lice"  "essic"

substring(s2, first = 2) # basically to the end

## [1] "om robinson" "lice"        "essica"

# Get to know the data frame "iris"
iris

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica

# Find means of sub-groups: Group the plants in the iris data according to species. 
# And within each group, find the mean of each numeric variable by applying the mean() function.
# One of the R functions for doing this is aggregate().
aggregate(. ~ Species, data = iris, FUN = mean)

##      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     setosa        5.006       3.428        1.462       0.246
## 2 versicolor        5.936       2.770        4.260       1.326
## 3  virginica        6.588       2.974        5.552       2.026

# The function can be used to groups formed with 2 or more grouping variables
aggregate(mpg ~ cyl + am, data = mtcars, mean) # Find mean of mpg within each group determined by cyl & am

##   cyl am      mpg
## 1   4  0 22.90000
## 2   6  0 19.12500
## 3   8  0 15.05000
## 4   4  1 28.07500
## 5   6  1 20.56667
## 6   8  1 15.40000

There are many more built-in functions.

The paste() Function

The paste() or paste0() function is useful when pasting texts and numeric values to send messages.

n = 30000000
paste("The number of COVID-19 cases is ", n, ".", ifelse(n>1000000, " This is huge!", ""), 
      sep = ""
     )

## [1] "The number of COVID-19 cases is 3e+07. This is huge!"

paste0("The number of COVID-19 cases is ", n, ".", ifelse(n>1000000, " This is huge!", "")
     )

## [1] "The number of COVID-19 cases is 3e+07. This is huge!"

paste0("The number of COVID-19 cases is ", sprintf("%i", n), ".", ifelse(n>1000000, " This is huge!", "")
     )

## [1] "The number of COVID-19 cases is 30000000. This is huge!"

The factor() Function

In statistics, variables can be classified as numerical (quantitative) or categorical (qualitative). A categorical variable can further by classified as ordinal or nominal. Examples of ordinal variables are grade (A, B, C, D, F) and income level (low, medium, high). Examples of nominal variables are color (red, blue, …) and political affiliation (Democratic, Republican, Independent).

In situations where we want to visualize data, we need to first convert a variable to a categorical one with levels and we need to use the R function “factor()”. Here is an example.

x = c("Democratic", "Independent", "Independent", "Republican", "Democratic", "Republican", "Republican", "Independent", "Democratic", "Democratic", "Democratic", "Republican")

T1 = table(x)
T1 # The order of the names of the table is the alphabetical order of the values in x

## x
##  Democratic Independent  Republican 
##           5           3           4

barplot(T1)

F = factor(x, levels = c("Democratic", "Republican", "Independent")) # F is a factor
F # The levels themselves are not ordered.

##  [1] Democratic  Independent Independent Republican  Democratic  Republican 
##  [7] Republican  Independent Democratic  Democratic  Democratic  Republican 
## Levels: Democratic Republican Independent

T2 = table(F)
T2 # The order of the names of the table has been changed!! The order is the same as in the levels of F

## F
##  Democratic  Republican Independent 
##           5           4           3

barplot(T2)

# The following makes a factor with ordered levels
F = factor(x, levels = c("Democratic", "Republican", "Independent"), ordered = TRUE)  
F # The levels themselves are now ordered as specified.

##  [1] Democratic  Independent Independent Republican  Democratic  Republican 
##  [7] Republican  Independent Democratic  Democratic  Democratic  Republican 
## Levels: Democratic < Republican < Independent

F[3]>F[5] # The comparison of internal values in F now makes sense

## [1] TRUE

as.numeric(x) # Not working

## Warning: NAs introduced by coercion

##  [1] NA NA NA NA NA NA NA NA NA NA NA NA

as.numeric(F) # Factors can be converted to numeric

##  [1] 1 3 3 2 1 2 2 3 1 1 1 2

When creating a graph (scatterplot or barplot) with a legend that is numerical but actually categorical, we need to convert the numerical variable to a factor, as shown later when we introduce the ggplot2 package.

Numeric Summary of Data

y = c(4, 8,  6,  4,  9,  3, 14, 13, 12)
sum(y)

## [1] 73

mean(y)

## [1] 8.111111

median(y)

## [1] 8

sd(y)

## [1] 4.166667

summary(y) # 5-number summary plus the mean

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.000   8.000   8.111  12.000  14.000

c(summary(y), sd=sd(y)) # Add standard deviation to the standard summary of data

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.        sd 
##  3.000000  4.000000  8.000000  8.111111 12.000000 14.000000  4.166667

x = c(30, 45, 34, 27, 51, 18, 67, 56, 48)

cor(x, y) # Correlation between x and y variables

## [1] 0.9540732

# Handling missing data
z = c(12, 23, 41, 34, 10, NA, 18, 22, NA, 52)
mean(z)

## [1] NA

mean(z, na.rm = TRUE) # Remove missing values first before taking the mean.

## [1] 26.5

sd(z)

## [1] NA

sd(z, na.rm = TRUE) # Remove missing values first before taking the standard deviation.

## [1] 14.65801

Graphical Display of Data

When graphing data, we need to know 3 regions: the plot region, the figure region, and the device region. Here are pages showing the 3 regions: https://img-blog.csdn.net/20130709222854796 and https://r-charts.com/base-r/margins/

To reset the plot margin, use the “mar” parameter“. To reset the figure margin, use the”oma" parameter. These are done as this:

par(mar = c(1,2,3,6), oma = c(5,2,2,1))

Type ?par on the console to see the default settings for “mar” and “oma”.

x = c(3, 6, 7, 10, 12, 18, 20)
y = c(5, 8, 11, 15, 20, 26, 31)

par(mar = c(1,4,3,6), oma = c(5,2,2,1)) 
plot(x, y) # A scatterplot of y vs x
box("figure", col = "blue") # Show the figure region only for demonstration purpose; "figure" is the same as "inner"
box("outer", col = "red") # Show the device region only for demonstration purpose

# Suppose you flipped a die of 6-faces 25 times and got the following points:
a = c(2, 3, 1, 6, 3, 1, 4, 3, 1, 3, 1, 2, 1, 4, 5, 2, 2, 5, 4, 3, 1, 5, 5, 3, 1)

# To plot the frequencies (how many times each number occurs), 
# you must get the frequency table first. The R function table() does it.
T = table(a) # A frequency table: count the numbers of different values

# Now, create a bar plot based on the frequency table you got
barplot(T, 
        xlab="Number",     # Name the x-axis
        ylab="Frequency",  # Name the y-axis
        col = "red",       # Color each bar red
        main = "")         # Empty title

# Add a title using the title() function and annotate it.
title("Do You Know What This Graph Says?", 
      col.main="blue",   # Color the title
      font.main = 2,     # Change the font of the title. The default corresponds to value 1
      cex.main = 2       # Double the size of the words in the title. Default is 1
     )

# To use percentages for a bar plot, we must convert frequencies to relative frequencies.
# To do this, we divide frequencies by the total frequency, which can be obtained by the
# function length(). Multiplying by 100 gives the percentages.
barplot(T/length(a)*100, xlab="Number", ylab="Percent", col = "red")

# Can I change x-labels (not the same as title)? Yes.
barplot(T/length(a)*100, 
        xlab="Number", 
        ylab="Percent", 
        col = "red",
        names = c("One", "Two", "Three", "Four", "Five", "Six")
       )

# Display labels at an angle
barplot(T/length(a)*100, 
        xlab="Number", 
        ylab="Percent", 
        col = "red",
        names = c("One", "Two", "Three", "Four", "Five", "Six"),
        las = 2   # Vertical label
       )

hist(a, main = "This is a Histogram", col = "blue")

# Make a probability histogram: the height of each bar equals that of the frequency histogram, 
# divided by the total frequency, and then divided by the width of the bar. 
# Doing so allows you to add a density curve to the histogram if needed
hist(a, probability = TRUE, xlab = "Age", main = "Histogram of Ages of Kids in a Big Family")

Overlay Curves

A function can be plotted using the curve() function in R. Sometimes, we may want to plot a few functions in the same graph. The following plots the square-root function and the exponential function in the same graph. Note that the option “add = TRUE” is used.

par(mar=c(3.1, 3.1, 2.1, 5.1), oma=c(4.1, 4.1, 3.1, 6.1), xpd=TRUE)

curve(sqrt, 0, 5, col = "red", lty = 1)
legend("topright", inset=c(-0.5,0), legend=c("sqrt","exp"), title="Function", lty = c(1,2), col = c("red", "blue"), text.col = c("red", "blue"), title.col = "black", bty = "n")

box("plot", col = "green")
box("figure", col = "blue") # Show the figure region only for demonstration purpose; "figure" is the same as "inner"
box("outer", col = "red") # Show the device region only for demonstration purpose


par(xpd=FALSE) # Without this, the following added curve will go out the plotting region
curve(exp, add = TRUE, col = "blue", lty = 2)

User-defined Functions

# 1. A mathematical function y = xe^x
f = function(x){ # The corresponding R function is named "f", which takes x as
                 # an input and outputs something. Something good to read:               
                 # https://en.wikipedia.org/wiki/Lambert_W_function 
  y = x*exp(x)
  return(y)
}

# Plot the curve of the f function
curve(f, from = -7, to = 1)
abline(h=0, col = "red") # Add a real x-axis

# Can you find where the function has its minimum? Add a vertical line there.

# 2. My summary function for an input that is a vector
Summary = function(x){
  Mean=mean(x)
  Sd = sd(x)
  Median = median(x)
  Summary=c(Mean = Mean, "    Standard Deviation" = Sd, Median = Median)
  return(Summary) # A function must return something
}

x=c(2, 4, 6, 1, 0, -3, 7)
Summary(x) # Apply the function just defined

##                   Mean     Standard Deviation                 Median 
##               2.428571               3.505098               2.000000

# 2. Write an R function that implement the mathematical function 
# f(x) = 2x-1 for x < 1, x^2 for 1<=x<=5, and 5x for x >5.
f = function(x){
  n = length(x)
  y = numeric(n)
  
  for (i in 1:n){
    if (x[i] < 1){
      y[i] = 2 * x[i] - 1
    } else if (x[i] <= 5) {
      y[i] = x[i]^2
    } else {
      y[i] = 5 * x[i]
    }
  }
  
  return(y)
}

f(-3)

## [1] -7

f(4)

## [1] 16

f(6)

## [1] 30

curve(f, xlim = c(-2, 10), col = "red", main = "Plot of a Piece-wise function")

# 3. Write a function that print the first n terms of the sequence defined by
# f(n) = 2*f(n-1)+1, with f(1) = 10.
f = function(n){
  a = numeric(n)
  a[1] = 10
  for (i in 2:n){
    a[i] = 2*a[i-1] + 1
  }
  
  return(a)
}

f(10)

##  [1]   10   21   43   87  175  351  703 1407 2815 5631

# Note: the above function does not apply to a vector

The third R function above can be written using the idea of recursion - self calling. It is a very powerful method and meany programming problems can be solved using this idea. Note that not every programming language allows the realization of this idea.

# Write a function that output the first n terms according to 
# f(n) = 2*f(n-1)+1, with f(1) = 10.
f = function(n){
  if(n == 1){
    return(10)
  } else {
    return(2*f(n-1) + 1)
  }
}

# Use the function
f(5)

## [1] 175

f(12)

## [1] 22527

The next example shows a function that finds the mean (i.e., average) of a set of values.

# Find the mean of a set of values
Mean = function(v){
  n = length(v)
  if (n==1){
    return(v)
  } else {
    return((Mean(v[-n])*(n-1) + v[n])/n)
  }
}

v = c(2,8,0,9,3)
Mean(v)

## [1] 4.4

mean(v)

## [1] 4.4

The next example shows a function that finds the greatest common divisor of two positive integers.

## Find Greatest Common Divisor of two numbers (say a & b) using the Euclidean algorithm.
## Set x = min(a,b) and y = max(a,b). Set x = min(x, y-x) and y = y-x. 
## Repeat until x = y. The GCD(a,b) = x.
gcd=function(v=c(45,75)){
  x=min(v)
  y=max(v)
  if (x==y) {
    return(x)
  } else{
    x=min(x,y-x)
    y = y-x
    return(gcd(c(x, y)))
  }
}

gcd()

## [1] 15

gcd(c(63,54))

## [1] 9

gcd(c(5723, 1261))

## [1] 97

gcd(c(59446125, 81124893))

## [1] 8973

Exercises:

Write a function that calculates $\frac{1}{1}-\frac{1}{2}+\frac{1}{3}-\frac{1}{4}+\cdots+\frac{1}{n}$ for a given $n$. For large $n$, your answer should be close to $ln(2)$.
Write a function that calculates $\frac{1}{1}-\frac{1}{3}+\frac{1}{5}-\frac{1}{7}+\cdots+(-1)^n\frac{1}{2n+1}$ for a given $n$. For large $n$, your answer should be close to $\frac{\pi}{4}$.
Write a function that calculates $\frac{1}{1^2}+\frac{1}{2^2}+\frac{1}{3^2}+\frac{1}{4^2}+\cdots+\frac{1}{n^2}$ for a given $n$. For large $n$, your answer should be close to $\frac{\pi^2}{6}$, which equals the value of the Riemann zeta function evaluated at 2. Interestingly, the probability of 2 randomly chosen positive integers being co-prime equals $\frac{6}{\pi^2}$, around $61\%$.
Write a function that calculates $\frac{1}{1^2}-\frac{1}{2^2}+\frac{1}{3^2}-\frac{1}{4^2}+\cdots+(-1)^{n+1}\frac{1}{n^2}$ for a given $n$. For large $n$, your answer should be close to $\frac{\pi^2}{12}$.
Write two R functions, one for f(x) = x^2 and one for g(x) = 2^x. Plot them for x between -2 and 5 in the same graph. At how many points do they intercept? Can you find all of those points?

The Pipe Operator: A Modern Way of Using R Functions

The idea:

In mathematics, we can create a composite function such as

\[f(g(h(x)))\] Since each function is just an operator, we can instruct a machine to do this:

\[x \rightarrow h \rightarrow g \rightarrow f\] So, in R, if $f$, $g$, and $h$ are functions, and $x$ is an input, we can instruct the machine by “x%>%h%>%g%>%f.”

To use the pipe operator, we need to install the “dplyr” package or the more general “tidyverse” package.

x = c(2, 3, 7, 1, 9, 6, 5)
x %>% mean() # Normally you do: mean(x)

## [1] 4.714286

mtcars %>% plot() # Normally you do: plot(mtcars)

mtcars %>% apply(2, var) %>% sqrt() # Normally you do: sqrt(apply(mtcars, 2, var))

##         mpg         cyl        disp          hp        drat          wt 
##   6.0269481   1.7859216 123.9386938  68.5628685   0.5346787   0.9784574 
##        qsec          vs          am        gear        carb 
##   1.7869432   0.5040161   0.4989909   0.7378041   1.6152000

sqrt(apply(mtcars, 2, var))

##         mpg         cyl        disp          hp        drat          wt 
##   6.0269481   1.7859216 123.9386938  68.5628685   0.5346787   0.9784574 
##        qsec          vs          am        gear        carb 
##   1.7869432   0.5040161   0.4989909   0.7378041   1.6152000

# If you feel uncomfortable, don't use the pipe operator
mtcars %>% ggplot(mapping=aes(x = hp, y = mpg)) +
  geom_point(color = "red")

Debugging

Many times, we may run into errors when running code we write. Here are some common mistakes. Copy each line of code on console and run it.

cat + 1

dog + 1

Mean(c(23, 45, 32, 18))

How can you figure out why the errors occur?

An Introduction to the ggplot2 package

The package ggplot2 is the major package for plotting data. You may watch a video https://www.youtube.com/watch?v=HeqHMM4ziXA before you read and practice the following.

Plotting data with the qplot() function from ggplot2 Package

library(ggplot2) # Load the package

qplot(mpg, wt, data = mtcars)

qplot(mpg, wt, data = mtcars, colour = cyl)

qplot(mpg, wt, data = mtcars, size = cyl)

qplot(mpg, wt, data = mtcars, facets = vs ~ am)

ggplot(mtcars, aes(x=mpg, y=wt)) +
  geom_point() +
  facet_grid(vs ~ am, labeller = label_both)

## The previous result is not ideal, since the facet labels don't make a good sense.

mtcars$vs=factor(mtcars$vs) # Make "vs" a factor

levels(mtcars$vs)=c("high", "low") # Set custom levels to "vs"

## Now, let's redo the plot
ggplot(mtcars, aes(x=mpg, y=wt)) +
  geom_point() +
  facet_grid(vs ~ am, labeller = label_both)

# Or using
qplot(mpg, wt, data = mtcars, facets = vs ~ am)

Another video to watch: https://www.youtube.com/watch?v=HPJn1CMvtmI. Now, practice the following:

mtcars

##                      mpg cyl  disp  hp drat    wt  qsec   vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46 high  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02 high  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  low  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  low  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02 high  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  low  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84 high  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  low  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  low  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  low  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  low  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40 high  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60 high  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00 high  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98 high  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82 high  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42 high  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  low  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  low  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  low  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  low  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87 high  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30 high  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41 high  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05 high  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  low  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70 high  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  low  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50 high  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50 high  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60 high  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  low  1    4    2

ggplot(mtcars, mapping=aes(x = hp, y = mpg)) +
  geom_point(color = "red")

## Want to do other plots?

ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_line(color = "red")

ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(color = "blue", fill = "red", bins=6)

ggplot(mtcars, aes(x = mpg)) +
  geom_density(color = "red")

ggplot(mtcars, aes(x = mpg, fill = factor(cyl))) +
  geom_histogram(bins = 6) # bad

ggplot(mtcars, aes(x = mpg, colour = factor(cyl))) +
  geom_density() +
  labs(colour = "Cylinder") # rename legend title

ggplot(mtcars, aes(x = mpg, color= as.factor(cyl))) +
  geom_density() +
  labs(title = "gggggg", x="Miles per Gallon", y= "Density", color="Cylinder") # rename legend title

ggplot(mtcars, aes(y = mpg)) +
  geom_boxplot(fill = "red")

# More to learn about the theme()
ggplot(mtcars, aes(x = as.factor(cyl), y = mpg, fill = as.factor(cyl))) +
  geom_boxplot() + 
  labs(title = "Mpg vs Cyl", subtitle = "Hello World!", x="Cyl", caption = "Data courtesy: xyz") +
  theme(plot.background = element_rect(fill = "gray"),
        panel.background = element_rect(fill = "yellow"),
        plot.margin = unit(c(t=1,r=2,b=3,l=4), "cm"),
        axis.title.x = element_text(hjust = 0.5, size = 20, color = "pink", angle = 45, vjust = -1),
        plot.title = element_text(hjust = 0.5, size = 28, color = "green"),
        plot.subtitle = element_text(hjust = 0.5, size = 15, color = "red"),
        plot.caption = element_text(hjust = 0, size = 10, color = "green"),
        legend.position = "none",
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank()
        )

myData = data.frame(
  Student = 1:24,
  Major = c("Accounting", "IS", "CS", "Statistics", "EE", "Math", "Accounting", "CS", "CS", "Statistics", "EE", "Math","Accounting", "IS", "IS", "Statistics", "Math", "Math", "Accounting", "IS", "Statistics", "Statistics", "EE", "CS"),
  Sex = c("Male", "Female", "Male", "Male", "Female", "Male", "Female", "Male", "Male", "Female", "Female", "Female", "Male", "Male", "Female", "Male", "Female", "Female", "Male", "Female", "Male", "Male", "Male", "Male"),
  Score = c(87, 92, 69, 90, 88, 79, 94, 86, 92, 84, 78, 67, 95, 91, 93, 83, 82, 94, 86, 74, 95, 55, 93, 85)
)

myData

##    Student      Major    Sex Score
## 1        1 Accounting   Male    87
## 2        2         IS Female    92
## 3        3         CS   Male    69
## 4        4 Statistics   Male    90
## 5        5         EE Female    88
## 6        6       Math   Male    79
## 7        7 Accounting Female    94
## 8        8         CS   Male    86
## 9        9         CS   Male    92
## 10      10 Statistics Female    84
## 11      11         EE Female    78
## 12      12       Math Female    67
## 13      13 Accounting   Male    95
## 14      14         IS   Male    91
## 15      15         IS Female    93
## 16      16 Statistics   Male    83
## 17      17       Math Female    82
## 18      18       Math Female    94
## 19      19 Accounting   Male    86
## 20      20         IS Female    74
## 21      21 Statistics   Male    95
## 22      22 Statistics   Male    55
## 23      23         EE   Male    93
## 24      24         CS   Male    85

ggplot(myData, aes(x = Score, color= Sex)) +
  geom_density()

# Bar plots with 3 possible positions
ggplot(myData, aes(x = Major, fill = Sex)) +
  geom_bar(position = "stack") # Each bar consists of two pieces representing the counts of each sex

ggplot(myData, aes(x = Major, fill = Sex)) +
  geom_bar(position = "dodge") # Side-by-side

ggplot(myData, aes(x = Major, fill = Sex)) +
  geom_bar(position = "fill") # Each bar has the same height of 1 and consists of two sexes in proportion

# Name-value kind of data
popData = data.frame(City = c("St Cloud", "Sartell", "Waite Park", "St Joseph"), 
                        Population = c(66169, 18428, 7718, 7147))
popData

##         City Population
## 1   St Cloud      66169
## 2    Sartell      18428
## 3 Waite Park       7718
## 4  St Joseph       7147

ggplot(popData, aes(x = City, y = Population)) +
  geom_col() + # for Name-value kind of data 
  theme(axis.text.x = element_text(angle = 45))

# The Titanic Data

D=as.data.frame(Titanic)
D1=D %>% group_by(Class)
D2=D1 %>% summarise(Freq=sum(Freq))
D2

## # A tibble: 4 x 2
##   Class  Freq
##   <fct> <dbl>
## 1 1st     325
## 2 2nd     285
## 3 3rd     706
## 4 Crew    885

ggplot(D2, aes(x=Class, y = Freq)) +
  geom_col() +
  geom_text(aes(label=Freq), vjust=2.5, col ="red", size=13)

ggplot(D, aes(x=Class, y = Freq, fill = Survived)) +
  geom_col(position = "stack") +
  labs(fill = "Survival Status") # Change title of legend

ggplot(D, aes(x=Class, y = Freq, fill = Survived)) +
  geom_col(position = "dodge")

ggplot(D, aes(x=Class, y = Freq, fill = Survived)) +
  geom_col(position = "dodge")

# Which one to use: geom_bar() or geom_col? 
# Both functions create bar graphs. In general, for individual data, such as myData above, use geom_bar(); 
# for grouped data, such as D above, use geom_col(). 
# The function geom_bar() can also be applied to grouped data, if you use it this way: geom_bar(stat="identity").

Plotting Functions with ggplot2

We covered the plotting of data. How can we plot curves? Here are examples:

# Define the R function for f(x) = x
f1 = function(x){
  return(x)
}

# Define the R function for f(x) = x^2
f2 = function(x){
  return(x^2)
}

# Define the R function for f(x) = x^3
f3 = function(x){
  return(x^3)
}

# Plot all 3 curves in one graph
ggplot() +
  xlim(-3, 3) +
  geom_function(aes(colour = "Diagonal line"), fun = f1) +   # Add the graph of function f1
  geom_function(aes(colour = "Parabola"), fun = f2) +  # Add the graph of function f2
  geom_function(aes(colour = "Cubic"), fun = f3) +  # Add the graph of function f3
  labs(title = "Overlay Three Curves in One Graph", color = "Curves") # Set the title of the legend: labs = labels

The next example plot the standard normal density curve along with a few t-distribution curves with different degrees of freedom.

ggplot() +    # Ready to plot
  xlim(-5, 5) +  # Set the x limits. In the following, fun = function
  geom_function(aes(colour = "normal"), fun = dnorm) +   # Add a standard normal density curve: d = density, norm = "Normal"
  geom_function(aes(colour = "t, df = 1"), fun = dt, args = list(df = 1)) +  # Add a t-distribution (with df = 1) density curve
  geom_function(aes(colour = "t, df = 10"), fun = dt, args = list(df = 10)) +
  geom_function(aes(colour = "t, df = 30"), fun = dt, args = list(df = 30)) +
  geom_function(aes(colour = "t, df = 50"), fun = dt, args = list(df = 50)) +
  theme(legend.title = element_blank())

More Examples with ggplot2

Custom Colors

## custom colors
library(ggplot2)

ggplot(mtcars, aes(x=mpg, y=disp, color = factor(cyl))) +
  geom_line() + 
  scale_color_manual(values=c("blue", "green", "red")) # You can set your favorite colors

Overlay curves from different sources of data

x= 1:10

y0 = x

y1 = x + 1

y2 = x + 2

y3 = x + 3

ggplot() + geom_line(aes(x=x, y=y0, color="0")) +
  geom_line(aes(x=x, y=y1, color="1")) +
  geom_line(aes(x=x, y=y2, color="2")) +
  geom_line(aes(x=x, y=y3, color="3")) + 
  scale_color_manual(values =c("black", "red", "blue", "green"))

Other Nice Plots

We demonstrate the use of the “BasketballAnalyzeR” package for creating nice plots.

D = USArrests[1:10,] # Only use the first 10 rows of the USArrests data from the base package

BasketballAnalyzeR::radialprofile(data=D, title = rownames(D))

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

Using arrange() to arrange observations in order

arrange(mtcars, cyl) # By default, arrange the data in ascending order according to the values of "cyl"

##                      mpg cyl  disp  hp drat    wt  qsec   vs am gear carb
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  low  1    4    1
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  low  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  low  0    4    2
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  low  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  low  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  low  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  low  0    3    1
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  low  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70 high  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  low  1    5    2
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  low  1    4    2
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46 high  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02 high  1    4    4
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  low  0    3    1
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  low  0    3    1
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  low  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  low  0    4    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50 high  1    5    6
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02 high  0    3    2
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84 high  0    3    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40 high  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60 high  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00 high  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98 high  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82 high  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42 high  0    3    4
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87 high  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30 high  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41 high  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05 high  0    3    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50 high  1    5    4
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60 high  1    5    8

arrange(mtcars, cyl, disp) # By default, arrange data in ascending order, by "cyl" and then by "disp"

##                      mpg cyl  disp  hp drat    wt  qsec   vs am gear carb
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  low  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  low  1    4    2
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  low  1    4    1
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  low  1    4    1
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  low  1    5    2
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  low  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  low  0    3    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70 high  1    5    2
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  low  1    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  low  0    4    2
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  low  0    4    2
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50 high  1    5    6
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46 high  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02 high  1    4    4
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  low  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  low  0    4    4
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  low  0    3    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  low  0    3    1
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40 high  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60 high  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00 high  0    3    3
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60 high  1    5    8
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30 high  0    3    2
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87 high  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41 high  0    3    4
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50 high  1    5    4
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02 high  0    3    2
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84 high  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05 high  0    3    2
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42 high  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82 high  0    3    4
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98 high  0    3    4

arrange(mtcars, cyl, desc(disp)) # cyl in ascending order and disp in descending order

##                      mpg cyl  disp  hp drat    wt  qsec   vs am gear carb
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  low  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  low  0    4    2
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  low  1    4    2
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70 high  1    5    2
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  low  0    3    1
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  low  1    4    1
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  low  1    5    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  low  1    4    1
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  low  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  low  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  low  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  low  0    3    1
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  low  0    3    1
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  low  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  low  0    4    4
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46 high  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02 high  1    4    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50 high  1    5    6
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98 high  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82 high  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42 high  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05 high  0    3    2
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02 high  0    3    2
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84 high  0    3    4
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50 high  1    5    4
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41 high  0    3    4
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87 high  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30 high  0    3    2
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60 high  1    5    8
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40 high  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60 high  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00 high  0    3    3

arrange(mtcars, cyl, -disp) # cyl in ascending order and disp in descending order; the same as above

##                      mpg cyl  disp  hp drat    wt  qsec   vs am gear carb
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  low  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  low  0    4    2
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  low  1    4    2
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70 high  1    5    2
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  low  0    3    1
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  low  1    4    1
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  low  1    5    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  low  1    4    1
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  low  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  low  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  low  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  low  0    3    1
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  low  0    3    1
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  low  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  low  0    4    4
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46 high  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02 high  1    4    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50 high  1    5    6
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98 high  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82 high  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42 high  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05 high  0    3    2
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02 high  0    3    2
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84 high  0    3    4
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50 high  1    5    4
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41 high  0    3    4
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87 high  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30 high  0    3    2
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60 high  1    5    8
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40 high  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60 high  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00 high  0    3    3

Using mutate() to change or create a column

D = mtcars # Make a copy
mutate(D, 
       fake = mpg + disp, # A meaningless column called "fake" added to the D data frame
       mpg_cat = (mpg>20), # Add another column called "mpg_cat". The column indicates whether mpg > 20
       log_disp = log(disp), # Add still another column called "log_disp": A log-transformation to disp
       mpg = log(mpg) # Replace the mpg column by its log value
      )

##                          mpg cyl  disp  hp drat    wt  qsec   vs am gear carb
## Mazda RX4           3.044522   6 160.0 110 3.90 2.620 16.46 high  1    4    4
## Mazda RX4 Wag       3.044522   6 160.0 110 3.90 2.875 17.02 high  1    4    4
## Datsun 710          3.126761   4 108.0  93 3.85 2.320 18.61  low  1    4    1
## Hornet 4 Drive      3.063391   6 258.0 110 3.08 3.215 19.44  low  0    3    1
## Hornet Sportabout   2.928524   8 360.0 175 3.15 3.440 17.02 high  0    3    2
## Valiant             2.895912   6 225.0 105 2.76 3.460 20.22  low  0    3    1
## Duster 360          2.660260   8 360.0 245 3.21 3.570 15.84 high  0    3    4
## Merc 240D           3.194583   4 146.7  62 3.69 3.190 20.00  low  0    4    2
## Merc 230            3.126761   4 140.8  95 3.92 3.150 22.90  low  0    4    2
## Merc 280            2.954910   6 167.6 123 3.92 3.440 18.30  low  0    4    4
## Merc 280C           2.879198   6 167.6 123 3.92 3.440 18.90  low  0    4    4
## Merc 450SE          2.797281   8 275.8 180 3.07 4.070 17.40 high  0    3    3
## Merc 450SL          2.850707   8 275.8 180 3.07 3.730 17.60 high  0    3    3
## Merc 450SLC         2.721295   8 275.8 180 3.07 3.780 18.00 high  0    3    3
## Cadillac Fleetwood  2.341806   8 472.0 205 2.93 5.250 17.98 high  0    3    4
## Lincoln Continental 2.341806   8 460.0 215 3.00 5.424 17.82 high  0    3    4
## Chrysler Imperial   2.687847   8 440.0 230 3.23 5.345 17.42 high  0    3    4
## Fiat 128            3.478158   4  78.7  66 4.08 2.200 19.47  low  1    4    1
## Honda Civic         3.414443   4  75.7  52 4.93 1.615 18.52  low  1    4    2
## Toyota Corolla      3.523415   4  71.1  65 4.22 1.835 19.90  low  1    4    1
## Toyota Corona       3.068053   4 120.1  97 3.70 2.465 20.01  low  0    3    1
## Dodge Challenger    2.740840   8 318.0 150 2.76 3.520 16.87 high  0    3    2
## AMC Javelin         2.721295   8 304.0 150 3.15 3.435 17.30 high  0    3    2
## Camaro Z28          2.587764   8 350.0 245 3.73 3.840 15.41 high  0    3    4
## Pontiac Firebird    2.954910   8 400.0 175 3.08 3.845 17.05 high  0    3    2
## Fiat X1-9           3.306887   4  79.0  66 4.08 1.935 18.90  low  1    4    1
## Porsche 914-2       3.258097   4 120.3  91 4.43 2.140 16.70 high  1    5    2
## Lotus Europa        3.414443   4  95.1 113 3.77 1.513 16.90  low  1    5    2
## Ford Pantera L      2.760010   8 351.0 264 4.22 3.170 14.50 high  1    5    4
## Ferrari Dino        2.980619   6 145.0 175 3.62 2.770 15.50 high  1    5    6
## Maserati Bora       2.708050   8 301.0 335 3.54 3.570 14.60 high  1    5    8
## Volvo 142E          3.063391   4 121.0 109 4.11 2.780 18.60  low  1    4    2
##                      fake mpg_cat log_disp
## Mazda RX4           181.0    TRUE 5.075174
## Mazda RX4 Wag       181.0    TRUE 5.075174
## Datsun 710          130.8    TRUE 4.682131
## Hornet 4 Drive      279.4    TRUE 5.552960
## Hornet Sportabout   378.7   FALSE 5.886104
## Valiant             243.1   FALSE 5.416100
## Duster 360          374.3   FALSE 5.886104
## Merc 240D           171.1    TRUE 4.988390
## Merc 230            163.6    TRUE 4.947340
## Merc 280            186.8   FALSE 5.121580
## Merc 280C           185.4   FALSE 5.121580
## Merc 450SE          292.2   FALSE 5.619676
## Merc 450SL          293.1   FALSE 5.619676
## Merc 450SLC         291.0   FALSE 5.619676
## Cadillac Fleetwood  482.4   FALSE 6.156979
## Lincoln Continental 470.4   FALSE 6.131226
## Chrysler Imperial   454.7   FALSE 6.086775
## Fiat 128            111.1    TRUE 4.365643
## Honda Civic         106.1    TRUE 4.326778
## Toyota Corolla      105.0    TRUE 4.264087
## Toyota Corona       141.6    TRUE 4.788325
## Dodge Challenger    333.5   FALSE 5.762051
## AMC Javelin         319.2   FALSE 5.717028
## Camaro Z28          363.3   FALSE 5.857933
## Pontiac Firebird    419.2   FALSE 5.991465
## Fiat X1-9           106.3    TRUE 4.369448
## Porsche 914-2       146.3    TRUE 4.789989
## Lotus Europa        125.5    TRUE 4.554929
## Ford Pantera L      366.8   FALSE 5.860786
## Ferrari Dino        164.7   FALSE 4.976734
## Maserati Bora       316.0   FALSE 5.707110
## Volvo 142E          142.4    TRUE 4.795791

Using group_by() to group data

by_cyl <- mtcars %>% group_by(cyl)
# grouping doesn't change how the data looks (apart from listing how it's grouped):
by_cyl

## # A tibble: 32 x 11
## # Groups:   cyl [3]
##      mpg   cyl  disp    hp  drat    wt  qsec vs       am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5 high      1     4     4
##  2  21       6  160    110  3.9   2.88  17.0 high      1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6 low       1     4     1
##  4  21.4     6  258    110  3.08  3.22  19.4 low       0     3     1
##  5  18.7     8  360    175  3.15  3.44  17.0 high      0     3     2
##  6  18.1     6  225    105  2.76  3.46  20.2 low       0     3     1
##  7  14.3     8  360    245  3.21  3.57  15.8 high      0     3     4
##  8  24.4     4  147.    62  3.69  3.19  20   low       0     4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9 low       0     4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3 low       0     4     4
## # … with 22 more rows

class(mtcars)

## [1] "data.frame"

class(by_cyl) # The structure of the new data frame is richer

## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

Using summarize() to summarize grouped data

by_cyl <- mtcars %>% group_by(cyl)
# grouping doesn't change how the data looks (apart from listing how it's grouped):
by_cyl # Now, the mtcars data frame has been grouped and called by_cyl.

## # A tibble: 32 x 11
## # Groups:   cyl [3]
##      mpg   cyl  disp    hp  drat    wt  qsec vs       am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5 high      1     4     4
##  2  21       6  160    110  3.9   2.88  17.0 high      1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6 low       1     4     1
##  4  21.4     6  258    110  3.08  3.22  19.4 low       0     3     1
##  5  18.7     8  360    175  3.15  3.44  17.0 high      0     3     2
##  6  18.1     6  225    105  2.76  3.46  20.2 low       0     3     1
##  7  14.3     8  360    245  3.21  3.57  15.8 high      0     3     4
##  8  24.4     4  147.    62  3.69  3.19  20   low       0     4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9 low       0     4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3 low       0     4     4
## # … with 22 more rows

by_cyl %>% summarise(
  mpg_mean = mean(mpg), # Mean of disp for each cyl group
  hp_mean = mean(hp) # Mean of hp for each cyl group
)

## # A tibble: 3 x 3
##     cyl mpg_mean hp_mean
##   <dbl>    <dbl>   <dbl>
## 1     4     26.7    82.6
## 2     6     19.7   122. 
## 3     8     15.1   209.

by_cyl %>% summarise(freq = n()) # Frequency distribution of "cyl";

## # A tibble: 3 x 2
##     cyl  freq
##   <dbl> <int>
## 1     4    11
## 2     6     7
## 3     8    14

                                 # The resulting data frame has two columns: cyl and freq

A Comprehensive Example

This example will use the data frame “diamonds” from the ggplot2 package. The dataset contains the prices and other attributes of almost 54,000 diamonds. The variables are as follows:

price: price in US dollars ($326–$18,823)

carat: weight of the diamond (0.2–5.01)

cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)

color: diamond colour, from D (best) to J (worst)

clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

x: length in mm (0–10.74)

y: width in mm (0–58.9)

z: depth in mm (0–31.8)

depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)

table: width of top of diamond relative to widest point (43–95)

Reference: https://stackoverflow.com/questions/19233365/how-to-create-a-marimekko-mosaic-plot-in-ggplot2

head(diamonds, n = 10)

## # A tibble: 10 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39

df <- diamonds %>%
  group_by(cut, clarity) %>% ## Must do grouping first in order to do the following operations
  summarise(count = n()) %>%
  group_by(cut) %>% # If not specified, the following will still be done by default for each category of cut
  mutate(cut.count = sum(count), 
         prop = count/sum(count)) %>%
  ungroup() # Remove grouping after we have added the two new columns

## `summarise()` has grouped output by 'cut'. You can override using the `.groups` argument.

ggplot(df, aes(x = cut, y = prop, width = cut.count, fill = clarity)) +
  geom_col(colour = "black") +
  #geom_text(aes(label = scales::percent(prop)), position = position_stack(vjust = 0.5)) + # if labels are desired
  facet_grid(.~cut, scales = "free_x", space = "free_x") +
  scale_fill_brewer(palette = "RdYlGn") +
  theme_void()

Data Wrangling with the “tidyr” Package

The function “gather” from tidyr package is fabulous for reshaping data from columns to rows (wide to long). It is an alternative to the function “melt” from package “reshape2”. You use gather() when you notice that you have columns that are not variables.

The function “spread” from “tidyr” spreads out a single column into multiple columns, so it’s good to use it to summarize standard data as a contingency table.

library(tidyr)
D=data.frame(Gender=c("Male", "Female"), R=c(20, 30), D=c(23, 32), I=c(34, 45))
D # Looks like a two-way table

##   Gender  R  D  I
## 1   Male 20 23 34
## 2 Female 30 32 45

# The following are two ways of converting data from wide format to long format using
# the gather() function from package tidyr. 
# The following uses columns R, D and I to make two columns that represent Keys and Values. 
# Other columns from the data will remain in the new data frame.
x1=D %>% gather(key = "Party", value = "Count", c(R,D,I)) 
# Alternatively: "-" means exclusion; that is, use all columns except Gender to form Keys and Values.
x2=D %>% gather(key = Party, value = Count, -Gender) 
x1

##   Gender Party Count
## 1   Male     R    20
## 2 Female     R    30
## 3   Male     D    23
## 4 Female     D    32
## 5   Male     I    34
## 6 Female     I    45

x2

##   Gender Party Count
## 1   Male     R    20
## 2 Female     R    30
## 3   Male     D    23
## 4 Female     D    32
## 5   Male     I    34
## 6 Female     I    45

# The two data frames x1 and x2 are the same and are in the standard format (long format) that R can easily understand.

# The following is a reversal of the previous procedure.
library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

y=D %>% melt(id.vars=c("Gender"), variable.name = "Party", value.name = "Count") # the melt() function is from package reshape2
y # The same as x1 and x2; that is, we can melt a two-way table to produce a standard table.

##   Gender Party Count
## 1   Male     R    20
## 2 Female     R    30
## 3   Male     D    23
## 4 Female     D    32
## 5   Male     I    34
## 6 Female     I    45

z=x1 %>% spread(key=Gender, value=Count) ## Gender on columns
z

##   Party Female Male
## 1     D     32   23
## 2     I     45   34
## 3     R     30   20

w=x1 %>% spread(key=Party, value=Count) ## Party on columns
w

##   Gender  D  I  R
## 1 Female 32 45 30
## 2   Male 23 34 20

u = xtabs(Count~Gender+Party, y) # Another way
u

##         Party
## Gender    R  D  I
##   Female 30 32 45
##   Male   20 23 34

# To have a particular order you want for your two-way table, 
# you need to make character variables factors. Here is how:
y$Party=factor(y$Party, levels = c("D", "I", "R")) # Set the levels of Party in the D-I-R order
y$Gender=factor(y$Gender, levels = c("Male", "Female")) # Set the levels of Gender in Male-Female order
xtabs(Count~Gender+Party, y)

##         Party
## Gender    D  I  R
##   Male   23 34 20
##   Female 32 45 30

# The following are two more examples of data in wide format or long format. You may wish to play with them.

head(who, n = 20) ## In wide format; you can type ?who on the console to study the details of the data

## # A tibble: 20 x 60
##    country  iso2  iso3   year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
##    <chr>    <chr> <chr> <int>       <int>        <int>        <int>        <int>
##  1 Afghani… AF    AFG    1980          NA           NA           NA           NA
##  2 Afghani… AF    AFG    1981          NA           NA           NA           NA
##  3 Afghani… AF    AFG    1982          NA           NA           NA           NA
##  4 Afghani… AF    AFG    1983          NA           NA           NA           NA
##  5 Afghani… AF    AFG    1984          NA           NA           NA           NA
##  6 Afghani… AF    AFG    1985          NA           NA           NA           NA
##  7 Afghani… AF    AFG    1986          NA           NA           NA           NA
##  8 Afghani… AF    AFG    1987          NA           NA           NA           NA
##  9 Afghani… AF    AFG    1988          NA           NA           NA           NA
## 10 Afghani… AF    AFG    1989          NA           NA           NA           NA
## 11 Afghani… AF    AFG    1990          NA           NA           NA           NA
## 12 Afghani… AF    AFG    1991          NA           NA           NA           NA
## 13 Afghani… AF    AFG    1992          NA           NA           NA           NA
## 14 Afghani… AF    AFG    1993          NA           NA           NA           NA
## 15 Afghani… AF    AFG    1994          NA           NA           NA           NA
## 16 Afghani… AF    AFG    1995          NA           NA           NA           NA
## 17 Afghani… AF    AFG    1996          NA           NA           NA           NA
## 18 Afghani… AF    AFG    1997           0           10            6            3
## 19 Afghani… AF    AFG    1998          30          129          128           90
## 20 Afghani… AF    AFG    1999           8           55           55           47
## # … with 52 more variables: new_sp_m4554 <int>, new_sp_m5564 <int>,
## #   new_sp_m65 <int>, new_sp_f014 <int>, new_sp_f1524 <int>,
## #   new_sp_f2534 <int>, new_sp_f3544 <int>, new_sp_f4554 <int>,
## #   new_sp_f5564 <int>, new_sp_f65 <int>, new_sn_m014 <int>,
## #   new_sn_m1524 <int>, new_sn_m2534 <int>, new_sn_m3544 <int>,
## #   new_sn_m4554 <int>, new_sn_m5564 <int>, new_sn_m65 <int>,
## #   new_sn_f014 <int>, new_sn_f1524 <int>, new_sn_f2534 <int>,
## #   new_sn_f3544 <int>, new_sn_f4554 <int>, new_sn_f5564 <int>,
## #   new_sn_f65 <int>, new_ep_m014 <int>, new_ep_m1524 <int>,
## #   new_ep_m2534 <int>, new_ep_m3544 <int>, new_ep_m4554 <int>,
## #   new_ep_m5564 <int>, new_ep_m65 <int>, new_ep_f014 <int>,
## #   new_ep_f1524 <int>, new_ep_f2534 <int>, new_ep_f3544 <int>,
## #   new_ep_f4554 <int>, new_ep_f5564 <int>, new_ep_f65 <int>,
## #   newrel_m014 <int>, newrel_m1524 <int>, newrel_m2534 <int>,
## #   newrel_m3544 <int>, newrel_m4554 <int>, newrel_m5564 <int>,
## #   newrel_m65 <int>, newrel_f014 <int>, newrel_f1524 <int>,
## #   newrel_f2534 <int>, newrel_f3544 <int>, newrel_f4554 <int>,
## #   newrel_f5564 <int>, newrel_f65 <int>

# For simplicity, we only consider the first 10 columns
y = who[, 1:10] %>% melt(id.vars=c("country", "iso2", "iso3", "year"), variable.names=c(new_sp_m014, new_sp_m1524, new_sp_m2534, new_sp_m3544, new_sp_m4554, new_sp_m5564))

head(y, n = 20)

##        country iso2 iso3 year    variable value
## 1  Afghanistan   AF  AFG 1980 new_sp_m014    NA
## 2  Afghanistan   AF  AFG 1981 new_sp_m014    NA
## 3  Afghanistan   AF  AFG 1982 new_sp_m014    NA
## 4  Afghanistan   AF  AFG 1983 new_sp_m014    NA
## 5  Afghanistan   AF  AFG 1984 new_sp_m014    NA
## 6  Afghanistan   AF  AFG 1985 new_sp_m014    NA
## 7  Afghanistan   AF  AFG 1986 new_sp_m014    NA
## 8  Afghanistan   AF  AFG 1987 new_sp_m014    NA
## 9  Afghanistan   AF  AFG 1988 new_sp_m014    NA
## 10 Afghanistan   AF  AFG 1989 new_sp_m014    NA
## 11 Afghanistan   AF  AFG 1990 new_sp_m014    NA
## 12 Afghanistan   AF  AFG 1991 new_sp_m014    NA
## 13 Afghanistan   AF  AFG 1992 new_sp_m014    NA
## 14 Afghanistan   AF  AFG 1993 new_sp_m014    NA
## 15 Afghanistan   AF  AFG 1994 new_sp_m014    NA
## 16 Afghanistan   AF  AFG 1995 new_sp_m014    NA
## 17 Afghanistan   AF  AFG 1996 new_sp_m014    NA
## 18 Afghanistan   AF  AFG 1997 new_sp_m014     0
## 19 Afghanistan   AF  AFG 1998 new_sp_m014    30
## 20 Afghanistan   AF  AFG 1999 new_sp_m014     8

airquality ## In wide format; You convert it to long format;

##     Ozone Solar.R Wind Temp Month Day
## 1      41     190  7.4   67     5   1
## 2      36     118  8.0   72     5   2
## 3      12     149 12.6   74     5   3
## 4      18     313 11.5   62     5   4
## 5      NA      NA 14.3   56     5   5
## 6      28      NA 14.9   66     5   6
## 7      23     299  8.6   65     5   7
## 8      19      99 13.8   59     5   8
## 9       8      19 20.1   61     5   9
## 10     NA     194  8.6   69     5  10
## 11      7      NA  6.9   74     5  11
## 12     16     256  9.7   69     5  12
## 13     11     290  9.2   66     5  13
## 14     14     274 10.9   68     5  14
## 15     18      65 13.2   58     5  15
## 16     14     334 11.5   64     5  16
## 17     34     307 12.0   66     5  17
## 18      6      78 18.4   57     5  18
## 19     30     322 11.5   68     5  19
## 20     11      44  9.7   62     5  20
## 21      1       8  9.7   59     5  21
## 22     11     320 16.6   73     5  22
## 23      4      25  9.7   61     5  23
## 24     32      92 12.0   61     5  24
## 25     NA      66 16.6   57     5  25
## 26     NA     266 14.9   58     5  26
## 27     NA      NA  8.0   57     5  27
## 28     23      13 12.0   67     5  28
## 29     45     252 14.9   81     5  29
## 30    115     223  5.7   79     5  30
## 31     37     279  7.4   76     5  31
## 32     NA     286  8.6   78     6   1
## 33     NA     287  9.7   74     6   2
## 34     NA     242 16.1   67     6   3
## 35     NA     186  9.2   84     6   4
## 36     NA     220  8.6   85     6   5
## 37     NA     264 14.3   79     6   6
## 38     29     127  9.7   82     6   7
## 39     NA     273  6.9   87     6   8
## 40     71     291 13.8   90     6   9
## 41     39     323 11.5   87     6  10
## 42     NA     259 10.9   93     6  11
## 43     NA     250  9.2   92     6  12
## 44     23     148  8.0   82     6  13
## 45     NA     332 13.8   80     6  14
## 46     NA     322 11.5   79     6  15
## 47     21     191 14.9   77     6  16
## 48     37     284 20.7   72     6  17
## 49     20      37  9.2   65     6  18
## 50     12     120 11.5   73     6  19
## 51     13     137 10.3   76     6  20
## 52     NA     150  6.3   77     6  21
## 53     NA      59  1.7   76     6  22
## 54     NA      91  4.6   76     6  23
## 55     NA     250  6.3   76     6  24
## 56     NA     135  8.0   75     6  25
## 57     NA     127  8.0   78     6  26
## 58     NA      47 10.3   73     6  27
## 59     NA      98 11.5   80     6  28
## 60     NA      31 14.9   77     6  29
## 61     NA     138  8.0   83     6  30
## 62    135     269  4.1   84     7   1
## 63     49     248  9.2   85     7   2
## 64     32     236  9.2   81     7   3
## 65     NA     101 10.9   84     7   4
## 66     64     175  4.6   83     7   5
## 67     40     314 10.9   83     7   6
## 68     77     276  5.1   88     7   7
## 69     97     267  6.3   92     7   8
## 70     97     272  5.7   92     7   9
## 71     85     175  7.4   89     7  10
## 72     NA     139  8.6   82     7  11
## 73     10     264 14.3   73     7  12
## 74     27     175 14.9   81     7  13
## 75     NA     291 14.9   91     7  14
## 76      7      48 14.3   80     7  15
## 77     48     260  6.9   81     7  16
## 78     35     274 10.3   82     7  17
## 79     61     285  6.3   84     7  18
## 80     79     187  5.1   87     7  19
## 81     63     220 11.5   85     7  20
## 82     16       7  6.9   74     7  21
## 83     NA     258  9.7   81     7  22
## 84     NA     295 11.5   82     7  23
## 85     80     294  8.6   86     7  24
## 86    108     223  8.0   85     7  25
## 87     20      81  8.6   82     7  26
## 88     52      82 12.0   86     7  27
## 89     82     213  7.4   88     7  28
## 90     50     275  7.4   86     7  29
## 91     64     253  7.4   83     7  30
## 92     59     254  9.2   81     7  31
## 93     39      83  6.9   81     8   1
## 94      9      24 13.8   81     8   2
## 95     16      77  7.4   82     8   3
## 96     78      NA  6.9   86     8   4
## 97     35      NA  7.4   85     8   5
## 98     66      NA  4.6   87     8   6
## 99    122     255  4.0   89     8   7
## 100    89     229 10.3   90     8   8
## 101   110     207  8.0   90     8   9
## 102    NA     222  8.6   92     8  10
## 103    NA     137 11.5   86     8  11
## 104    44     192 11.5   86     8  12
## 105    28     273 11.5   82     8  13
## 106    65     157  9.7   80     8  14
## 107    NA      64 11.5   79     8  15
## 108    22      71 10.3   77     8  16
## 109    59      51  6.3   79     8  17
## 110    23     115  7.4   76     8  18
## 111    31     244 10.9   78     8  19
## 112    44     190 10.3   78     8  20
## 113    21     259 15.5   77     8  21
## 114     9      36 14.3   72     8  22
## 115    NA     255 12.6   75     8  23
## 116    45     212  9.7   79     8  24
## 117   168     238  3.4   81     8  25
## 118    73     215  8.0   86     8  26
## 119    NA     153  5.7   88     8  27
## 120    76     203  9.7   97     8  28
## 121   118     225  2.3   94     8  29
## 122    84     237  6.3   96     8  30
## 123    85     188  6.3   94     8  31
## 124    96     167  6.9   91     9   1
## 125    78     197  5.1   92     9   2
## 126    73     183  2.8   93     9   3
## 127    91     189  4.6   93     9   4
## 128    47      95  7.4   87     9   5
## 129    32      92 15.5   84     9   6
## 130    20     252 10.9   80     9   7
## 131    23     220 10.3   78     9   8
## 132    21     230 10.9   75     9   9
## 133    24     259  9.7   73     9  10
## 134    44     236 14.9   81     9  11
## 135    21     259 15.5   76     9  12
## 136    28     238  6.3   77     9  13
## 137     9      24 10.9   71     9  14
## 138    13     112 11.5   71     9  15
## 139    46     237  6.9   78     9  16
## 140    18     224 13.8   67     9  17
## 141    13      27 10.3   76     9  18
## 142    24     238 10.3   68     9  19
## 143    16     201  8.0   82     9  20
## 144    13     238 12.6   64     9  21
## 145    23      14  9.2   71     9  22
## 146    36     139 10.3   81     9  23
## 147     7      49 10.3   69     9  24
## 148    14      20 16.6   63     9  25
## 149    30     193  6.9   70     9  26
## 150    NA     145 13.2   77     9  27
## 151    14     191 14.3   75     9  28
## 152    18     131  8.0   76     9  29
## 153    20     223 11.5   68     9  30

           ## you can type ?airquality on the console to study the details of the data
y = airquality %>% melt(id.vars=c("Month", "Day"), variable.names=c(Ozone, Solar.R, Wind, Temp))
y

##     Month Day variable value
## 1       5   1    Ozone  41.0
## 2       5   2    Ozone  36.0
## 3       5   3    Ozone  12.0
## 4       5   4    Ozone  18.0
## 5       5   5    Ozone    NA
## 6       5   6    Ozone  28.0
## 7       5   7    Ozone  23.0
## 8       5   8    Ozone  19.0
## 9       5   9    Ozone   8.0
## 10      5  10    Ozone    NA
## 11      5  11    Ozone   7.0
## 12      5  12    Ozone  16.0
## 13      5  13    Ozone  11.0
## 14      5  14    Ozone  14.0
## 15      5  15    Ozone  18.0
## 16      5  16    Ozone  14.0
## 17      5  17    Ozone  34.0
## 18      5  18    Ozone   6.0
## 19      5  19    Ozone  30.0
## 20      5  20    Ozone  11.0
## 21      5  21    Ozone   1.0
## 22      5  22    Ozone  11.0
## 23      5  23    Ozone   4.0
## 24      5  24    Ozone  32.0
## 25      5  25    Ozone    NA
## 26      5  26    Ozone    NA
## 27      5  27    Ozone    NA
## 28      5  28    Ozone  23.0
## 29      5  29    Ozone  45.0
## 30      5  30    Ozone 115.0
## 31      5  31    Ozone  37.0
## 32      6   1    Ozone    NA
## 33      6   2    Ozone    NA
## 34      6   3    Ozone    NA
## 35      6   4    Ozone    NA
## 36      6   5    Ozone    NA
## 37      6   6    Ozone    NA
## 38      6   7    Ozone  29.0
## 39      6   8    Ozone    NA
## 40      6   9    Ozone  71.0
## 41      6  10    Ozone  39.0
## 42      6  11    Ozone    NA
## 43      6  12    Ozone    NA
## 44      6  13    Ozone  23.0
## 45      6  14    Ozone    NA
## 46      6  15    Ozone    NA
## 47      6  16    Ozone  21.0
## 48      6  17    Ozone  37.0
## 49      6  18    Ozone  20.0
## 50      6  19    Ozone  12.0
## 51      6  20    Ozone  13.0
## 52      6  21    Ozone    NA
## 53      6  22    Ozone    NA
## 54      6  23    Ozone    NA
## 55      6  24    Ozone    NA
## 56      6  25    Ozone    NA
## 57      6  26    Ozone    NA
## 58      6  27    Ozone    NA
## 59      6  28    Ozone    NA
## 60      6  29    Ozone    NA
## 61      6  30    Ozone    NA
## 62      7   1    Ozone 135.0
## 63      7   2    Ozone  49.0
## 64      7   3    Ozone  32.0
## 65      7   4    Ozone    NA
## 66      7   5    Ozone  64.0
## 67      7   6    Ozone  40.0
## 68      7   7    Ozone  77.0
## 69      7   8    Ozone  97.0
## 70      7   9    Ozone  97.0
## 71      7  10    Ozone  85.0
## 72      7  11    Ozone    NA
## 73      7  12    Ozone  10.0
## 74      7  13    Ozone  27.0
## 75      7  14    Ozone    NA
## 76      7  15    Ozone   7.0
## 77      7  16    Ozone  48.0
## 78      7  17    Ozone  35.0
## 79      7  18    Ozone  61.0
## 80      7  19    Ozone  79.0
## 81      7  20    Ozone  63.0
## 82      7  21    Ozone  16.0
## 83      7  22    Ozone    NA
## 84      7  23    Ozone    NA
## 85      7  24    Ozone  80.0
## 86      7  25    Ozone 108.0
## 87      7  26    Ozone  20.0
## 88      7  27    Ozone  52.0
## 89      7  28    Ozone  82.0
## 90      7  29    Ozone  50.0
## 91      7  30    Ozone  64.0
## 92      7  31    Ozone  59.0
## 93      8   1    Ozone  39.0
## 94      8   2    Ozone   9.0
## 95      8   3    Ozone  16.0
## 96      8   4    Ozone  78.0
## 97      8   5    Ozone  35.0
## 98      8   6    Ozone  66.0
## 99      8   7    Ozone 122.0
## 100     8   8    Ozone  89.0
## 101     8   9    Ozone 110.0
## 102     8  10    Ozone    NA
## 103     8  11    Ozone    NA
## 104     8  12    Ozone  44.0
## 105     8  13    Ozone  28.0
## 106     8  14    Ozone  65.0
## 107     8  15    Ozone    NA
## 108     8  16    Ozone  22.0
## 109     8  17    Ozone  59.0
## 110     8  18    Ozone  23.0
## 111     8  19    Ozone  31.0
## 112     8  20    Ozone  44.0
## 113     8  21    Ozone  21.0
## 114     8  22    Ozone   9.0
## 115     8  23    Ozone    NA
## 116     8  24    Ozone  45.0
## 117     8  25    Ozone 168.0
## 118     8  26    Ozone  73.0
## 119     8  27    Ozone    NA
## 120     8  28    Ozone  76.0
## 121     8  29    Ozone 118.0
## 122     8  30    Ozone  84.0
## 123     8  31    Ozone  85.0
## 124     9   1    Ozone  96.0
## 125     9   2    Ozone  78.0
## 126     9   3    Ozone  73.0
## 127     9   4    Ozone  91.0
## 128     9   5    Ozone  47.0
## 129     9   6    Ozone  32.0
## 130     9   7    Ozone  20.0
## 131     9   8    Ozone  23.0
## 132     9   9    Ozone  21.0
## 133     9  10    Ozone  24.0
## 134     9  11    Ozone  44.0
## 135     9  12    Ozone  21.0
## 136     9  13    Ozone  28.0
## 137     9  14    Ozone   9.0
## 138     9  15    Ozone  13.0
## 139     9  16    Ozone  46.0
## 140     9  17    Ozone  18.0
## 141     9  18    Ozone  13.0
## 142     9  19    Ozone  24.0
## 143     9  20    Ozone  16.0
## 144     9  21    Ozone  13.0
## 145     9  22    Ozone  23.0
## 146     9  23    Ozone  36.0
## 147     9  24    Ozone   7.0
## 148     9  25    Ozone  14.0
## 149     9  26    Ozone  30.0
## 150     9  27    Ozone    NA
## 151     9  28    Ozone  14.0
## 152     9  29    Ozone  18.0
## 153     9  30    Ozone  20.0
## 154     5   1  Solar.R 190.0
## 155     5   2  Solar.R 118.0
## 156     5   3  Solar.R 149.0
## 157     5   4  Solar.R 313.0
## 158     5   5  Solar.R    NA
## 159     5   6  Solar.R    NA
## 160     5   7  Solar.R 299.0
## 161     5   8  Solar.R  99.0
## 162     5   9  Solar.R  19.0
## 163     5  10  Solar.R 194.0
## 164     5  11  Solar.R    NA
## 165     5  12  Solar.R 256.0
## 166     5  13  Solar.R 290.0
## 167     5  14  Solar.R 274.0
## 168     5  15  Solar.R  65.0
## 169     5  16  Solar.R 334.0
## 170     5  17  Solar.R 307.0
## 171     5  18  Solar.R  78.0
## 172     5  19  Solar.R 322.0
## 173     5  20  Solar.R  44.0
## 174     5  21  Solar.R   8.0
## 175     5  22  Solar.R 320.0
## 176     5  23  Solar.R  25.0
## 177     5  24  Solar.R  92.0
## 178     5  25  Solar.R  66.0
## 179     5  26  Solar.R 266.0
## 180     5  27  Solar.R    NA
## 181     5  28  Solar.R  13.0
## 182     5  29  Solar.R 252.0
## 183     5  30  Solar.R 223.0
## 184     5  31  Solar.R 279.0
## 185     6   1  Solar.R 286.0
## 186     6   2  Solar.R 287.0
## 187     6   3  Solar.R 242.0
## 188     6   4  Solar.R 186.0
## 189     6   5  Solar.R 220.0
## 190     6   6  Solar.R 264.0
## 191     6   7  Solar.R 127.0
## 192     6   8  Solar.R 273.0
## 193     6   9  Solar.R 291.0
## 194     6  10  Solar.R 323.0
## 195     6  11  Solar.R 259.0
## 196     6  12  Solar.R 250.0
## 197     6  13  Solar.R 148.0
## 198     6  14  Solar.R 332.0
## 199     6  15  Solar.R 322.0
## 200     6  16  Solar.R 191.0
## 201     6  17  Solar.R 284.0
## 202     6  18  Solar.R  37.0
## 203     6  19  Solar.R 120.0
## 204     6  20  Solar.R 137.0
## 205     6  21  Solar.R 150.0
## 206     6  22  Solar.R  59.0
## 207     6  23  Solar.R  91.0
## 208     6  24  Solar.R 250.0
## 209     6  25  Solar.R 135.0
## 210     6  26  Solar.R 127.0
## 211     6  27  Solar.R  47.0
## 212     6  28  Solar.R  98.0
## 213     6  29  Solar.R  31.0
## 214     6  30  Solar.R 138.0
## 215     7   1  Solar.R 269.0
## 216     7   2  Solar.R 248.0
## 217     7   3  Solar.R 236.0
## 218     7   4  Solar.R 101.0
## 219     7   5  Solar.R 175.0
## 220     7   6  Solar.R 314.0
## 221     7   7  Solar.R 276.0
## 222     7   8  Solar.R 267.0
## 223     7   9  Solar.R 272.0
## 224     7  10  Solar.R 175.0
## 225     7  11  Solar.R 139.0
## 226     7  12  Solar.R 264.0
## 227     7  13  Solar.R 175.0
## 228     7  14  Solar.R 291.0
## 229     7  15  Solar.R  48.0
## 230     7  16  Solar.R 260.0
## 231     7  17  Solar.R 274.0
## 232     7  18  Solar.R 285.0
## 233     7  19  Solar.R 187.0
## 234     7  20  Solar.R 220.0
## 235     7  21  Solar.R   7.0
## 236     7  22  Solar.R 258.0
## 237     7  23  Solar.R 295.0
## 238     7  24  Solar.R 294.0
## 239     7  25  Solar.R 223.0
## 240     7  26  Solar.R  81.0
## 241     7  27  Solar.R  82.0
## 242     7  28  Solar.R 213.0
## 243     7  29  Solar.R 275.0
## 244     7  30  Solar.R 253.0
## 245     7  31  Solar.R 254.0
## 246     8   1  Solar.R  83.0
## 247     8   2  Solar.R  24.0
## 248     8   3  Solar.R  77.0
## 249     8   4  Solar.R    NA
## 250     8   5  Solar.R    NA
## 251     8   6  Solar.R    NA
## 252     8   7  Solar.R 255.0
## 253     8   8  Solar.R 229.0
## 254     8   9  Solar.R 207.0
## 255     8  10  Solar.R 222.0
## 256     8  11  Solar.R 137.0
## 257     8  12  Solar.R 192.0
## 258     8  13  Solar.R 273.0
## 259     8  14  Solar.R 157.0
## 260     8  15  Solar.R  64.0
## 261     8  16  Solar.R  71.0
## 262     8  17  Solar.R  51.0
## 263     8  18  Solar.R 115.0
## 264     8  19  Solar.R 244.0
## 265     8  20  Solar.R 190.0
## 266     8  21  Solar.R 259.0
## 267     8  22  Solar.R  36.0
## 268     8  23  Solar.R 255.0
## 269     8  24  Solar.R 212.0
## 270     8  25  Solar.R 238.0
## 271     8  26  Solar.R 215.0
## 272     8  27  Solar.R 153.0
## 273     8  28  Solar.R 203.0
## 274     8  29  Solar.R 225.0
## 275     8  30  Solar.R 237.0
## 276     8  31  Solar.R 188.0
## 277     9   1  Solar.R 167.0
## 278     9   2  Solar.R 197.0
## 279     9   3  Solar.R 183.0
## 280     9   4  Solar.R 189.0
## 281     9   5  Solar.R  95.0
## 282     9   6  Solar.R  92.0
## 283     9   7  Solar.R 252.0
## 284     9   8  Solar.R 220.0
## 285     9   9  Solar.R 230.0
## 286     9  10  Solar.R 259.0
## 287     9  11  Solar.R 236.0
## 288     9  12  Solar.R 259.0
## 289     9  13  Solar.R 238.0
## 290     9  14  Solar.R  24.0
## 291     9  15  Solar.R 112.0
## 292     9  16  Solar.R 237.0
## 293     9  17  Solar.R 224.0
## 294     9  18  Solar.R  27.0
## 295     9  19  Solar.R 238.0
## 296     9  20  Solar.R 201.0
## 297     9  21  Solar.R 238.0
## 298     9  22  Solar.R  14.0
## 299     9  23  Solar.R 139.0
## 300     9  24  Solar.R  49.0
## 301     9  25  Solar.R  20.0
## 302     9  26  Solar.R 193.0
## 303     9  27  Solar.R 145.0
## 304     9  28  Solar.R 191.0
## 305     9  29  Solar.R 131.0
## 306     9  30  Solar.R 223.0
## 307     5   1     Wind   7.4
## 308     5   2     Wind   8.0
## 309     5   3     Wind  12.6
## 310     5   4     Wind  11.5
## 311     5   5     Wind  14.3
## 312     5   6     Wind  14.9
## 313     5   7     Wind   8.6
## 314     5   8     Wind  13.8
## 315     5   9     Wind  20.1
## 316     5  10     Wind   8.6
## 317     5  11     Wind   6.9
## 318     5  12     Wind   9.7
## 319     5  13     Wind   9.2
## 320     5  14     Wind  10.9
## 321     5  15     Wind  13.2
## 322     5  16     Wind  11.5
## 323     5  17     Wind  12.0
## 324     5  18     Wind  18.4
## 325     5  19     Wind  11.5
## 326     5  20     Wind   9.7
## 327     5  21     Wind   9.7
## 328     5  22     Wind  16.6
## 329     5  23     Wind   9.7
## 330     5  24     Wind  12.0
## 331     5  25     Wind  16.6
## 332     5  26     Wind  14.9
## 333     5  27     Wind   8.0
## 334     5  28     Wind  12.0
## 335     5  29     Wind  14.9
## 336     5  30     Wind   5.7
## 337     5  31     Wind   7.4
## 338     6   1     Wind   8.6
## 339     6   2     Wind   9.7
## 340     6   3     Wind  16.1
## 341     6   4     Wind   9.2
## 342     6   5     Wind   8.6
## 343     6   6     Wind  14.3
## 344     6   7     Wind   9.7
## 345     6   8     Wind   6.9
## 346     6   9     Wind  13.8
## 347     6  10     Wind  11.5
## 348     6  11     Wind  10.9
## 349     6  12     Wind   9.2
## 350     6  13     Wind   8.0
## 351     6  14     Wind  13.8
## 352     6  15     Wind  11.5
## 353     6  16     Wind  14.9
## 354     6  17     Wind  20.7
## 355     6  18     Wind   9.2
## 356     6  19     Wind  11.5
## 357     6  20     Wind  10.3
## 358     6  21     Wind   6.3
## 359     6  22     Wind   1.7
## 360     6  23     Wind   4.6
## 361     6  24     Wind   6.3
## 362     6  25     Wind   8.0
## 363     6  26     Wind   8.0
## 364     6  27     Wind  10.3
## 365     6  28     Wind  11.5
## 366     6  29     Wind  14.9
## 367     6  30     Wind   8.0
## 368     7   1     Wind   4.1
## 369     7   2     Wind   9.2
## 370     7   3     Wind   9.2
## 371     7   4     Wind  10.9
## 372     7   5     Wind   4.6
## 373     7   6     Wind  10.9
## 374     7   7     Wind   5.1
## 375     7   8     Wind   6.3
## 376     7   9     Wind   5.7
## 377     7  10     Wind   7.4
## 378     7  11     Wind   8.6
## 379     7  12     Wind  14.3
## 380     7  13     Wind  14.9
## 381     7  14     Wind  14.9
## 382     7  15     Wind  14.3
## 383     7  16     Wind   6.9
## 384     7  17     Wind  10.3
## 385     7  18     Wind   6.3
## 386     7  19     Wind   5.1
## 387     7  20     Wind  11.5
## 388     7  21     Wind   6.9
## 389     7  22     Wind   9.7
## 390     7  23     Wind  11.5
## 391     7  24     Wind   8.6
## 392     7  25     Wind   8.0
## 393     7  26     Wind   8.6
## 394     7  27     Wind  12.0
## 395     7  28     Wind   7.4
## 396     7  29     Wind   7.4
## 397     7  30     Wind   7.4
## 398     7  31     Wind   9.2
## 399     8   1     Wind   6.9
## 400     8   2     Wind  13.8
## 401     8   3     Wind   7.4
## 402     8   4     Wind   6.9
## 403     8   5     Wind   7.4
## 404     8   6     Wind   4.6
## 405     8   7     Wind   4.0
## 406     8   8     Wind  10.3
## 407     8   9     Wind   8.0
## 408     8  10     Wind   8.6
## 409     8  11     Wind  11.5
## 410     8  12     Wind  11.5
## 411     8  13     Wind  11.5
## 412     8  14     Wind   9.7
## 413     8  15     Wind  11.5
## 414     8  16     Wind  10.3
## 415     8  17     Wind   6.3
## 416     8  18     Wind   7.4
## 417     8  19     Wind  10.9
## 418     8  20     Wind  10.3
## 419     8  21     Wind  15.5
## 420     8  22     Wind  14.3
## 421     8  23     Wind  12.6
## 422     8  24     Wind   9.7
## 423     8  25     Wind   3.4
## 424     8  26     Wind   8.0
## 425     8  27     Wind   5.7
## 426     8  28     Wind   9.7
## 427     8  29     Wind   2.3
## 428     8  30     Wind   6.3
## 429     8  31     Wind   6.3
## 430     9   1     Wind   6.9
## 431     9   2     Wind   5.1
## 432     9   3     Wind   2.8
## 433     9   4     Wind   4.6
## 434     9   5     Wind   7.4
## 435     9   6     Wind  15.5
## 436     9   7     Wind  10.9
## 437     9   8     Wind  10.3
## 438     9   9     Wind  10.9
## 439     9  10     Wind   9.7
## 440     9  11     Wind  14.9
## 441     9  12     Wind  15.5
## 442     9  13     Wind   6.3
## 443     9  14     Wind  10.9
## 444     9  15     Wind  11.5
## 445     9  16     Wind   6.9
## 446     9  17     Wind  13.8
## 447     9  18     Wind  10.3
## 448     9  19     Wind  10.3
## 449     9  20     Wind   8.0
## 450     9  21     Wind  12.6
## 451     9  22     Wind   9.2
## 452     9  23     Wind  10.3
## 453     9  24     Wind  10.3
## 454     9  25     Wind  16.6
## 455     9  26     Wind   6.9
## 456     9  27     Wind  13.2
## 457     9  28     Wind  14.3
## 458     9  29     Wind   8.0
## 459     9  30     Wind  11.5
## 460     5   1     Temp  67.0
## 461     5   2     Temp  72.0
## 462     5   3     Temp  74.0
## 463     5   4     Temp  62.0
## 464     5   5     Temp  56.0
## 465     5   6     Temp  66.0
## 466     5   7     Temp  65.0
## 467     5   8     Temp  59.0
## 468     5   9     Temp  61.0
## 469     5  10     Temp  69.0
## 470     5  11     Temp  74.0
## 471     5  12     Temp  69.0
## 472     5  13     Temp  66.0
## 473     5  14     Temp  68.0
## 474     5  15     Temp  58.0
## 475     5  16     Temp  64.0
## 476     5  17     Temp  66.0
## 477     5  18     Temp  57.0
## 478     5  19     Temp  68.0
## 479     5  20     Temp  62.0
## 480     5  21     Temp  59.0
## 481     5  22     Temp  73.0
## 482     5  23     Temp  61.0
## 483     5  24     Temp  61.0
## 484     5  25     Temp  57.0
## 485     5  26     Temp  58.0
## 486     5  27     Temp  57.0
## 487     5  28     Temp  67.0
## 488     5  29     Temp  81.0
## 489     5  30     Temp  79.0
## 490     5  31     Temp  76.0
## 491     6   1     Temp  78.0
## 492     6   2     Temp  74.0
## 493     6   3     Temp  67.0
## 494     6   4     Temp  84.0
## 495     6   5     Temp  85.0
## 496     6   6     Temp  79.0
## 497     6   7     Temp  82.0
## 498     6   8     Temp  87.0
## 499     6   9     Temp  90.0
## 500     6  10     Temp  87.0
## 501     6  11     Temp  93.0
## 502     6  12     Temp  92.0
## 503     6  13     Temp  82.0
## 504     6  14     Temp  80.0
## 505     6  15     Temp  79.0
## 506     6  16     Temp  77.0
## 507     6  17     Temp  72.0
## 508     6  18     Temp  65.0
## 509     6  19     Temp  73.0
## 510     6  20     Temp  76.0
## 511     6  21     Temp  77.0
## 512     6  22     Temp  76.0
## 513     6  23     Temp  76.0
## 514     6  24     Temp  76.0
## 515     6  25     Temp  75.0
## 516     6  26     Temp  78.0
## 517     6  27     Temp  73.0
## 518     6  28     Temp  80.0
## 519     6  29     Temp  77.0
## 520     6  30     Temp  83.0
## 521     7   1     Temp  84.0
## 522     7   2     Temp  85.0
## 523     7   3     Temp  81.0
## 524     7   4     Temp  84.0
## 525     7   5     Temp  83.0
## 526     7   6     Temp  83.0
## 527     7   7     Temp  88.0
## 528     7   8     Temp  92.0
## 529     7   9     Temp  92.0
## 530     7  10     Temp  89.0
## 531     7  11     Temp  82.0
## 532     7  12     Temp  73.0
## 533     7  13     Temp  81.0
## 534     7  14     Temp  91.0
## 535     7  15     Temp  80.0
## 536     7  16     Temp  81.0
## 537     7  17     Temp  82.0
## 538     7  18     Temp  84.0
## 539     7  19     Temp  87.0
## 540     7  20     Temp  85.0
## 541     7  21     Temp  74.0
## 542     7  22     Temp  81.0
## 543     7  23     Temp  82.0
## 544     7  24     Temp  86.0
## 545     7  25     Temp  85.0
## 546     7  26     Temp  82.0
## 547     7  27     Temp  86.0
## 548     7  28     Temp  88.0
## 549     7  29     Temp  86.0
## 550     7  30     Temp  83.0
## 551     7  31     Temp  81.0
## 552     8   1     Temp  81.0
## 553     8   2     Temp  81.0
## 554     8   3     Temp  82.0
## 555     8   4     Temp  86.0
## 556     8   5     Temp  85.0
## 557     8   6     Temp  87.0
## 558     8   7     Temp  89.0
## 559     8   8     Temp  90.0
## 560     8   9     Temp  90.0
## 561     8  10     Temp  92.0
## 562     8  11     Temp  86.0
## 563     8  12     Temp  86.0
## 564     8  13     Temp  82.0
## 565     8  14     Temp  80.0
## 566     8  15     Temp  79.0
## 567     8  16     Temp  77.0
## 568     8  17     Temp  79.0
## 569     8  18     Temp  76.0
## 570     8  19     Temp  78.0
## 571     8  20     Temp  78.0
## 572     8  21     Temp  77.0
## 573     8  22     Temp  72.0
## 574     8  23     Temp  75.0
## 575     8  24     Temp  79.0
## 576     8  25     Temp  81.0
## 577     8  26     Temp  86.0
## 578     8  27     Temp  88.0
## 579     8  28     Temp  97.0
## 580     8  29     Temp  94.0
## 581     8  30     Temp  96.0
## 582     8  31     Temp  94.0
## 583     9   1     Temp  91.0
## 584     9   2     Temp  92.0
## 585     9   3     Temp  93.0
## 586     9   4     Temp  93.0
## 587     9   5     Temp  87.0
## 588     9   6     Temp  84.0
## 589     9   7     Temp  80.0
## 590     9   8     Temp  78.0
## 591     9   9     Temp  75.0
## 592     9  10     Temp  73.0
## 593     9  11     Temp  81.0
## 594     9  12     Temp  76.0
## 595     9  13     Temp  77.0
## 596     9  14     Temp  71.0
## 597     9  15     Temp  71.0
## 598     9  16     Temp  78.0
## 599     9  17     Temp  67.0
## 600     9  18     Temp  76.0
## 601     9  19     Temp  68.0
## 602     9  20     Temp  82.0
## 603     9  21     Temp  64.0
## 604     9  22     Temp  71.0
## 605     9  23     Temp  81.0
## 606     9  24     Temp  69.0
## 607     9  25     Temp  63.0
## 608     9  26     Temp  70.0
## 609     9  27     Temp  77.0
## 610     9  28     Temp  75.0
## 611     9  29     Temp  76.0
## 612     9  30     Temp  68.0

Query Using the sqldf Package

Reference: https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/sql.html

SQL is a database query language - a language designed specifically for interacting with a database.

library(sqldf) # Load the package

## Loading required package: gsubfn

## Loading required package: proto

## Loading required package: RSQLite

#library(RSQLite)
sqldf('SELECT mpg FROM mtcars')

##     mpg
## 1  21.0
## 2  21.0
## 3  22.8
## 4  21.4
## 5  18.7
## 6  18.1
## 7  14.3
## 8  24.4
## 9  22.8
## 10 19.2
## 11 17.8
## 12 16.4
## 13 17.3
## 14 15.2
## 15 10.4
## 16 10.4
## 17 14.7
## 18 32.4
## 19 30.4
## 20 33.9
## 21 21.5
## 22 15.5
## 23 15.2
## 24 13.3
## 25 19.2
## 26 27.3
## 27 26.0
## 28 30.4
## 29 15.8
## 30 19.7
## 31 15.0
## 32 21.4

sqldf('SELECT mpg, disp FROM mtcars')

##     mpg  disp
## 1  21.0 160.0
## 2  21.0 160.0
## 3  22.8 108.0
## 4  21.4 258.0
## 5  18.7 360.0
## 6  18.1 225.0
## 7  14.3 360.0
## 8  24.4 146.7
## 9  22.8 140.8
## 10 19.2 167.6
## 11 17.8 167.6
## 12 16.4 275.8
## 13 17.3 275.8
## 14 15.2 275.8
## 15 10.4 472.0
## 16 10.4 460.0
## 17 14.7 440.0
## 18 32.4  78.7
## 19 30.4  75.7
## 20 33.9  71.1
## 21 21.5 120.1
## 22 15.5 318.0
## 23 15.2 304.0
## 24 13.3 350.0
## 25 19.2 400.0
## 26 27.3  79.0
## 27 26.0 120.3
## 28 30.4  95.1
## 29 15.8 351.0
## 30 19.7 145.0
## 31 15.0 301.0
## 32 21.4 121.0

sqldf('SELECT mpg, disp FROM mtcars WHERE am = 1 ORDER BY mpg ASC')

##     mpg  disp
## 1  15.0 301.0
## 2  15.8 351.0
## 3  19.7 145.0
## 4  21.0 160.0
## 5  21.0 160.0
## 6  21.4 121.0
## 7  22.8 108.0
## 8  26.0 120.3
## 9  27.3  79.0
## 10 30.4  75.7
## 11 30.4  95.1
## 12 32.4  78.7
## 13 33.9  71.1

sqldf('SELECT * FROM mtcars WHERE (mpg > 20 AND disp < 95) OR carb > 1000')

##    mpg cyl disp hp drat    wt  qsec  vs am gear carb
## 1 32.4   4 78.7 66 4.08 2.200 19.47 low  1    4    1
## 2 30.4   4 75.7 52 4.93 1.615 18.52 low  1    4    2
## 3 33.9   4 71.1 65 4.22 1.835 19.90 low  1    4    1
## 4 27.3   4 79.0 66 4.08 1.935 18.90 low  1    4    1

sqldf('SELECT * FROM mtcars WHERE carb NOT IN (1,2)')

##     mpg cyl  disp  hp drat    wt  qsec   vs am gear carb
## 1  21.0   6 160.0 110 3.90 2.620 16.46 high  1    4    4
## 2  21.0   6 160.0 110 3.90 2.875 17.02 high  1    4    4
## 3  14.3   8 360.0 245 3.21 3.570 15.84 high  0    3    4
## 4  19.2   6 167.6 123 3.92 3.440 18.30  low  0    4    4
## 5  17.8   6 167.6 123 3.92 3.440 18.90  low  0    4    4
## 6  16.4   8 275.8 180 3.07 4.070 17.40 high  0    3    3
## 7  17.3   8 275.8 180 3.07 3.730 17.60 high  0    3    3
## 8  15.2   8 275.8 180 3.07 3.780 18.00 high  0    3    3
## 9  10.4   8 472.0 205 2.93 5.250 17.98 high  0    3    4
## 10 10.4   8 460.0 215 3.00 5.424 17.82 high  0    3    4
## 11 14.7   8 440.0 230 3.23 5.345 17.42 high  0    3    4
## 12 13.3   8 350.0 245 3.73 3.840 15.41 high  0    3    4
## 13 15.8   8 351.0 264 4.22 3.170 14.50 high  1    5    4
## 14 19.7   6 145.0 175 3.62 2.770 15.50 high  1    5    6
## 15 15.0   8 301.0 335 3.54 3.570 14.60 high  1    5    8

sqldf("SELECT am, AVG(mpg) AS Avgmpg FROM mtcars GROUP BY am")

##   am   Avgmpg
## 1  0 17.14737
## 2  1 24.39231

sqldf("SELECT am, COUNT() as count FROM mtcars GROUP BY am")

##   am count
## 1  0    19
## 2  1    13

Relational database management system (RDBMS) in R

We introduce how to use SQLite databases in R using the RSQLite package. You will learn to perform the basic tasks such as sending queries to a SQLite database, parameterised queries, or creating tables using RSQLite. The power of the intersection between R and SQLite can be seen in parameterized queries, which could be used if you need to query a database in order to display information based on user input inside an R Shiny app.

For more details, refer to https://www.datacamp.com/community/tutorials/sqlite-in-r.

library(RSQLite)

# Create a connection to our new database, CarsDB.db
conn <- dbConnect(drv = RSQLite::SQLite(),  # drv is a database driver
                  "CarsDB.db"   # A path to a SQLite database
                 )

# Create a column that names each car in the "mtcars" data frame
D = mtcars
D$car_names <- rownames(D)
rownames(D) <- NULL


# Write the mtcars dataset into a table named mtcars_data
dbWriteTable(conn, "cars_data", D, overwrite=TRUE)
# List all the tables available in the database
dbListTables(conn)

## [1] "cars_data"

# Get the car names and horsepower of the cars with 8 cylinders
dbGetQuery(conn,"SELECT car_names, hp, cyl FROM cars_data
                 WHERE cyl = 8")

##              car_names  hp cyl
## 1    Hornet Sportabout 175   8
## 2           Duster 360 245   8
## 3           Merc 450SE 180   8
## 4           Merc 450SL 180   8
## 5          Merc 450SLC 180   8
## 6   Cadillac Fleetwood 205   8
## 7  Lincoln Continental 215   8
## 8    Chrysler Imperial 230   8
## 9     Dodge Challenger 150   8
## 10         AMC Javelin 150   8
## 11          Camaro Z28 245   8
## 12    Pontiac Firebird 175   8
## 13      Ford Pantera L 264   8
## 14       Maserati Bora 335   8

# Get the car names and horsepower starting with M that have 6 or 8 cylinders
dbGetQuery(conn,"SELECT car_names, hp, cyl FROM cars_data
                 WHERE car_names LIKE 'M%' AND cyl IN (6,8)")

##       car_names  hp cyl
## 1     Mazda RX4 110   6
## 2 Mazda RX4 Wag 110   6
## 3      Merc 280 123   6
## 4     Merc 280C 123   6
## 5    Merc 450SE 180   8
## 6    Merc 450SL 180   8
## 7   Merc 450SLC 180   8
## 8 Maserati Bora 335   8

# Get the average horsepower and mpg by number of cylinder groups
dbGetQuery(conn,"SELECT cyl, AVG(hp) AS 'average_hp', AVG(mpg) AS 'average_mpg' FROM cars_data
                 GROUP BY cyl
                 ORDER BY average_hp")

##   cyl average_hp average_mpg
## 1   4   82.63636    26.66364
## 2   6  122.28571    19.74286
## 3   8  209.21429    15.10000

avg_HpCyl <- dbGetQuery(conn,"SELECT cyl, AVG(hp) AS 'average_hp'FROM cars_data
                 GROUP BY cyl
                 ORDER BY average_hp")
avg_HpCyl

##   cyl average_hp
## 1   4   82.63636
## 2   6  122.28571
## 3   8  209.21429

class(avg_HpCyl)

## [1] "data.frame"

# Lets assume that there is some user input that asks us to look only into cars that have over 18 miles per gallon (mpg)
# and more than 6 cylinders
mpg <-  18
cyl <- 6
Result <- dbGetQuery(conn, 'SELECT car_names, mpg, cyl FROM cars_data WHERE mpg >= ? AND cyl >= ?', params = c(mpg,cyl))
Result

##           car_names  mpg cyl
## 1         Mazda RX4 21.0   6
## 2     Mazda RX4 Wag 21.0   6
## 3    Hornet 4 Drive 21.4   6
## 4 Hornet Sportabout 18.7   8
## 5           Valiant 18.1   6
## 6          Merc 280 19.2   6
## 7  Pontiac Firebird 19.2   8
## 8      Ferrari Dino 19.7   6

# Delete the column belonging to the Mazda RX4. You will see a 1 as the output.
dbExecute(conn, "DELETE FROM cars_data WHERE car_names = 'Mazda RX4'")

## [1] 1

# Insert the data for the Mazda RX4. This will also ouput a 1
dbExecute(conn, "INSERT INTO cars_data VALUES (21.0,6,160.0,110,3.90,2.620,16.46,0,1,4,4,'Mazda RX4')")

## [1] 1

Regular Expression

When you scrape webpages to get data, you will need to manipulate the data, since the data may contain some symbols that are not directly suitable for analysis.

Here is ta RStudio cheatsheet for regular expressions: https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf

library(stringr)

y = c("2,345", "34,156", "6,571,089", "89,071")
y1 = str_replace_all(string=y, pattern=",", replacement="") ## Replace all commas by ""; i.e., remove all commas
y2 = as.numeric(y1)
mean(y2)

## [1] 1674165

y %>% str_replace_all(pattern=",", replacement="") %>% as.numeric() %>% mean()

## [1] 1674165

z = c("Hello World!", "How are you?", "I'm sorry.")
str_split(z, " ") ## Split each string in the x vector by a single space

## [[1]]
## [1] "Hello"  "World!"
## 
## [[2]]
## [1] "How"  "are"  "you?"
## 
## [[3]]
## [1] "I'm"    "sorry."

strsplit(z, " ") ## Split each string in the x vector by a single space. Available from the base package

## [[1]]
## [1] "Hello"  "World!"
## 
## [[2]]
## [1] "How"  "are"  "you?"
## 
## [[3]]
## [1] "I'm"    "sorry."

x=c("$56,987.00", "f6 8h$7 h", "8j%  #2,%f &*      ^j-5$d")
x

## [1] "$56,987.00"                "f6 8h$7 h"                
## [3] "8j%  #2,%f &*      ^j-5$d"

str_replace_all(string=x, pattern="\\$|\\%", replacement="") ## Replace all "$" or "%" signs by "", i.e., remove all "$" or "%" signs

## [1] "56,987.00"              "f6 8h7 h"               "8j  #2,f &*      ^j-5d"

str_replace_all(string=x, pattern="\\D", replacement="")  ## Replace all non-digits by "", i.e., remove all non-digits

## [1] "5698700" "687"     "825"

str_replace_all(string=x, pattern="\\d", replacement="")  ## Replace all digits by "", i.e., remove all digits

## [1] "$,."                    "f h$ h"                 "j%  #,%f &*      ^j-$d"

str_replace_all(string=x, pattern="\\W", replacement="")  ## Replace all non-word characters by "", i.e., remove non-word characters

## [1] "5698700" "f68h7h"  "8j2fj5d"

str_replace_all(string=x, pattern="\\s", replacement="")  ## Replace all spaces by "", i.e., remove non-word characters

## [1] "$56,987.00"       "f68h$7h"          "8j%#2,%f&*^j-5$d"

## The following replaces all spaces, dollar signs or (|) commas by "", i.e., remove non-word characters
str_replace_all(string=x, pattern="\\s|\\$|,", replacement="")

## [1] "56987.00"       "f68h7h"         "8j%#2%f&*^j-5d"

## \w matches any “word” character, which includes alphabetic characters, marks and decimal numbers. 
## The complement, \W, matches any non-word character.
str_extract_all("Don't eat that 67 m5d!", "\\w")[[1]]

##  [1] "D" "o" "n" "t" "e" "a" "t" "t" "h" "a" "t" "6" "7" "m" "5" "d"

## [abc]: matches a, b, or c. 
## [a-z]: matches every character between a and z (in Unicode code point order). 
## [^abc]: matches anything except a, b, or c.
## [\$\-]: matches $ or -.

# Anchors:
## ^ matches the start of string.
## $ matches the end of the string.


## \d: matches any digit. The complement, \D, matches any character that is not a decimal digit.

## \s: matches any whitespace. This includes tabs, newlines. The complement, \S, matches any non-whitespace character.

## Remove a substring starting with a particular character
str1 <- c("2,099@St Cloud","459,606@DC@USA")
s1<-gsub("@.*|,", "", str1)
s1

## [1] "2099"   "459606"

str2 <- c("2,099[3]","459,606")
s2<-gsub("\\[.*|,", "", str2)
s2

## [1] "2099"   "459606"

# The "." symbol represents any symbol except the new line symbol "\n"
# + means matches one or more and * matches 0 or more
# In the following, the pattern starts with "[", is followed by 
# one or more none new line symbols, and ends with "]".
gsub("\\[.+\\]", "", c("12,908", "450,896,054[12]"))

## [1] "12,908"      "450,896,054"

#if you want to keep the @ character in str1:
s<-gsub("(@).*","\\1",str1)
s

## [1] "2,099@"   "459,606@"

#If what you want to remove everything from the last @ in str1:
s<-gsub("(.*)@.*","\\1",str1)
s

## [1] "2,099"      "459,606@DC"

Handling Dates in R

## Date Formats ##
# Y=year in 4 digits
# M=no such thing
# D=no such thing
# y=tricky
# m=month in two digits
# d=day in a month


x <- c("01-01-1960", "01-02-1960", "1-3-1960", "1-4-1960", "01-05-1960", "01-06-1960", "1-7-1960", "1-8-1960")
z <- as.Date(x = x, format = "%m-%d-%Y") # m=month in number
z

## [1] "1960-01-01" "1960-01-02" "1960-01-03" "1960-01-04" "1960-01-05"
## [6] "1960-01-06" "1960-01-07" "1960-01-08"

stockPrice = c(56, 64, 67, 71, 84, 91, 95, 97) 
plot(stockPrice ~ z) # Does not show x-values as dates

plot(stockPrice ~ z, xaxt = "n", xlab = "") # Setting xaxt = "n" removes x-values; Setting xlab = "" removes x-axis title
axis(side = 1, at = z,labels = x, las = 2) # Adding labels on the x-axis at designated positions and displaying labels perpendicularly

par(mar=c(b=7,l=4,t=4,r=4))

plot(stockPrice ~ z, xaxt = "n", xlab = "") # Setting xaxt = "n" removes x-values; Setting xlab = "" removes x-axis title
axis(1, at = z, labels = x, las = 2) # Adding labels on the x-axis at designated positions and displaying labels perpendicularly

ggplot(NULL, aes(x=z, y = stockPrice)) +
  geom_point() +
  scale_x_date(date_breaks = "day", date_labels = "%a \n %b %d, %Y") + # Formatting x-values
  labs(x="")

x <- c("1960-01-01", "1960-01-02", "1960-03-31", "1960-07-30")
z <- as.Date(x, "%Y-%m-%d") # m=month in number
z

## [1] "1960-01-01" "1960-01-02" "1960-03-31" "1960-07-30"

## More examples: can be skipped
x <- c("1jan1960", "2jan1960", "31mar1960", "30jul1960")
z <- as.Date(x, format="%d%b%Y") # b=month in abb, d=day in a month, Y=year in 4 digits
z

## [1] "1960-01-01" "1960-01-02" "1960-03-31" "1960-07-30"

x <- c("1jan60", "2jan60", "31mar60", "30jul60")
z <- as.Date(x, "%d%b%y") # b=month in abb, Y=year in 4 digits
z

## [1] "2060-01-01" "2060-01-02" "2060-03-31" "2060-07-30"

x <- c("jan11960", "jan21960", "mar311960", "jul301960")
z <- as.Date(x, "%b%d%Y") ## Wrong
z

## [1] "0960-01-11" "0960-01-21" "1960-03-31" "1960-07-30"

x <- c("jan11960", "jan21960", "mar311960", "jul301960")
z <- as.Date(x, "%b%D%Y") ## Wrong
z

## [1] NA NA NA NA

x <- c("jan011960", "jan021960", "mar311960", "jul301960")
z <- as.Date(x, "%b%d%Y")
z

## [1] "1960-01-01" "1960-01-02" "1960-03-31" "1960-07-30"

x <- c("1960jan1", "1960jan2", "1960mar31", "1960jul30")
z <- as.Date(x, "%Y%b%d")
z

## [1] "1960-01-01" "1960-01-02" "1960-03-31" "1960-07-30"

x <- c("01011960", "01021960", "03311960", "07301960")
z <- as.Date(x, "%m%d%Y") # m=month in number
z

## [1] "1960-01-01" "1960-01-02" "1960-03-31" "1960-07-30"

x <- c("01/01/1960", "01/02/1960", "03/31/1960", "07/30/1960")
z <- as.Date(x, "%m/%d/%Y") # m=month in number
z

## [1] "1960-01-01" "1960-01-02" "1960-03-31" "1960-07-30"

x <- c("1/1/1960", "1/2/1960", "3/31/1960", "7/30/1960")
z <- as.Date(x, "%m/%d/%Y") # Leading 0 ignored
z

## [1] "1960-01-01" "1960-01-02" "1960-03-31" "1960-07-30"

y <- c("01011960", "01021960", "03311960", "07301960")
z <- as.Date(y, "%m%d%y") # Wrong in year
z

## [1] "2019-01-01" "2019-01-02" "2019-03-31" "2019-07-30"

str_replace_all(z, "-", "/") # Replace "-" in all dates by "/"

## [1] "2019/01/01" "2019/01/02" "2019/03/31" "2019/07/30"

## [1] "2019-01-01" "2019-01-02" "2019-03-31" "2019-07-30"

y <- c("196001011", "19600102", "19600331", "19600730")
z <- as.Date(y, "%Y/%m/%d") # m=month in number
z

## [1] NA NA NA NA

y <- c("1960Jan011", "1960Jan02", "1960Mar31", "1960Jul30")
z <- as.Date(y, "%Y%B%d") # m=month in number
z

## [1] "1960-01-01" "1960-01-02" "1960-03-31" "1960-07-30"

format(z, "%a %b %d %X %Y %Z") # a=day in a week

## [1] "Fri Jan 01 00:00:00 1960 UTC" "Sat Jan 02 00:00:00 1960 UTC"
## [3] "Thu Mar 31 00:00:00 1960 UTC" "Sat Jul 30 00:00:00 1960 UTC"

x <- c("1/1/1960", "1/2/1960", "3/31/1960", "7/30/1960")
z <- as.Date(x, "%m/%d/%Y") # Leading 0 ignored
z

## [1] "1960-01-01" "1960-01-02" "1960-03-31" "1960-07-30"

format(z, "%B %d,%Y")

## [1] "January 01,1960" "January 02,1960" "March 31,1960"   "July 30,1960"

format(z, "%Y-%B-%d")

## [1] "1960-January-01" "1960-January-02" "1960-March-31"   "1960-July-30"

Some Interesting Topics

Making Nice Tables through R Markdown

This is based on a resource here (Right click to open a new tab)

baseline_tbl <- structure(list(var = c("Total", "Age at initial diagnosis", "Age at initial diagnosis", 
"Age at initial diagnosis", "Age at initial diagnosis", "Year of inital diagnosis", 
"Year of inital diagnosis", "Year of inital diagnosis", "Year of inital diagnosis", 
"Year of inital diagnosis", "Year of inital diagnosis", "Year of inital diagnosis", 
"Is advanced NSCLC", "Is advanced NSCLC", "Is advanced NSCLC", 
"Age at advanced diagnosis", "Age at advanced diagnosis", "Age at advanced diagnosis", 
"Age at advanced diagnosis"), level = c("", "<65", ">=65", "Missing", 
"Median", "Prior 2015", "2015", "2016", "2017", "2018", "2019", 
"2020", "Yes", "No", "Missing", "<65", ">=65", "Missing", "Median"
), nsclc = c(1320L, 695L, 620L, 5L, 64L, 305L, 320L, 403L, 253L, 
24L, 0L, 0L, 1026L, 294L, 0L, 553L, 467L, 300L, 64L), nsclc_pct = c(NA, 
0.526515151515151, 0.46969696969697, 0.00378787878787879, NA, 
0.231060606060606, 0.242424242424242, 0.30530303030303, 0.191666666666667, 
0.0181818181818182, 0, 0, 0.777272727272727, 0.222727272727273, 
0, 0.418939393939394, 0.353787878787879, 0.227272727272727, NA
), adv_nsclc = c(1026, 575, 446, 5, 63, 275, 260, 295, 170, 16, 
0, 0, 1026, 0, 0, 553, 467, 6, 63.7), adv_pct = c(NA, 0.560428849902534, 
0.434697855750487, 0.00487329434697856, NA, 0.268031189083821, 
0.253411306042885, 0.287524366471735, 0.165692007797271, 0.0155945419103314, 
0, 0, 1, 0, 0, 0.538986354775828, 0.455165692007797, 0.00584795321637427, 
NA), early_nsclc = c(656, 319, 334, 3, 65, 164, 151, 199, 121, 
10, 0, 0, 364, 292, 0, 178, 183, 295, 65.21), early_pct = c(NA, 
0.486280487804878, 0.509146341463415, 0.00457317073170732, NA, 
0.25, 0.230182926829268, 0.303353658536585, 0.184451219512195, 
0.0152439024390244, 0, 0, 0.554878048780488, 0.445121951219512, 
0, 0.271341463414634, 0.278963414634146, 0.44969512195122, NA
)), row.names = c(NA, -19L), class = c("tbl_df", "tbl", "data.frame"
))


baseline_tbl %>% 
    kbl(align = "llcccccc", escape=TRUE, col.names = c("Variable", "Level", "N", "%", "N", "%", "N", "%")) %>%
    kable_paper(c("striped"),full_width = FALSE) %>%
    column_spec(1, bold = TRUE, background = "#91D1C233",extra_css = "border-bottom: 1px solid;") %>%
    column_spec(2,
                width = "10em", border_right = TRUE) %>% 
    column_spec(c(3,5,7), width = "3em") %>% 
    column_spec(c(4,6,8), italic = TRUE,border_right = TRUE,width = "5em") %>% 
    row_spec(0, bold = TRUE, font_size = 18, extra_css = "border-bottom: 1px solid;") %>% 
    row_spec(1, underline = TRUE, extra_css = "border-bottom: 1px solid;") %>% 
    row_spec(c(5,12,15,19), extra_css = "border-bottom: 1px solid;") %>% 
    collapse_rows(columns = 1, valign = "top") %>% 
    add_header_above(c(" " = 2, "NSCLC" = 2, "Advanced NSCLC" = 2, "Early NSCLC" = 2),
                     bold = TRUE,extra_css ="border-bottom: 1px solid;") %>% 
    kable_classic_2()

		NSCLC		Advanced NSCLC		Early NSCLC
Variable	Level	N	%	N	%	N	%
Total		1320	NA	1026.0	NA	656.00	NA
Age at initial diagnosis	<65	695	0.5265152	575.0	0.5604288	319.00	0.4862805
Age at initial diagnosis	>=65	620	0.4696970	446.0	0.4346979	334.00	0.5091463
Age at initial diagnosis	Missing	5	0.0037879	5.0	0.0048733	3.00	0.0045732
Age at initial diagnosis	Median	64	NA	63.0	NA	65.00	NA
Year of inital diagnosis	Prior 2015	305	0.2310606	275.0	0.2680312	164.00	0.2500000
Year of inital diagnosis	2015	320	0.2424242	260.0	0.2534113	151.00	0.2301829
Year of inital diagnosis	2016	403	0.3053030	295.0	0.2875244	199.00	0.3033537
Year of inital diagnosis	2017	253	0.1916667	170.0	0.1656920	121.00	0.1844512
Year of inital diagnosis	2018	24	0.0181818	16.0	0.0155945	10.00	0.0152439
Year of inital diagnosis	2019	0	0.0000000	0.0	0.0000000	0.00	0.0000000
Year of inital diagnosis	2020	0	0.0000000	0.0	0.0000000	0.00	0.0000000
Is advanced NSCLC	Yes	1026	0.7772727	1026.0	1.0000000	364.00	0.5548780
Is advanced NSCLC	No	294	0.2227273	0.0	0.0000000	292.00	0.4451220
Is advanced NSCLC	Missing	0	0.0000000	0.0	0.0000000	0.00	0.0000000
Age at advanced diagnosis	<65	553	0.4189394	553.0	0.5389864	178.00	0.2713415
Age at advanced diagnosis	>=65	467	0.3537879	467.0	0.4551657	183.00	0.2789634
Age at advanced diagnosis	Missing	300	0.2272727	6.0	0.0058480	295.00	0.4496951
Age at advanced diagnosis	Median	64	NA	63.7	NA	65.21	NA

Making a Fern

We use four affine transformations:

$y = A_1 x + b_1$
$y = A_2 x + b_2$
$y = A_3 x + b_2$
$y = A_4 x + b_4$

where

\[A_1 = \left [\begin{array}{cc} 0.86&0.03 \\ -0.03&0.86 \\ \end{array}\right ], ~~~ b_1 = \left[\begin{array}{cc} 0 \\ 1.5 \\ \end{array}\right]\]

\[A_2 = \left[\begin{array}{cc} 0.2&-0.25 \\ 0.21&0.23 \\ \end{array}\right], ~~~ b_2 = \left[\begin{array}{cc} 0 \\ 1.5 \\ \end{array}\right]\]

\[A_3 = \left[\begin{array}{cc} -0.15&0.27 \\ 0.25&0.26 \\ \end{array}\right], ~~~ b_3 = \left[\begin{array}{cc} 0 \\ 0.45 \\ \end{array}\right]\]

\[A_4 = \left[\begin{array}{cc} 0&0 \\ 0&0.17 \\ \end{array}\right], ~~~ b_4 = \left[\begin{array}{cc} 0 \\ 0 \\ \end{array}\right]\]

To make a fern, start by setting $x = (0, 0)$ (the origin). Choose one of the four affine transformations at random with probabilities 0.83, 0.08, 0.08, and 0.01. Apply the chosen transformation to $x$ and denote the result still as $x$. Plot $x$. The process is repeated until a certain number of points are plotted. Now, you see a fern.

# Make a matrix that contains all elements from the 4 matrices and the 4 probabilities
A = matrix(0, 4, 7)
A[1, ] = c(0.86, 0.03, -0.03, 0.86, 0, 1.5, 0.83)
A[2, ] = c(0.2, -0.25,0.21, 0.23, 0, 1.5, 0.08)
A[3, ] = c(-0.15, 0.27, 0.25, 0.26, 0, 0.45, 0.08)
A[4, ] = c(0, 0, 0, 0.17, 0, 0, 0.01)
  
# Start at the origin
x = c(0, 0)

# Starts a graphics device driver for the X Window System (version 11)
#X11()

# Make a plot template without axes and box
plot(0, 0, 
     type = "n", 
     xaxt = "n", 
     yaxt = "n", 
     bty = "n", 
     xlab = "", 
     ylab = "", 
     main = NULL, 
     xlim = c(-5, 10), 
     ylim = c(-2, 10))

# Add points
for (i in 1:2000){
  a = A[sample(1:4, 1, prob = A[, 7]), 1:6]
  x = matrix(a[1:4], 2, 2, byrow = TRUE)%*%x + a[5:6]

  points(x[1], x[2], col = "green")
}

Making a Word Cloud

A word cloud is a text mining method to find the most frequently used words in a text.

# You need to install the following packages first
# "tm", "SnowballC", "wordcloud", "RColorBrewer", "RCurl", and "XML"

A function from http://www.sthda.com/english/wiki/word-cloud-generator-in-r-one-killer-function-to-do-everything-you-need provides a way of creating word clouds. Here is how the function is defined.

#++++++++++++++++++++++++++++++++++
# rquery.wordcloud() : Word cloud generator
# - http://www.sthda.com
#+++++++++++++++++++++++++++++++++++
# x : character string (plain text, web url, txt file path)
# type : specify whether x is a plain text, a web page url or a file path
# lang : the language of the text
# excludeWords : a vector of words to exclude from the text
# textStemming : reduces words to their root form
# colorPalette : the name of color palette taken from RColorBrewer package, 
  # or a color name, or a color code
# min.freq : words with frequency below min.freq will not be plotted
# max.words : Maximum number of words to be plotted. least frequent terms dropped
# value returned by the function : a list(tdm, freqTable)
rquery.wordcloud <- function(x, type=c("text", "url", "file"), 
                          lang="english", excludeWords=NULL, 
                          textStemming=FALSE,  colorPalette="Dark2",
                          min.freq=3, max.words=200)
{ 
  library("tm")
  library("SnowballC")
  library("wordcloud")
  library("RColorBrewer") 
  
  if(type[1]=="file") text <- readLines(x)
  else if(type[1]=="url") text <- html_to_text(x)
  else if(type[1]=="text") text <- x
  
  # Load the text as a corpus
  docs <- Corpus(VectorSource(text))
  # Convert the text to lower case
  docs <- tm_map(docs, content_transformer(tolower))
  # Remove numbers
  docs <- tm_map(docs, removeNumbers)
  # Remove stopwords for the language 
  docs <- tm_map(docs, removeWords, stopwords(lang))
  # Remove punctuations
  docs <- tm_map(docs, removePunctuation)
  # Eliminate extra white spaces
  docs <- tm_map(docs, stripWhitespace)
  # Remove your own stopwords
  if(!is.null(excludeWords)) 
    docs <- tm_map(docs, removeWords, excludeWords) 
  # Text stemming
  if(textStemming) docs <- tm_map(docs, stemDocument)
  # Create term-document matrix
  tdm <- TermDocumentMatrix(docs)
  m <- as.matrix(tdm)
  v <- sort(rowSums(m),decreasing=TRUE)
  d <- data.frame(word = names(v),freq=v)
  # check the color palette name 
  if(!colorPalette %in% rownames(brewer.pal.info)) colors = colorPalette
  else colors = brewer.pal(8, colorPalette) 
  # Plot the word cloud
  set.seed(1234)
  wordcloud(d$word,d$freq, min.freq=min.freq, max.words=max.words,
            random.order=FALSE, rot.per=0.35, 
            use.r.layout=FALSE, colors=colors)
  
  invisible(list(tdm=tdm, freqTable = d))
}
#++++++++++++++++++++++
# Helper function
#++++++++++++++++++++++
# Download and parse webpage
html_to_text<-function(url){
  library(RCurl)
  library(XML)
  # download html
  html.doc <- getURL(url)  
  #convert to plain text
  doc = htmlParse(html.doc, asText=TRUE)
 # "//text()" returns all text outside of HTML tags.
 # We also don’t want text such as style and script codes
  text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
  # Format text vector into one character string
  return(paste(text, collapse = " "))
}

Let’s use the function to process the text from http://www.sthda.com/english/wiki/word-cloud-generator-in-r-one-killer-function-to-do-everything-you-need

filePath <- "http://www.sthda.com/sthda/RDoc/example-files/martin-luther-king-i-have-a-dream-speech.txt"
res<-rquery.wordcloud(filePath, type ="file", lang = "english")

## Warning in tm_map.SimpleCorpus(docs, content_transformer(tolower)):
## transformation drops documents

## Warning in tm_map.SimpleCorpus(docs, removeNumbers): transformation drops
## documents

## Warning in tm_map.SimpleCorpus(docs, removeWords, stopwords(lang)):
## transformation drops documents

## Warning in tm_map.SimpleCorpus(docs, removePunctuation): transformation drops
## documents

## Warning in tm_map.SimpleCorpus(docs, stripWhitespace): transformation drops
## documents

res

## $tdm
## <<TermDocumentMatrix (terms: 166, documents: 46)>>
## Non-/sparse entries: 278/7358
## Sparsity           : 96%
## Maximal term length: 13
## Weighting          : term frequency (tf)
## 
## $freqTable
##                        word freq
## will                   will   17
## freedom             freedom   13
## ring                   ring   12
## dream                 dream   11
## day                     day   11
## let                     let   11
## every                 every    9
## one                     one    8
## able                   able    8
## together           together    7
## nation               nation    4
## mountain           mountain    4
## shall                 shall    4
## faith                 faith    4
## free                   free    4
## today                 today    3
## men                     men    3
## state                 state    3
## children           children    3
## little               little    3
## black                 black    3
## white                 white    3
## made                   made    3
## god                     god    3
## new                     new    3
## sing                   sing    3
## land                   land    3
## last                   last    3
## even                   even    2
## live                   live    2
## meaning             meaning    2
## true                   true    2
## brotherhood     brotherhood    2
## former               former    2
## georgia             georgia    2
## sons                   sons    2
## heat                   heat    2
## mississippi     mississippi    2
## sweltering       sweltering    2
## alabama             alabama    2
## boys                   boys    2
## girls                 girls    2
## hands                 hands    2
## join                   join    2
## words                 words    2
## hill                   hill    2
## places               places    2
## hope                   hope    2
## stone                 stone    2
## thee                   thee    2
## mountainside   mountainside    2
## american           american    1
## deeply               deeply    1
## difficulties   difficulties    1
## face                   face    1
## rooted               rooted    1
## still                 still    1
## though               though    1
## tomorrow           tomorrow    1
## creed                 creed    1
## rise                   rise    1
## created             created    1
## equal                 equal    1
## hold                   hold    1
## selfevident     selfevident    1
## truths               truths    1
## hills                 hills    1
## owners               owners    1
## red                     red    1
## sit                     sit    1
## slave                 slave    1
## slaves               slaves    1
## table                 table    1
## injustice         injustice    1
## justice             justice    1
## oasis                 oasis    1
## oppression       oppression    1
## transformed     transformed    1
## character         character    1
## color                 color    1
## content             content    1
## four                   four    1
## judged               judged    1
## skin                   skin    1
## brothers           brothers    1
## dripping           dripping    1
## governor           governor    1
## interposition interposition    1
## lips                   lips    1
## nullification nullification    1
## racists             racists    1
## right                 right    1
## sisters             sisters    1
## vicious             vicious    1
## crooked             crooked    1
## exalted             exalted    1
## flesh                 flesh    1
## glory                 glory    1
## lord                   lord    1
## low                     low    1
## plain                 plain    1
## revealed           revealed    1
## rough                 rough    1
## see                     see    1
## straight           straight    1
## valley               valley    1
## back                   back    1
## south                 south    1
## beautiful         beautiful    1
## despair             despair    1
## discords           discords    1
## hew                     hew    1
## jail                   jail    1
## jangling           jangling    1
## knowing             knowing    1
## pray                   pray    1
## stand                 stand    1
## struggle           struggle    1
## symphony           symphony    1
## transform         transform    1
## work                   work    1
## country             country    1
## liberty             liberty    1
## sweet                 sweet    1
## tis                     tis    1
## died                   died    1
## fathers             fathers    1
## pilgrim             pilgrim    1
## pride                 pride    1
## america             america    1
## become               become    1
## great                 great    1
## must                   must    1
## hampshire         hampshire    1
## hilltops           hilltops    1
## prodigious       prodigious    1
## mighty               mighty    1
## mountains         mountains    1
## york                   york    1
## alleghenies     alleghenies    1
## heightening     heightening    1
## pennsylvania   pennsylvania    1
## colorado           colorado    1
## rockies             rockies    1
## snowcapped       snowcapped    1
## california       california    1
## curvaceous       curvaceous    1
## slopes               slopes    1
## lookout             lookout    1
## tennessee         tennessee    1
## molehill           molehill    1
## allow                 allow    1
## catholics         catholics    1
## city                   city    1
## gentiles           gentiles    1
## hamlet               hamlet    1
## happens             happens    1
## jews                   jews    1
## negro                 negro    1
## old                     old    1
## protestants     protestants    1
## speed                 speed    1
## spiritual         spiritual    1
## village             village    1
## almighty           almighty    1
## thank                 thank    1

A comparison between R and Python as a Data Science Programming Tool

You are strongly suggested to read this: https://www.quora.com/?pa_story=MTA5ODAwNzM5NTgzODE1NTE4MXwyMTExMDYyMzQ5ODA4MDh8MA**

An Introduction to R Programming

Shiju Zhang

7/5/2022