An integrated development environment (IDE) for R and Python, with a console, syntax-highlighting editor that supports direct code execution, and tools for plotting, history, debugging and workspace management (From https://rstudio.com/).
Watch the video: https://www.youtube.com/watch?v=PviVimazpz8
Register on https://rstudio.cloud/ to get an account.
Once R and RStudio have been installed, open RStudio. Type “version” in the console window (the lower-left window) to check the version of R..
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
x = 4
y = 7
z = 67
a = x*y + x/y - sqrt(x) - 2*abs(y) + exp(-0.3*x) +3*z
a
## [1] 213.8726
For Python fans, the following is simple python code. To run python code within RStudio, the package “reticulate” must be loaded in the “setup” code chunk (at the beginning of this document, below YAML).
fruits = ["apple", "banana", "cherry"]
for x in fruits:
print(x)
## apple
## banana
## cherry
Basic data structures in R include the vector, matrix, data frame, and list. Some of these structures require that all members be of the same data type (e.g. vectors, matrices) while others permit multiple data types (e.g. data frames, lists).
x = c(3, 6, 7, 10, 12, 18, 20) # This is a numeric vector
class(x)
## [1] "numeric"
y = c("Tom", "Jerry", "Donald", "Bob") # This is a character vector
class(y)
## [1] "character"
z1 = c(4, 6, 9)
z2 = c(a = 4, b = 6, c = 9) # A named vector, looking like a table
print(x)
## [1] 3 6 7 10 12 18 20
y
## [1] "Tom" "Jerry" "Donald" "Bob"
z1
## [1] 4 6 9
z2
## a b c
## 4 6 9
letters # a constant vector built in base R
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
state.name # another constant vector built in base R
## [1] "Alabama" "Alaska" "Arizona" "Arkansas"
## [5] "California" "Colorado" "Connecticut" "Delaware"
## [9] "Florida" "Georgia" "Hawaii" "Idaho"
## [13] "Illinois" "Indiana" "Iowa" "Kansas"
## [17] "Kentucky" "Louisiana" "Maine" "Maryland"
## [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
## [25] "Missouri" "Montana" "Nebraska" "Nevada"
## [29] "New Hampshire" "New Jersey" "New Mexico" "New York"
## [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
## [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
## [41] "South Dakota" "Tennessee" "Texas" "Utah"
## [45] "Vermont" "Virginia" "Washington" "West Virginia"
## [49] "Wisconsin" "Wyoming"
M = matrix(data = 1:12, nrow = 3, byrow = TRUE)
M # Print the matrix.
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
dim(M) # Dimension of matrix M
## [1] 3 4
M[2,3] # the element in row 2 and column 3, which is a scalar
## [1] 7
M[2,] # the second row, which is a vector
## [1] 5 6 7 8
M[,3] # the third column, which is a vector
## [1] 3 7 11
M2 = matrix(0, nrow = 3, ncol = 4)
M2
## [,1] [,2] [,3] [,4]
## [1,] 0 0 0 0
## [2,] 0 0 0 0
## [3,] 0 0 0 0
class(M2)
## [1] "matrix" "array"
DF = data.frame(Name = c("John", "Alice", "Tod", "Megan", "Jessica"),
Major = c("Software Engineering", "Statistics", "Accounting", "Computer Science", "Statistics"),
`Score of Stat` = c(89, 92, 87, 90, 88), # When a column name has spaces, use quotation marks
'Score of Calc' = c(77, 87, 92, 93, 79)
)
DF # Print
## Name Major Score.of.Stat Score.of.Calc
## 1 John Software Engineering 89 77
## 2 Alice Statistics 92 87
## 3 Tod Accounting 87 92
## 4 Megan Computer Science 90 93
## 5 Jessica Statistics 88 79
# Subsetting the data
DF[2,3] # Extract the element on the second row and the third column and Print it.
## [1] 92
DF[2,] # Extract the second row
## Name Major Score.of.Stat Score.of.Calc
## 2 Alice Statistics 92 87
DF[, 3] # Extract the third column
## [1] 89 92 87 90 88
DF[, c(1,3)]
## Name Score.of.Stat
## 1 John 89
## 2 Alice 92
## 3 Tod 87
## 4 Megan 90
## 5 Jessica 88
# Check the dimensions
dim(DF)
## [1] 5 4
# Get the row names. By default, row names are 1, 2, 3, ...
rownames(DF)
## [1] "1" "2" "3" "4" "5"
# Get the column names in two ways
colnames(DF)
## [1] "Name" "Major" "Score.of.Stat" "Score.of.Calc"
names(DF)
## [1] "Name" "Major" "Score.of.Stat" "Score.of.Calc"
# Check the class of an object
class(DF)
## [1] "data.frame"
# Rename the second column
colnames(DF)[2] = "Specialty"
DF
## Name Specialty Score.of.Stat Score.of.Calc
## 1 John Software Engineering 89 77
## 2 Alice Statistics 92 87
## 3 Tod Accounting 87 92
## 4 Megan Computer Science 90 93
## 5 Jessica Statistics 88 79
R has some built in data frames that we can use. Here are examples:
# 1. The pressure data
pressure
## temperature pressure
## 1 0 0.0002
## 2 20 0.0012
## 3 40 0.0060
## 4 60 0.0300
## 5 80 0.0900
## 6 100 0.2700
## 7 120 0.7500
## 8 140 1.8500
## 9 160 4.2000
## 10 180 8.8000
## 11 200 17.3000
## 12 220 32.1000
## 13 240 57.0000
## 14 260 96.0000
## 15 280 157.0000
## 16 300 247.0000
## 17 320 376.0000
## 18 340 558.0000
## 19 360 806.0000
# 2. The growth data
PlantGrowth
## weight group
## 1 4.17 ctrl
## 2 5.58 ctrl
## 3 5.18 ctrl
## 4 6.11 ctrl
## 5 4.50 ctrl
## 6 4.61 ctrl
## 7 5.17 ctrl
## 8 4.53 ctrl
## 9 5.33 ctrl
## 10 5.14 ctrl
## 11 4.81 trt1
## 12 4.17 trt1
## 13 4.41 trt1
## 14 3.59 trt1
## 15 5.87 trt1
## 16 3.83 trt1
## 17 6.03 trt1
## 18 4.89 trt1
## 19 4.32 trt1
## 20 4.69 trt1
## 21 6.31 trt2
## 22 5.12 trt2
## 23 5.54 trt2
## 24 5.50 trt2
## 25 5.37 trt2
## 26 5.29 trt2
## 27 4.92 trt2
## 28 6.15 trt2
## 29 5.80 trt2
## 30 5.26 trt2
# 3. The cars data
cars
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
## 7 10 18
## 8 10 26
## 9 10 34
## 10 11 17
## 11 11 28
## 12 12 14
## 13 12 20
## 14 12 24
## 15 12 28
## 16 13 26
## 17 13 34
## 18 13 34
## 19 13 46
## 20 14 26
## 21 14 36
## 22 14 60
## 23 14 80
## 24 15 20
## 25 15 26
## 26 15 54
## 27 16 32
## 28 16 40
## 29 17 32
## 30 17 40
## 31 17 50
## 32 18 42
## 33 18 56
## 34 18 76
## 35 18 84
## 36 19 36
## 37 19 46
## 38 19 68
## 39 20 32
## 40 20 48
## 41 20 52
## 42 20 56
## 43 20 64
## 44 22 66
## 45 23 54
## 46 24 70
## 47 24 92
## 48 24 93
## 49 24 120
## 50 25 85
# 4. Motor Trend Car Road Tests
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
# 5. Fisher's iris data
iris
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 16 5.7 4.4 1.5 0.4 setosa
## 17 5.4 3.9 1.3 0.4 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 19 5.7 3.8 1.7 0.3 setosa
## 20 5.1 3.8 1.5 0.3 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 22 5.1 3.7 1.5 0.4 setosa
## 23 4.6 3.6 1.0 0.2 setosa
## 24 5.1 3.3 1.7 0.5 setosa
## 25 4.8 3.4 1.9 0.2 setosa
## 26 5.0 3.0 1.6 0.2 setosa
## 27 5.0 3.4 1.6 0.4 setosa
## 28 5.2 3.5 1.5 0.2 setosa
## 29 5.2 3.4 1.4 0.2 setosa
## 30 4.7 3.2 1.6 0.2 setosa
## 31 4.8 3.1 1.6 0.2 setosa
## 32 5.4 3.4 1.5 0.4 setosa
## 33 5.2 4.1 1.5 0.1 setosa
## 34 5.5 4.2 1.4 0.2 setosa
## 35 4.9 3.1 1.5 0.2 setosa
## 36 5.0 3.2 1.2 0.2 setosa
## 37 5.5 3.5 1.3 0.2 setosa
## 38 4.9 3.6 1.4 0.1 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 40 5.1 3.4 1.5 0.2 setosa
## 41 5.0 3.5 1.3 0.3 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 43 4.4 3.2 1.3 0.2 setosa
## 44 5.0 3.5 1.6 0.6 setosa
## 45 5.1 3.8 1.9 0.4 setosa
## 46 4.8 3.0 1.4 0.3 setosa
## 47 5.1 3.8 1.6 0.2 setosa
## 48 4.6 3.2 1.4 0.2 setosa
## 49 5.3 3.7 1.5 0.2 setosa
## 50 5.0 3.3 1.4 0.2 setosa
## 51 7.0 3.2 4.7 1.4 versicolor
## 52 6.4 3.2 4.5 1.5 versicolor
## 53 6.9 3.1 4.9 1.5 versicolor
## 54 5.5 2.3 4.0 1.3 versicolor
## 55 6.5 2.8 4.6 1.5 versicolor
## 56 5.7 2.8 4.5 1.3 versicolor
## 57 6.3 3.3 4.7 1.6 versicolor
## 58 4.9 2.4 3.3 1.0 versicolor
## 59 6.6 2.9 4.6 1.3 versicolor
## 60 5.2 2.7 3.9 1.4 versicolor
## 61 5.0 2.0 3.5 1.0 versicolor
## 62 5.9 3.0 4.2 1.5 versicolor
## 63 6.0 2.2 4.0 1.0 versicolor
## 64 6.1 2.9 4.7 1.4 versicolor
## 65 5.6 2.9 3.6 1.3 versicolor
## 66 6.7 3.1 4.4 1.4 versicolor
## 67 5.6 3.0 4.5 1.5 versicolor
## 68 5.8 2.7 4.1 1.0 versicolor
## 69 6.2 2.2 4.5 1.5 versicolor
## 70 5.6 2.5 3.9 1.1 versicolor
## 71 5.9 3.2 4.8 1.8 versicolor
## 72 6.1 2.8 4.0 1.3 versicolor
## 73 6.3 2.5 4.9 1.5 versicolor
## 74 6.1 2.8 4.7 1.2 versicolor
## 75 6.4 2.9 4.3 1.3 versicolor
## 76 6.6 3.0 4.4 1.4 versicolor
## 77 6.8 2.8 4.8 1.4 versicolor
## 78 6.7 3.0 5.0 1.7 versicolor
## 79 6.0 2.9 4.5 1.5 versicolor
## 80 5.7 2.6 3.5 1.0 versicolor
## 81 5.5 2.4 3.8 1.1 versicolor
## 82 5.5 2.4 3.7 1.0 versicolor
## 83 5.8 2.7 3.9 1.2 versicolor
## 84 6.0 2.7 5.1 1.6 versicolor
## 85 5.4 3.0 4.5 1.5 versicolor
## 86 6.0 3.4 4.5 1.6 versicolor
## 87 6.7 3.1 4.7 1.5 versicolor
## 88 6.3 2.3 4.4 1.3 versicolor
## 89 5.6 3.0 4.1 1.3 versicolor
## 90 5.5 2.5 4.0 1.3 versicolor
## 91 5.5 2.6 4.4 1.2 versicolor
## 92 6.1 3.0 4.6 1.4 versicolor
## 93 5.8 2.6 4.0 1.2 versicolor
## 94 5.0 2.3 3.3 1.0 versicolor
## 95 5.6 2.7 4.2 1.3 versicolor
## 96 5.7 3.0 4.2 1.2 versicolor
## 97 5.7 2.9 4.2 1.3 versicolor
## 98 6.2 2.9 4.3 1.3 versicolor
## 99 5.1 2.5 3.0 1.1 versicolor
## 100 5.7 2.8 4.1 1.3 versicolor
## 101 6.3 3.3 6.0 2.5 virginica
## 102 5.8 2.7 5.1 1.9 virginica
## 103 7.1 3.0 5.9 2.1 virginica
## 104 6.3 2.9 5.6 1.8 virginica
## 105 6.5 3.0 5.8 2.2 virginica
## 106 7.6 3.0 6.6 2.1 virginica
## 107 4.9 2.5 4.5 1.7 virginica
## 108 7.3 2.9 6.3 1.8 virginica
## 109 6.7 2.5 5.8 1.8 virginica
## 110 7.2 3.6 6.1 2.5 virginica
## 111 6.5 3.2 5.1 2.0 virginica
## 112 6.4 2.7 5.3 1.9 virginica
## 113 6.8 3.0 5.5 2.1 virginica
## 114 5.7 2.5 5.0 2.0 virginica
## 115 5.8 2.8 5.1 2.4 virginica
## 116 6.4 3.2 5.3 2.3 virginica
## 117 6.5 3.0 5.5 1.8 virginica
## 118 7.7 3.8 6.7 2.2 virginica
## 119 7.7 2.6 6.9 2.3 virginica
## 120 6.0 2.2 5.0 1.5 virginica
## 121 6.9 3.2 5.7 2.3 virginica
## 122 5.6 2.8 4.9 2.0 virginica
## 123 7.7 2.8 6.7 2.0 virginica
## 124 6.3 2.7 4.9 1.8 virginica
## 125 6.7 3.3 5.7 2.1 virginica
## 126 7.2 3.2 6.0 1.8 virginica
## 127 6.2 2.8 4.8 1.8 virginica
## 128 6.1 3.0 4.9 1.8 virginica
## 129 6.4 2.8 5.6 2.1 virginica
## 130 7.2 3.0 5.8 1.6 virginica
## 131 7.4 2.8 6.1 1.9 virginica
## 132 7.9 3.8 6.4 2.0 virginica
## 133 6.4 2.8 5.6 2.2 virginica
## 134 6.3 2.8 5.1 1.5 virginica
## 135 6.1 2.6 5.6 1.4 virginica
## 136 7.7 3.0 6.1 2.3 virginica
## 137 6.3 3.4 5.6 2.4 virginica
## 138 6.4 3.1 5.5 1.8 virginica
## 139 6.0 3.0 4.8 1.8 virginica
## 140 6.9 3.1 5.4 2.1 virginica
## 141 6.7 3.1 5.6 2.4 virginica
## 142 6.9 3.1 5.1 2.3 virginica
## 143 5.8 2.7 5.1 1.9 virginica
## 144 6.8 3.2 5.9 2.3 virginica
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
# 6. Crime rates data
USArrests
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
## Connecticut 3.3 110 77 11.1
## Delaware 5.9 238 72 15.8
## Florida 15.4 335 80 31.9
## Georgia 17.4 211 60 25.8
## Hawaii 5.3 46 83 20.2
## Idaho 2.6 120 54 14.2
## Illinois 10.4 249 83 24.0
## Indiana 7.2 113 65 21.0
## Iowa 2.2 56 57 11.3
## Kansas 6.0 115 66 18.0
## Kentucky 9.7 109 52 16.3
## Louisiana 15.4 249 66 22.2
## Maine 2.1 83 51 7.8
## Maryland 11.3 300 67 27.8
## Massachusetts 4.4 149 85 16.3
## Michigan 12.1 255 74 35.1
## Minnesota 2.7 72 66 14.9
## Mississippi 16.1 259 44 17.1
## Missouri 9.0 178 70 28.2
## Montana 6.0 109 53 16.4
## Nebraska 4.3 102 62 16.5
## Nevada 12.2 252 81 46.0
## New Hampshire 2.1 57 56 9.5
## New Jersey 7.4 159 89 18.8
## New Mexico 11.4 285 70 32.1
## New York 11.1 254 86 26.1
## North Carolina 13.0 337 45 16.1
## North Dakota 0.8 45 44 7.3
## Ohio 7.3 120 75 21.4
## Oklahoma 6.6 151 68 20.0
## Oregon 4.9 159 67 29.3
## Pennsylvania 6.3 106 72 14.9
## Rhode Island 3.4 174 87 8.3
## South Carolina 14.4 279 48 22.5
## South Dakota 3.8 86 45 12.8
## Tennessee 13.2 188 59 26.9
## Texas 12.7 201 80 25.5
## Utah 3.2 120 80 22.9
## Vermont 2.2 48 32 11.2
## Virginia 8.5 156 63 20.7
## Washington 4.0 145 73 26.2
## West Virginia 5.7 81 39 9.3
## Wisconsin 2.6 53 66 10.8
## Wyoming 6.8 161 60 15.6
library(DT)
datatable(iris)
Creating data frames with the data.frame() function is only useful for small data. If data are saved in a CSV (Comma-Separated Values) file on a local computer or on internet, we can read them via the read.table(), read.csv(), or other R functions. For help. type ?read.table in the console.
Reading data from a data file is demonstrated below.
# Read data from a your own computer
filePath = "/Users/home/Downloads/daily.csv"
myData = read.table(file = filePath, header = TRUE, sep = ",", skip = 0)
head(myData) # Display the first 6 rows by default
## date state positive negative pending hospitalizedCurrently
## 1 20200409 AK 235 6988 NA NA
## 2 20200409 AL 2769 18058 NA NA
## 3 20200409 AR 1119 13832 NA 73
## 4 20200409 AS 0 20 11 NA
## 5 20200409 AZ 3018 34160 NA NA
## 6 20200409 CA 18309 145191 14100 2825
## hospitalizedCumulative inIcuCurrently inIcuCumulative onVentilatorCurrently
## 1 27 NA NA NA
## 2 333 NA NA NA
## 3 130 NA 43 31
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA 1132 NA NA
## onVentilatorCumulative recovered hash
## 1 NA 49 b25c4a1e5bb964d05142022307f57e41e0872f26
## 2 NA NA 34ec8a688f0984d69a58ddef29d640ace6ee65bf
## 3 39 288 d21f4f58502c23fd66e3842b4b6eb7c9c18ccdf0
## 4 NA NA b74c56fd55694ebfcdaf5fe53c9b3fd3ffa11195
## 5 NA NA a706debeb0a3336ce380a04afeca43b1f77d44c9
## 6 NA NA 415e9a2ee1c4debcc75db294bc356b28a291a185
## dateChecked death hospitalized total totalTestResults posNeg fips
## 1 2020-04-09T20:00:00Z 7 27 7223 7223 7223 2
## 2 2020-04-09T20:00:00Z 74 333 20827 20827 20827 1
## 3 2020-04-09T20:00:00Z 21 130 14951 14951 14951 5
## 4 2020-04-09T20:00:00Z 0 NA 31 20 20 60
## 5 2020-04-09T20:00:00Z 89 NA 37178 37178 37178 4
## 6 2020-04-09T20:00:00Z 492 NA 177600 163500 163500 6
## deathIncrease hospitalizedIncrease negativeIncrease positiveIncrease
## 1 0 0 146 9
## 2 8 19 1305 400
## 3 3 0 302 119
## 4 0 0 0 0
## 5 9 0 2322 292
## 6 50 0 17884 1352
## totalTestResultsIncrease
## 1 155
## 2 1705
## 3 421
## 4 0
## 5 2614
## 6 19236
# The following is a dataset from http://data.un.org/
url = "http://data.un.org/_Docs/SYB/CSV/SYB62_246_201907_Population%20Growth,%20Fertility%20and%20Mortality%20Indicators.csv"
D = read.table(file = url, header = TRUE, sep = ",", skip = 1, fill = TRUE)
dim(D) # To display the numbers of rows and columns of the data
## [1] 4966 7
D$Footnotes[1] # Notes on the Year column
## [1] "Data refers to a 5-year period preceding the reference year."
D = D[-c(1:564), 2:5] # Remove the first 564 rows (non-countries) and keep the 2nd to 5th columns
head(D) # Display the first 6 rows by default
## X Year Series
## 565 Afghanistan 2005 Population annual rate of increase (percent)
## 566 Afghanistan 2005 Total fertility rate (children per women)
## 567 Afghanistan 2005 Infant mortality for both sexes (per 1,000 live births)
## 568 Afghanistan 2005 Maternal mortality ratio (deaths per 100,000 population)
## 569 Afghanistan 2005 Life expectancy at birth for both sexes (years)
## 570 Afghanistan 2005 Life expectancy at birth for males (years)
## Value
## 565 4.2140
## 566 7.1816
## 567 84.6470
## 568 820.9010
## 569 56.9960
## 570 55.7900
head(D, n = 10) # Display the first n (10) rows
## X Year Series
## 565 Afghanistan 2005 Population annual rate of increase (percent)
## 566 Afghanistan 2005 Total fertility rate (children per women)
## 567 Afghanistan 2005 Infant mortality for both sexes (per 1,000 live births)
## 568 Afghanistan 2005 Maternal mortality ratio (deaths per 100,000 population)
## 569 Afghanistan 2005 Life expectancy at birth for both sexes (years)
## 570 Afghanistan 2005 Life expectancy at birth for males (years)
## 571 Afghanistan 2005 Life expectancy at birth for females (years)
## 572 Afghanistan 2010 Population annual rate of increase (percent)
## 573 Afghanistan 2010 Total fertility rate (children per women)
## 574 Afghanistan 2010 Infant mortality for both sexes (per 1,000 live births)
## Value
## 565 4.2140
## 566 7.1816
## 567 84.6470
## 568 820.9010
## 569 56.9960
## 570 55.7900
## 571 58.2900
## 572 2.5790
## 573 6.4784
## 574 72.1930
tail(D) # Display the last 6 rows by default
## X Year Series
## 4961 Zimbabwe 2015 Total fertility rate (children per women)
## 4962 Zimbabwe 2015 Infant mortality for both sexes (per 1,000 live births)
## 4963 Zimbabwe 2015 Maternal mortality ratio (deaths per 100,000 population)
## 4964 Zimbabwe 2015 Life expectancy at birth for both sexes (years)
## 4965 Zimbabwe 2015 Life expectancy at birth for males (years)
## 4966 Zimbabwe 2015 Life expectancy at birth for females (years)
## Value
## 4961 4.0897
## 4962 51.2210
## 4963 443.3138
## 4964 56.7100
## 4965 54.8800
## 4966 58.2600
tail(D, n = 10) # Display the last n (10) rows by default
## X Year Series
## 4957 Zimbabwe 2010 Life expectancy at birth for both sexes (years)
## 4958 Zimbabwe 2010 Life expectancy at birth for males (years)
## 4959 Zimbabwe 2010 Life expectancy at birth for females (years)
## 4960 Zimbabwe 2015 Population annual rate of increase (percent)
## 4961 Zimbabwe 2015 Total fertility rate (children per women)
## 4962 Zimbabwe 2015 Infant mortality for both sexes (per 1,000 live births)
## 4963 Zimbabwe 2015 Maternal mortality ratio (deaths per 100,000 population)
## 4964 Zimbabwe 2015 Life expectancy at birth for both sexes (years)
## 4965 Zimbabwe 2015 Life expectancy at birth for males (years)
## 4966 Zimbabwe 2015 Life expectancy at birth for females (years)
## Value
## 4957 45.0170
## 4958 43.3400
## 4959 46.6800
## 4960 1.6860
## 4961 4.0897
## 4962 51.2210
## 4963 443.3138
## 4964 56.7100
## 4965 54.8800
## 4966 58.2600
names(D) # Display the column (variable) names.
## [1] "X" "Year" "Series" "Value"
names(D)[1] = "Country" # Change the column name from "X" to "Country"
# Can you answer this question?
# During the 5-year period ending at 2005, what was the population annual rate of increase (percent)?
# During the 5-year period ending at 2010, what was the life expectancy at birth for females (years)?
# Your answer to previous questions should equal 2.0100 + 2.2040 and 65.1230 + 13.417, respectively.
# If your data are in the xls or xlsx format, install and load the package "readxl" and
# then call read_excel() function to read the data.
# If you read an xls file, issue the code
# my_data <- read_excel("path to my_file.xls")
# If you read an xlsx file, issue the code
# my_data <- read_excel("path to my_file.xlsx")
L = list(Name = "Tom", Age = 19, Sex = "Male", Major = "Computer Science")
L
## $Name
## [1] "Tom"
##
## $Age
## [1] 19
##
## $Sex
## [1] "Male"
##
## $Major
## [1] "Computer Science"
L[[3]] ## The third object (z) in the list L
## [1] "Male"
L$Age # Extract the Age element in the list L.
## [1] 19
length(L) # How many elements are in L
## [1] 4
class(L)
## [1] "list"
x=5:9
x
## [1] 5 6 7 8 9
rm(x) # Remove the R object x
## print(x) would result in an error!
ls() ## A vector that holds all R objects in memory
## [1] "a" "D" "DF" "filePath" "L" "M"
## [7] "M2" "myData" "url" "y" "z" "z1"
## [13] "z2"
rm(list = ls()) ## Remove all
Reserved words in R programming are a set of words that have special meaning and cannot be used as an identifier (variable name, function name etc.).
Here is a list of reserved words in the R’s parser.
if, else, repeat, while, function, for, in, next, break, TRUE, FALSE, NULL, Inf, NaN, NA
This list can be viewed by typing help(reserved) or ?reserved at the R command prompt.
In any programming language, logical expressions, if-else conditions, and for- or while- Loops are fundamental.
x = 2
X = 3
y = 4
a = (x == X) # Check if x equals X
a # the value of a is FALSE
## [1] FALSE
b = (x > y)
b
## [1] FALSE
m = (x != X) # Check if x and X are NOT the same
m # The value of m is TRUE
## [1] TRUE
A simple if-else structure is for the situation that there are only two cases to consider. It may look like:
x = 3
if (x < 0) {
print("Negative")
} else {
print("Non-negative")
} ## Note: the "else" branch should follow an "{" immediately!
## [1] "Non-negative"
A more complicated if-else structure is for the situation that there are more than two cases to consider. An example is:
x = 3
if (x < 0) {
print("Negative")
} else if (x<= 1){
print("Between 0 and 1")
} else if (x<=5){
print("greater than 1 and no more than 5")
} else {
print("Greater than 5")
}
## [1] "greater than 1 and no more than 5"
## Find the sum of a given sequence
x = 1:100
s = 0 ## Initialize s to 0
for (a in x) {
s = s + a ## Update s as x is retrieved
}
print(s)
## [1] 5050
# Print the first a few terms of a Fibbonacci Sequence
# A Fibbonacci sequence is the sequence: 1, 1, 2, 3, 5, 8, 13, 21, 34, ...
# The next term equals the sum of the previous two terms
a = 1 # the first term of the Fibbonacci sequence
b = 1 # the second term
for (i in 1:10){
cat(i, ": ", a, "\n") # Outputs the objects, concatenating the representations.
temp <- a # Save "a" temporarily
a <- b
b <- temp + b # Update "b"
}
## 1 : 1
## 2 : 1
## 3 : 2
## 4 : 3
## 5 : 5
## 6 : 8
## 7 : 13
## 8 : 21
## 9 : 34
## 10 : 55
i <- 10
while (i > 5) {
print(i)
i = i-1 # This is not an equation. It means to update "i"
}
## [1] 10
## [1] 9
## [1] 8
## [1] 7
## [1] 6
We demonstrate the use of the following built-in functions in R: summary(), table(), names(), sample(), apply(), sapply(), toupper(), substr(), substring(), and aggregate().
x = c(6, 5, 2, 1, 6, 1, 4, 3, 2, 1, 3, 6, 2, 6, 4, 1, 5, 1, 4, 4, 3, 6, 1, 4, 6, 5, 6, 5, 5, 5)
summary(x) # Summarize the data to give a 5-number summary plus the eman
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 4.000 3.767 5.000 6.000
T = table(x) # Tablulate the values in x. The result is a one-way table with names.
T # Print the table
## x
## 1 2 3 4 5 6
## 6 3 3 5 6 7
names(T)
## [1] "1" "2" "3" "4" "5" "6"
# You can change the names of the table to what you want.
names(T)[2] = "Two"
T # Print the table
## 1 Two 3 4 5 6
## 6 3 3 5 6 7
names(T)
## [1] "1" "Two" "3" "4" "5" "6"
Names = c("Abel", "Bernoulli", "Catherine", "Dick", "Alice", "Issac") # A vector of individuals (can be viewed as a population)
sample(x = Names, size = 2, replace = FALSE) ## Randomly select two individuals without replacement from the population
## [1] "Catherine" "Abel"
# Create a data frame with 3 columns
M = data.frame(x = c(8, 6, 7, 5, 9, 8),
y = c(23, 43, 22, 17, 56, 48),
z = c(45, 78, 90, 34, 67, 23)
)
names(M)
## [1] "x" "y" "z"
apply(M, 2, mean) # Apply the function "mean" to each column of M. The result is a vector
## x y z
## 7.166667 34.833333 56.166667
apply(M, 2, summary) # Apply the function "summary" to each column of M. The result is a vector
## x y z
## Min. 5.000000 17.00000 23.00000
## 1st Qu. 6.250000 22.25000 36.75000
## Median 7.500000 33.00000 56.00000
## Mean 7.166667 34.83333 56.16667
## 3rd Qu. 8.000000 46.75000 75.25000
## Max. 9.000000 56.00000 90.00000
x=c(9, 16, 25, 100)
sapply(x, sqrt) # Apply the function "sqrt" to each element of vector x. The result is a vector
## [1] 3 4 5 10
s1 = "Happy new year!"
s2 = c("tom robinson", "Alice", "jessica")
toupper(s1) # convert s1 in uppercase
## [1] "HAPPY NEW YEAR!"
toupper(s2)
## [1] "TOM ROBINSON" "ALICE" "JESSICA"
tolower(s1)
## [1] "happy new year!"
tolower(s2)
## [1] "tom robinson" "alice" "jessica"
substr(s1, start = 3, stop = 8) # substring of s1 from third char to 8th char; must provide a stop.
## [1] "ppy ne"
substring(s2, first = 2, last = 6)
## [1] "om ro" "lice" "essic"
substring(s2, first = 2) # basically to the end
## [1] "om robinson" "lice" "essica"
# Get to know the data frame "iris"
iris
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 16 5.7 4.4 1.5 0.4 setosa
## 17 5.4 3.9 1.3 0.4 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 19 5.7 3.8 1.7 0.3 setosa
## 20 5.1 3.8 1.5 0.3 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 22 5.1 3.7 1.5 0.4 setosa
## 23 4.6 3.6 1.0 0.2 setosa
## 24 5.1 3.3 1.7 0.5 setosa
## 25 4.8 3.4 1.9 0.2 setosa
## 26 5.0 3.0 1.6 0.2 setosa
## 27 5.0 3.4 1.6 0.4 setosa
## 28 5.2 3.5 1.5 0.2 setosa
## 29 5.2 3.4 1.4 0.2 setosa
## 30 4.7 3.2 1.6 0.2 setosa
## 31 4.8 3.1 1.6 0.2 setosa
## 32 5.4 3.4 1.5 0.4 setosa
## 33 5.2 4.1 1.5 0.1 setosa
## 34 5.5 4.2 1.4 0.2 setosa
## 35 4.9 3.1 1.5 0.2 setosa
## 36 5.0 3.2 1.2 0.2 setosa
## 37 5.5 3.5 1.3 0.2 setosa
## 38 4.9 3.6 1.4 0.1 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 40 5.1 3.4 1.5 0.2 setosa
## 41 5.0 3.5 1.3 0.3 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 43 4.4 3.2 1.3 0.2 setosa
## 44 5.0 3.5 1.6 0.6 setosa
## 45 5.1 3.8 1.9 0.4 setosa
## 46 4.8 3.0 1.4 0.3 setosa
## 47 5.1 3.8 1.6 0.2 setosa
## 48 4.6 3.2 1.4 0.2 setosa
## 49 5.3 3.7 1.5 0.2 setosa
## 50 5.0 3.3 1.4 0.2 setosa
## 51 7.0 3.2 4.7 1.4 versicolor
## 52 6.4 3.2 4.5 1.5 versicolor
## 53 6.9 3.1 4.9 1.5 versicolor
## 54 5.5 2.3 4.0 1.3 versicolor
## 55 6.5 2.8 4.6 1.5 versicolor
## 56 5.7 2.8 4.5 1.3 versicolor
## 57 6.3 3.3 4.7 1.6 versicolor
## 58 4.9 2.4 3.3 1.0 versicolor
## 59 6.6 2.9 4.6 1.3 versicolor
## 60 5.2 2.7 3.9 1.4 versicolor
## 61 5.0 2.0 3.5 1.0 versicolor
## 62 5.9 3.0 4.2 1.5 versicolor
## 63 6.0 2.2 4.0 1.0 versicolor
## 64 6.1 2.9 4.7 1.4 versicolor
## 65 5.6 2.9 3.6 1.3 versicolor
## 66 6.7 3.1 4.4 1.4 versicolor
## 67 5.6 3.0 4.5 1.5 versicolor
## 68 5.8 2.7 4.1 1.0 versicolor
## 69 6.2 2.2 4.5 1.5 versicolor
## 70 5.6 2.5 3.9 1.1 versicolor
## 71 5.9 3.2 4.8 1.8 versicolor
## 72 6.1 2.8 4.0 1.3 versicolor
## 73 6.3 2.5 4.9 1.5 versicolor
## 74 6.1 2.8 4.7 1.2 versicolor
## 75 6.4 2.9 4.3 1.3 versicolor
## 76 6.6 3.0 4.4 1.4 versicolor
## 77 6.8 2.8 4.8 1.4 versicolor
## 78 6.7 3.0 5.0 1.7 versicolor
## 79 6.0 2.9 4.5 1.5 versicolor
## 80 5.7 2.6 3.5 1.0 versicolor
## 81 5.5 2.4 3.8 1.1 versicolor
## 82 5.5 2.4 3.7 1.0 versicolor
## 83 5.8 2.7 3.9 1.2 versicolor
## 84 6.0 2.7 5.1 1.6 versicolor
## 85 5.4 3.0 4.5 1.5 versicolor
## 86 6.0 3.4 4.5 1.6 versicolor
## 87 6.7 3.1 4.7 1.5 versicolor
## 88 6.3 2.3 4.4 1.3 versicolor
## 89 5.6 3.0 4.1 1.3 versicolor
## 90 5.5 2.5 4.0 1.3 versicolor
## 91 5.5 2.6 4.4 1.2 versicolor
## 92 6.1 3.0 4.6 1.4 versicolor
## 93 5.8 2.6 4.0 1.2 versicolor
## 94 5.0 2.3 3.3 1.0 versicolor
## 95 5.6 2.7 4.2 1.3 versicolor
## 96 5.7 3.0 4.2 1.2 versicolor
## 97 5.7 2.9 4.2 1.3 versicolor
## 98 6.2 2.9 4.3 1.3 versicolor
## 99 5.1 2.5 3.0 1.1 versicolor
## 100 5.7 2.8 4.1 1.3 versicolor
## 101 6.3 3.3 6.0 2.5 virginica
## 102 5.8 2.7 5.1 1.9 virginica
## 103 7.1 3.0 5.9 2.1 virginica
## 104 6.3 2.9 5.6 1.8 virginica
## 105 6.5 3.0 5.8 2.2 virginica
## 106 7.6 3.0 6.6 2.1 virginica
## 107 4.9 2.5 4.5 1.7 virginica
## 108 7.3 2.9 6.3 1.8 virginica
## 109 6.7 2.5 5.8 1.8 virginica
## 110 7.2 3.6 6.1 2.5 virginica
## 111 6.5 3.2 5.1 2.0 virginica
## 112 6.4 2.7 5.3 1.9 virginica
## 113 6.8 3.0 5.5 2.1 virginica
## 114 5.7 2.5 5.0 2.0 virginica
## 115 5.8 2.8 5.1 2.4 virginica
## 116 6.4 3.2 5.3 2.3 virginica
## 117 6.5 3.0 5.5 1.8 virginica
## 118 7.7 3.8 6.7 2.2 virginica
## 119 7.7 2.6 6.9 2.3 virginica
## 120 6.0 2.2 5.0 1.5 virginica
## 121 6.9 3.2 5.7 2.3 virginica
## 122 5.6 2.8 4.9 2.0 virginica
## 123 7.7 2.8 6.7 2.0 virginica
## 124 6.3 2.7 4.9 1.8 virginica
## 125 6.7 3.3 5.7 2.1 virginica
## 126 7.2 3.2 6.0 1.8 virginica
## 127 6.2 2.8 4.8 1.8 virginica
## 128 6.1 3.0 4.9 1.8 virginica
## 129 6.4 2.8 5.6 2.1 virginica
## 130 7.2 3.0 5.8 1.6 virginica
## 131 7.4 2.8 6.1 1.9 virginica
## 132 7.9 3.8 6.4 2.0 virginica
## 133 6.4 2.8 5.6 2.2 virginica
## 134 6.3 2.8 5.1 1.5 virginica
## 135 6.1 2.6 5.6 1.4 virginica
## 136 7.7 3.0 6.1 2.3 virginica
## 137 6.3 3.4 5.6 2.4 virginica
## 138 6.4 3.1 5.5 1.8 virginica
## 139 6.0 3.0 4.8 1.8 virginica
## 140 6.9 3.1 5.4 2.1 virginica
## 141 6.7 3.1 5.6 2.4 virginica
## 142 6.9 3.1 5.1 2.3 virginica
## 143 5.8 2.7 5.1 1.9 virginica
## 144 6.8 3.2 5.9 2.3 virginica
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
# Find means of sub-groups: Group the plants in the iris data according to species.
# And within each group, find the mean of each numeric variable by applying the mean() function.
# One of the R functions for doing this is aggregate().
aggregate(. ~ Species, data = iris, FUN = mean)
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 setosa 5.006 3.428 1.462 0.246
## 2 versicolor 5.936 2.770 4.260 1.326
## 3 virginica 6.588 2.974 5.552 2.026
# The function can be used to groups formed with 2 or more grouping variables
aggregate(mpg ~ cyl + am, data = mtcars, mean) # Find mean of mpg within each group determined by cyl & am
## cyl am mpg
## 1 4 0 22.90000
## 2 6 0 19.12500
## 3 8 0 15.05000
## 4 4 1 28.07500
## 5 6 1 20.56667
## 6 8 1 15.40000
There are many more built-in functions.
The paste() or paste0() function is useful when pasting texts and numeric values to send messages.
n = 30000000
paste("The number of COVID-19 cases is ", n, ".", ifelse(n>1000000, " This is huge!", ""),
sep = ""
)
## [1] "The number of COVID-19 cases is 3e+07. This is huge!"
paste0("The number of COVID-19 cases is ", n, ".", ifelse(n>1000000, " This is huge!", "")
)
## [1] "The number of COVID-19 cases is 3e+07. This is huge!"
paste0("The number of COVID-19 cases is ", sprintf("%i", n), ".", ifelse(n>1000000, " This is huge!", "")
)
## [1] "The number of COVID-19 cases is 30000000. This is huge!"
In statistics, variables can be classified as numerical (quantitative) or categorical (qualitative). A categorical variable can further by classified as ordinal or nominal. Examples of ordinal variables are grade (A, B, C, D, F) and income level (low, medium, high). Examples of nominal variables are color (red, blue, …) and political affiliation (Democratic, Republican, Independent).
In situations where we want to visualize data, we need to first convert a variable to a categorical one with levels and we need to use the R function “factor()”. Here is an example.
x = c("Democratic", "Independent", "Independent", "Republican", "Democratic", "Republican", "Republican", "Independent", "Democratic", "Democratic", "Democratic", "Republican")
T1 = table(x)
T1 # The order of the names of the table is the alphabetical order of the values in x
## x
## Democratic Independent Republican
## 5 3 4
barplot(T1)
F = factor(x, levels = c("Democratic", "Republican", "Independent")) # F is a factor
F # The levels themselves are not ordered.
## [1] Democratic Independent Independent Republican Democratic Republican
## [7] Republican Independent Democratic Democratic Democratic Republican
## Levels: Democratic Republican Independent
T2 = table(F)
T2 # The order of the names of the table has been changed!! The order is the same as in the levels of F
## F
## Democratic Republican Independent
## 5 4 3
barplot(T2)
# The following makes a factor with ordered levels
F = factor(x, levels = c("Democratic", "Republican", "Independent"), ordered = TRUE)
F # The levels themselves are now ordered as specified.
## [1] Democratic Independent Independent Republican Democratic Republican
## [7] Republican Independent Democratic Democratic Democratic Republican
## Levels: Democratic < Republican < Independent
F[3]>F[5] # The comparison of internal values in F now makes sense
## [1] TRUE
as.numeric(x) # Not working
## Warning: NAs introduced by coercion
## [1] NA NA NA NA NA NA NA NA NA NA NA NA
as.numeric(F) # Factors can be converted to numeric
## [1] 1 3 3 2 1 2 2 3 1 1 1 2
When creating a graph (scatterplot or barplot) with a legend that is numerical but actually categorical, we need to convert the numerical variable to a factor, as shown later when we introduce the ggplot2 package.
y = c(4, 8, 6, 4, 9, 3, 14, 13, 12)
sum(y)
## [1] 73
mean(y)
## [1] 8.111111
median(y)
## [1] 8
sd(y)
## [1] 4.166667
summary(y) # 5-number summary plus the mean
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 8.000 8.111 12.000 14.000
c(summary(y), sd=sd(y)) # Add standard deviation to the standard summary of data
## Min. 1st Qu. Median Mean 3rd Qu. Max. sd
## 3.000000 4.000000 8.000000 8.111111 12.000000 14.000000 4.166667
x = c(30, 45, 34, 27, 51, 18, 67, 56, 48)
cor(x, y) # Correlation between x and y variables
## [1] 0.9540732
# Handling missing data
z = c(12, 23, 41, 34, 10, NA, 18, 22, NA, 52)
mean(z)
## [1] NA
mean(z, na.rm = TRUE) # Remove missing values first before taking the mean.
## [1] 26.5
sd(z)
## [1] NA
sd(z, na.rm = TRUE) # Remove missing values first before taking the standard deviation.
## [1] 14.65801
When graphing data, we need to know 3 regions: the plot region, the figure region, and the device region. Here are pages showing the 3 regions: https://img-blog.csdn.net/20130709222854796 and https://r-charts.com/base-r/margins/
To reset the plot margin, use the “mar” parameter“. To reset the figure margin, use the”oma" parameter. These are done as this:
par(mar = c(1,2,3,6), oma = c(5,2,2,1))
Type ?par on the console to see the default settings for “mar” and “oma”.
x = c(3, 6, 7, 10, 12, 18, 20)
y = c(5, 8, 11, 15, 20, 26, 31)
par(mar = c(1,4,3,6), oma = c(5,2,2,1))
plot(x, y) # A scatterplot of y vs x
box("figure", col = "blue") # Show the figure region only for demonstration purpose; "figure" is the same as "inner"
box("outer", col = "red") # Show the device region only for demonstration purpose
# Suppose you flipped a die of 6-faces 25 times and got the following points:
a = c(2, 3, 1, 6, 3, 1, 4, 3, 1, 3, 1, 2, 1, 4, 5, 2, 2, 5, 4, 3, 1, 5, 5, 3, 1)
# To plot the frequencies (how many times each number occurs),
# you must get the frequency table first. The R function table() does it.
T = table(a) # A frequency table: count the numbers of different values
# Now, create a bar plot based on the frequency table you got
barplot(T,
xlab="Number", # Name the x-axis
ylab="Frequency", # Name the y-axis
col = "red", # Color each bar red
main = "") # Empty title
# Add a title using the title() function and annotate it.
title("Do You Know What This Graph Says?",
col.main="blue", # Color the title
font.main = 2, # Change the font of the title. The default corresponds to value 1
cex.main = 2 # Double the size of the words in the title. Default is 1
)
# To use percentages for a bar plot, we must convert frequencies to relative frequencies.
# To do this, we divide frequencies by the total frequency, which can be obtained by the
# function length(). Multiplying by 100 gives the percentages.
barplot(T/length(a)*100, xlab="Number", ylab="Percent", col = "red")
# Can I change x-labels (not the same as title)? Yes.
barplot(T/length(a)*100,
xlab="Number",
ylab="Percent",
col = "red",
names = c("One", "Two", "Three", "Four", "Five", "Six")
)
# Display labels at an angle
barplot(T/length(a)*100,
xlab="Number",
ylab="Percent",
col = "red",
names = c("One", "Two", "Three", "Four", "Five", "Six"),
las = 2 # Vertical label
)
hist(a, main = "This is a Histogram", col = "blue")
# Make a probability histogram: the height of each bar equals that of the frequency histogram,
# divided by the total frequency, and then divided by the width of the bar.
# Doing so allows you to add a density curve to the histogram if needed
hist(a, probability = TRUE, xlab = "Age", main = "Histogram of Ages of Kids in a Big Family")
A function can be plotted using the curve() function in R. Sometimes, we may want to plot a few functions in the same graph. The following plots the square-root function and the exponential function in the same graph. Note that the option “add = TRUE” is used.
par(mar=c(3.1, 3.1, 2.1, 5.1), oma=c(4.1, 4.1, 3.1, 6.1), xpd=TRUE)
curve(sqrt, 0, 5, col = "red", lty = 1)
legend("topright", inset=c(-0.5,0), legend=c("sqrt","exp"), title="Function", lty = c(1,2), col = c("red", "blue"), text.col = c("red", "blue"), title.col = "black", bty = "n")
box("plot", col = "green")
box("figure", col = "blue") # Show the figure region only for demonstration purpose; "figure" is the same as "inner"
box("outer", col = "red") # Show the device region only for demonstration purpose
par(xpd=FALSE) # Without this, the following added curve will go out the plotting region
curve(exp, add = TRUE, col = "blue", lty = 2)
# 1. A mathematical function y = xe^x
f = function(x){ # The corresponding R function is named "f", which takes x as
# an input and outputs something. Something good to read:
# https://en.wikipedia.org/wiki/Lambert_W_function
y = x*exp(x)
return(y)
}
# Plot the curve of the f function
curve(f, from = -7, to = 1)
abline(h=0, col = "red") # Add a real x-axis
# Can you find where the function has its minimum? Add a vertical line there.
# 2. My summary function for an input that is a vector
Summary = function(x){
Mean=mean(x)
Sd = sd(x)
Median = median(x)
Summary=c(Mean = Mean, " Standard Deviation" = Sd, Median = Median)
return(Summary) # A function must return something
}
x=c(2, 4, 6, 1, 0, -3, 7)
Summary(x) # Apply the function just defined
## Mean Standard Deviation Median
## 2.428571 3.505098 2.000000
# 2. Write an R function that implement the mathematical function
# f(x) = 2x-1 for x < 1, x^2 for 1<=x<=5, and 5x for x >5.
f = function(x){
n = length(x)
y = numeric(n)
for (i in 1:n){
if (x[i] < 1){
y[i] = 2 * x[i] - 1
} else if (x[i] <= 5) {
y[i] = x[i]^2
} else {
y[i] = 5 * x[i]
}
}
return(y)
}
f(-3)
## [1] -7
f(4)
## [1] 16
f(6)
## [1] 30
curve(f, xlim = c(-2, 10), col = "red", main = "Plot of a Piece-wise function")
# 3. Write a function that print the first n terms of the sequence defined by
# f(n) = 2*f(n-1)+1, with f(1) = 10.
f = function(n){
a = numeric(n)
a[1] = 10
for (i in 2:n){
a[i] = 2*a[i-1] + 1
}
return(a)
}
f(10)
## [1] 10 21 43 87 175 351 703 1407 2815 5631
# Note: the above function does not apply to a vector
The third R function above can be written using the idea of recursion - self calling. It is a very powerful method and meany programming problems can be solved using this idea. Note that not every programming language allows the realization of this idea.
# Write a function that output the first n terms according to
# f(n) = 2*f(n-1)+1, with f(1) = 10.
f = function(n){
if(n == 1){
return(10)
} else {
return(2*f(n-1) + 1)
}
}
# Use the function
f(5)
## [1] 175
f(12)
## [1] 22527
The next example shows a function that finds the mean (i.e., average) of a set of values.
# Find the mean of a set of values
Mean = function(v){
n = length(v)
if (n==1){
return(v)
} else {
return((Mean(v[-n])*(n-1) + v[n])/n)
}
}
v = c(2,8,0,9,3)
Mean(v)
## [1] 4.4
mean(v)
## [1] 4.4
The next example shows a function that finds the greatest common divisor of two positive integers.
## Find Greatest Common Divisor of two numbers (say a & b) using the Euclidean algorithm.
## Set x = min(a,b) and y = max(a,b). Set x = min(x, y-x) and y = y-x.
## Repeat until x = y. The GCD(a,b) = x.
gcd=function(v=c(45,75)){
x=min(v)
y=max(v)
if (x==y) {
return(x)
} else{
x=min(x,y-x)
y = y-x
return(gcd(c(x, y)))
}
}
gcd()
## [1] 15
gcd(c(63,54))
## [1] 9
gcd(c(5723, 1261))
## [1] 97
gcd(c(59446125, 81124893))
## [1] 8973
Exercises:
Write a function that calculates \(\frac{1}{1}-\frac{1}{2}+\frac{1}{3}-\frac{1}{4}+\cdots+\frac{1}{n}\) for a given \(n\). For large \(n\), your answer should be close to \(ln(2)\).
Write a function that calculates \(\frac{1}{1}-\frac{1}{3}+\frac{1}{5}-\frac{1}{7}+\cdots+(-1)^n\frac{1}{2n+1}\) for a given \(n\). For large \(n\), your answer should be close to \(\frac{\pi}{4}\).
Write a function that calculates \(\frac{1}{1^2}+\frac{1}{2^2}+\frac{1}{3^2}+\frac{1}{4^2}+\cdots+\frac{1}{n^2}\) for a given \(n\). For large \(n\), your answer should be close to \(\frac{\pi^2}{6}\), which equals the value of the Riemann zeta function evaluated at 2. Interestingly, the probability of 2 randomly chosen positive integers being co-prime equals \(\frac{6}{\pi^2}\), around \(61\%\).
Write a function that calculates \(\frac{1}{1^2}-\frac{1}{2^2}+\frac{1}{3^2}-\frac{1}{4^2}+\cdots+(-1)^{n+1}\frac{1}{n^2}\) for a given \(n\). For large \(n\), your answer should be close to \(\frac{\pi^2}{12}\).
Write two R functions, one for f(x) = x^2 and one for g(x) = 2^x. Plot them for x between -2 and 5 in the same graph. At how many points do they intercept? Can you find all of those points?
The idea:
In mathematics, we can create a composite function such as
\[f(g(h(x)))\] Since each function is just an operator, we can instruct a machine to do this:
\[x \rightarrow h \rightarrow g \rightarrow f\] So, in R, if \(f\), \(g\), and \(h\) are functions, and \(x\) is an input, we can instruct the machine by “x%>%h%>%g%>%f.”
To use the pipe operator, we need to install the “dplyr” package or the more general “tidyverse” package.
x = c(2, 3, 7, 1, 9, 6, 5)
x %>% mean() # Normally you do: mean(x)
## [1] 4.714286
mtcars %>% plot() # Normally you do: plot(mtcars)
mtcars %>% apply(2, var) %>% sqrt() # Normally you do: sqrt(apply(mtcars, 2, var))
## mpg cyl disp hp drat wt
## 6.0269481 1.7859216 123.9386938 68.5628685 0.5346787 0.9784574
## qsec vs am gear carb
## 1.7869432 0.5040161 0.4989909 0.7378041 1.6152000
sqrt(apply(mtcars, 2, var))
## mpg cyl disp hp drat wt
## 6.0269481 1.7859216 123.9386938 68.5628685 0.5346787 0.9784574
## qsec vs am gear carb
## 1.7869432 0.5040161 0.4989909 0.7378041 1.6152000
# If you feel uncomfortable, don't use the pipe operator
mtcars %>% ggplot(mapping=aes(x = hp, y = mpg)) +
geom_point(color = "red")
Many times, we may run into errors when running code we write. Here are some common mistakes. Copy each line of code on console and run it.
cat + 1
dog + 1
Mean(c(23, 45, 32, 18))
How can you figure out why the errors occur?
The package ggplot2 is the major package for plotting data. You may watch a video https://www.youtube.com/watch?v=HeqHMM4ziXA before you read and practice the following.
library(ggplot2) # Load the package
qplot(mpg, wt, data = mtcars)
qplot(mpg, wt, data = mtcars, colour = cyl)
qplot(mpg, wt, data = mtcars, size = cyl)
qplot(mpg, wt, data = mtcars, facets = vs ~ am)
ggplot(mtcars, aes(x=mpg, y=wt)) +
geom_point() +
facet_grid(vs ~ am, labeller = label_both)
## The previous result is not ideal, since the facet labels don't make a good sense.
mtcars$vs=factor(mtcars$vs) # Make "vs" a factor
levels(mtcars$vs)=c("high", "low") # Set custom levels to "vs"
## Now, let's redo the plot
ggplot(mtcars, aes(x=mpg, y=wt)) +
geom_point() +
facet_grid(vs ~ am, labeller = label_both)
# Or using
qplot(mpg, wt, data = mtcars, facets = vs ~ am)
Another video to watch: https://www.youtube.com/watch?v=HPJn1CMvtmI. Now, practice the following:
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 high 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 high 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 low 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 low 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 high 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 low 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 high 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 low 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 low 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 low 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 low 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 high 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 high 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 high 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 high 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 high 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 high 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 low 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 low 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 low 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 low 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 high 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 high 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 high 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 high 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 low 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 high 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 low 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 high 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 high 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 high 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 low 1 4 2
ggplot(mtcars, mapping=aes(x = hp, y = mpg)) +
geom_point(color = "red")
## Want to do other plots?
ggplot(mtcars, aes(x = hp, y = mpg)) +
geom_line(color = "red")
ggplot(mtcars, aes(x = mpg)) +
geom_histogram(color = "blue", fill = "red", bins=6)
ggplot(mtcars, aes(x = mpg)) +
geom_density(color = "red")
ggplot(mtcars, aes(x = mpg, fill = factor(cyl))) +
geom_histogram(bins = 6) # bad
ggplot(mtcars, aes(x = mpg, colour = factor(cyl))) +
geom_density() +
labs(colour = "Cylinder") # rename legend title
ggplot(mtcars, aes(x = mpg, color= as.factor(cyl))) +
geom_density() +
labs(title = "gggggg", x="Miles per Gallon", y= "Density", color="Cylinder") # rename legend title
ggplot(mtcars, aes(y = mpg)) +
geom_boxplot(fill = "red")
# More to learn about the theme()
ggplot(mtcars, aes(x = as.factor(cyl), y = mpg, fill = as.factor(cyl))) +
geom_boxplot() +
labs(title = "Mpg vs Cyl", subtitle = "Hello World!", x="Cyl", caption = "Data courtesy: xyz") +
theme(plot.background = element_rect(fill = "gray"),
panel.background = element_rect(fill = "yellow"),
plot.margin = unit(c(t=1,r=2,b=3,l=4), "cm"),
axis.title.x = element_text(hjust = 0.5, size = 20, color = "pink", angle = 45, vjust = -1),
plot.title = element_text(hjust = 0.5, size = 28, color = "green"),
plot.subtitle = element_text(hjust = 0.5, size = 15, color = "red"),
plot.caption = element_text(hjust = 0, size = 10, color = "green"),
legend.position = "none",
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank()
)
myData = data.frame(
Student = 1:24,
Major = c("Accounting", "IS", "CS", "Statistics", "EE", "Math", "Accounting", "CS", "CS", "Statistics", "EE", "Math","Accounting", "IS", "IS", "Statistics", "Math", "Math", "Accounting", "IS", "Statistics", "Statistics", "EE", "CS"),
Sex = c("Male", "Female", "Male", "Male", "Female", "Male", "Female", "Male", "Male", "Female", "Female", "Female", "Male", "Male", "Female", "Male", "Female", "Female", "Male", "Female", "Male", "Male", "Male", "Male"),
Score = c(87, 92, 69, 90, 88, 79, 94, 86, 92, 84, 78, 67, 95, 91, 93, 83, 82, 94, 86, 74, 95, 55, 93, 85)
)
myData
## Student Major Sex Score
## 1 1 Accounting Male 87
## 2 2 IS Female 92
## 3 3 CS Male 69
## 4 4 Statistics Male 90
## 5 5 EE Female 88
## 6 6 Math Male 79
## 7 7 Accounting Female 94
## 8 8 CS Male 86
## 9 9 CS Male 92
## 10 10 Statistics Female 84
## 11 11 EE Female 78
## 12 12 Math Female 67
## 13 13 Accounting Male 95
## 14 14 IS Male 91
## 15 15 IS Female 93
## 16 16 Statistics Male 83
## 17 17 Math Female 82
## 18 18 Math Female 94
## 19 19 Accounting Male 86
## 20 20 IS Female 74
## 21 21 Statistics Male 95
## 22 22 Statistics Male 55
## 23 23 EE Male 93
## 24 24 CS Male 85
ggplot(myData, aes(x = Score, color= Sex)) +
geom_density()
# Bar plots with 3 possible positions
ggplot(myData, aes(x = Major, fill = Sex)) +
geom_bar(position = "stack") # Each bar consists of two pieces representing the counts of each sex
ggplot(myData, aes(x = Major, fill = Sex)) +
geom_bar(position = "dodge") # Side-by-side
ggplot(myData, aes(x = Major, fill = Sex)) +
geom_bar(position = "fill") # Each bar has the same height of 1 and consists of two sexes in proportion
# Name-value kind of data
popData = data.frame(City = c("St Cloud", "Sartell", "Waite Park", "St Joseph"),
Population = c(66169, 18428, 7718, 7147))
popData
## City Population
## 1 St Cloud 66169
## 2 Sartell 18428
## 3 Waite Park 7718
## 4 St Joseph 7147
ggplot(popData, aes(x = City, y = Population)) +
geom_col() + # for Name-value kind of data
theme(axis.text.x = element_text(angle = 45))
# The Titanic Data
D=as.data.frame(Titanic)
D1=D %>% group_by(Class)
D2=D1 %>% summarise(Freq=sum(Freq))
D2
## # A tibble: 4 x 2
## Class Freq
## <fct> <dbl>
## 1 1st 325
## 2 2nd 285
## 3 3rd 706
## 4 Crew 885
ggplot(D2, aes(x=Class, y = Freq)) +
geom_col() +
geom_text(aes(label=Freq), vjust=2.5, col ="red", size=13)
ggplot(D, aes(x=Class, y = Freq, fill = Survived)) +
geom_col(position = "stack") +
labs(fill = "Survival Status") # Change title of legend
ggplot(D, aes(x=Class, y = Freq, fill = Survived)) +
geom_col(position = "dodge")
ggplot(D, aes(x=Class, y = Freq, fill = Survived)) +
geom_col(position = "dodge")
# Which one to use: geom_bar() or geom_col?
# Both functions create bar graphs. In general, for individual data, such as myData above, use geom_bar();
# for grouped data, such as D above, use geom_col().
# The function geom_bar() can also be applied to grouped data, if you use it this way: geom_bar(stat="identity").
We covered the plotting of data. How can we plot curves? Here are examples:
# Define the R function for f(x) = x
f1 = function(x){
return(x)
}
# Define the R function for f(x) = x^2
f2 = function(x){
return(x^2)
}
# Define the R function for f(x) = x^3
f3 = function(x){
return(x^3)
}
# Plot all 3 curves in one graph
ggplot() +
xlim(-3, 3) +
geom_function(aes(colour = "Diagonal line"), fun = f1) + # Add the graph of function f1
geom_function(aes(colour = "Parabola"), fun = f2) + # Add the graph of function f2
geom_function(aes(colour = "Cubic"), fun = f3) + # Add the graph of function f3
labs(title = "Overlay Three Curves in One Graph", color = "Curves") # Set the title of the legend: labs = labels
The next example plot the standard normal density curve along with a few t-distribution curves with different degrees of freedom.
ggplot() + # Ready to plot
xlim(-5, 5) + # Set the x limits. In the following, fun = function
geom_function(aes(colour = "normal"), fun = dnorm) + # Add a standard normal density curve: d = density, norm = "Normal"
geom_function(aes(colour = "t, df = 1"), fun = dt, args = list(df = 1)) + # Add a t-distribution (with df = 1) density curve
geom_function(aes(colour = "t, df = 10"), fun = dt, args = list(df = 10)) +
geom_function(aes(colour = "t, df = 30"), fun = dt, args = list(df = 30)) +
geom_function(aes(colour = "t, df = 50"), fun = dt, args = list(df = 50)) +
theme(legend.title = element_blank())
## custom colors
library(ggplot2)
ggplot(mtcars, aes(x=mpg, y=disp, color = factor(cyl))) +
geom_line() +
scale_color_manual(values=c("blue", "green", "red")) # You can set your favorite colors
x= 1:10
y0 = x
y1 = x + 1
y2 = x + 2
y3 = x + 3
ggplot() + geom_line(aes(x=x, y=y0, color="0")) +
geom_line(aes(x=x, y=y1, color="1")) +
geom_line(aes(x=x, y=y2, color="2")) +
geom_line(aes(x=x, y=y3, color="3")) +
scale_color_manual(values =c("black", "red", "blue", "green"))
We demonstrate the use of the “BasketballAnalyzeR” package for creating nice plots.
D = USArrests[1:10,] # Only use the first 10 rows of the USArrests data from the base package
BasketballAnalyzeR::radialprofile(data=D, title = rownames(D))
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
arrange(mtcars, cyl) # By default, arrange the data in ascending order according to the values of "cyl"
## mpg cyl disp hp drat wt qsec vs am gear carb
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 low 1 4 1
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 low 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 low 0 4 2
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 low 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 low 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 low 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 low 0 3 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 low 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 high 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 low 1 5 2
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 low 1 4 2
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 high 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 high 1 4 4
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 low 0 3 1
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 low 0 3 1
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 low 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 low 0 4 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 high 1 5 6
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 high 0 3 2
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 high 0 3 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 high 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 high 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 high 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 high 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 high 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 high 0 3 4
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 high 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 high 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 high 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 high 0 3 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 high 1 5 4
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 high 1 5 8
arrange(mtcars, cyl, disp) # By default, arrange data in ascending order, by "cyl" and then by "disp"
## mpg cyl disp hp drat wt qsec vs am gear carb
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 low 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 low 1 4 2
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 low 1 4 1
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 low 1 4 1
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 low 1 5 2
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 low 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 low 0 3 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 high 1 5 2
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 low 1 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 low 0 4 2
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 low 0 4 2
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 high 1 5 6
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 high 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 high 1 4 4
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 low 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 low 0 4 4
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 low 0 3 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 low 0 3 1
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 high 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 high 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 high 0 3 3
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 high 1 5 8
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 high 0 3 2
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 high 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 high 0 3 4
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 high 1 5 4
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 high 0 3 2
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 high 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 high 0 3 2
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 high 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 high 0 3 4
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 high 0 3 4
arrange(mtcars, cyl, desc(disp)) # cyl in ascending order and disp in descending order
## mpg cyl disp hp drat wt qsec vs am gear carb
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 low 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 low 0 4 2
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 low 1 4 2
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 high 1 5 2
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 low 0 3 1
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 low 1 4 1
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 low 1 5 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 low 1 4 1
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 low 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 low 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 low 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 low 0 3 1
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 low 0 3 1
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 low 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 low 0 4 4
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 high 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 high 1 4 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 high 1 5 6
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 high 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 high 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 high 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 high 0 3 2
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 high 0 3 2
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 high 0 3 4
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 high 1 5 4
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 high 0 3 4
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 high 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 high 0 3 2
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 high 1 5 8
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 high 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 high 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 high 0 3 3
arrange(mtcars, cyl, -disp) # cyl in ascending order and disp in descending order; the same as above
## mpg cyl disp hp drat wt qsec vs am gear carb
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 low 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 low 0 4 2
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 low 1 4 2
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 high 1 5 2
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 low 0 3 1
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 low 1 4 1
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 low 1 5 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 low 1 4 1
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 low 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 low 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 low 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 low 0 3 1
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 low 0 3 1
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 low 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 low 0 4 4
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 high 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 high 1 4 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 high 1 5 6
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 high 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 high 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 high 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 high 0 3 2
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 high 0 3 2
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 high 0 3 4
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 high 1 5 4
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 high 0 3 4
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 high 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 high 0 3 2
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 high 1 5 8
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 high 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 high 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 high 0 3 3
D = mtcars # Make a copy
mutate(D,
fake = mpg + disp, # A meaningless column called "fake" added to the D data frame
mpg_cat = (mpg>20), # Add another column called "mpg_cat". The column indicates whether mpg > 20
log_disp = log(disp), # Add still another column called "log_disp": A log-transformation to disp
mpg = log(mpg) # Replace the mpg column by its log value
)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 3.044522 6 160.0 110 3.90 2.620 16.46 high 1 4 4
## Mazda RX4 Wag 3.044522 6 160.0 110 3.90 2.875 17.02 high 1 4 4
## Datsun 710 3.126761 4 108.0 93 3.85 2.320 18.61 low 1 4 1
## Hornet 4 Drive 3.063391 6 258.0 110 3.08 3.215 19.44 low 0 3 1
## Hornet Sportabout 2.928524 8 360.0 175 3.15 3.440 17.02 high 0 3 2
## Valiant 2.895912 6 225.0 105 2.76 3.460 20.22 low 0 3 1
## Duster 360 2.660260 8 360.0 245 3.21 3.570 15.84 high 0 3 4
## Merc 240D 3.194583 4 146.7 62 3.69 3.190 20.00 low 0 4 2
## Merc 230 3.126761 4 140.8 95 3.92 3.150 22.90 low 0 4 2
## Merc 280 2.954910 6 167.6 123 3.92 3.440 18.30 low 0 4 4
## Merc 280C 2.879198 6 167.6 123 3.92 3.440 18.90 low 0 4 4
## Merc 450SE 2.797281 8 275.8 180 3.07 4.070 17.40 high 0 3 3
## Merc 450SL 2.850707 8 275.8 180 3.07 3.730 17.60 high 0 3 3
## Merc 450SLC 2.721295 8 275.8 180 3.07 3.780 18.00 high 0 3 3
## Cadillac Fleetwood 2.341806 8 472.0 205 2.93 5.250 17.98 high 0 3 4
## Lincoln Continental 2.341806 8 460.0 215 3.00 5.424 17.82 high 0 3 4
## Chrysler Imperial 2.687847 8 440.0 230 3.23 5.345 17.42 high 0 3 4
## Fiat 128 3.478158 4 78.7 66 4.08 2.200 19.47 low 1 4 1
## Honda Civic 3.414443 4 75.7 52 4.93 1.615 18.52 low 1 4 2
## Toyota Corolla 3.523415 4 71.1 65 4.22 1.835 19.90 low 1 4 1
## Toyota Corona 3.068053 4 120.1 97 3.70 2.465 20.01 low 0 3 1
## Dodge Challenger 2.740840 8 318.0 150 2.76 3.520 16.87 high 0 3 2
## AMC Javelin 2.721295 8 304.0 150 3.15 3.435 17.30 high 0 3 2
## Camaro Z28 2.587764 8 350.0 245 3.73 3.840 15.41 high 0 3 4
## Pontiac Firebird 2.954910 8 400.0 175 3.08 3.845 17.05 high 0 3 2
## Fiat X1-9 3.306887 4 79.0 66 4.08 1.935 18.90 low 1 4 1
## Porsche 914-2 3.258097 4 120.3 91 4.43 2.140 16.70 high 1 5 2
## Lotus Europa 3.414443 4 95.1 113 3.77 1.513 16.90 low 1 5 2
## Ford Pantera L 2.760010 8 351.0 264 4.22 3.170 14.50 high 1 5 4
## Ferrari Dino 2.980619 6 145.0 175 3.62 2.770 15.50 high 1 5 6
## Maserati Bora 2.708050 8 301.0 335 3.54 3.570 14.60 high 1 5 8
## Volvo 142E 3.063391 4 121.0 109 4.11 2.780 18.60 low 1 4 2
## fake mpg_cat log_disp
## Mazda RX4 181.0 TRUE 5.075174
## Mazda RX4 Wag 181.0 TRUE 5.075174
## Datsun 710 130.8 TRUE 4.682131
## Hornet 4 Drive 279.4 TRUE 5.552960
## Hornet Sportabout 378.7 FALSE 5.886104
## Valiant 243.1 FALSE 5.416100
## Duster 360 374.3 FALSE 5.886104
## Merc 240D 171.1 TRUE 4.988390
## Merc 230 163.6 TRUE 4.947340
## Merc 280 186.8 FALSE 5.121580
## Merc 280C 185.4 FALSE 5.121580
## Merc 450SE 292.2 FALSE 5.619676
## Merc 450SL 293.1 FALSE 5.619676
## Merc 450SLC 291.0 FALSE 5.619676
## Cadillac Fleetwood 482.4 FALSE 6.156979
## Lincoln Continental 470.4 FALSE 6.131226
## Chrysler Imperial 454.7 FALSE 6.086775
## Fiat 128 111.1 TRUE 4.365643
## Honda Civic 106.1 TRUE 4.326778
## Toyota Corolla 105.0 TRUE 4.264087
## Toyota Corona 141.6 TRUE 4.788325
## Dodge Challenger 333.5 FALSE 5.762051
## AMC Javelin 319.2 FALSE 5.717028
## Camaro Z28 363.3 FALSE 5.857933
## Pontiac Firebird 419.2 FALSE 5.991465
## Fiat X1-9 106.3 TRUE 4.369448
## Porsche 914-2 146.3 TRUE 4.789989
## Lotus Europa 125.5 TRUE 4.554929
## Ford Pantera L 366.8 FALSE 5.860786
## Ferrari Dino 164.7 FALSE 4.976734
## Maserati Bora 316.0 FALSE 5.707110
## Volvo 142E 142.4 TRUE 4.795791
by_cyl <- mtcars %>% group_by(cyl)
# grouping doesn't change how the data looks (apart from listing how it's grouped):
by_cyl
## # A tibble: 32 x 11
## # Groups: cyl [3]
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 high 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 high 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 low 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 low 0 3 1
## 5 18.7 8 360 175 3.15 3.44 17.0 high 0 3 2
## 6 18.1 6 225 105 2.76 3.46 20.2 low 0 3 1
## 7 14.3 8 360 245 3.21 3.57 15.8 high 0 3 4
## 8 24.4 4 147. 62 3.69 3.19 20 low 0 4 2
## 9 22.8 4 141. 95 3.92 3.15 22.9 low 0 4 2
## 10 19.2 6 168. 123 3.92 3.44 18.3 low 0 4 4
## # … with 22 more rows
class(mtcars)
## [1] "data.frame"
class(by_cyl) # The structure of the new data frame is richer
## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
by_cyl <- mtcars %>% group_by(cyl)
# grouping doesn't change how the data looks (apart from listing how it's grouped):
by_cyl # Now, the mtcars data frame has been grouped and called by_cyl.
## # A tibble: 32 x 11
## # Groups: cyl [3]
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 high 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 high 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 low 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 low 0 3 1
## 5 18.7 8 360 175 3.15 3.44 17.0 high 0 3 2
## 6 18.1 6 225 105 2.76 3.46 20.2 low 0 3 1
## 7 14.3 8 360 245 3.21 3.57 15.8 high 0 3 4
## 8 24.4 4 147. 62 3.69 3.19 20 low 0 4 2
## 9 22.8 4 141. 95 3.92 3.15 22.9 low 0 4 2
## 10 19.2 6 168. 123 3.92 3.44 18.3 low 0 4 4
## # … with 22 more rows
by_cyl %>% summarise(
mpg_mean = mean(mpg), # Mean of disp for each cyl group
hp_mean = mean(hp) # Mean of hp for each cyl group
)
## # A tibble: 3 x 3
## cyl mpg_mean hp_mean
## <dbl> <dbl> <dbl>
## 1 4 26.7 82.6
## 2 6 19.7 122.
## 3 8 15.1 209.
by_cyl %>% summarise(freq = n()) # Frequency distribution of "cyl";
## # A tibble: 3 x 2
## cyl freq
## <dbl> <int>
## 1 4 11
## 2 6 7
## 3 8 14
# The resulting data frame has two columns: cyl and freq
This example will use the data frame “diamonds” from the ggplot2 package. The dataset contains the prices and other attributes of almost 54,000 diamonds. The variables are as follows:
price: price in US dollars ($326–$18,823)
carat: weight of the diamond (0.2–5.01)
cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color: diamond colour, from D (best) to J (worst)
clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
x: length in mm (0–10.74)
y: width in mm (0–58.9)
z: depth in mm (0–31.8)
depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
table: width of top of diamond relative to widest point (43–95)
Reference: https://stackoverflow.com/questions/19233365/how-to-create-a-marimekko-mosaic-plot-in-ggplot2
head(diamonds, n = 10)
## # A tibble: 10 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
df <- diamonds %>%
group_by(cut, clarity) %>% ## Must do grouping first in order to do the following operations
summarise(count = n()) %>%
group_by(cut) %>% # If not specified, the following will still be done by default for each category of cut
mutate(cut.count = sum(count),
prop = count/sum(count)) %>%
ungroup() # Remove grouping after we have added the two new columns
## `summarise()` has grouped output by 'cut'. You can override using the `.groups` argument.
ggplot(df, aes(x = cut, y = prop, width = cut.count, fill = clarity)) +
geom_col(colour = "black") +
#geom_text(aes(label = scales::percent(prop)), position = position_stack(vjust = 0.5)) + # if labels are desired
facet_grid(.~cut, scales = "free_x", space = "free_x") +
scale_fill_brewer(palette = "RdYlGn") +
theme_void()
The function “gather” from tidyr package is fabulous for reshaping data from columns to rows (wide to long). It is an alternative to the function “melt” from package “reshape2”. You use gather() when you notice that you have columns that are not variables.
The function “spread” from “tidyr” spreads out a single column into multiple columns, so it’s good to use it to summarize standard data as a contingency table.
library(tidyr)
D=data.frame(Gender=c("Male", "Female"), R=c(20, 30), D=c(23, 32), I=c(34, 45))
D # Looks like a two-way table
## Gender R D I
## 1 Male 20 23 34
## 2 Female 30 32 45
# The following are two ways of converting data from wide format to long format using
# the gather() function from package tidyr.
# The following uses columns R, D and I to make two columns that represent Keys and Values.
# Other columns from the data will remain in the new data frame.
x1=D %>% gather(key = "Party", value = "Count", c(R,D,I))
# Alternatively: "-" means exclusion; that is, use all columns except Gender to form Keys and Values.
x2=D %>% gather(key = Party, value = Count, -Gender)
x1
## Gender Party Count
## 1 Male R 20
## 2 Female R 30
## 3 Male D 23
## 4 Female D 32
## 5 Male I 34
## 6 Female I 45
x2
## Gender Party Count
## 1 Male R 20
## 2 Female R 30
## 3 Male D 23
## 4 Female D 32
## 5 Male I 34
## 6 Female I 45
# The two data frames x1 and x2 are the same and are in the standard format (long format) that R can easily understand.
# The following is a reversal of the previous procedure.
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
y=D %>% melt(id.vars=c("Gender"), variable.name = "Party", value.name = "Count") # the melt() function is from package reshape2
y # The same as x1 and x2; that is, we can melt a two-way table to produce a standard table.
## Gender Party Count
## 1 Male R 20
## 2 Female R 30
## 3 Male D 23
## 4 Female D 32
## 5 Male I 34
## 6 Female I 45
z=x1 %>% spread(key=Gender, value=Count) ## Gender on columns
z
## Party Female Male
## 1 D 32 23
## 2 I 45 34
## 3 R 30 20
w=x1 %>% spread(key=Party, value=Count) ## Party on columns
w
## Gender D I R
## 1 Female 32 45 30
## 2 Male 23 34 20
u = xtabs(Count~Gender+Party, y) # Another way
u
## Party
## Gender R D I
## Female 30 32 45
## Male 20 23 34
# To have a particular order you want for your two-way table,
# you need to make character variables factors. Here is how:
y$Party=factor(y$Party, levels = c("D", "I", "R")) # Set the levels of Party in the D-I-R order
y$Gender=factor(y$Gender, levels = c("Male", "Female")) # Set the levels of Gender in Male-Female order
xtabs(Count~Gender+Party, y)
## Party
## Gender D I R
## Male 23 34 20
## Female 32 45 30
# The following are two more examples of data in wide format or long format. You may wish to play with them.
head(who, n = 20) ## In wide format; you can type ?who on the console to study the details of the data
## # A tibble: 20 x 60
## country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
## <chr> <chr> <chr> <int> <int> <int> <int> <int>
## 1 Afghani… AF AFG 1980 NA NA NA NA
## 2 Afghani… AF AFG 1981 NA NA NA NA
## 3 Afghani… AF AFG 1982 NA NA NA NA
## 4 Afghani… AF AFG 1983 NA NA NA NA
## 5 Afghani… AF AFG 1984 NA NA NA NA
## 6 Afghani… AF AFG 1985 NA NA NA NA
## 7 Afghani… AF AFG 1986 NA NA NA NA
## 8 Afghani… AF AFG 1987 NA NA NA NA
## 9 Afghani… AF AFG 1988 NA NA NA NA
## 10 Afghani… AF AFG 1989 NA NA NA NA
## 11 Afghani… AF AFG 1990 NA NA NA NA
## 12 Afghani… AF AFG 1991 NA NA NA NA
## 13 Afghani… AF AFG 1992 NA NA NA NA
## 14 Afghani… AF AFG 1993 NA NA NA NA
## 15 Afghani… AF AFG 1994 NA NA NA NA
## 16 Afghani… AF AFG 1995 NA NA NA NA
## 17 Afghani… AF AFG 1996 NA NA NA NA
## 18 Afghani… AF AFG 1997 0 10 6 3
## 19 Afghani… AF AFG 1998 30 129 128 90
## 20 Afghani… AF AFG 1999 8 55 55 47
## # … with 52 more variables: new_sp_m4554 <int>, new_sp_m5564 <int>,
## # new_sp_m65 <int>, new_sp_f014 <int>, new_sp_f1524 <int>,
## # new_sp_f2534 <int>, new_sp_f3544 <int>, new_sp_f4554 <int>,
## # new_sp_f5564 <int>, new_sp_f65 <int>, new_sn_m014 <int>,
## # new_sn_m1524 <int>, new_sn_m2534 <int>, new_sn_m3544 <int>,
## # new_sn_m4554 <int>, new_sn_m5564 <int>, new_sn_m65 <int>,
## # new_sn_f014 <int>, new_sn_f1524 <int>, new_sn_f2534 <int>,
## # new_sn_f3544 <int>, new_sn_f4554 <int>, new_sn_f5564 <int>,
## # new_sn_f65 <int>, new_ep_m014 <int>, new_ep_m1524 <int>,
## # new_ep_m2534 <int>, new_ep_m3544 <int>, new_ep_m4554 <int>,
## # new_ep_m5564 <int>, new_ep_m65 <int>, new_ep_f014 <int>,
## # new_ep_f1524 <int>, new_ep_f2534 <int>, new_ep_f3544 <int>,
## # new_ep_f4554 <int>, new_ep_f5564 <int>, new_ep_f65 <int>,
## # newrel_m014 <int>, newrel_m1524 <int>, newrel_m2534 <int>,
## # newrel_m3544 <int>, newrel_m4554 <int>, newrel_m5564 <int>,
## # newrel_m65 <int>, newrel_f014 <int>, newrel_f1524 <int>,
## # newrel_f2534 <int>, newrel_f3544 <int>, newrel_f4554 <int>,
## # newrel_f5564 <int>, newrel_f65 <int>
# For simplicity, we only consider the first 10 columns
y = who[, 1:10] %>% melt(id.vars=c("country", "iso2", "iso3", "year"), variable.names=c(new_sp_m014, new_sp_m1524, new_sp_m2534, new_sp_m3544, new_sp_m4554, new_sp_m5564))
head(y, n = 20)
## country iso2 iso3 year variable value
## 1 Afghanistan AF AFG 1980 new_sp_m014 NA
## 2 Afghanistan AF AFG 1981 new_sp_m014 NA
## 3 Afghanistan AF AFG 1982 new_sp_m014 NA
## 4 Afghanistan AF AFG 1983 new_sp_m014 NA
## 5 Afghanistan AF AFG 1984 new_sp_m014 NA
## 6 Afghanistan AF AFG 1985 new_sp_m014 NA
## 7 Afghanistan AF AFG 1986 new_sp_m014 NA
## 8 Afghanistan AF AFG 1987 new_sp_m014 NA
## 9 Afghanistan AF AFG 1988 new_sp_m014 NA
## 10 Afghanistan AF AFG 1989 new_sp_m014 NA
## 11 Afghanistan AF AFG 1990 new_sp_m014 NA
## 12 Afghanistan AF AFG 1991 new_sp_m014 NA
## 13 Afghanistan AF AFG 1992 new_sp_m014 NA
## 14 Afghanistan AF AFG 1993 new_sp_m014 NA
## 15 Afghanistan AF AFG 1994 new_sp_m014 NA
## 16 Afghanistan AF AFG 1995 new_sp_m014 NA
## 17 Afghanistan AF AFG 1996 new_sp_m014 NA
## 18 Afghanistan AF AFG 1997 new_sp_m014 0
## 19 Afghanistan AF AFG 1998 new_sp_m014 30
## 20 Afghanistan AF AFG 1999 new_sp_m014 8
airquality ## In wide format; You convert it to long format;
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
## 7 23 299 8.6 65 5 7
## 8 19 99 13.8 59 5 8
## 9 8 19 20.1 61 5 9
## 10 NA 194 8.6 69 5 10
## 11 7 NA 6.9 74 5 11
## 12 16 256 9.7 69 5 12
## 13 11 290 9.2 66 5 13
## 14 14 274 10.9 68 5 14
## 15 18 65 13.2 58 5 15
## 16 14 334 11.5 64 5 16
## 17 34 307 12.0 66 5 17
## 18 6 78 18.4 57 5 18
## 19 30 322 11.5 68 5 19
## 20 11 44 9.7 62 5 20
## 21 1 8 9.7 59 5 21
## 22 11 320 16.6 73 5 22
## 23 4 25 9.7 61 5 23
## 24 32 92 12.0 61 5 24
## 25 NA 66 16.6 57 5 25
## 26 NA 266 14.9 58 5 26
## 27 NA NA 8.0 57 5 27
## 28 23 13 12.0 67 5 28
## 29 45 252 14.9 81 5 29
## 30 115 223 5.7 79 5 30
## 31 37 279 7.4 76 5 31
## 32 NA 286 8.6 78 6 1
## 33 NA 287 9.7 74 6 2
## 34 NA 242 16.1 67 6 3
## 35 NA 186 9.2 84 6 4
## 36 NA 220 8.6 85 6 5
## 37 NA 264 14.3 79 6 6
## 38 29 127 9.7 82 6 7
## 39 NA 273 6.9 87 6 8
## 40 71 291 13.8 90 6 9
## 41 39 323 11.5 87 6 10
## 42 NA 259 10.9 93 6 11
## 43 NA 250 9.2 92 6 12
## 44 23 148 8.0 82 6 13
## 45 NA 332 13.8 80 6 14
## 46 NA 322 11.5 79 6 15
## 47 21 191 14.9 77 6 16
## 48 37 284 20.7 72 6 17
## 49 20 37 9.2 65 6 18
## 50 12 120 11.5 73 6 19
## 51 13 137 10.3 76 6 20
## 52 NA 150 6.3 77 6 21
## 53 NA 59 1.7 76 6 22
## 54 NA 91 4.6 76 6 23
## 55 NA 250 6.3 76 6 24
## 56 NA 135 8.0 75 6 25
## 57 NA 127 8.0 78 6 26
## 58 NA 47 10.3 73 6 27
## 59 NA 98 11.5 80 6 28
## 60 NA 31 14.9 77 6 29
## 61 NA 138 8.0 83 6 30
## 62 135 269 4.1 84 7 1
## 63 49 248 9.2 85 7 2
## 64 32 236 9.2 81 7 3
## 65 NA 101 10.9 84 7 4
## 66 64 175 4.6 83 7 5
## 67 40 314 10.9 83 7 6
## 68 77 276 5.1 88 7 7
## 69 97 267 6.3 92 7 8
## 70 97 272 5.7 92 7 9
## 71 85 175 7.4 89 7 10
## 72 NA 139 8.6 82 7 11
## 73 10 264 14.3 73 7 12
## 74 27 175 14.9 81 7 13
## 75 NA 291 14.9 91 7 14
## 76 7 48 14.3 80 7 15
## 77 48 260 6.9 81 7 16
## 78 35 274 10.3 82 7 17
## 79 61 285 6.3 84 7 18
## 80 79 187 5.1 87 7 19
## 81 63 220 11.5 85 7 20
## 82 16 7 6.9 74 7 21
## 83 NA 258 9.7 81 7 22
## 84 NA 295 11.5 82 7 23
## 85 80 294 8.6 86 7 24
## 86 108 223 8.0 85 7 25
## 87 20 81 8.6 82 7 26
## 88 52 82 12.0 86 7 27
## 89 82 213 7.4 88 7 28
## 90 50 275 7.4 86 7 29
## 91 64 253 7.4 83 7 30
## 92 59 254 9.2 81 7 31
## 93 39 83 6.9 81 8 1
## 94 9 24 13.8 81 8 2
## 95 16 77 7.4 82 8 3
## 96 78 NA 6.9 86 8 4
## 97 35 NA 7.4 85 8 5
## 98 66 NA 4.6 87 8 6
## 99 122 255 4.0 89 8 7
## 100 89 229 10.3 90 8 8
## 101 110 207 8.0 90 8 9
## 102 NA 222 8.6 92 8 10
## 103 NA 137 11.5 86 8 11
## 104 44 192 11.5 86 8 12
## 105 28 273 11.5 82 8 13
## 106 65 157 9.7 80 8 14
## 107 NA 64 11.5 79 8 15
## 108 22 71 10.3 77 8 16
## 109 59 51 6.3 79 8 17
## 110 23 115 7.4 76 8 18
## 111 31 244 10.9 78 8 19
## 112 44 190 10.3 78 8 20
## 113 21 259 15.5 77 8 21
## 114 9 36 14.3 72 8 22
## 115 NA 255 12.6 75 8 23
## 116 45 212 9.7 79 8 24
## 117 168 238 3.4 81 8 25
## 118 73 215 8.0 86 8 26
## 119 NA 153 5.7 88 8 27
## 120 76 203 9.7 97 8 28
## 121 118 225 2.3 94 8 29
## 122 84 237 6.3 96 8 30
## 123 85 188 6.3 94 8 31
## 124 96 167 6.9 91 9 1
## 125 78 197 5.1 92 9 2
## 126 73 183 2.8 93 9 3
## 127 91 189 4.6 93 9 4
## 128 47 95 7.4 87 9 5
## 129 32 92 15.5 84 9 6
## 130 20 252 10.9 80 9 7
## 131 23 220 10.3 78 9 8
## 132 21 230 10.9 75 9 9
## 133 24 259 9.7 73 9 10
## 134 44 236 14.9 81 9 11
## 135 21 259 15.5 76 9 12
## 136 28 238 6.3 77 9 13
## 137 9 24 10.9 71 9 14
## 138 13 112 11.5 71 9 15
## 139 46 237 6.9 78 9 16
## 140 18 224 13.8 67 9 17
## 141 13 27 10.3 76 9 18
## 142 24 238 10.3 68 9 19
## 143 16 201 8.0 82 9 20
## 144 13 238 12.6 64 9 21
## 145 23 14 9.2 71 9 22
## 146 36 139 10.3 81 9 23
## 147 7 49 10.3 69 9 24
## 148 14 20 16.6 63 9 25
## 149 30 193 6.9 70 9 26
## 150 NA 145 13.2 77 9 27
## 151 14 191 14.3 75 9 28
## 152 18 131 8.0 76 9 29
## 153 20 223 11.5 68 9 30
## you can type ?airquality on the console to study the details of the data
y = airquality %>% melt(id.vars=c("Month", "Day"), variable.names=c(Ozone, Solar.R, Wind, Temp))
y
## Month Day variable value
## 1 5 1 Ozone 41.0
## 2 5 2 Ozone 36.0
## 3 5 3 Ozone 12.0
## 4 5 4 Ozone 18.0
## 5 5 5 Ozone NA
## 6 5 6 Ozone 28.0
## 7 5 7 Ozone 23.0
## 8 5 8 Ozone 19.0
## 9 5 9 Ozone 8.0
## 10 5 10 Ozone NA
## 11 5 11 Ozone 7.0
## 12 5 12 Ozone 16.0
## 13 5 13 Ozone 11.0
## 14 5 14 Ozone 14.0
## 15 5 15 Ozone 18.0
## 16 5 16 Ozone 14.0
## 17 5 17 Ozone 34.0
## 18 5 18 Ozone 6.0
## 19 5 19 Ozone 30.0
## 20 5 20 Ozone 11.0
## 21 5 21 Ozone 1.0
## 22 5 22 Ozone 11.0
## 23 5 23 Ozone 4.0
## 24 5 24 Ozone 32.0
## 25 5 25 Ozone NA
## 26 5 26 Ozone NA
## 27 5 27 Ozone NA
## 28 5 28 Ozone 23.0
## 29 5 29 Ozone 45.0
## 30 5 30 Ozone 115.0
## 31 5 31 Ozone 37.0
## 32 6 1 Ozone NA
## 33 6 2 Ozone NA
## 34 6 3 Ozone NA
## 35 6 4 Ozone NA
## 36 6 5 Ozone NA
## 37 6 6 Ozone NA
## 38 6 7 Ozone 29.0
## 39 6 8 Ozone NA
## 40 6 9 Ozone 71.0
## 41 6 10 Ozone 39.0
## 42 6 11 Ozone NA
## 43 6 12 Ozone NA
## 44 6 13 Ozone 23.0
## 45 6 14 Ozone NA
## 46 6 15 Ozone NA
## 47 6 16 Ozone 21.0
## 48 6 17 Ozone 37.0
## 49 6 18 Ozone 20.0
## 50 6 19 Ozone 12.0
## 51 6 20 Ozone 13.0
## 52 6 21 Ozone NA
## 53 6 22 Ozone NA
## 54 6 23 Ozone NA
## 55 6 24 Ozone NA
## 56 6 25 Ozone NA
## 57 6 26 Ozone NA
## 58 6 27 Ozone NA
## 59 6 28 Ozone NA
## 60 6 29 Ozone NA
## 61 6 30 Ozone NA
## 62 7 1 Ozone 135.0
## 63 7 2 Ozone 49.0
## 64 7 3 Ozone 32.0
## 65 7 4 Ozone NA
## 66 7 5 Ozone 64.0
## 67 7 6 Ozone 40.0
## 68 7 7 Ozone 77.0
## 69 7 8 Ozone 97.0
## 70 7 9 Ozone 97.0
## 71 7 10 Ozone 85.0
## 72 7 11 Ozone NA
## 73 7 12 Ozone 10.0
## 74 7 13 Ozone 27.0
## 75 7 14 Ozone NA
## 76 7 15 Ozone 7.0
## 77 7 16 Ozone 48.0
## 78 7 17 Ozone 35.0
## 79 7 18 Ozone 61.0
## 80 7 19 Ozone 79.0
## 81 7 20 Ozone 63.0
## 82 7 21 Ozone 16.0
## 83 7 22 Ozone NA
## 84 7 23 Ozone NA
## 85 7 24 Ozone 80.0
## 86 7 25 Ozone 108.0
## 87 7 26 Ozone 20.0
## 88 7 27 Ozone 52.0
## 89 7 28 Ozone 82.0
## 90 7 29 Ozone 50.0
## 91 7 30 Ozone 64.0
## 92 7 31 Ozone 59.0
## 93 8 1 Ozone 39.0
## 94 8 2 Ozone 9.0
## 95 8 3 Ozone 16.0
## 96 8 4 Ozone 78.0
## 97 8 5 Ozone 35.0
## 98 8 6 Ozone 66.0
## 99 8 7 Ozone 122.0
## 100 8 8 Ozone 89.0
## 101 8 9 Ozone 110.0
## 102 8 10 Ozone NA
## 103 8 11 Ozone NA
## 104 8 12 Ozone 44.0
## 105 8 13 Ozone 28.0
## 106 8 14 Ozone 65.0
## 107 8 15 Ozone NA
## 108 8 16 Ozone 22.0
## 109 8 17 Ozone 59.0
## 110 8 18 Ozone 23.0
## 111 8 19 Ozone 31.0
## 112 8 20 Ozone 44.0
## 113 8 21 Ozone 21.0
## 114 8 22 Ozone 9.0
## 115 8 23 Ozone NA
## 116 8 24 Ozone 45.0
## 117 8 25 Ozone 168.0
## 118 8 26 Ozone 73.0
## 119 8 27 Ozone NA
## 120 8 28 Ozone 76.0
## 121 8 29 Ozone 118.0
## 122 8 30 Ozone 84.0
## 123 8 31 Ozone 85.0
## 124 9 1 Ozone 96.0
## 125 9 2 Ozone 78.0
## 126 9 3 Ozone 73.0
## 127 9 4 Ozone 91.0
## 128 9 5 Ozone 47.0
## 129 9 6 Ozone 32.0
## 130 9 7 Ozone 20.0
## 131 9 8 Ozone 23.0
## 132 9 9 Ozone 21.0
## 133 9 10 Ozone 24.0
## 134 9 11 Ozone 44.0
## 135 9 12 Ozone 21.0
## 136 9 13 Ozone 28.0
## 137 9 14 Ozone 9.0
## 138 9 15 Ozone 13.0
## 139 9 16 Ozone 46.0
## 140 9 17 Ozone 18.0
## 141 9 18 Ozone 13.0
## 142 9 19 Ozone 24.0
## 143 9 20 Ozone 16.0
## 144 9 21 Ozone 13.0
## 145 9 22 Ozone 23.0
## 146 9 23 Ozone 36.0
## 147 9 24 Ozone 7.0
## 148 9 25 Ozone 14.0
## 149 9 26 Ozone 30.0
## 150 9 27 Ozone NA
## 151 9 28 Ozone 14.0
## 152 9 29 Ozone 18.0
## 153 9 30 Ozone 20.0
## 154 5 1 Solar.R 190.0
## 155 5 2 Solar.R 118.0
## 156 5 3 Solar.R 149.0
## 157 5 4 Solar.R 313.0
## 158 5 5 Solar.R NA
## 159 5 6 Solar.R NA
## 160 5 7 Solar.R 299.0
## 161 5 8 Solar.R 99.0
## 162 5 9 Solar.R 19.0
## 163 5 10 Solar.R 194.0
## 164 5 11 Solar.R NA
## 165 5 12 Solar.R 256.0
## 166 5 13 Solar.R 290.0
## 167 5 14 Solar.R 274.0
## 168 5 15 Solar.R 65.0
## 169 5 16 Solar.R 334.0
## 170 5 17 Solar.R 307.0
## 171 5 18 Solar.R 78.0
## 172 5 19 Solar.R 322.0
## 173 5 20 Solar.R 44.0
## 174 5 21 Solar.R 8.0
## 175 5 22 Solar.R 320.0
## 176 5 23 Solar.R 25.0
## 177 5 24 Solar.R 92.0
## 178 5 25 Solar.R 66.0
## 179 5 26 Solar.R 266.0
## 180 5 27 Solar.R NA
## 181 5 28 Solar.R 13.0
## 182 5 29 Solar.R 252.0
## 183 5 30 Solar.R 223.0
## 184 5 31 Solar.R 279.0
## 185 6 1 Solar.R 286.0
## 186 6 2 Solar.R 287.0
## 187 6 3 Solar.R 242.0
## 188 6 4 Solar.R 186.0
## 189 6 5 Solar.R 220.0
## 190 6 6 Solar.R 264.0
## 191 6 7 Solar.R 127.0
## 192 6 8 Solar.R 273.0
## 193 6 9 Solar.R 291.0
## 194 6 10 Solar.R 323.0
## 195 6 11 Solar.R 259.0
## 196 6 12 Solar.R 250.0
## 197 6 13 Solar.R 148.0
## 198 6 14 Solar.R 332.0
## 199 6 15 Solar.R 322.0
## 200 6 16 Solar.R 191.0
## 201 6 17 Solar.R 284.0
## 202 6 18 Solar.R 37.0
## 203 6 19 Solar.R 120.0
## 204 6 20 Solar.R 137.0
## 205 6 21 Solar.R 150.0
## 206 6 22 Solar.R 59.0
## 207 6 23 Solar.R 91.0
## 208 6 24 Solar.R 250.0
## 209 6 25 Solar.R 135.0
## 210 6 26 Solar.R 127.0
## 211 6 27 Solar.R 47.0
## 212 6 28 Solar.R 98.0
## 213 6 29 Solar.R 31.0
## 214 6 30 Solar.R 138.0
## 215 7 1 Solar.R 269.0
## 216 7 2 Solar.R 248.0
## 217 7 3 Solar.R 236.0
## 218 7 4 Solar.R 101.0
## 219 7 5 Solar.R 175.0
## 220 7 6 Solar.R 314.0
## 221 7 7 Solar.R 276.0
## 222 7 8 Solar.R 267.0
## 223 7 9 Solar.R 272.0
## 224 7 10 Solar.R 175.0
## 225 7 11 Solar.R 139.0
## 226 7 12 Solar.R 264.0
## 227 7 13 Solar.R 175.0
## 228 7 14 Solar.R 291.0
## 229 7 15 Solar.R 48.0
## 230 7 16 Solar.R 260.0
## 231 7 17 Solar.R 274.0
## 232 7 18 Solar.R 285.0
## 233 7 19 Solar.R 187.0
## 234 7 20 Solar.R 220.0
## 235 7 21 Solar.R 7.0
## 236 7 22 Solar.R 258.0
## 237 7 23 Solar.R 295.0
## 238 7 24 Solar.R 294.0
## 239 7 25 Solar.R 223.0
## 240 7 26 Solar.R 81.0
## 241 7 27 Solar.R 82.0
## 242 7 28 Solar.R 213.0
## 243 7 29 Solar.R 275.0
## 244 7 30 Solar.R 253.0
## 245 7 31 Solar.R 254.0
## 246 8 1 Solar.R 83.0
## 247 8 2 Solar.R 24.0
## 248 8 3 Solar.R 77.0
## 249 8 4 Solar.R NA
## 250 8 5 Solar.R NA
## 251 8 6 Solar.R NA
## 252 8 7 Solar.R 255.0
## 253 8 8 Solar.R 229.0
## 254 8 9 Solar.R 207.0
## 255 8 10 Solar.R 222.0
## 256 8 11 Solar.R 137.0
## 257 8 12 Solar.R 192.0
## 258 8 13 Solar.R 273.0
## 259 8 14 Solar.R 157.0
## 260 8 15 Solar.R 64.0
## 261 8 16 Solar.R 71.0
## 262 8 17 Solar.R 51.0
## 263 8 18 Solar.R 115.0
## 264 8 19 Solar.R 244.0
## 265 8 20 Solar.R 190.0
## 266 8 21 Solar.R 259.0
## 267 8 22 Solar.R 36.0
## 268 8 23 Solar.R 255.0
## 269 8 24 Solar.R 212.0
## 270 8 25 Solar.R 238.0
## 271 8 26 Solar.R 215.0
## 272 8 27 Solar.R 153.0
## 273 8 28 Solar.R 203.0
## 274 8 29 Solar.R 225.0
## 275 8 30 Solar.R 237.0
## 276 8 31 Solar.R 188.0
## 277 9 1 Solar.R 167.0
## 278 9 2 Solar.R 197.0
## 279 9 3 Solar.R 183.0
## 280 9 4 Solar.R 189.0
## 281 9 5 Solar.R 95.0
## 282 9 6 Solar.R 92.0
## 283 9 7 Solar.R 252.0
## 284 9 8 Solar.R 220.0
## 285 9 9 Solar.R 230.0
## 286 9 10 Solar.R 259.0
## 287 9 11 Solar.R 236.0
## 288 9 12 Solar.R 259.0
## 289 9 13 Solar.R 238.0
## 290 9 14 Solar.R 24.0
## 291 9 15 Solar.R 112.0
## 292 9 16 Solar.R 237.0
## 293 9 17 Solar.R 224.0
## 294 9 18 Solar.R 27.0
## 295 9 19 Solar.R 238.0
## 296 9 20 Solar.R 201.0
## 297 9 21 Solar.R 238.0
## 298 9 22 Solar.R 14.0
## 299 9 23 Solar.R 139.0
## 300 9 24 Solar.R 49.0
## 301 9 25 Solar.R 20.0
## 302 9 26 Solar.R 193.0
## 303 9 27 Solar.R 145.0
## 304 9 28 Solar.R 191.0
## 305 9 29 Solar.R 131.0
## 306 9 30 Solar.R 223.0
## 307 5 1 Wind 7.4
## 308 5 2 Wind 8.0
## 309 5 3 Wind 12.6
## 310 5 4 Wind 11.5
## 311 5 5 Wind 14.3
## 312 5 6 Wind 14.9
## 313 5 7 Wind 8.6
## 314 5 8 Wind 13.8
## 315 5 9 Wind 20.1
## 316 5 10 Wind 8.6
## 317 5 11 Wind 6.9
## 318 5 12 Wind 9.7
## 319 5 13 Wind 9.2
## 320 5 14 Wind 10.9
## 321 5 15 Wind 13.2
## 322 5 16 Wind 11.5
## 323 5 17 Wind 12.0
## 324 5 18 Wind 18.4
## 325 5 19 Wind 11.5
## 326 5 20 Wind 9.7
## 327 5 21 Wind 9.7
## 328 5 22 Wind 16.6
## 329 5 23 Wind 9.7
## 330 5 24 Wind 12.0
## 331 5 25 Wind 16.6
## 332 5 26 Wind 14.9
## 333 5 27 Wind 8.0
## 334 5 28 Wind 12.0
## 335 5 29 Wind 14.9
## 336 5 30 Wind 5.7
## 337 5 31 Wind 7.4
## 338 6 1 Wind 8.6
## 339 6 2 Wind 9.7
## 340 6 3 Wind 16.1
## 341 6 4 Wind 9.2
## 342 6 5 Wind 8.6
## 343 6 6 Wind 14.3
## 344 6 7 Wind 9.7
## 345 6 8 Wind 6.9
## 346 6 9 Wind 13.8
## 347 6 10 Wind 11.5
## 348 6 11 Wind 10.9
## 349 6 12 Wind 9.2
## 350 6 13 Wind 8.0
## 351 6 14 Wind 13.8
## 352 6 15 Wind 11.5
## 353 6 16 Wind 14.9
## 354 6 17 Wind 20.7
## 355 6 18 Wind 9.2
## 356 6 19 Wind 11.5
## 357 6 20 Wind 10.3
## 358 6 21 Wind 6.3
## 359 6 22 Wind 1.7
## 360 6 23 Wind 4.6
## 361 6 24 Wind 6.3
## 362 6 25 Wind 8.0
## 363 6 26 Wind 8.0
## 364 6 27 Wind 10.3
## 365 6 28 Wind 11.5
## 366 6 29 Wind 14.9
## 367 6 30 Wind 8.0
## 368 7 1 Wind 4.1
## 369 7 2 Wind 9.2
## 370 7 3 Wind 9.2
## 371 7 4 Wind 10.9
## 372 7 5 Wind 4.6
## 373 7 6 Wind 10.9
## 374 7 7 Wind 5.1
## 375 7 8 Wind 6.3
## 376 7 9 Wind 5.7
## 377 7 10 Wind 7.4
## 378 7 11 Wind 8.6
## 379 7 12 Wind 14.3
## 380 7 13 Wind 14.9
## 381 7 14 Wind 14.9
## 382 7 15 Wind 14.3
## 383 7 16 Wind 6.9
## 384 7 17 Wind 10.3
## 385 7 18 Wind 6.3
## 386 7 19 Wind 5.1
## 387 7 20 Wind 11.5
## 388 7 21 Wind 6.9
## 389 7 22 Wind 9.7
## 390 7 23 Wind 11.5
## 391 7 24 Wind 8.6
## 392 7 25 Wind 8.0
## 393 7 26 Wind 8.6
## 394 7 27 Wind 12.0
## 395 7 28 Wind 7.4
## 396 7 29 Wind 7.4
## 397 7 30 Wind 7.4
## 398 7 31 Wind 9.2
## 399 8 1 Wind 6.9
## 400 8 2 Wind 13.8
## 401 8 3 Wind 7.4
## 402 8 4 Wind 6.9
## 403 8 5 Wind 7.4
## 404 8 6 Wind 4.6
## 405 8 7 Wind 4.0
## 406 8 8 Wind 10.3
## 407 8 9 Wind 8.0
## 408 8 10 Wind 8.6
## 409 8 11 Wind 11.5
## 410 8 12 Wind 11.5
## 411 8 13 Wind 11.5
## 412 8 14 Wind 9.7
## 413 8 15 Wind 11.5
## 414 8 16 Wind 10.3
## 415 8 17 Wind 6.3
## 416 8 18 Wind 7.4
## 417 8 19 Wind 10.9
## 418 8 20 Wind 10.3
## 419 8 21 Wind 15.5
## 420 8 22 Wind 14.3
## 421 8 23 Wind 12.6
## 422 8 24 Wind 9.7
## 423 8 25 Wind 3.4
## 424 8 26 Wind 8.0
## 425 8 27 Wind 5.7
## 426 8 28 Wind 9.7
## 427 8 29 Wind 2.3
## 428 8 30 Wind 6.3
## 429 8 31 Wind 6.3
## 430 9 1 Wind 6.9
## 431 9 2 Wind 5.1
## 432 9 3 Wind 2.8
## 433 9 4 Wind 4.6
## 434 9 5 Wind 7.4
## 435 9 6 Wind 15.5
## 436 9 7 Wind 10.9
## 437 9 8 Wind 10.3
## 438 9 9 Wind 10.9
## 439 9 10 Wind 9.7
## 440 9 11 Wind 14.9
## 441 9 12 Wind 15.5
## 442 9 13 Wind 6.3
## 443 9 14 Wind 10.9
## 444 9 15 Wind 11.5
## 445 9 16 Wind 6.9
## 446 9 17 Wind 13.8
## 447 9 18 Wind 10.3
## 448 9 19 Wind 10.3
## 449 9 20 Wind 8.0
## 450 9 21 Wind 12.6
## 451 9 22 Wind 9.2
## 452 9 23 Wind 10.3
## 453 9 24 Wind 10.3
## 454 9 25 Wind 16.6
## 455 9 26 Wind 6.9
## 456 9 27 Wind 13.2
## 457 9 28 Wind 14.3
## 458 9 29 Wind 8.0
## 459 9 30 Wind 11.5
## 460 5 1 Temp 67.0
## 461 5 2 Temp 72.0
## 462 5 3 Temp 74.0
## 463 5 4 Temp 62.0
## 464 5 5 Temp 56.0
## 465 5 6 Temp 66.0
## 466 5 7 Temp 65.0
## 467 5 8 Temp 59.0
## 468 5 9 Temp 61.0
## 469 5 10 Temp 69.0
## 470 5 11 Temp 74.0
## 471 5 12 Temp 69.0
## 472 5 13 Temp 66.0
## 473 5 14 Temp 68.0
## 474 5 15 Temp 58.0
## 475 5 16 Temp 64.0
## 476 5 17 Temp 66.0
## 477 5 18 Temp 57.0
## 478 5 19 Temp 68.0
## 479 5 20 Temp 62.0
## 480 5 21 Temp 59.0
## 481 5 22 Temp 73.0
## 482 5 23 Temp 61.0
## 483 5 24 Temp 61.0
## 484 5 25 Temp 57.0
## 485 5 26 Temp 58.0
## 486 5 27 Temp 57.0
## 487 5 28 Temp 67.0
## 488 5 29 Temp 81.0
## 489 5 30 Temp 79.0
## 490 5 31 Temp 76.0
## 491 6 1 Temp 78.0
## 492 6 2 Temp 74.0
## 493 6 3 Temp 67.0
## 494 6 4 Temp 84.0
## 495 6 5 Temp 85.0
## 496 6 6 Temp 79.0
## 497 6 7 Temp 82.0
## 498 6 8 Temp 87.0
## 499 6 9 Temp 90.0
## 500 6 10 Temp 87.0
## 501 6 11 Temp 93.0
## 502 6 12 Temp 92.0
## 503 6 13 Temp 82.0
## 504 6 14 Temp 80.0
## 505 6 15 Temp 79.0
## 506 6 16 Temp 77.0
## 507 6 17 Temp 72.0
## 508 6 18 Temp 65.0
## 509 6 19 Temp 73.0
## 510 6 20 Temp 76.0
## 511 6 21 Temp 77.0
## 512 6 22 Temp 76.0
## 513 6 23 Temp 76.0
## 514 6 24 Temp 76.0
## 515 6 25 Temp 75.0
## 516 6 26 Temp 78.0
## 517 6 27 Temp 73.0
## 518 6 28 Temp 80.0
## 519 6 29 Temp 77.0
## 520 6 30 Temp 83.0
## 521 7 1 Temp 84.0
## 522 7 2 Temp 85.0
## 523 7 3 Temp 81.0
## 524 7 4 Temp 84.0
## 525 7 5 Temp 83.0
## 526 7 6 Temp 83.0
## 527 7 7 Temp 88.0
## 528 7 8 Temp 92.0
## 529 7 9 Temp 92.0
## 530 7 10 Temp 89.0
## 531 7 11 Temp 82.0
## 532 7 12 Temp 73.0
## 533 7 13 Temp 81.0
## 534 7 14 Temp 91.0
## 535 7 15 Temp 80.0
## 536 7 16 Temp 81.0
## 537 7 17 Temp 82.0
## 538 7 18 Temp 84.0
## 539 7 19 Temp 87.0
## 540 7 20 Temp 85.0
## 541 7 21 Temp 74.0
## 542 7 22 Temp 81.0
## 543 7 23 Temp 82.0
## 544 7 24 Temp 86.0
## 545 7 25 Temp 85.0
## 546 7 26 Temp 82.0
## 547 7 27 Temp 86.0
## 548 7 28 Temp 88.0
## 549 7 29 Temp 86.0
## 550 7 30 Temp 83.0
## 551 7 31 Temp 81.0
## 552 8 1 Temp 81.0
## 553 8 2 Temp 81.0
## 554 8 3 Temp 82.0
## 555 8 4 Temp 86.0
## 556 8 5 Temp 85.0
## 557 8 6 Temp 87.0
## 558 8 7 Temp 89.0
## 559 8 8 Temp 90.0
## 560 8 9 Temp 90.0
## 561 8 10 Temp 92.0
## 562 8 11 Temp 86.0
## 563 8 12 Temp 86.0
## 564 8 13 Temp 82.0
## 565 8 14 Temp 80.0
## 566 8 15 Temp 79.0
## 567 8 16 Temp 77.0
## 568 8 17 Temp 79.0
## 569 8 18 Temp 76.0
## 570 8 19 Temp 78.0
## 571 8 20 Temp 78.0
## 572 8 21 Temp 77.0
## 573 8 22 Temp 72.0
## 574 8 23 Temp 75.0
## 575 8 24 Temp 79.0
## 576 8 25 Temp 81.0
## 577 8 26 Temp 86.0
## 578 8 27 Temp 88.0
## 579 8 28 Temp 97.0
## 580 8 29 Temp 94.0
## 581 8 30 Temp 96.0
## 582 8 31 Temp 94.0
## 583 9 1 Temp 91.0
## 584 9 2 Temp 92.0
## 585 9 3 Temp 93.0
## 586 9 4 Temp 93.0
## 587 9 5 Temp 87.0
## 588 9 6 Temp 84.0
## 589 9 7 Temp 80.0
## 590 9 8 Temp 78.0
## 591 9 9 Temp 75.0
## 592 9 10 Temp 73.0
## 593 9 11 Temp 81.0
## 594 9 12 Temp 76.0
## 595 9 13 Temp 77.0
## 596 9 14 Temp 71.0
## 597 9 15 Temp 71.0
## 598 9 16 Temp 78.0
## 599 9 17 Temp 67.0
## 600 9 18 Temp 76.0
## 601 9 19 Temp 68.0
## 602 9 20 Temp 82.0
## 603 9 21 Temp 64.0
## 604 9 22 Temp 71.0
## 605 9 23 Temp 81.0
## 606 9 24 Temp 69.0
## 607 9 25 Temp 63.0
## 608 9 26 Temp 70.0
## 609 9 27 Temp 77.0
## 610 9 28 Temp 75.0
## 611 9 29 Temp 76.0
## 612 9 30 Temp 68.0
Reference: https://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/sql.html
SQL is a database query language - a language designed specifically for interacting with a database.
library(sqldf) # Load the package
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
#library(RSQLite)
sqldf('SELECT mpg FROM mtcars')
## mpg
## 1 21.0
## 2 21.0
## 3 22.8
## 4 21.4
## 5 18.7
## 6 18.1
## 7 14.3
## 8 24.4
## 9 22.8
## 10 19.2
## 11 17.8
## 12 16.4
## 13 17.3
## 14 15.2
## 15 10.4
## 16 10.4
## 17 14.7
## 18 32.4
## 19 30.4
## 20 33.9
## 21 21.5
## 22 15.5
## 23 15.2
## 24 13.3
## 25 19.2
## 26 27.3
## 27 26.0
## 28 30.4
## 29 15.8
## 30 19.7
## 31 15.0
## 32 21.4
sqldf('SELECT mpg, disp FROM mtcars')
## mpg disp
## 1 21.0 160.0
## 2 21.0 160.0
## 3 22.8 108.0
## 4 21.4 258.0
## 5 18.7 360.0
## 6 18.1 225.0
## 7 14.3 360.0
## 8 24.4 146.7
## 9 22.8 140.8
## 10 19.2 167.6
## 11 17.8 167.6
## 12 16.4 275.8
## 13 17.3 275.8
## 14 15.2 275.8
## 15 10.4 472.0
## 16 10.4 460.0
## 17 14.7 440.0
## 18 32.4 78.7
## 19 30.4 75.7
## 20 33.9 71.1
## 21 21.5 120.1
## 22 15.5 318.0
## 23 15.2 304.0
## 24 13.3 350.0
## 25 19.2 400.0
## 26 27.3 79.0
## 27 26.0 120.3
## 28 30.4 95.1
## 29 15.8 351.0
## 30 19.7 145.0
## 31 15.0 301.0
## 32 21.4 121.0
sqldf('SELECT mpg, disp FROM mtcars WHERE am = 1 ORDER BY mpg ASC')
## mpg disp
## 1 15.0 301.0
## 2 15.8 351.0
## 3 19.7 145.0
## 4 21.0 160.0
## 5 21.0 160.0
## 6 21.4 121.0
## 7 22.8 108.0
## 8 26.0 120.3
## 9 27.3 79.0
## 10 30.4 75.7
## 11 30.4 95.1
## 12 32.4 78.7
## 13 33.9 71.1
sqldf('SELECT * FROM mtcars WHERE (mpg > 20 AND disp < 95) OR carb > 1000')
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 32.4 4 78.7 66 4.08 2.200 19.47 low 1 4 1
## 2 30.4 4 75.7 52 4.93 1.615 18.52 low 1 4 2
## 3 33.9 4 71.1 65 4.22 1.835 19.90 low 1 4 1
## 4 27.3 4 79.0 66 4.08 1.935 18.90 low 1 4 1
sqldf('SELECT * FROM mtcars WHERE carb NOT IN (1,2)')
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 21.0 6 160.0 110 3.90 2.620 16.46 high 1 4 4
## 2 21.0 6 160.0 110 3.90 2.875 17.02 high 1 4 4
## 3 14.3 8 360.0 245 3.21 3.570 15.84 high 0 3 4
## 4 19.2 6 167.6 123 3.92 3.440 18.30 low 0 4 4
## 5 17.8 6 167.6 123 3.92 3.440 18.90 low 0 4 4
## 6 16.4 8 275.8 180 3.07 4.070 17.40 high 0 3 3
## 7 17.3 8 275.8 180 3.07 3.730 17.60 high 0 3 3
## 8 15.2 8 275.8 180 3.07 3.780 18.00 high 0 3 3
## 9 10.4 8 472.0 205 2.93 5.250 17.98 high 0 3 4
## 10 10.4 8 460.0 215 3.00 5.424 17.82 high 0 3 4
## 11 14.7 8 440.0 230 3.23 5.345 17.42 high 0 3 4
## 12 13.3 8 350.0 245 3.73 3.840 15.41 high 0 3 4
## 13 15.8 8 351.0 264 4.22 3.170 14.50 high 1 5 4
## 14 19.7 6 145.0 175 3.62 2.770 15.50 high 1 5 6
## 15 15.0 8 301.0 335 3.54 3.570 14.60 high 1 5 8
sqldf("SELECT am, AVG(mpg) AS Avgmpg FROM mtcars GROUP BY am")
## am Avgmpg
## 1 0 17.14737
## 2 1 24.39231
sqldf("SELECT am, COUNT() as count FROM mtcars GROUP BY am")
## am count
## 1 0 19
## 2 1 13
We introduce how to use SQLite databases in R using the RSQLite package. You will learn to perform the basic tasks such as sending queries to a SQLite database, parameterised queries, or creating tables using RSQLite. The power of the intersection between R and SQLite can be seen in parameterized queries, which could be used if you need to query a database in order to display information based on user input inside an R Shiny app.
For more details, refer to https://www.datacamp.com/community/tutorials/sqlite-in-r.
library(RSQLite)
# Create a connection to our new database, CarsDB.db
conn <- dbConnect(drv = RSQLite::SQLite(), # drv is a database driver
"CarsDB.db" # A path to a SQLite database
)
# Create a column that names each car in the "mtcars" data frame
D = mtcars
D$car_names <- rownames(D)
rownames(D) <- NULL
# Write the mtcars dataset into a table named mtcars_data
dbWriteTable(conn, "cars_data", D, overwrite=TRUE)
# List all the tables available in the database
dbListTables(conn)
## [1] "cars_data"
# Get the car names and horsepower of the cars with 8 cylinders
dbGetQuery(conn,"SELECT car_names, hp, cyl FROM cars_data
WHERE cyl = 8")
## car_names hp cyl
## 1 Hornet Sportabout 175 8
## 2 Duster 360 245 8
## 3 Merc 450SE 180 8
## 4 Merc 450SL 180 8
## 5 Merc 450SLC 180 8
## 6 Cadillac Fleetwood 205 8
## 7 Lincoln Continental 215 8
## 8 Chrysler Imperial 230 8
## 9 Dodge Challenger 150 8
## 10 AMC Javelin 150 8
## 11 Camaro Z28 245 8
## 12 Pontiac Firebird 175 8
## 13 Ford Pantera L 264 8
## 14 Maserati Bora 335 8
# Get the car names and horsepower starting with M that have 6 or 8 cylinders
dbGetQuery(conn,"SELECT car_names, hp, cyl FROM cars_data
WHERE car_names LIKE 'M%' AND cyl IN (6,8)")
## car_names hp cyl
## 1 Mazda RX4 110 6
## 2 Mazda RX4 Wag 110 6
## 3 Merc 280 123 6
## 4 Merc 280C 123 6
## 5 Merc 450SE 180 8
## 6 Merc 450SL 180 8
## 7 Merc 450SLC 180 8
## 8 Maserati Bora 335 8
# Get the average horsepower and mpg by number of cylinder groups
dbGetQuery(conn,"SELECT cyl, AVG(hp) AS 'average_hp', AVG(mpg) AS 'average_mpg' FROM cars_data
GROUP BY cyl
ORDER BY average_hp")
## cyl average_hp average_mpg
## 1 4 82.63636 26.66364
## 2 6 122.28571 19.74286
## 3 8 209.21429 15.10000
avg_HpCyl <- dbGetQuery(conn,"SELECT cyl, AVG(hp) AS 'average_hp'FROM cars_data
GROUP BY cyl
ORDER BY average_hp")
avg_HpCyl
## cyl average_hp
## 1 4 82.63636
## 2 6 122.28571
## 3 8 209.21429
class(avg_HpCyl)
## [1] "data.frame"
# Lets assume that there is some user input that asks us to look only into cars that have over 18 miles per gallon (mpg)
# and more than 6 cylinders
mpg <- 18
cyl <- 6
Result <- dbGetQuery(conn, 'SELECT car_names, mpg, cyl FROM cars_data WHERE mpg >= ? AND cyl >= ?', params = c(mpg,cyl))
Result
## car_names mpg cyl
## 1 Mazda RX4 21.0 6
## 2 Mazda RX4 Wag 21.0 6
## 3 Hornet 4 Drive 21.4 6
## 4 Hornet Sportabout 18.7 8
## 5 Valiant 18.1 6
## 6 Merc 280 19.2 6
## 7 Pontiac Firebird 19.2 8
## 8 Ferrari Dino 19.7 6
# Delete the column belonging to the Mazda RX4. You will see a 1 as the output.
dbExecute(conn, "DELETE FROM cars_data WHERE car_names = 'Mazda RX4'")
## [1] 1
# Insert the data for the Mazda RX4. This will also ouput a 1
dbExecute(conn, "INSERT INTO cars_data VALUES (21.0,6,160.0,110,3.90,2.620,16.46,0,1,4,4,'Mazda RX4')")
## [1] 1
When you scrape webpages to get data, you will need to manipulate the data, since the data may contain some symbols that are not directly suitable for analysis.
Here is ta RStudio cheatsheet for regular expressions: https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf
library(stringr)
y = c("2,345", "34,156", "6,571,089", "89,071")
y1 = str_replace_all(string=y, pattern=",", replacement="") ## Replace all commas by ""; i.e., remove all commas
y2 = as.numeric(y1)
mean(y2)
## [1] 1674165
y %>% str_replace_all(pattern=",", replacement="") %>% as.numeric() %>% mean()
## [1] 1674165
z = c("Hello World!", "How are you?", "I'm sorry.")
str_split(z, " ") ## Split each string in the x vector by a single space
## [[1]]
## [1] "Hello" "World!"
##
## [[2]]
## [1] "How" "are" "you?"
##
## [[3]]
## [1] "I'm" "sorry."
strsplit(z, " ") ## Split each string in the x vector by a single space. Available from the base package
## [[1]]
## [1] "Hello" "World!"
##
## [[2]]
## [1] "How" "are" "you?"
##
## [[3]]
## [1] "I'm" "sorry."
x=c("$56,987.00", "f6 8h$7 h", "8j% #2,%f &* ^j-5$d")
x
## [1] "$56,987.00" "f6 8h$7 h"
## [3] "8j% #2,%f &* ^j-5$d"
str_replace_all(string=x, pattern="\\$|\\%", replacement="") ## Replace all "$" or "%" signs by "", i.e., remove all "$" or "%" signs
## [1] "56,987.00" "f6 8h7 h" "8j #2,f &* ^j-5d"
str_replace_all(string=x, pattern="\\D", replacement="") ## Replace all non-digits by "", i.e., remove all non-digits
## [1] "5698700" "687" "825"
str_replace_all(string=x, pattern="\\d", replacement="") ## Replace all digits by "", i.e., remove all digits
## [1] "$,." "f h$ h" "j% #,%f &* ^j-$d"
str_replace_all(string=x, pattern="\\W", replacement="") ## Replace all non-word characters by "", i.e., remove non-word characters
## [1] "5698700" "f68h7h" "8j2fj5d"
str_replace_all(string=x, pattern="\\s", replacement="") ## Replace all spaces by "", i.e., remove non-word characters
## [1] "$56,987.00" "f68h$7h" "8j%#2,%f&*^j-5$d"
## The following replaces all spaces, dollar signs or (|) commas by "", i.e., remove non-word characters
str_replace_all(string=x, pattern="\\s|\\$|,", replacement="")
## [1] "56987.00" "f68h7h" "8j%#2%f&*^j-5d"
## \w matches any “word” character, which includes alphabetic characters, marks and decimal numbers.
## The complement, \W, matches any non-word character.
str_extract_all("Don't eat that 67 m5d!", "\\w")[[1]]
## [1] "D" "o" "n" "t" "e" "a" "t" "t" "h" "a" "t" "6" "7" "m" "5" "d"
## [abc]: matches a, b, or c.
## [a-z]: matches every character between a and z (in Unicode code point order).
## [^abc]: matches anything except a, b, or c.
## [\$\-]: matches $ or -.
# Anchors:
## ^ matches the start of string.
## $ matches the end of the string.
## \d: matches any digit. The complement, \D, matches any character that is not a decimal digit.
## \s: matches any whitespace. This includes tabs, newlines. The complement, \S, matches any non-whitespace character.
## Remove a substring starting with a particular character
str1 <- c("2,099@St Cloud","459,606@DC@USA")
s1<-gsub("@.*|,", "", str1)
s1
## [1] "2099" "459606"
str2 <- c("2,099[3]","459,606")
s2<-gsub("\\[.*|,", "", str2)
s2
## [1] "2099" "459606"
# The "." symbol represents any symbol except the new line symbol "\n"
# + means matches one or more and * matches 0 or more
# In the following, the pattern starts with "[", is followed by
# one or more none new line symbols, and ends with "]".
gsub("\\[.+\\]", "", c("12,908", "450,896,054[12]"))
## [1] "12,908" "450,896,054"
#if you want to keep the @ character in str1:
s<-gsub("(@).*","\\1",str1)
s
## [1] "2,099@" "459,606@"
#If what you want to remove everything from the last @ in str1:
s<-gsub("(.*)@.*","\\1",str1)
s
## [1] "2,099" "459,606@DC"
## Date Formats ##
# Y=year in 4 digits
# M=no such thing
# D=no such thing
# y=tricky
# m=month in two digits
# d=day in a month
x <- c("01-01-1960", "01-02-1960", "1-3-1960", "1-4-1960", "01-05-1960", "01-06-1960", "1-7-1960", "1-8-1960")
z <- as.Date(x = x, format = "%m-%d-%Y") # m=month in number
z
## [1] "1960-01-01" "1960-01-02" "1960-01-03" "1960-01-04" "1960-01-05"
## [6] "1960-01-06" "1960-01-07" "1960-01-08"
stockPrice = c(56, 64, 67, 71, 84, 91, 95, 97)
plot(stockPrice ~ z) # Does not show x-values as dates
plot(stockPrice ~ z, xaxt = "n", xlab = "") # Setting xaxt = "n" removes x-values; Setting xlab = "" removes x-axis title
axis(side = 1, at = z,labels = x, las = 2) # Adding labels on the x-axis at designated positions and displaying labels perpendicularly
par(mar=c(b=7,l=4,t=4,r=4))
plot(stockPrice ~ z, xaxt = "n", xlab = "") # Setting xaxt = "n" removes x-values; Setting xlab = "" removes x-axis title
axis(1, at = z, labels = x, las = 2) # Adding labels on the x-axis at designated positions and displaying labels perpendicularly
ggplot(NULL, aes(x=z, y = stockPrice)) +
geom_point() +
scale_x_date(date_breaks = "day", date_labels = "%a \n %b %d, %Y") + # Formatting x-values
labs(x="")
x <- c("1960-01-01", "1960-01-02", "1960-03-31", "1960-07-30")
z <- as.Date(x, "%Y-%m-%d") # m=month in number
z
## [1] "1960-01-01" "1960-01-02" "1960-03-31" "1960-07-30"
## More examples: can be skipped
x <- c("1jan1960", "2jan1960", "31mar1960", "30jul1960")
z <- as.Date(x, format="%d%b%Y") # b=month in abb, d=day in a month, Y=year in 4 digits
z
## [1] "1960-01-01" "1960-01-02" "1960-03-31" "1960-07-30"
x <- c("1jan60", "2jan60", "31mar60", "30jul60")
z <- as.Date(x, "%d%b%y") # b=month in abb, Y=year in 4 digits
z
## [1] "2060-01-01" "2060-01-02" "2060-03-31" "2060-07-30"
x <- c("jan11960", "jan21960", "mar311960", "jul301960")
z <- as.Date(x, "%b%d%Y") ## Wrong
z
## [1] "0960-01-11" "0960-01-21" "1960-03-31" "1960-07-30"
x <- c("jan11960", "jan21960", "mar311960", "jul301960")
z <- as.Date(x, "%b%D%Y") ## Wrong
z
## [1] NA NA NA NA
x <- c("jan011960", "jan021960", "mar311960", "jul301960")
z <- as.Date(x, "%b%d%Y")
z
## [1] "1960-01-01" "1960-01-02" "1960-03-31" "1960-07-30"
x <- c("1960jan1", "1960jan2", "1960mar31", "1960jul30")
z <- as.Date(x, "%Y%b%d")
z
## [1] "1960-01-01" "1960-01-02" "1960-03-31" "1960-07-30"
x <- c("01011960", "01021960", "03311960", "07301960")
z <- as.Date(x, "%m%d%Y") # m=month in number
z
## [1] "1960-01-01" "1960-01-02" "1960-03-31" "1960-07-30"
x <- c("01/01/1960", "01/02/1960", "03/31/1960", "07/30/1960")
z <- as.Date(x, "%m/%d/%Y") # m=month in number
z
## [1] "1960-01-01" "1960-01-02" "1960-03-31" "1960-07-30"
x <- c("1/1/1960", "1/2/1960", "3/31/1960", "7/30/1960")
z <- as.Date(x, "%m/%d/%Y") # Leading 0 ignored
z
## [1] "1960-01-01" "1960-01-02" "1960-03-31" "1960-07-30"
y <- c("01011960", "01021960", "03311960", "07301960")
z <- as.Date(y, "%m%d%y") # Wrong in year
z
## [1] "2019-01-01" "2019-01-02" "2019-03-31" "2019-07-30"
str_replace_all(z, "-", "/") # Replace "-" in all dates by "/"
## [1] "2019/01/01" "2019/01/02" "2019/03/31" "2019/07/30"
z
## [1] "2019-01-01" "2019-01-02" "2019-03-31" "2019-07-30"
y <- c("196001011", "19600102", "19600331", "19600730")
z <- as.Date(y, "%Y/%m/%d") # m=month in number
z
## [1] NA NA NA NA
y <- c("1960Jan011", "1960Jan02", "1960Mar31", "1960Jul30")
z <- as.Date(y, "%Y%B%d") # m=month in number
z
## [1] "1960-01-01" "1960-01-02" "1960-03-31" "1960-07-30"
format(z, "%a %b %d %X %Y %Z") # a=day in a week
## [1] "Fri Jan 01 00:00:00 1960 UTC" "Sat Jan 02 00:00:00 1960 UTC"
## [3] "Thu Mar 31 00:00:00 1960 UTC" "Sat Jul 30 00:00:00 1960 UTC"
x <- c("1/1/1960", "1/2/1960", "3/31/1960", "7/30/1960")
z <- as.Date(x, "%m/%d/%Y") # Leading 0 ignored
z
## [1] "1960-01-01" "1960-01-02" "1960-03-31" "1960-07-30"
format(z, "%B %d,%Y")
## [1] "January 01,1960" "January 02,1960" "March 31,1960" "July 30,1960"
format(z, "%Y-%B-%d")
## [1] "1960-January-01" "1960-January-02" "1960-March-31" "1960-July-30"
This is based on a resource here (Right click to open a new tab)
baseline_tbl <- structure(list(var = c("Total", "Age at initial diagnosis", "Age at initial diagnosis",
"Age at initial diagnosis", "Age at initial diagnosis", "Year of inital diagnosis",
"Year of inital diagnosis", "Year of inital diagnosis", "Year of inital diagnosis",
"Year of inital diagnosis", "Year of inital diagnosis", "Year of inital diagnosis",
"Is advanced NSCLC", "Is advanced NSCLC", "Is advanced NSCLC",
"Age at advanced diagnosis", "Age at advanced diagnosis", "Age at advanced diagnosis",
"Age at advanced diagnosis"), level = c("", "<65", ">=65", "Missing",
"Median", "Prior 2015", "2015", "2016", "2017", "2018", "2019",
"2020", "Yes", "No", "Missing", "<65", ">=65", "Missing", "Median"
), nsclc = c(1320L, 695L, 620L, 5L, 64L, 305L, 320L, 403L, 253L,
24L, 0L, 0L, 1026L, 294L, 0L, 553L, 467L, 300L, 64L), nsclc_pct = c(NA,
0.526515151515151, 0.46969696969697, 0.00378787878787879, NA,
0.231060606060606, 0.242424242424242, 0.30530303030303, 0.191666666666667,
0.0181818181818182, 0, 0, 0.777272727272727, 0.222727272727273,
0, 0.418939393939394, 0.353787878787879, 0.227272727272727, NA
), adv_nsclc = c(1026, 575, 446, 5, 63, 275, 260, 295, 170, 16,
0, 0, 1026, 0, 0, 553, 467, 6, 63.7), adv_pct = c(NA, 0.560428849902534,
0.434697855750487, 0.00487329434697856, NA, 0.268031189083821,
0.253411306042885, 0.287524366471735, 0.165692007797271, 0.0155945419103314,
0, 0, 1, 0, 0, 0.538986354775828, 0.455165692007797, 0.00584795321637427,
NA), early_nsclc = c(656, 319, 334, 3, 65, 164, 151, 199, 121,
10, 0, 0, 364, 292, 0, 178, 183, 295, 65.21), early_pct = c(NA,
0.486280487804878, 0.509146341463415, 0.00457317073170732, NA,
0.25, 0.230182926829268, 0.303353658536585, 0.184451219512195,
0.0152439024390244, 0, 0, 0.554878048780488, 0.445121951219512,
0, 0.271341463414634, 0.278963414634146, 0.44969512195122, NA
)), row.names = c(NA, -19L), class = c("tbl_df", "tbl", "data.frame"
))
baseline_tbl %>%
kbl(align = "llcccccc", escape=TRUE, col.names = c("Variable", "Level", "N", "%", "N", "%", "N", "%")) %>%
kable_paper(c("striped"),full_width = FALSE) %>%
column_spec(1, bold = TRUE, background = "#91D1C233",extra_css = "border-bottom: 1px solid;") %>%
column_spec(2,
width = "10em", border_right = TRUE) %>%
column_spec(c(3,5,7), width = "3em") %>%
column_spec(c(4,6,8), italic = TRUE,border_right = TRUE,width = "5em") %>%
row_spec(0, bold = TRUE, font_size = 18, extra_css = "border-bottom: 1px solid;") %>%
row_spec(1, underline = TRUE, extra_css = "border-bottom: 1px solid;") %>%
row_spec(c(5,12,15,19), extra_css = "border-bottom: 1px solid;") %>%
collapse_rows(columns = 1, valign = "top") %>%
add_header_above(c(" " = 2, "NSCLC" = 2, "Advanced NSCLC" = 2, "Early NSCLC" = 2),
bold = TRUE,extra_css ="border-bottom: 1px solid;") %>%
kable_classic_2()
NSCLC
|
Advanced NSCLC
|
Early NSCLC
|
|||||
---|---|---|---|---|---|---|---|
Variable | Level | N | % | N | % | N | % |
Total | 1320 | NA | 1026.0 | NA | 656.00 | NA | |
Age at initial diagnosis | <65 | 695 | 0.5265152 | 575.0 | 0.5604288 | 319.00 | 0.4862805 |
Age at initial diagnosis | >=65 | 620 | 0.4696970 | 446.0 | 0.4346979 | 334.00 | 0.5091463 |
Age at initial diagnosis | Missing | 5 | 0.0037879 | 5.0 | 0.0048733 | 3.00 | 0.0045732 |
Age at initial diagnosis | Median | 64 | NA | 63.0 | NA | 65.00 | NA |
Year of inital diagnosis | Prior 2015 | 305 | 0.2310606 | 275.0 | 0.2680312 | 164.00 | 0.2500000 |
Year of inital diagnosis | 2015 | 320 | 0.2424242 | 260.0 | 0.2534113 | 151.00 | 0.2301829 |
Year of inital diagnosis | 2016 | 403 | 0.3053030 | 295.0 | 0.2875244 | 199.00 | 0.3033537 |
Year of inital diagnosis | 2017 | 253 | 0.1916667 | 170.0 | 0.1656920 | 121.00 | 0.1844512 |
Year of inital diagnosis | 2018 | 24 | 0.0181818 | 16.0 | 0.0155945 | 10.00 | 0.0152439 |
Year of inital diagnosis | 2019 | 0 | 0.0000000 | 0.0 | 0.0000000 | 0.00 | 0.0000000 |
Year of inital diagnosis | 2020 | 0 | 0.0000000 | 0.0 | 0.0000000 | 0.00 | 0.0000000 |
Is advanced NSCLC | Yes | 1026 | 0.7772727 | 1026.0 | 1.0000000 | 364.00 | 0.5548780 |
Is advanced NSCLC | No | 294 | 0.2227273 | 0.0 | 0.0000000 | 292.00 | 0.4451220 |
Is advanced NSCLC | Missing | 0 | 0.0000000 | 0.0 | 0.0000000 | 0.00 | 0.0000000 |
Age at advanced diagnosis | <65 | 553 | 0.4189394 | 553.0 | 0.5389864 | 178.00 | 0.2713415 |
Age at advanced diagnosis | >=65 | 467 | 0.3537879 | 467.0 | 0.4551657 | 183.00 | 0.2789634 |
Age at advanced diagnosis | Missing | 300 | 0.2272727 | 6.0 | 0.0058480 | 295.00 | 0.4496951 |
Age at advanced diagnosis | Median | 64 | NA | 63.7 | NA | 65.21 | NA |
We use four affine transformations:
\(y = A_1 x + b_1\)
\(y = A_2 x + b_2\)
\(y = A_3 x + b_2\)
\(y = A_4 x + b_4\)
where
\[A_1 = \left [\begin{array}{cc} 0.86&0.03 \\ -0.03&0.86 \\ \end{array}\right ], ~~~ b_1 = \left[\begin{array}{cc} 0 \\ 1.5 \\ \end{array}\right]\]
\[A_2 = \left[\begin{array}{cc} 0.2&-0.25 \\ 0.21&0.23 \\ \end{array}\right], ~~~ b_2 = \left[\begin{array}{cc} 0 \\ 1.5 \\ \end{array}\right]\]
\[A_3 = \left[\begin{array}{cc} -0.15&0.27 \\ 0.25&0.26 \\ \end{array}\right], ~~~ b_3 = \left[\begin{array}{cc} 0 \\ 0.45 \\ \end{array}\right]\]
\[A_4 = \left[\begin{array}{cc} 0&0 \\ 0&0.17 \\ \end{array}\right], ~~~ b_4 = \left[\begin{array}{cc} 0 \\ 0 \\ \end{array}\right]\]
To make a fern, start by setting \(x = (0, 0)\) (the origin). Choose one of the four affine transformations at random with probabilities 0.83, 0.08, 0.08, and 0.01. Apply the chosen transformation to \(x\) and denote the result still as \(x\). Plot \(x\). The process is repeated until a certain number of points are plotted. Now, you see a fern.
# Make a matrix that contains all elements from the 4 matrices and the 4 probabilities
A = matrix(0, 4, 7)
A[1, ] = c(0.86, 0.03, -0.03, 0.86, 0, 1.5, 0.83)
A[2, ] = c(0.2, -0.25,0.21, 0.23, 0, 1.5, 0.08)
A[3, ] = c(-0.15, 0.27, 0.25, 0.26, 0, 0.45, 0.08)
A[4, ] = c(0, 0, 0, 0.17, 0, 0, 0.01)
# Start at the origin
x = c(0, 0)
# Starts a graphics device driver for the X Window System (version 11)
#X11()
# Make a plot template without axes and box
plot(0, 0,
type = "n",
xaxt = "n",
yaxt = "n",
bty = "n",
xlab = "",
ylab = "",
main = NULL,
xlim = c(-5, 10),
ylim = c(-2, 10))
# Add points
for (i in 1:2000){
a = A[sample(1:4, 1, prob = A[, 7]), 1:6]
x = matrix(a[1:4], 2, 2, byrow = TRUE)%*%x + a[5:6]
points(x[1], x[2], col = "green")
}
A word cloud is a text mining method to find the most frequently used words in a text.
# You need to install the following packages first
# "tm", "SnowballC", "wordcloud", "RColorBrewer", "RCurl", and "XML"
A function from http://www.sthda.com/english/wiki/word-cloud-generator-in-r-one-killer-function-to-do-everything-you-need provides a way of creating word clouds. Here is how the function is defined.
#++++++++++++++++++++++++++++++++++
# rquery.wordcloud() : Word cloud generator
# - http://www.sthda.com
#+++++++++++++++++++++++++++++++++++
# x : character string (plain text, web url, txt file path)
# type : specify whether x is a plain text, a web page url or a file path
# lang : the language of the text
# excludeWords : a vector of words to exclude from the text
# textStemming : reduces words to their root form
# colorPalette : the name of color palette taken from RColorBrewer package,
# or a color name, or a color code
# min.freq : words with frequency below min.freq will not be plotted
# max.words : Maximum number of words to be plotted. least frequent terms dropped
# value returned by the function : a list(tdm, freqTable)
rquery.wordcloud <- function(x, type=c("text", "url", "file"),
lang="english", excludeWords=NULL,
textStemming=FALSE, colorPalette="Dark2",
min.freq=3, max.words=200)
{
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
if(type[1]=="file") text <- readLines(x)
else if(type[1]=="url") text <- html_to_text(x)
else if(type[1]=="text") text <- x
# Load the text as a corpus
docs <- Corpus(VectorSource(text))
# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# Remove numbers
docs <- tm_map(docs, removeNumbers)
# Remove stopwords for the language
docs <- tm_map(docs, removeWords, stopwords(lang))
# Remove punctuations
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
# Remove your own stopwords
if(!is.null(excludeWords))
docs <- tm_map(docs, removeWords, excludeWords)
# Text stemming
if(textStemming) docs <- tm_map(docs, stemDocument)
# Create term-document matrix
tdm <- TermDocumentMatrix(docs)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
# check the color palette name
if(!colorPalette %in% rownames(brewer.pal.info)) colors = colorPalette
else colors = brewer.pal(8, colorPalette)
# Plot the word cloud
set.seed(1234)
wordcloud(d$word,d$freq, min.freq=min.freq, max.words=max.words,
random.order=FALSE, rot.per=0.35,
use.r.layout=FALSE, colors=colors)
invisible(list(tdm=tdm, freqTable = d))
}
#++++++++++++++++++++++
# Helper function
#++++++++++++++++++++++
# Download and parse webpage
html_to_text<-function(url){
library(RCurl)
library(XML)
# download html
html.doc <- getURL(url)
#convert to plain text
doc = htmlParse(html.doc, asText=TRUE)
# "//text()" returns all text outside of HTML tags.
# We also don’t want text such as style and script codes
text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
# Format text vector into one character string
return(paste(text, collapse = " "))
}
Let’s use the function to process the text from http://www.sthda.com/english/wiki/word-cloud-generator-in-r-one-killer-function-to-do-everything-you-need
filePath <- "http://www.sthda.com/sthda/RDoc/example-files/martin-luther-king-i-have-a-dream-speech.txt"
res<-rquery.wordcloud(filePath, type ="file", lang = "english")
## Warning in tm_map.SimpleCorpus(docs, content_transformer(tolower)):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(docs, removeNumbers): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(docs, removeWords, stopwords(lang)):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(docs, removePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(docs, stripWhitespace): transformation drops
## documents
res
## $tdm
## <<TermDocumentMatrix (terms: 166, documents: 46)>>
## Non-/sparse entries: 278/7358
## Sparsity : 96%
## Maximal term length: 13
## Weighting : term frequency (tf)
##
## $freqTable
## word freq
## will will 17
## freedom freedom 13
## ring ring 12
## dream dream 11
## day day 11
## let let 11
## every every 9
## one one 8
## able able 8
## together together 7
## nation nation 4
## mountain mountain 4
## shall shall 4
## faith faith 4
## free free 4
## today today 3
## men men 3
## state state 3
## children children 3
## little little 3
## black black 3
## white white 3
## made made 3
## god god 3
## new new 3
## sing sing 3
## land land 3
## last last 3
## even even 2
## live live 2
## meaning meaning 2
## true true 2
## brotherhood brotherhood 2
## former former 2
## georgia georgia 2
## sons sons 2
## heat heat 2
## mississippi mississippi 2
## sweltering sweltering 2
## alabama alabama 2
## boys boys 2
## girls girls 2
## hands hands 2
## join join 2
## words words 2
## hill hill 2
## places places 2
## hope hope 2
## stone stone 2
## thee thee 2
## mountainside mountainside 2
## american american 1
## deeply deeply 1
## difficulties difficulties 1
## face face 1
## rooted rooted 1
## still still 1
## though though 1
## tomorrow tomorrow 1
## creed creed 1
## rise rise 1
## created created 1
## equal equal 1
## hold hold 1
## selfevident selfevident 1
## truths truths 1
## hills hills 1
## owners owners 1
## red red 1
## sit sit 1
## slave slave 1
## slaves slaves 1
## table table 1
## injustice injustice 1
## justice justice 1
## oasis oasis 1
## oppression oppression 1
## transformed transformed 1
## character character 1
## color color 1
## content content 1
## four four 1
## judged judged 1
## skin skin 1
## brothers brothers 1
## dripping dripping 1
## governor governor 1
## interposition interposition 1
## lips lips 1
## nullification nullification 1
## racists racists 1
## right right 1
## sisters sisters 1
## vicious vicious 1
## crooked crooked 1
## exalted exalted 1
## flesh flesh 1
## glory glory 1
## lord lord 1
## low low 1
## plain plain 1
## revealed revealed 1
## rough rough 1
## see see 1
## straight straight 1
## valley valley 1
## back back 1
## south south 1
## beautiful beautiful 1
## despair despair 1
## discords discords 1
## hew hew 1
## jail jail 1
## jangling jangling 1
## knowing knowing 1
## pray pray 1
## stand stand 1
## struggle struggle 1
## symphony symphony 1
## transform transform 1
## work work 1
## country country 1
## liberty liberty 1
## sweet sweet 1
## tis tis 1
## died died 1
## fathers fathers 1
## pilgrim pilgrim 1
## pride pride 1
## america america 1
## become become 1
## great great 1
## must must 1
## hampshire hampshire 1
## hilltops hilltops 1
## prodigious prodigious 1
## mighty mighty 1
## mountains mountains 1
## york york 1
## alleghenies alleghenies 1
## heightening heightening 1
## pennsylvania pennsylvania 1
## colorado colorado 1
## rockies rockies 1
## snowcapped snowcapped 1
## california california 1
## curvaceous curvaceous 1
## slopes slopes 1
## lookout lookout 1
## tennessee tennessee 1
## molehill molehill 1
## allow allow 1
## catholics catholics 1
## city city 1
## gentiles gentiles 1
## hamlet hamlet 1
## happens happens 1
## jews jews 1
## negro negro 1
## old old 1
## protestants protestants 1
## speed speed 1
## spiritual spiritual 1
## village village 1
## almighty almighty 1
## thank thank 1
You are strongly suggested to read this: https://www.quora.com/?pa_story=MTA5ODAwNzM5NTgzODE1NTE4MXwyMTExMDYyMzQ5ODA4MDh8MA**