birthweight <- read.csv("birthweight.csv")
#Basic data types We have already said that logical values can be used to subset a data frame, and all the values in a given column of a data frame must be of the same type or class. But what does this mean?
##Understanding class R has the following basic data classes:
numeric (includes integer and double) character logical complex raw Generally, in bioinformatics, values belong to one of the first three classes. Read more about the complex and raw data types here.
class(birthweight$birthweight)
## [1] "numeric"
class(birthweight$smoker)
## [1] "character"
class(birthweight$geriatric.pregnancy)
## [1] "logical"
The numeric category is fairly self-explanatory. What are character and logical?
Character values are exactly what they sound like: stored characters (letters and / or numbers). In the birthweight table, the “birth.date” and “location” columns contain character values.
head(birthweight$location)
## [1] "General" "Silver Hill" "Silver Hill" "Silver Hill" "Memorial"
## [6] "Memorial"
Characters are recognizable by the quotation marks that appear around them in the output. R cannot perform mathematical operations on numbers stored as characters.
#1 + "1"
Logical values are TRUE, FALSE, or NA (missing). Logical values are the result of comparing one item to another with relational operators.
The relational operators in R are:
greater than = greater than or equal to < less than <= less than or equal to == equal to != not equal to
birthweight[birthweight$head.circumference > 35, c("length", "weeks.gestation", "maternal.height", "paternal.height")]
## length weeks.gestation maternal.height paternal.height
## 1 52 38 164 NA
## 4 53 41 161 175
## 7 52 40 170 181
## 15 53 40 171 183
## 16 53 40 170 185
## 18 49 40 152 170
## 20 58 41 173 180
## 21 54 38 172 172
## 23 52 39 170 178
## 25 51 38 165 NA
## 31 58 41 172 185
## 33 51 40 168 181
## 34 51 39 157 NA
## 35 54 42 175 184
## 42 53 44 174 189
birthweight[birthweight$maternal.age <= 20, c("location", "maternal.age", "paternal.age")]
## location maternal.age paternal.age
## 11 Memorial 20 22
## 14 Memorial 19 20
## 15 Silver Hill 19 19
## 16 Memorial 20 24
## 21 Silver Hill 18 20
## 22 Silver Hill 20 23
## 26 General 20 23
## 28 General 20 20
## 37 Silver Hill 20 20
## 39 General 19 NA
## 42 Silver Hill 20 26
Notice that when R is asked to perform a comparison between a number and a missing value, the result is a missing value.
birthweight[birthweight$paternal.education == 10, c(1,13:16)]
## ID paternal.age paternal.education paternal.cigarettes paternal.height
## NA NA NA NA NA NA
## NA.1 NA NA NA NA NA
## 7 365 30 10 25 181
## 24 321 39 10 0 171
## NA.2 NA NA NA NA NA
## 26 1360 23 10 35 179
## 28 1363 20 10 35 185
## NA.3 NA NA NA NA NA
## 36 1191 21 10 25 185
## 37 431 20 10 35 180
## NA.4 NA NA NA NA NA
birthweight[birthweight$weeks.gestation != 40, "weeks.gestation"]
## [1] 38 39 41 41 39 39 34 38 38 38 41 37 39 41 38 35 39 37 38 44 41 37 41 41 35
## [26] 39 42 42 33 33 39 45 44
birthweight[birthweight$location == "General",]
## ID birth.date location length birthweight head.circumference
## 1 1107 1/25/1967 General 52 3.23 36
## 17 820 10/7/1967 General 52 3.77 34
## 18 752 10/19/1967 General 49 3.32 36
## 26 1360 2/16/1968 General 56 4.55 34
## 28 1363 4/2/1968 General 48 2.37 30
## 33 1088 7/24/1968 General 51 3.27 36
## 36 1191 9/7/1968 General 53 3.65 33
## 39 1600 10/9/1968 General 53 2.90 34
## 40 532 10/25/1968 General 53 3.59 34
## 41 223 12/11/1968 General 50 3.87 33
## weeks.gestation smoker maternal.age maternal.cigarettes maternal.height
## 1 38 no 31 0 164
## 17 40 no 24 0 157
## 18 40 yes 27 12 152
## 26 44 no 20 0 162
## 28 37 yes 20 7 163
## 33 40 no 24 0 168
## 36 42 no 21 0 165
## 39 39 no 19 0 165
## 40 40 yes 31 12 163
## 41 45 yes 28 25 163
## maternal.prepregnant.weight paternal.age paternal.education
## 1 57 NA NA
## 17 50 31 16
## 18 48 37 12
## 26 57 23 10
## 28 47 20 10
## 33 53 29 16
## 36 61 21 10
## 39 57 NA NA
## 40 49 41 12
## 41 54 30 16
## paternal.cigarettes paternal.height low.birthweight geriatric.pregnancy
## 1 NA NA 0 FALSE
## 17 0 173 0 FALSE
## 18 25 170 0 FALSE
## 26 35 179 0 FALSE
## 28 35 185 1 FALSE
## 33 0 181 0 FALSE
## 36 25 185 0 FALSE
## 39 NA NA 0 FALSE
## 40 50 191 0 FALSE
## 41 0 183 0 FALSE
Many of R’s functions also return logical values.
is.numeric(birthweight$ID)
## [1] TRUE
is.numeric(birthweight$smoker)
## [1] FALSE
##Coercion: converting between classes The birthweight data frame has three columns that should probably be logical values: “smoker”, “low.birthweight”, and “geriatric.pregnancy”. All of these are questions that can be answered with TRUE/FALSE. However, only “geriatric.pregnancy” is stored as a logical value. Storing “smoker” and “low.birthweight” as logical values would be more useful, since it allows us to subset the data frame more easily.
Changing the class of data is known as coercion.
as.logical(birthweight$low.birthweight)
## [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [25] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [37] TRUE TRUE FALSE FALSE FALSE FALSE
as.logical(birthweight$smoker)
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
The as.logical() function converted “low.birthweight” to a logical vector, but could not convert “smoker,” and returned a vector of missing data denoted by NA. Why is this?
The coercion rule in R is as follows:
logical > integer > numeric > complex > character
R can convert logical values to integers, store integers as the more general numeric type, or represent numeric data as a character, but these coercion operations cannot always be reversed without losing information.
as.numeric(birthweight$geriatric.pregnancy)
## [1] 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
## [39] 0 0 0 0
The as.logical() function only operates on “low.birthweight” the way we want because the data was encoded as 0s and 1s. If any other numbers were used, the results might be unexpected.
as.logical(birthweight$maternal.age)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Let’s convert the “low.birthweight” column to logical.
birthweight$low.birthweight <- as.logical(birthweight$low.birthweight)
birthweight
## ID birth.date location length birthweight head.circumference
## 1 1107 1/25/1967 General 52 3.23 36
## 2 697 2/6/1967 Silver Hill 48 3.03 35
## 3 1683 2/14/1967 Silver Hill 53 3.35 33
## 4 27 3/9/1967 Silver Hill 53 3.55 37
## 5 1522 3/13/1967 Memorial 50 2.74 33
## 6 569 3/23/1967 Memorial 50 2.51 35
## 7 365 4/23/1967 Memorial 52 3.53 37
## 8 808 5/5/1967 Silver Hill 48 2.92 33
## 9 1369 6/4/1967 Silver Hill 49 3.18 34
## 10 1023 6/7/1967 Memorial 52 3.00 35
## 11 822 6/14/1967 Memorial 50 3.42 35
## 12 1272 6/20/1967 Memorial 53 2.75 32
## 13 1262 6/25/1967 Silver Hill 53 3.19 34
## 14 575 7/12/1967 Memorial 50 2.78 30
## 15 1016 7/13/1967 Silver Hill 53 4.32 36
## 16 792 9/7/1967 Memorial 53 3.64 38
## 17 820 10/7/1967 General 52 3.77 34
## 18 752 10/19/1967 General 49 3.32 36
## 19 619 11/1/1967 Memorial 52 3.41 33
## 20 1764 12/7/1967 Silver Hill 58 4.57 39
## 21 1081 12/14/1967 Silver Hill 54 3.63 38
## 22 516 1/8/1968 Silver Hill 47 2.66 33
## 23 272 1/10/1968 Memorial 52 3.86 36
## 24 321 1/21/1968 Silver Hill 48 3.11 33
## 25 1636 2/2/1968 Silver Hill 51 3.93 38
## 26 1360 2/16/1968 General 56 4.55 34
## 27 1388 2/22/1968 Memorial 51 3.14 33
## 28 1363 4/2/1968 General 48 2.37 30
## 29 1058 4/24/1968 Silver Hill 53 3.15 34
## 30 755 4/25/1968 Memorial 53 3.20 33
## 31 462 6/19/1968 Silver Hill 58 4.10 39
## 32 300 7/18/1968 Silver Hill 46 2.05 32
## 33 1088 7/24/1968 General 51 3.27 36
## 34 57 8/12/1968 Memorial 51 3.32 38
## 35 553 8/17/1968 Silver Hill 54 3.94 37
## 36 1191 9/7/1968 General 53 3.65 33
## 37 431 9/16/1968 Silver Hill 48 1.92 30
## 38 1313 9/27/1968 Silver Hill 43 2.65 32
## 39 1600 10/9/1968 General 53 2.90 34
## 40 532 10/25/1968 General 53 3.59 34
## 41 223 12/11/1968 General 50 3.87 33
## 42 1187 12/19/1968 Silver Hill 53 4.07 38
## weeks.gestation smoker maternal.age maternal.cigarettes maternal.height
## 1 38 no 31 0 164
## 2 39 no 27 0 162
## 3 41 no 27 0 164
## 4 41 yes 37 25 161
## 5 39 yes 21 17 156
## 6 39 yes 22 7 159
## 7 40 yes 26 25 170
## 8 34 no 26 0 167
## 9 38 yes 31 25 162
## 10 38 yes 30 12 165
## 11 38 no 20 0 157
## 12 40 yes 37 50 168
## 13 41 yes 27 35 163
## 14 37 yes 19 7 165
## 15 40 no 19 0 171
## 16 40 yes 20 2 170
## 17 40 no 24 0 157
## 18 40 yes 27 12 152
## 19 39 yes 23 25 181
## 20 41 yes 32 12 173
## 21 38 no 18 0 172
## 22 35 yes 20 35 170
## 23 39 yes 30 25 170
## 24 37 no 28 0 158
## 25 38 no 29 0 165
## 26 44 no 20 0 162
## 27 41 yes 22 7 160
## 28 37 yes 20 7 163
## 29 40 no 29 0 167
## 30 41 no 21 0 155
## 31 41 no 35 0 172
## 32 35 yes 41 7 166
## 33 40 no 24 0 168
## 34 39 yes 23 17 157
## 35 42 no 24 0 175
## 36 42 no 21 0 165
## 37 33 yes 20 7 161
## 38 33 no 24 0 149
## 39 39 no 19 0 165
## 40 40 yes 31 12 163
## 41 45 yes 28 25 163
## 42 44 no 20 0 174
## maternal.prepregnant.weight paternal.age paternal.education
## 1 57 NA NA
## 2 62 27 14
## 3 62 37 14
## 4 66 46 NA
## 5 53 24 12
## 6 52 23 14
## 7 62 30 10
## 8 64 25 12
## 9 57 32 16
## 10 64 38 14
## 11 48 22 14
## 12 61 31 16
## 13 51 31 16
## 14 60 20 14
## 15 62 19 12
## 16 59 24 12
## 17 50 31 16
## 18 48 37 12
## 19 69 23 16
## 20 70 38 14
## 21 50 20 12
## 22 57 23 12
## 23 78 40 16
## 24 54 39 10
## 25 61 NA NA
## 26 57 23 10
## 27 53 24 16
## 28 47 20 10
## 29 60 30 16
## 30 55 25 14
## 31 58 31 16
## 32 57 37 14
## 33 53 29 16
## 34 48 NA NA
## 35 66 30 12
## 36 61 21 10
## 37 50 20 10
## 38 45 26 16
## 39 57 NA NA
## 40 49 41 12
## 41 54 30 16
## 42 68 26 14
## paternal.cigarettes paternal.height low.birthweight geriatric.pregnancy
## 1 NA NA FALSE FALSE
## 2 0 178 FALSE FALSE
## 3 0 170 FALSE FALSE
## 4 0 175 FALSE TRUE
## 5 7 179 FALSE FALSE
## 6 25 NA TRUE FALSE
## 7 25 181 FALSE FALSE
## 8 25 175 FALSE FALSE
## 9 50 194 FALSE FALSE
## 10 50 180 FALSE FALSE
## 11 0 179 FALSE FALSE
## 12 0 173 FALSE TRUE
## 13 25 185 FALSE FALSE
## 14 0 183 FALSE FALSE
## 15 0 183 FALSE FALSE
## 16 12 185 FALSE FALSE
## 17 0 173 FALSE FALSE
## 18 25 170 FALSE FALSE
## 19 2 181 FALSE FALSE
## 20 25 180 FALSE FALSE
## 21 7 172 FALSE FALSE
## 22 50 186 TRUE FALSE
## 23 50 178 FALSE FALSE
## 24 0 171 FALSE FALSE
## 25 NA NA FALSE FALSE
## 26 35 179 FALSE FALSE
## 27 12 176 FALSE FALSE
## 28 35 185 TRUE FALSE
## 29 NA 182 FALSE FALSE
## 30 25 183 FALSE FALSE
## 31 25 185 FALSE TRUE
## 32 25 173 TRUE TRUE
## 33 0 181 FALSE FALSE
## 34 NA NA FALSE FALSE
## 35 0 184 FALSE FALSE
## 36 25 185 FALSE FALSE
## 37 35 180 TRUE FALSE
## 38 0 169 TRUE FALSE
## 39 NA NA FALSE FALSE
## 40 50 191 FALSE FALSE
## 41 0 183 FALSE FALSE
## 42 25 189 FALSE FALSE
Note that the output of as.logical(birthweight$low.birthweight) must be assigned to the “low.birthweight” column in order for the values in the column to change.