Volchenko, Shirokanova
January 16, 2023
Image source: Practical Data Cleaning. 19 Essential Tips to Scrub Your Dirty Data (and keep your boss happy)
Image source: https://www.mymarketresearchmethods.com/types-of-data-nominal-ordinal-interval-ratio/
variables of interest for today’s class:
Nominal variables are stored as factors or characters in R
## [1] "factor"
## [1] "factor"
Central tendency measures
In R there is no built-in mode function but we can create one
## [1] Female
## Levels: Male Female
Frequency distributions
##
## Male Female
## 18871 21292
##
## Male Female
## 0.4696031 0.5298494
##
## Male Female
## 46.96031 52.98494
Ordinal variables can be stored as factors in R
## [1] "factor"
Mode
Frequency distributions
##
## A big city Suburbs or outskirts of big city
## 8492 4727
## Town or small city Country village
## 12920 11410
## Farm or home in countryside
## 2546
##
## A big city Suburbs or outskirts of big city
## 0.21132263 0.11763096
## Town or small city Country village
## 0.32151300 0.28393679
## Farm or home in countryside
## 0.06335697
##
## A big city Suburbs or outskirts of big city
## 21.132263 11.763096
## Town or small city Country village
## 32.151300 28.393679
## Farm or home in countryside
## 6.335697
Barplots are also suitable here (but make sure that the sequence of category make sense here)
library(ggplot2)
ggplot(ess7, aes(x = domicil)) +
geom_bar() +
coord_flip() +
scale_x_discrete(limits = rev(levels(ess7$domicil)))Should be stored as numeric in R
## [1] "factor"
Oops! It is a factor!
But we can clearly see that we have numbers as values:
##
## 1900 1910 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926
## 1 1 1 4 2 1 2 3 9 11 22 28 45 48 69 90
## 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942
## 94 140 151 214 192 231 272 270 314 336 358 387 432 466 490 482
## 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958
## 521 530 616 651 657 701 648 751 650 684 625 640 687 709 636 699
## 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974
## 703 733 689 670 666 716 696 654 690 688 641 679 610 657 580 650
## 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990
## 641 670 624 622 666 709 617 572 540 552 580 527 528 522 530 519
## 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
## 497 495 533 462 448 499 537 501 346 56
We can change variable type with as.numeric(), as.character() and as.factor() functions
let’s take a look at our new variable
## [1] 1 90
## [1] 55.13775
## [1] 55
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 1 1 1 4 2 1 2 3 9 11 22 28 45 48 69 90 94 140 151 214
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 192 231 272 270 314 336 358 387 432 466 490 482 521 530 616 651 657 701 648 751
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 650 684 625 640 687 709 636 699 703 733 689 670 666 716 696 654 690 688 641 679
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 610 657 580 650 641 670 624 622 666 709 617 572 540 552 580 527 528 522 530 519
## 81 82 83 84 85 86 87 88 89 90
## 497 495 533 462 448 499 537 501 346 56
Hm… Something went wrong…
A hint: If you have a variable that contains only numbers but it is stored as factor in R you need to use as.numeric(as.character()) sequence
## [1] 1900 2000
## [1] 1965.137
## [1] 1965
Now we can see that the range, the mean and the median are looking more reliable.
An unofficial type of scaling that falls between ordinal and interval. Technically ordinal but can be analyzed as if it were interval. Usually there are five or more levels of the variable.
For example:
Would you say that most people can be trusted, or that you can’t be too careful in dealing with people? Please tell me on a score of 0 to 10, where 0 means you can’t be too careful and 10 means that most people can be trusted.
##
## You can't be too careful 1
## 1994 1136
## 2 3
## 2493 3838
## 4 5
## 3781 8356
## 6 7
## 4725 6802
## 8 9
## 5057 1237
## Most people can be trusted
## 691
## [1] "factor"
We can recode it into numeric and calculate mean, median and standard deviation.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 5.000 6.000 6.211 8.000 11.000 75
We can recode it into numeric and calculate mean, median and standard deviation.
If we want to keep the orignal range of the variable (from 0 to 10) we can subtract 1 from each observation
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 4.000 5.000 5.211 7.000 10.000 75
Should be stored as numeric in R
##
## 30 32 34 35 36 37 38 38.7 39 39.9 40
## 1 1 2 1 2 4 8 2 6 1 19
## 40.005 40.95 41 41.3 41.85 42 42.2 42.3 42.5 43 43.5
## 1 1 13 1 1 36 1 1 1 31 2
## 43.65 44 44.1 44.5 44.55 45 45.3 45.4 45.45 45.5 45.9
## 1 34 12 7 1 119 1 2 3 1 2
## 46 46.3 46.7 46.8 47 47.2 47.25 47.5 47.6 47.7 48
## 60 1 2 3 98 2 17 3 10 5 175
## 48.1 48.15 48.5 48.6 49 49.4 49.5 49.8 49.9 49.95 50
## 2 3 2 3 140 1 4 1 2 2 428
## 50.005 50.1 50.3 50.4 50.5 50.8 50.85 51 51.1 51.2 51.3
## 1 1 2 48 4 32 3 193 1 1 10
## 51.4 51.5 51.7 51.9 52 52.1 52.2 52.3 52.5 52.6 52.65
## 2 5 8 1 386 1 7 1 1 3 4
## 53 53.005 53.1 53.5 53.55 54 54.4 54.45 54.5 54.9 55
## 385 1 13 7 35 424 6 5 3 15 698
## 55.003 55.05 55.1 55.3 55.35 55.5 55.6 55.8 56 56.2 56.25
## 1 1 1 7 2 4 2 6 455 11 2
## 56.5 56.6 56.7 56.8 57 57.002 57.15 57.2 57.5 57.6 57.7
## 2 1 60 1 438 1 6 76 3 10 1
## 58 58.05 58.1 58.5 58.95 59 59.1 59.4 59.5 59.6 59.8
## 724 3 11 27 7 429 1 34 3 1 1
## 59.85 59.9 60 60.003 60.3 60.5 60.7 60.75 60.8 61 61.2
## 46 7 1417 1 72 5 1 9 9 362 23
## 61.4 61.5 61.65 61.7 62 62.007 62.1 62.5 62.55 62.6 62.9
## 1 4 8 9 858 1 13 6 2 18 1
## 63 63.002 63.45 63.5 63.6 63.9 64 64.001 64.3 64.35 64.4
## 841 1 7 117 2 15 622 1 1 11 21
## 64.5 64.8 64.9 64.992 65 65.005 65.007 65.1 65.2 65.25 65.3
## 2 13 10 1 1450 1 1 1 1 14 19
## 65.5 65.7 65.8 66 66.004 66.1 66.15 66.2 66.3 66.4 66.5
## 2 6 30 484 2 1 55 13 1 1 3
## 66.6 66.7 67 67.05 67.1 67.5 67.6 67.9 67.95 68 68.4
## 19 73 660 3 9 22 4 1 6 1034 7
## 68.5 68.85 68.9 69 69.2 69.3 69.4 69.5 69.7 69.75 69.9
## 9 6 11 512 1 85 2 8 1 5 117
## 70 70.2 70.3 70.4 70.5 70.6 70.65 70.8 71 71.1 71.2
## 1883 17 4 1 7 2 15 15 379 15 16
## 71.5 71.55 71.7 72 72.1 72.2 72.45 72.5 72.6 72.8 72.9
## 5 11 14 1024 14 1 62 6 10 1 15
## 73 73.1 73.2 73.3 73.35 73.4 73.5 73.8 73.9 74 74.2
## 714 1 1 2 1 1 14 17 13 738 1
## 74.25 74.4 74.5 74.7 74.8 75 75.15 75.3 75.5 75.6 75.7
## 1 8 4 11 2 1527 5 5 7 91 2
## 76 76.05 76.2 76.5 76.7 76.9 76.95 77 77.1 77.4 77.5
## 616 3 140 12 8 1 6 449 15 10 4
## 77.6 77.7 77.85 78 78.1 78.3 78.5 78.6 78.75 78.9 79
## 8 3 16 943 1 12 20 1 46 16 489
## 79.1 79.2 79.4 79.5 79.65 79.8 79.9 80 80.005 80.1 80.3
## 1 14 64 4 4 11 2 1923 1 13 5
## 80.5 80.55 80.6 80.7 81 81.02 81.45 81.5 81.6 81.7 81.9
## 6 2 1 17 367 1 5 4 4 1 97
## 82 82.1 82.3 82.35 82.4 82.5 82.6 82.8 83 83.007 83.08
## 789 3 1 2 1 4 103 5 575 1 1
## 83.1 83.3 83.4 83.5 83.7 83.9 84 84.005 84.15 84.2 84.4
## 2 1 1 9 12 3 529 1 7 1 8
## 84.5 84.6 84.8 85 85.05 85.1 85.3 85.5 85.7 85.95 86
## 6 9 13 1149 34 1 7 15 50 5 467
## 86.1 86.2 86.4 86.5 86.6 86.85 87 87.1 87.2 87.3 87.5
## 1 6 7 3 3 1 392 8 2 7 7
## 87.75 88 88.1 88.2 88.3 88.4 88.5 88.65 88.8 88.9 89
## 1 469 1 70 1 1 2 5 3 75 361
## 89.1 89.3 89.4 89.5 89.55 89.6 89.8 89.992 90 90.1 90.3
## 8 1 1 8 3 2 12 1 1151 1 3
## 90.45 90.5 90.7 90.8 90.9 91 91.2 91.35 91.6 91.8 92
## 2 4 7 1 8 201 9 22 2 3 429
## 92.1 92.25 92.5 92.7 92.8 93 93.15 93.4 93.5 93.6 94
## 29 5 6 7 2 277 1 3 3 2 231
## 94.2 94.3 94.5 94.95 95 95.1 95.3 95.4 95.5 95.7 95.85
## 1 3 52 1 606 1 36 4 1 2 3
## 96 96.2 96.3 96.5 96.6 96.7 96.75 97 97.1 97.2 97.5
## 217 7 5 3 3 1 1 158 2 6 4
## 97.65 98 98.1 98.4 98.5 98.55 98.9 99 99.1 99.3 99.45
## 24 308 2 21 2 1 1 86 1 2 1
## 99.8 99.9 100 100.2 100.5 100.6 100.7 100.8 100.9 101 101.6
## 4 2 565 1 1 1 1 43 2 60 27
## 101.7 102 102.5 102.6 103 103.05 103.4 103.5 103.9 103.95 104
## 4 135 3 2 81 2 4 1 1 9 86
## 104.2 104.3 104.4 104.5 104.8 105 105.2 105.3 105.5 105.7 105.75
## 1 1 4 1 8 228 1 4 1 2 1
## 106 106.2 106.4 106.6 106.65 107 107.1 108 109 109.35 109.8
## 65 3 1 1 1 42 17 81 25 3 1
## 110 110.25 110.5 110.7 111 111.1 111.15 111.6 112 113 113.4
## 277 4 1 1 19 3 2 2 50 24 16
## 114 114.3 115 115.2 116 116.55 117 117.5 118 119 119.5
## 23 6 103 2 24 3 24 2 25 9 1
## 119.7 120 120.7 121 122 122.4 122.85 122.9 123 123.4 123.8
## 6 114 3 8 12 1 1 1 10 1 1
## 124 124.7 125 126 126.5 127 128 128.4 129 130 130.2
## 9 1 52 10 1 8 6 1 7 49 1
## 131 132 132.3 132.9 133 133.4 133.65 134 135 136 137
## 3 5 3 1 2 1 1 3 17 2 4
## 138 138.6 139 140 141 141.1 143 144 144.9 145 146
## 3 1 1 15 2 1 1 2 2 3 2
## 147 149 150 150.3 151 151.2 152 153 154 155 157
## 2 1 9 1 1 1 2 1 1 2 1
## 157.5 158.8 160 161 162 163 164 165 165.1 167 170
## 1 1 5 1 1 2 2 2 1 2 4
## 172 173 175 180 182 185 190 195
## 1 1 2 3 2 1 1 1
## [1] "factor"
##
## 76 97 100 101 105 106 107 108 116 117 118
## 1 1 4 1 3 2 3 2 3 1 1
## 120 122 123 125 127 129.54 130 132.08 134.62 135 136
## 2 1 1 1 1 2 2 1 1 3 1
## 138 139 139.7 140 141 142 142.24 143 144 144.78 145
## 3 2 2 15 2 11 3 6 3 2 26
## 146 147 147.32 148 149 149.86 150 151 152 152.4 153
## 11 20 11 42 27 31 363 63 239 69 202
## 154 154.94 155 156 157 157.48 158 159 160 160.02 161
## 234 83 575 472 572 172 883 401 2129 149 382
## 162 162.56 163 164 165 165.1 166 167 167.64 168 169
## 1108 175 1233 1238 2272 154 705 1186 204 2139 855
## 170 170.18 171 172 172.72 173 174 175 175.26 176 177
## 2702 180 564 1515 190 1256 944 1730 128 1245 528
## 177.8 178 179 180 180.34 181 182 182.88 183 184 185
## 160 1668 533 1890 102 400 828 98 834 447 739
## 185.42 186 187 187.96 188 189 190 190.5 191 192 193
## 63 507 375 45 370 181 305 23 97 136 124
## 193.04 194 195 195.58 196 197 198 198.12 199 200 200.66
## 6 71 57 3 42 32 43 5 5 32 2
## 201 202 203 203.2 204 205 206 207 208 210
## 6 9 4 2 4 3 2 1 2 1
## [1] "factor"
Once again, we need to use as.numeric(as.character()) here
ess7$weight2 <- as.numeric(as.character(ess7$weight))
ess7$height2 <- as.numeric(as.character(ess7$height))Central tendency measures
We can calculate mean, median, standard deviation and report a range of ratio variable
## [1] 74.86203
## [1] 74
## [1] 30 195
## [1] 15.60587
Please, note, that here we use our new variable weight2
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 76.0 164.0 170.0 170.5 178.0 210.0 434
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 10.52 22.48 25.14 25.69 28.09 141.97 1313
## [1] 25.69089
## [1] 25.14286
BMI as continuos measure -> BMI as categories
| Category | BMI (kg/m2) |
|---|---|
| Very severely underweight | <15 |
| Severely underweight | 15-16 |
| Underweight | 16-18.5 |
| Normal (healthy weight) | 18.5-25 |
| Overweight | 25-30 |
| Obese Class I (Moderately obese) | 30-35 |
| Obese Class II (Severely obese) | 35-40 |
| Obese Class III (Very severely obese) | >40 |
Let’s recode continuous BMI into categories
ess7$bmi_cat <- cut(ess7$bmi, c(0, 15, 16, 18.5, 25, 30, 35, 40, 150))
table(ess7$bmi_cat) # absolute values##
## (0,15] (15,16] (16,18.5] (18.5,25] (25,30] (30,35] (35,40] (40,150]
## 27 47 929 17997 13909 4539 1052 372
##
## (0,15] (15,16] (16,18.5] (18.5,25] (25,30] (30,35]
## 0.0006718925 0.0011695906 0.0231180789 0.4478536767 0.3461241757 0.1129525943
## (35,40] (40,150]
## 0.0261789225 0.0092571855
##
## (0,15] (15,16] (16,18.5] (18.5,25] (25,30] (30,35]
## 0.06718925 0.11695906 2.31180789 44.78536767 34.61241757 11.29525943
## (35,40] (40,150]
## 2.61789225 0.92571855
We can add meaningful labels here:
ess7$bmi_cat1 <- factor(ess7$bmi_cat,
levels = c("(0,15]", "(15,16]", "(16,18.5]", "(18.5,25]","(25,30]",
"(30,35]", "(35,40]", "(40,150]"),
labels = c("Very severely underweight",
"Severely underweight",
"Underweight",
"Normal \n (healthy weight)",
"Overweight",
"Obese Class I \n (Moderately obese)",
"Obese Class II \n (Severely obese)",
"Obese Class III \n (Very severely obese)"))
par(mar = c(5.1, 20, 2.1, 2.1))
barplot(table(ess7$bmi_cat1), las = 2, horiz = T)… and one function table1()
#install.packages(table1)
library(table1)
table1(~ as.numeric(as.character(yrbrn)) + gndr + domicil + height2 + weight2 + bmi + bmi_cat1, data=ess7)| Overall (N=40185) |
|
|---|---|
| as.numeric(as.character(yrbrn)) | |
| Mean (SD) | 1970 (18.8) |
| Median [Min, Max] | 1970 [1900, 2000] |
| Missing | 99 (0.2%) |
| gndr | |
| Male | 18871 (47.0%) |
| Female | 21292 (53.0%) |
| Missing | 22 (0.1%) |
| domicil | |
| A big city | 8492 (21.1%) |
| Suburbs or outskirts of big city | 4727 (11.8%) |
| Town or small city | 12920 (32.2%) |
| Country village | 11410 (28.4%) |
| Farm or home in countryside | 2546 (6.3%) |
| Missing | 90 (0.2%) |
| height2 | |
| Mean (SD) | 170 (9.77) |
| Median [Min, Max] | 170 [76.0, 210] |
| Missing | 434 (1.1%) |
| weight2 | |
| Mean (SD) | 74.9 (15.6) |
| Median [Min, Max] | 74.0 [30.0, 195] |
| Missing | 1162 (2.9%) |
| bmi | |
| Mean (SD) | 25.7 (4.77) |
| Median [Min, Max] | 25.1 [10.5, 142] |
| Missing | 1313 (3.3%) |
| bmi_cat1 | |
| Very severely underweight | 27 (0.1%) |
| Severely underweight | 47 (0.1%) |
| Underweight | 929 (2.3%) |
| Normal (healthy weight) | 17997 (44.8%) |
| Overweight | 13909 (34.6%) |
| Obese Class I (Moderately obese) | 4539 (11.3%) |
| Obese Class II (Severely obese) | 1052 (2.6%) |
| Obese Class III (Very severely obese) | 372 (0.9%) |
| Missing | 1313 (3.3%) |
| Variable type | Type in R | Central tendency measure | Plot type |
|---|---|---|---|
| Nominal | factor or character | mode | barplot |
| Ordinal | factor or character | mode, median | barplot (make sure that categories of a variable are in the correct sequence) |
| Interval | ordered factor or numeric | mode, median, mean | barlot (if there are few categories) or histogram |
| Ratio | numeric | mode, median, mean | histogram |