Variables types in R

Volchenko, Shirokanova

January 16, 2023

What are we going to do today

Variables types

Image source: Practical Data Cleaning. 19 Essential Tips to Scrub Your Dirty Data (and keep your boss happy)

Variables types (more joyful way)

Image source: https://github.com/allisonhorst

Variables types (even more joyful way)

Central tendency measures

Image source: https://www.mymarketresearchmethods.com/types-of-data-nominal-ordinal-interval-ratio/

Our dataset for today: ESS, round 7

variables of interest for today’s class:

#install.packages("foreign")
library(foreign)
ess7 <- read.spss("/Users/olesyavolchenko/Yandex.Disk.localized/datafiles/ESS/ESS7e02_2.sav", use.value.labels = T, to.data.frame = T)
#ess7 <- read.spss(choose.files(), use.value.labels = T, to.data.frame = T)

Let’s decide what is the type of each variable

Let’s decide what is the type of each variable

Nominal variables

Nominal variables are stored as factors or characters in R

class(ess7$gndr)
## [1] "factor"
class(ess7$cntry)
## [1] "factor"

Central tendency measures

In R there is no built-in mode function but we can create one

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

Mode(ess7$gndr)
## [1] Female
## Levels: Male Female

Frequency distributions

table(ess7$gndr) # absolute values
## 
##   Male Female 
##  18871  21292
table(ess7$gndr) / nrow(ess7) # shares
## 
##      Male    Female 
## 0.4696031 0.5298494
table(ess7$gndr) / nrow(ess7)*100 # percentages
## 
##     Male   Female 
## 46.96031 52.98494

Plots for nominal variables

barplot(table(ess7$gndr))

library(ggplot2)
ggplot(data = subset(ess7, !is.na(ess7$gndr)), aes(x = gndr)) +
  geom_bar()

Ordinal variables

Ordinal variables can be stored as factors in R

class(ess7$domicil)
## [1] "factor"

Mode

Frequency distributions

table(ess7$domicil) # absolute values
## 
##                       A big city Suburbs or outskirts of big city 
##                             8492                             4727 
##               Town or small city                  Country village 
##                            12920                            11410 
##      Farm or home in countryside 
##                             2546
table(ess7$domicil) / nrow(ess7) # shares
## 
##                       A big city Suburbs or outskirts of big city 
##                       0.21132263                       0.11763096 
##               Town or small city                  Country village 
##                       0.32151300                       0.28393679 
##      Farm or home in countryside 
##                       0.06335697
table(ess7$domicil) / nrow(ess7)*100 # percentages
## 
##                       A big city Suburbs or outskirts of big city 
##                        21.132263                        11.763096 
##               Town or small city                  Country village 
##                        32.151300                        28.393679 
##      Farm or home in countryside 
##                         6.335697

Plots for ordinal variables

Barplots are also suitable here (but make sure that the sequence of category make sense here)

barplot(table(ess7$domicil))

library(ggplot2)
ggplot(ess7, aes(x = domicil)) +
  geom_bar()

library(ggplot2)
ggplot(ess7, aes(x = domicil)) +
  geom_bar() +
  coord_flip() +
  scale_x_discrete(limits = rev(levels(ess7$domicil)))

Interval variable

Should be stored as numeric in R

class(ess7$yrbrn)
## [1] "factor"

Oops! It is a factor!

But we can clearly see that we have numbers as values:

table(ess7$yrbrn)
## 
## 1900 1910 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 
##    1    1    1    4    2    1    2    3    9   11   22   28   45   48   69   90 
## 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 
##   94  140  151  214  192  231  272  270  314  336  358  387  432  466  490  482 
## 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 
##  521  530  616  651  657  701  648  751  650  684  625  640  687  709  636  699 
## 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 
##  703  733  689  670  666  716  696  654  690  688  641  679  610  657  580  650 
## 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 
##  641  670  624  622  666  709  617  572  540  552  580  527  528  522  530  519 
## 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 
##  497  495  533  462  448  499  537  501  346   56

We can change variable type with as.numeric(), as.character() and as.factor() functions

ess7$yrbrn1 <- as.numeric(ess7$yrbrn)

let’s take a look at our new variable

range(ess7$yrbrn1, na.rm = T)
## [1]  1 90
mean(ess7$yrbrn1, na.rm = T)
## [1] 55.13775
median(ess7$yrbrn1, na.rm = T)
## [1] 55
table(ess7$yrbrn1)
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   1   1   1   4   2   1   2   3   9  11  22  28  45  48  69  90  94 140 151 214 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
## 192 231 272 270 314 336 358 387 432 466 490 482 521 530 616 651 657 701 648 751 
##  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
## 650 684 625 640 687 709 636 699 703 733 689 670 666 716 696 654 690 688 641 679 
##  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
## 610 657 580 650 641 670 624 622 666 709 617 572 540 552 580 527 528 522 530 519 
##  81  82  83  84  85  86  87  88  89  90 
## 497 495 533 462 448 499 537 501 346  56

Hm… Something went wrong…

A hint: If you have a variable that contains only numbers but it is stored as factor in R you need to use as.numeric(as.character()) sequence

ess7$yrbrn2 <- as.numeric(as.character(ess7$yrbrn))
range(ess7$yrbrn2, na.rm = T)
## [1] 1900 2000
mean(ess7$yrbrn2, na.rm = T)
## [1] 1965.137
median(ess7$yrbrn2, na.rm = T)
## [1] 1965

Now we can see that the range, the mean and the median are looking more reliable.

Quasi-interval variable

An unofficial type of scaling that falls between ordinal and interval. Technically ordinal but can be analyzed as if it were interval. Usually there are five or more levels of the variable.

For example:

Would you say that most people can be trusted, or that you can’t be too careful in dealing with people? Please tell me on a score of 0 to 10, where 0 means you can’t be too careful and 10 means that most people can be trusted.

table(ess7$ppltrst)
## 
##   You can't be too careful                          1 
##                       1994                       1136 
##                          2                          3 
##                       2493                       3838 
##                          4                          5 
##                       3781                       8356 
##                          6                          7 
##                       4725                       6802 
##                          8                          9 
##                       5057                       1237 
## Most people can be trusted 
##                        691
class(ess7$ppltrst)
## [1] "factor"

We can recode it into numeric and calculate mean, median and standard deviation.

ess7$ppltrst1 <- as.numeric(ess7$ppltrst)
summary(ess7$ppltrst1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   5.000   6.000   6.211   8.000  11.000      75

We can recode it into numeric and calculate mean, median and standard deviation.

If we want to keep the orignal range of the variable (from 0 to 10) we can subtract 1 from each observation

ess7$ppltrst1 <- as.numeric(ess7$ppltrst) - 1
summary(ess7$ppltrst1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   4.000   5.000   5.211   7.000  10.000      75

Ratio variable

Should be stored as numeric in R

table(ess7$weight)
## 
##     30     32     34     35     36     37     38   38.7     39   39.9     40 
##      1      1      2      1      2      4      8      2      6      1     19 
## 40.005  40.95     41   41.3  41.85     42   42.2   42.3   42.5     43   43.5 
##      1      1     13      1      1     36      1      1      1     31      2 
##  43.65     44   44.1   44.5  44.55     45   45.3   45.4  45.45   45.5   45.9 
##      1     34     12      7      1    119      1      2      3      1      2 
##     46   46.3   46.7   46.8     47   47.2  47.25   47.5   47.6   47.7     48 
##     60      1      2      3     98      2     17      3     10      5    175 
##   48.1  48.15   48.5   48.6     49   49.4   49.5   49.8   49.9  49.95     50 
##      2      3      2      3    140      1      4      1      2      2    428 
## 50.005   50.1   50.3   50.4   50.5   50.8  50.85     51   51.1   51.2   51.3 
##      1      1      2     48      4     32      3    193      1      1     10 
##   51.4   51.5   51.7   51.9     52   52.1   52.2   52.3   52.5   52.6  52.65 
##      2      5      8      1    386      1      7      1      1      3      4 
##     53 53.005   53.1   53.5  53.55     54   54.4  54.45   54.5   54.9     55 
##    385      1     13      7     35    424      6      5      3     15    698 
## 55.003  55.05   55.1   55.3  55.35   55.5   55.6   55.8     56   56.2  56.25 
##      1      1      1      7      2      4      2      6    455     11      2 
##   56.5   56.6   56.7   56.8     57 57.002  57.15   57.2   57.5   57.6   57.7 
##      2      1     60      1    438      1      6     76      3     10      1 
##     58  58.05   58.1   58.5  58.95     59   59.1   59.4   59.5   59.6   59.8 
##    724      3     11     27      7    429      1     34      3      1      1 
##  59.85   59.9     60 60.003   60.3   60.5   60.7  60.75   60.8     61   61.2 
##     46      7   1417      1     72      5      1      9      9    362     23 
##   61.4   61.5  61.65   61.7     62 62.007   62.1   62.5  62.55   62.6   62.9 
##      1      4      8      9    858      1     13      6      2     18      1 
##     63 63.002  63.45   63.5   63.6   63.9     64 64.001   64.3  64.35   64.4 
##    841      1      7    117      2     15    622      1      1     11     21 
##   64.5   64.8   64.9 64.992     65 65.005 65.007   65.1   65.2  65.25   65.3 
##      2     13     10      1   1450      1      1      1      1     14     19 
##   65.5   65.7   65.8     66 66.004   66.1  66.15   66.2   66.3   66.4   66.5 
##      2      6     30    484      2      1     55     13      1      1      3 
##   66.6   66.7     67  67.05   67.1   67.5   67.6   67.9  67.95     68   68.4 
##     19     73    660      3      9     22      4      1      6   1034      7 
##   68.5  68.85   68.9     69   69.2   69.3   69.4   69.5   69.7  69.75   69.9 
##      9      6     11    512      1     85      2      8      1      5    117 
##     70   70.2   70.3   70.4   70.5   70.6  70.65   70.8     71   71.1   71.2 
##   1883     17      4      1      7      2     15     15    379     15     16 
##   71.5  71.55   71.7     72   72.1   72.2  72.45   72.5   72.6   72.8   72.9 
##      5     11     14   1024     14      1     62      6     10      1     15 
##     73   73.1   73.2   73.3  73.35   73.4   73.5   73.8   73.9     74   74.2 
##    714      1      1      2      1      1     14     17     13    738      1 
##  74.25   74.4   74.5   74.7   74.8     75  75.15   75.3   75.5   75.6   75.7 
##      1      8      4     11      2   1527      5      5      7     91      2 
##     76  76.05   76.2   76.5   76.7   76.9  76.95     77   77.1   77.4   77.5 
##    616      3    140     12      8      1      6    449     15     10      4 
##   77.6   77.7  77.85     78   78.1   78.3   78.5   78.6  78.75   78.9     79 
##      8      3     16    943      1     12     20      1     46     16    489 
##   79.1   79.2   79.4   79.5  79.65   79.8   79.9     80 80.005   80.1   80.3 
##      1     14     64      4      4     11      2   1923      1     13      5 
##   80.5  80.55   80.6   80.7     81  81.02  81.45   81.5   81.6   81.7   81.9 
##      6      2      1     17    367      1      5      4      4      1     97 
##     82   82.1   82.3  82.35   82.4   82.5   82.6   82.8     83 83.007  83.08 
##    789      3      1      2      1      4    103      5    575      1      1 
##   83.1   83.3   83.4   83.5   83.7   83.9     84 84.005  84.15   84.2   84.4 
##      2      1      1      9     12      3    529      1      7      1      8 
##   84.5   84.6   84.8     85  85.05   85.1   85.3   85.5   85.7  85.95     86 
##      6      9     13   1149     34      1      7     15     50      5    467 
##   86.1   86.2   86.4   86.5   86.6  86.85     87   87.1   87.2   87.3   87.5 
##      1      6      7      3      3      1    392      8      2      7      7 
##  87.75     88   88.1   88.2   88.3   88.4   88.5  88.65   88.8   88.9     89 
##      1    469      1     70      1      1      2      5      3     75    361 
##   89.1   89.3   89.4   89.5  89.55   89.6   89.8 89.992     90   90.1   90.3 
##      8      1      1      8      3      2     12      1   1151      1      3 
##  90.45   90.5   90.7   90.8   90.9     91   91.2  91.35   91.6   91.8     92 
##      2      4      7      1      8    201      9     22      2      3    429 
##   92.1  92.25   92.5   92.7   92.8     93  93.15   93.4   93.5   93.6     94 
##     29      5      6      7      2    277      1      3      3      2    231 
##   94.2   94.3   94.5  94.95     95   95.1   95.3   95.4   95.5   95.7  95.85 
##      1      3     52      1    606      1     36      4      1      2      3 
##     96   96.2   96.3   96.5   96.6   96.7  96.75     97   97.1   97.2   97.5 
##    217      7      5      3      3      1      1    158      2      6      4 
##  97.65     98   98.1   98.4   98.5  98.55   98.9     99   99.1   99.3  99.45 
##     24    308      2     21      2      1      1     86      1      2      1 
##   99.8   99.9    100  100.2  100.5  100.6  100.7  100.8  100.9    101  101.6 
##      4      2    565      1      1      1      1     43      2     60     27 
##  101.7    102  102.5  102.6    103 103.05  103.4  103.5  103.9 103.95    104 
##      4    135      3      2     81      2      4      1      1      9     86 
##  104.2  104.3  104.4  104.5  104.8    105  105.2  105.3  105.5  105.7 105.75 
##      1      1      4      1      8    228      1      4      1      2      1 
##    106  106.2  106.4  106.6 106.65    107  107.1    108    109 109.35  109.8 
##     65      3      1      1      1     42     17     81     25      3      1 
##    110 110.25  110.5  110.7    111  111.1 111.15  111.6    112    113  113.4 
##    277      4      1      1     19      3      2      2     50     24     16 
##    114  114.3    115  115.2    116 116.55    117  117.5    118    119  119.5 
##     23      6    103      2     24      3     24      2     25      9      1 
##  119.7    120  120.7    121    122  122.4 122.85  122.9    123  123.4  123.8 
##      6    114      3      8     12      1      1      1     10      1      1 
##    124  124.7    125    126  126.5    127    128  128.4    129    130  130.2 
##      9      1     52     10      1      8      6      1      7     49      1 
##    131    132  132.3  132.9    133  133.4 133.65    134    135    136    137 
##      3      5      3      1      2      1      1      3     17      2      4 
##    138  138.6    139    140    141  141.1    143    144  144.9    145    146 
##      3      1      1     15      2      1      1      2      2      3      2 
##    147    149    150  150.3    151  151.2    152    153    154    155    157 
##      2      1      9      1      1      1      2      1      1      2      1 
##  157.5  158.8    160    161    162    163    164    165  165.1    167    170 
##      1      1      5      1      1      2      2      2      1      2      4 
##    172    173    175    180    182    185    190    195 
##      1      1      2      3      2      1      1      1
class(ess7$weight)
## [1] "factor"
table(ess7$height)
## 
##     76     97    100    101    105    106    107    108    116    117    118 
##      1      1      4      1      3      2      3      2      3      1      1 
##    120    122    123    125    127 129.54    130 132.08 134.62    135    136 
##      2      1      1      1      1      2      2      1      1      3      1 
##    138    139  139.7    140    141    142 142.24    143    144 144.78    145 
##      3      2      2     15      2     11      3      6      3      2     26 
##    146    147 147.32    148    149 149.86    150    151    152  152.4    153 
##     11     20     11     42     27     31    363     63    239     69    202 
##    154 154.94    155    156    157 157.48    158    159    160 160.02    161 
##    234     83    575    472    572    172    883    401   2129    149    382 
##    162 162.56    163    164    165  165.1    166    167 167.64    168    169 
##   1108    175   1233   1238   2272    154    705   1186    204   2139    855 
##    170 170.18    171    172 172.72    173    174    175 175.26    176    177 
##   2702    180    564   1515    190   1256    944   1730    128   1245    528 
##  177.8    178    179    180 180.34    181    182 182.88    183    184    185 
##    160   1668    533   1890    102    400    828     98    834    447    739 
## 185.42    186    187 187.96    188    189    190  190.5    191    192    193 
##     63    507    375     45    370    181    305     23     97    136    124 
## 193.04    194    195 195.58    196    197    198 198.12    199    200 200.66 
##      6     71     57      3     42     32     43      5      5     32      2 
##    201    202    203  203.2    204    205    206    207    208    210 
##      6      9      4      2      4      3      2      1      2      1
class(ess7$height)
## [1] "factor"

Once again, we need to use as.numeric(as.character()) here

ess7$weight2 <- as.numeric(as.character(ess7$weight))
ess7$height2 <- as.numeric(as.character(ess7$height))

Central tendency measures

We can calculate mean, median, standard deviation and report a range of ratio variable

mean(ess7$weight2, na.rm = T)
## [1] 74.86203
median(ess7$weight2, na.rm = T)
## [1] 74
range(ess7$weight2, na.rm = T)
## [1]  30 195
sd(ess7$weight2, na.rm = T)
## [1] 15.60587

Please, note, that here we use our new variable weight2

Plots for ratio variable

hist(ess7$weight2)

ggplot(ess7, aes(x = weight2)) +
  geom_histogram(binwidth = 5)

We can calculate BMI based on weight and height

#ess7$height2 <- as.numeric(as.character(ess7$height))
summary(ess7$height2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    76.0   164.0   170.0   170.5   178.0   210.0     434
ess7$bmi <- ess7$weight2/(ess7$height2/100)^2 # it works now
summary(ess7$bmi)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   10.52   22.48   25.14   25.69   28.09  141.97    1313
hist(ess7$bmi)

mean(ess7$bmi, na.rm = T)
## [1] 25.69089
median(ess7$bmi, na.rm = T)
## [1] 25.14286

We can move from higher level of measurement to a lower one

BMI as continuos measure -> BMI as categories

Category BMI (kg/m2)
Very severely underweight <15
Severely underweight 15-16
Underweight 16-18.5
Normal (healthy weight) 18.5-25
Overweight 25-30
Obese Class I (Moderately obese) 30-35
Obese Class II (Severely obese) 35-40
Obese Class III (Very severely obese) >40

Let’s recode continuous BMI into categories

ess7$bmi_cat <- cut(ess7$bmi, c(0, 15, 16, 18.5, 25, 30, 35, 40, 150))
table(ess7$bmi_cat) # absolute values
## 
##    (0,15]   (15,16] (16,18.5] (18.5,25]   (25,30]   (30,35]   (35,40]  (40,150] 
##        27        47       929     17997     13909      4539      1052       372
table(ess7$bmi_cat) / nrow(ess7) # shares
## 
##       (0,15]      (15,16]    (16,18.5]    (18.5,25]      (25,30]      (30,35] 
## 0.0006718925 0.0011695906 0.0231180789 0.4478536767 0.3461241757 0.1129525943 
##      (35,40]     (40,150] 
## 0.0261789225 0.0092571855
table(ess7$bmi_cat) / nrow(ess7)*100 # percentage
## 
##      (0,15]     (15,16]   (16,18.5]   (18.5,25]     (25,30]     (30,35] 
##  0.06718925  0.11695906  2.31180789 44.78536767 34.61241757 11.29525943 
##     (35,40]    (40,150] 
##  2.61789225  0.92571855

We can add meaningful labels here:

ess7$bmi_cat1 <- factor(ess7$bmi_cat,
levels = c("(0,15]", "(15,16]", "(16,18.5]", "(18.5,25]","(25,30]",  
           "(30,35]", "(35,40]", "(40,150]"),
labels = c("Very severely underweight",     
"Severely underweight", 
"Underweight",  
"Normal \n (healthy weight)", 
"Overweight", 
"Obese Class I \n (Moderately obese)", 
"Obese Class II \n (Severely obese)", 
"Obese Class III \n (Very severely obese)"))

par(mar = c(5.1, 20, 2.1, 2.1))
barplot(table(ess7$bmi_cat1), las = 2, horiz = T)

Finally, there is a way to summarize all relevant variables in the dataset using one table

… and one function table1()

#install.packages(table1)
library(table1)
table1(~ as.numeric(as.character(yrbrn)) + gndr + domicil + height2 + weight2 + bmi + bmi_cat1, data=ess7)
Overall
(N=40185)
as.numeric(as.character(yrbrn))
Mean (SD) 1970 (18.8)
Median [Min, Max] 1970 [1900, 2000]
Missing 99 (0.2%)
gndr
Male 18871 (47.0%)
Female 21292 (53.0%)
Missing 22 (0.1%)
domicil
A big city 8492 (21.1%)
Suburbs or outskirts of big city 4727 (11.8%)
Town or small city 12920 (32.2%)
Country village 11410 (28.4%)
Farm or home in countryside 2546 (6.3%)
Missing 90 (0.2%)
height2
Mean (SD) 170 (9.77)
Median [Min, Max] 170 [76.0, 210]
Missing 434 (1.1%)
weight2
Mean (SD) 74.9 (15.6)
Median [Min, Max] 74.0 [30.0, 195]
Missing 1162 (2.9%)
bmi
Mean (SD) 25.7 (4.77)
Median [Min, Max] 25.1 [10.5, 142]
Missing 1313 (3.3%)
bmi_cat1
Very severely underweight 27 (0.1%)
Severely underweight 47 (0.1%)
Underweight 929 (2.3%)
Normal (healthy weight) 17997 (44.8%)
Overweight 13909 (34.6%)
Obese Class I (Moderately obese) 4539 (11.3%)
Obese Class II (Severely obese) 1052 (2.6%)
Obese Class III (Very severely obese) 372 (0.9%)
Missing 1313 (3.3%)

Summary

Variable type Type in R Central tendency measure Plot type
Nominal factor or character mode barplot
Ordinal factor or character mode, median barplot (make sure that categories of a variable are in the correct sequence)
Interval ordered factor or numeric mode, median, mean barlot (if there are few categories) or histogram
Ratio numeric mode, median, mean histogram