Exercise 1. Variable names

The type of data we are working with will often influence the data visualization technique we use. We will be working with two types of variables: categorical and numeric. Each can be divided into two other groups: categorical can be ordinal or not, whereas numerical variables can be discrete or continuous.

We will review data types using some of the examples provided in the dslabs package. For example, the heights dataset.

library(dslabs)
data(heights)
names(heights)
## [1] "sex"    "height"

Exercise 2. Variable type

We saw that sex is the first variable. We know what values are represented by this variable and can confirm this by looking at the first few entires:

library(dslabs) data(heights) head(heights) What data type is the sex variable?

Exercise 3. Numerical values

Keep in mind that discrete numeric data can be considered ordinal. Although this is technically true, we usually reserve the term ordinal data for variables belonging to a small number of different groups, with each group having many members.

The height variable could be ordinal if, for example, we report a small number of values such as short, medium, and tall. Let’s explore how many unique values are used by the heights variable. For this we can use the unique function:

x <- c(3, 3, 3, 3, 4, 4, 2) unique(x)

Use the unique and length functions to determine how many unique heights were reported.

library(dslabs)
data(heights)
x <- heights$height
unique(x)
##   [1] 75.00000 70.00000 68.00000 74.00000 61.00000 65.00000 66.00000 62.00000
##   [9] 67.00000 72.00000 69.00000 64.00000 60.00000 73.00000 71.00000 66.75000
##  [17] 63.00000 70.50000 64.96063 64.17320 68.50000 78.74016 69.60000 66.50000
##  [25] 71.50000 76.00000 74.50000 78.00000 62.50000 77.00000 68.89000 64.17300
##  [33] 59.00000 67.70000 71.70000 70.87000 68.11000 64.57000 51.00000 59.05512
##  [41] 80.00000 64.96000 68.89764 69.68504 72.05000 72.50000 70.07874 53.00000
##  [49] 64.17323 59.05510 66.92000 70.86614 74.80000 68.40000 69.30000 66.53543
##  [57] 68.89760 61.81102 72.04724 67.72000 72.40000 79.05000 63.77953 66.40000
##  [65] 69.29000 66.92913 66.14160 74.80315 72.44094 53.77000 65.74803 67.71650
##  [73] 54.00000 68.11020 64.96100 67.71654 66.14173 70.80000 77.16540 69.29134
##  [81] 67.50000 72.83000 66.70000 68.11024 68.50394 67.78000 67.71000 79.00000
##  [89] 68.80000 70.10000 73.20000 73.62000 68.90000 66.90000 66.92910 70.85000
##  [97] 72.44000 61.32000 66.93000 58.00000 55.00000 73.22000 66.14170 62.99213
## [105] 70.08000 67.20000 72.45000 75.98000 75.59055 70.86600 62.40000 75.60000
## [113] 71.65000 62.60000 67.30000 64.20000 66.14000 64.50000 70.86610 75.40000
## [121] 72.83460 71.65354 64.56693 72.83465 78.74000 64.90000 59.84252 70.86000
## [129] 62.20472 68.11024 70.47244 52.00000 81.00000 68.89764 62.59843 82.67717
## [137] 50.00000 73.22835 63.38583
length(unique(x))
## [1] 139

Exercise 4. Tables

One of the useful outputs of data visualization is that we can learn about the distribution of variables. For categorical data we can construct this distribution by simply computing the frequency of each unique value. This can be done with the function table. Here is an example:

x <- c(3, 3, 3, 3, 4, 4, 2) table(x)

Use the table function to compute the frequencies of each unique height value. Because we are using the resulting frequency table in a later exercise we want you to save the results into an object and call it tab.

x <- c(3, 3, 3, 3, 4, 4, 2)
table(x)
## x
## 2 3 4 
## 1 4 2
library(dslabs)
data(heights)
x <- heights$height
tab <- table(x)

Exercise 5. Indicator variables

To see why treating the reported heights as an ordinal value is not useful in practice we note how many values are reported only once.

In the previous exercise we computed the variable tab which reports the number of times each unique value appears. For values reported only once tab will be 1. Use logicals and the function sum to count the number of times this happens.

library(dslabs)
data(heights)
tab <- table(heights$height)
sum(tab==1)
## [1] 63

Exercise 6. Data types - heights

Since there are a finite number of reported heights and technically the height can be considered ordinal, which of the following is true:

Possible Answers: