Categorical variable has a finite number of possible values, the categories.
Categories are the unique values that a variable can take, e.g.
\(\mathbf{A}\) is the set of all possible categories, each \(a_{s}, s=1, \ldots, k\) is one category
For random sample of size \(n\) of a categorical variable we can get
\[\text{Empirical probability:} \quad P(a_s) = P(\mathbf{x} = a_s) = \frac{\sum_i^n I(x_i = a_s)}{n} = \frac{count(x = a_s)}{n} \] For example: \[P(bike) = P(Q01\_Transport = bike) = \frac{\sum_i^n I(Q01\_Transport_i = bike)}{n} = \frac{count(Q01\_Transport = bike)}{n} \]
In R, categorical variables have special class factor
my_data <- read.table("survey.csv", header=TRUE, sep=",") # read data (if not done before)
class(my_data$Q01_Transport) # $col_name format for column subsetting
## [1] "factor"
We can get the unique categories by levels
levels(my_data[,"Q01_Transport"]) # [,col_name] format for column subsetting
## [1] "bike" "car" "moto"
## [4] "transport public" "walk"
or simply just the number of categories nlevels
nlevels(my_data[, 3]) # [,col_idx] format for column subsetting
## [1] 5
To convert a variable into a factor use as.factor
my_data$Q06_Vegetarian <- as.factor(my_data$Q06_Vegetarian)
class(my_data$Q06_Vegetarian)
## [1] "factor"
First get the list of names of the columns in the data frame
names(my_data)
## [1] "Response" "Filier" "Q01_Transport" "Q02_Time"
## [5] "Q03_Distance" "Q04_Trips" "Q05_Food" "Q06_Vegetarian"
## [9] "Q07_Mode" "Q08_Eiffel"
To get the counts per category use the function table
category_counts <- table(my_data[, 7])
print(category_counts)
##
## equilibre hit of the day natura pizza salad
## 20 18 6 27 5
## sandwich
## 9
You can add the total sum by addmargins
addmargins(category_counts)
##
## equilibre hit of the day natura pizza salad
## 20 18 6 27 5
## sandwich Sum
## 9 85
To get the frequencies (empirical probabilities) use the function prop.table
category_frequencies <- prop.table(category_counts)
print(category_frequencies)
##
## equilibre hit of the day natura pizza salad
## 0.23529412 0.21176471 0.07058824 0.31764706 0.05882353
## sandwich
## 0.10588235
Frequencies (empirical probabilities) have to sum to 1
addmargins(category_frequencies)
##
## equilibre hit of the day natura pizza salad
## 0.23529412 0.21176471 0.07058824 0.31764706 0.05882353
## sandwich Sum
## 0.10588235 1.00000000
To plot the empirical probability distribution of a categorical variable (the frequencies) use barplot
barplot(category_frequencies, main=names(my_data)[7], ylab="Frequencies")
In the barplot above, the “hit of the day” category title is missing because the text is too long. You can change the list category names to be plotted by using the names.arg argument.
plot_categories <- c("equ", "hod", "nat", "piz", "sal", "san")
barplot(category_frequencies, main=names(my_data)[7], ylab="Frequencies", names.arg=plot_categories)
There are many more things you can change in a barplot, e.g. color or width of the bars, horizontal / vertical orientation, etc. You can explore all the options described in the help ?barplot
?barplot
Continous variables can take any value between its minimum and maximum value. They have got infinite number of possible values.
Examples of continuous variables:
For continuous variables we cannot calculate the category counts or category frequencies because there are no categories (infinitely many possible values).
To describe the probability distribution we instead look at some summary sample statistics.
\[ \textbf{sample mean: } \quad \mu = \frac{1}{n} \sum_{i=1}^n x_i \qquad (n \dots \text{number of instances}) \]
\[ \textbf{sample median: } \quad q_{0.5}: \ P(X \leq q_{0.5})=\frac{1}{2} \quad \text{and} \quad P(X \geq q_{0.5})=\frac{1}{2} \qquad \text{middle value, 50% quantile} \]
\[ \textbf{sample mode: } \quad m = \text{arg} \max_x P(X = x) \qquad \text{most probable value} \]\[ \textbf{range: } \quad r = \max_i (x_i) - \min_i (x_i) \]
\[ \textbf{interquartile range: } \quad qr = q_{0.75} - q_{0.25} \]
\[ \textbf{variance: } \quad \sigma^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2 \qquad \text{average of squared deviations} \]
\[ \textbf{standard deviation: } \quad \sigma = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2} \qquad \text{square root of variance} \]
Continous variables are numerical.
class(my_data$Q03_Distance)
## [1] "numeric"
When a class of the variable is integer, R will treat it usually as a numeric, but you may decide to change it into factor.
Decicions like these are in the hand of the analyst, that is you!
class(my_data$Q04_Trips)
## [1] "integer"
To get the summary statistics for the typical values we use the mean and median functions (mode does not exist in R).
mean(my_data$Q03_Distance)
## [1] 15.26353
median(my_data$Q03_Distance)
## [1] 8
To get the summary statistics for the spread/variation we begin by getting the maximum and mimimum value.
max(my_data$Q03_Distance)
## [1] 177
min(my_data$Q03_Distance)
## [1] 0
We can use these to calculate the range
max(my_data$Q03_Distance) - min(my_data$Q03_Distance)
## [1] 177
For the other indicators we use the functions IQR, var and sd
IQR(my_data$Q03_Distance)
## [1] 12
var(my_data$Q03_Distance)
## [1] 687.8835
sd(my_data$Q03_Distance)
## [1] 26.22753
Recall: The relation between stadard deviation and variance is \(sd(x) = \sqrt{var(x)}\).
You can check it is true
sqrt(var(my_data$Q03_Distance)) - sd(my_data$Q03_Distance)
## [1] 0
To plot the empirical probability distribution of a continuous variable you cannot calculate the frequencies per category (there is infinitely many categories).
To get a feeling for the empirical probability distribution we can split the interval between the maximum and minimum possible value of a variable into bins, e.g. split values between \((0,5)\) to intervals \((0,1], (1,2], (2,3], (3,4], (4,5)\), and calculate the frequencies of the variable falling within the interval. This calculation is called creating the histogram of the data.
To plot the histogram of the data we use the hist function.
hist(my_data$Q03_Distance)
hist(my_data$Q03_Distance, breaks=50)
Or we can even define our own bins to plot
hist(my_data$Q03_Distance, breaks=c(0, 3, 6, 9, 12, 200))
To explore other options of how to tweak your histogram, use the R help
?hist