3 Univariate analysis

In univariate analysis we focus on a single variable (attribute).

3.1 Categorical variables

Categorical variable has a finite number of possible values, the categories.

Categories are the unique values that a variable can take, e.g.

coin = {head,tales}
Transport = {bike, moto, car, walk, transport public}
\(\mathbf{A} = \{a_1, a_2, \ldots, a_k\}\)

\(\mathbf{A}\) is the set of all possible categories, each \(a_{s}, s=1, \ldots, k\) is one category

3.1.1 Sample statistics

For random sample of size \(n\) of a categorical variable we can get

counts per category \(count(x = a_s) = \sum_{i=1}^n I(x_i = a_s)\)
frequencies per category = empirical probability

\[\text{Empirical probability:} \quad P(a_s) = P(\mathbf{x} = a_s) = \frac{\sum_i^n I(x_i = a_s)}{n} = \frac{count(x = a_s)}{n} \] For example: \[P(bike) = P(Q01\_Transport = bike) = \frac{\sum_i^n I(Q01\_Transport_i = bike)}{n} = \frac{count(Q01\_Transport = bike)}{n} \]

3.1.2 Properties of probability

\(P(a_s) \geq 0\) for all \(a_s \in \mathbf{A}\)
\(\sum_s^k P(a_s) = 1\)
\(P(a_s \cup a_r) = P(a_s) + P(a_r)\) for \(a_s, a_r\) disjoint

3.2 Categorical variables in R

In R, categorical variables have special class factor

my_data <- read.table("survey.csv", header=TRUE, sep=",") # read data (if not done before)

class(my_data$Q01_Transport) # $col_name format for column subsetting

## [1] "factor"

We can get the unique categories by levels

levels(my_data[,"Q01_Transport"]) # [,col_name] format for column subsetting

## [1] "bike"             "car"              "moto"            
## [4] "transport public" "walk"

or simply just the number of categories nlevels

nlevels(my_data[, 3]) # [,col_idx] format for column subsetting

## [1] 5

To convert a variable into a factor use as.factor

my_data$Q06_Vegetarian <- as.factor(my_data$Q06_Vegetarian)
class(my_data$Q06_Vegetarian)

## [1] "factor"

3.2.1 Sample statistics in R

First get the list of names of the columns in the data frame

names(my_data)

##  [1] "Response"       "Filier"         "Q01_Transport"  "Q02_Time"      
##  [5] "Q03_Distance"   "Q04_Trips"      "Q05_Food"       "Q06_Vegetarian"
##  [9] "Q07_Mode"       "Q08_Eiffel"

To get the counts per category use the function table

category_counts <- table(my_data[, 7])
print(category_counts)

## 
##      equilibre hit of the day         natura          pizza          salad 
##             20             18              6             27              5 
##       sandwich 
##              9

You can add the total sum by addmargins

addmargins(category_counts)

## 
##      equilibre hit of the day         natura          pizza          salad 
##             20             18              6             27              5 
##       sandwich            Sum 
##              9             85

To get the frequencies (empirical probabilities) use the function prop.table

category_frequencies <- prop.table(category_counts)
print(category_frequencies)

## 
##      equilibre hit of the day         natura          pizza          salad 
##     0.23529412     0.21176471     0.07058824     0.31764706     0.05882353 
##       sandwich 
##     0.10588235

Frequencies (empirical probabilities) have to sum to 1

addmargins(category_frequencies)

## 
##      equilibre hit of the day         natura          pizza          salad 
##     0.23529412     0.21176471     0.07058824     0.31764706     0.05882353 
##       sandwich            Sum 
##     0.10588235     1.00000000

3.2.2 Plot categorical distribution

To plot the empirical probability distribution of a categorical variable (the frequencies) use barplot

barplot(category_frequencies, main=names(my_data)[7], ylab="Frequencies")

In the barplot above, the “hit of the day” category title is missing because the text is too long. You can change the list category names to be plotted by using the names.arg argument.

plot_categories <- c("equ", "hod", "nat", "piz", "sal", "san")
barplot(category_frequencies, main=names(my_data)[7], ylab="Frequencies", names.arg=plot_categories)

There are many more things you can change in a barplot, e.g. color or width of the bars, horizontal / vertical orientation, etc. You can explore all the options described in the help ?barplot

?barplot

3.3 Continuous variables

Continous variables can take any value between its minimum and maximum value. They have got infinite number of possible values.

Examples of continuous variables:

Temperature in degrees Celsius (e.g. \(36.7\))
Salary in CHF (e.g. 7510)
Guessing the height of Eiffel tower in m (e.g. 365)

3.3.1 Sample statistics

For continuous variables we cannot calculate the category counts or category frequencies because there are no categories (infinitely many possible values).

To describe the probability distribution we instead look at some summary sample statistics.

3.3.1.1 Typical values

\[ \textbf{sample mean: } \quad \mu = \frac{1}{n} \sum_{i=1}^n x_i \qquad (n \dots \text{number of instances}) \]

\[ \textbf{sample median: } \quad q_{0.5}: \ P(X \leq q_{0.5})=\frac{1}{2} \quad \text{and} \quad P(X \geq q_{0.5})=\frac{1}{2} \qquad \text{middle value, 50% quantile} \]

\[ \textbf{sample mode: } \quad m = \text{arg} \max_x P(X = x) \qquad \text{most probable value} \]

3.3.1.2 Indication of spread/variation in values

\[ \textbf{range: } \quad r = \max_i (x_i) - \min_i (x_i) \]

\[ \textbf{interquartile range: } \quad qr = q_{0.75} - q_{0.25} \]

\[ \textbf{variance: } \quad \sigma^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2 \qquad \text{average of squared deviations} \]

\[ \textbf{standard deviation: } \quad \sigma = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2} \qquad \text{square root of variance} \]

3.4 Continous variables in R

Continous variables are numerical.

class(my_data$Q03_Distance)

## [1] "numeric"

When a class of the variable is integer, R will treat it usually as a numeric, but you may decide to change it into factor.

Decicions like these are in the hand of the analyst, that is you!

class(my_data$Q04_Trips)

## [1] "integer"

3.4.1 Sample statistics in R

To get the summary statistics for the typical values we use the mean and median functions (mode does not exist in R).

mean(my_data$Q03_Distance)

## [1] 15.26353

median(my_data$Q03_Distance)

## [1] 8

To get the summary statistics for the spread/variation we begin by getting the maximum and mimimum value.

max(my_data$Q03_Distance)

## [1] 177

min(my_data$Q03_Distance)

## [1] 0

We can use these to calculate the range

max(my_data$Q03_Distance) - min(my_data$Q03_Distance)

## [1] 177

For the other indicators we use the functions IQR, var and sd

IQR(my_data$Q03_Distance)

## [1] 12

var(my_data$Q03_Distance)

## [1] 687.8835

sd(my_data$Q03_Distance)

## [1] 26.22753

Recall: The relation between stadard deviation and variance is \(sd(x) = \sqrt{var(x)}\).

You can check it is true

sqrt(var(my_data$Q03_Distance)) - sd(my_data$Q03_Distance)

## [1] 0

3.4.2 Plot continous distribution

To plot the empirical probability distribution of a continuous variable you cannot calculate the frequencies per category (there is infinitely many categories).

To get a feeling for the empirical probability distribution we can split the interval between the maximum and minimum possible value of a variable into bins, e.g. split values between \((0,5)\) to intervals \((0,1], (1,2], (2,3], (3,4], (4,5)\), and calculate the frequencies of the variable falling within the interval. This calculation is called creating the histogram of the data.

To plot the histogram of the data we use the hist function.

hist(my_data$Q03_Distance)

hist(my_data$Q03_Distance, breaks=50)

Or we can even define our own bins to plot

hist(my_data$Q03_Distance, breaks=c(0, 3, 6, 9, 12, 200))

To explore other options of how to tweak your histogram, use the R help

?hist

Analyse du SI d'enterprise 2

Magda Gregorova

5/3/2019