The goal of this tutorial is to order a factor variable by the count of every label of the factor. This will help to visualize the information in a smarter way.
For this tutorial we will use the training set of titanic dataset that can be downloaded here. We need also to import ggplot2 and dplyr package:
library(ggplot2)
library(dplyr)
str(titanic)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA 56 NA NA 130 NA NA NA ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
summary(titanic)
## PassengerId Survived Pclass
## Min. : 1.0 Min. :0.0000 Min. :1.000
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000
## Median :446.0 Median :0.0000 Median :3.000
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Name Sex Age
## Abbing, Mr. Anthony : 1 female:314 Min. : 0.42
## Abbott, Mr. Rossmore Edward : 1 male :577 1st Qu.:20.12
## Abbott, Mrs. Stanton (Rosa Hunt) : 1 Median :28.00
## Abelson, Mr. Samuel : 1 Mean :29.70
## Abelson, Mrs. Samuel (Hannah Wizosky): 1 3rd Qu.:38.00
## Adahl, Mr. Mauritz Nils Martin : 1 Max. :80.00
## (Other) :885 NA's :177
## SibSp Parch Ticket Fare
## Min. :0.000 Min. :0.0000 1601 : 7 Min. : 0.00
## 1st Qu.:0.000 1st Qu.:0.0000 347082 : 7 1st Qu.: 7.91
## Median :0.000 Median :0.0000 CA. 2343: 7 Median : 14.45
## Mean :0.523 Mean :0.3816 3101295 : 6 Mean : 32.20
## 3rd Qu.:1.000 3rd Qu.:0.0000 347088 : 6 3rd Qu.: 31.00
## Max. :8.000 Max. :6.0000 CA 2144 : 6 Max. :512.33
## (Other) :852
## Cabin Embarked
## B96 B98 : 4 C :168
## C23 C25 C27: 4 Q : 77
## G6 : 4 S :644
## C22 C26 : 3 NA's: 2
## D : 3
## (Other) :186
## NA's :687
In order to do our analysis easier we will remove the rows with missing values on the
Embarkedvariable:
titanic <- titanic[-which(is.na(titanic$Embarked)) , ]
Now let’s see the distribution of the port of the embarkation:
emb_dist <- ggplot(titanic, aes(x = Embarked)) + geom_bar()
emb_dist
As we can see, the bar plot is ordered by the order of the levels of the factor. The next step is to arrange the bars by the amount of passengers in each category.
In order to arrange the bar chart plotted before by frequency, we must change the order of the factor levels of
Embarkedvariable in a proper way. First, we will need to create a new table with the frequency of every embarked class:
freq_table <- titanic %>% group_by(Embarked) %>% summarise(freq = n())
Next, we will order the levels of the
Embarkedvariable with the frequency of each level:
titanic$Embarked <- factor(titanic$Embarked, levels = freq_table$Embarked[order(freq_table$freq, decreasing = T)])
str(titanic$Embarked)
## Factor w/ 3 levels "S","C","Q": 1 2 1 1 1 3 1 1 1 2 ...
Finally, we plot the bar chart oredered:
emb_dist_ordered <- ggplot(titanic, aes(x = Embarked)) + geom_bar()
emb_dist_ordered
In this tutorial we have learnt how to order a bar chart by frequency. This will help the visualization so the information reported looks more organized.