Titanic

This document will study various data types in R using an inbuilt data set ‘Titanic’.

summary(Titanic)
## Number of cases in table: 2201 
## Number of factors: 4 
## Test for independence of all factors:
##  Chisq = 1637.4, df = 25, p-value = 0
##  Chi-squared approximation may be incorrect
str(Titanic)
##  'table' num [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
##  - attr(*, "dimnames")=List of 4
##   ..$ Class   : chr [1:4] "1st" "2nd" "3rd" "Crew"
##   ..$ Sex     : chr [1:2] "Male" "Female"
##   ..$ Age     : chr [1:2] "Child" "Adult"
##   ..$ Survived: chr [1:2] "No" "Yes"
class(Titanic)
## [1] "table"

Titanic data set is of class ‘table’ as can be seen above. It is little difficult to use this table structure for further analysis. Hence,lets convert this table into user friendly data frame.

titanicDataFrame <- as.data.frame(Titanic)
str(titanicDataFrame)
## 'data.frame':    32 obs. of  5 variables:
##  $ Class   : Factor w/ 4 levels "1st","2nd","3rd",..: 1 2 3 4 1 2 3 4 1 2 ...
##  $ Sex     : Factor w/ 2 levels "Male","Female": 1 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : Factor w/ 2 levels "Child","Adult": 1 1 1 1 1 1 1 1 2 2 ...
##  $ Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Freq    : num  0 0 35 0 0 0 17 0 118 154 ...

This data frame is much user friendly to read and understand and perform further analysis.

titanicDataFrame
##    Class    Sex   Age Survived Freq
## 1    1st   Male Child       No    0
## 2    2nd   Male Child       No    0
## 3    3rd   Male Child       No   35
## 4   Crew   Male Child       No    0
## 5    1st Female Child       No    0
## 6    2nd Female Child       No    0
## 7    3rd Female Child       No   17
## 8   Crew Female Child       No    0
## 9    1st   Male Adult       No  118
## 10   2nd   Male Adult       No  154
## 11   3rd   Male Adult       No  387
## 12  Crew   Male Adult       No  670
## 13   1st Female Adult       No    4
## 14   2nd Female Adult       No   13
## 15   3rd Female Adult       No   89
## 16  Crew Female Adult       No    3
## 17   1st   Male Child      Yes    5
## 18   2nd   Male Child      Yes   11
## 19   3rd   Male Child      Yes   13
## 20  Crew   Male Child      Yes    0
## 21   1st Female Child      Yes    1
## 22   2nd Female Child      Yes   13
## 23   3rd Female Child      Yes   14
## 24  Crew Female Child      Yes    0
## 25   1st   Male Adult      Yes   57
## 26   2nd   Male Adult      Yes   14
## 27   3rd   Male Adult      Yes   75
## 28  Crew   Male Adult      Yes  192
## 29   1st Female Adult      Yes  140
## 30   2nd Female Adult      Yes   80
## 31   3rd Female Adult      Yes   76
## 32  Crew Female Adult      Yes   20
class(titanicDataFrame)
## [1] "data.frame"
class(titanicDataFrame$Class)
## [1] "factor"
class(titanicDataFrame$Sex)
## [1] "factor"
class(titanicDataFrame$Freq)
## [1] "numeric"

Let usdraw some plots of the titanic data frame.I am using funModeling library to do so.

library(funModeling)

freq function will plot the frequency graph for all categorical variables in the data frame. Similalry , plot_num function plots the graph for all the numerical variables of the data frame.

freq(titanicDataFrame)

##   Class frequency percentage cumulative_perc
## 1   1st         8         25              25
## 2   2nd         8         25              50
## 3   3rd         8         25              75
## 4  Crew         8         25             100

##      Sex frequency percentage cumulative_perc
## 1   Male        16         50              50
## 2 Female        16         50             100

##     Age frequency percentage cumulative_perc
## 1 Child        16         50              50
## 2 Adult        16         50             100

##   Survived frequency percentage cumulative_perc
## 1       No        16         50              50
## 2      Yes        16         50             100
## [1] "Variables processed: Class, Sex, Age, Survived"
plot_num(titanicDataFrame)

We can convert each of the “factor” variables or “numeric” variable into data matrix etc.

matrixFreq<- as.matrix(titanicDataFrame$Freq)
matrixFreq
##       [,1]
##  [1,]    0
##  [2,]    0
##  [3,]   35
##  [4,]    0
##  [5,]    0
##  [6,]    0
##  [7,]   17
##  [8,]    0
##  [9,]  118
## [10,]  154
## [11,]  387
## [12,]  670
## [13,]    4
## [14,]   13
## [15,]   89
## [16,]    3
## [17,]    5
## [18,]   11
## [19,]   13
## [20,]    0
## [21,]    1
## [22,]   13
## [23,]   14
## [24,]    0
## [25,]   57
## [26,]   14
## [27,]   75
## [28,]  192
## [29,]  140
## [30,]   80
## [31,]   76
## [32,]   20
class(matrixFreq)
## [1] "matrix"

We can further perform any calculation like Mean , Median , Mode, SD , Variance , Distribution , Skewness , Kurtosis on this single variable.Let us take simple Freq variable in its vector form.

freqVector <- as.vector(titanicDataFrame$Freq)
freqVector
##  [1]   0   0  35   0   0   0  17   0 118 154 387 670   4  13  89   3   5
## [18]  11  13   0   1  13  14   0  57  14  75 192 140  80  76  20
class(freqVector)
## [1] "numeric"

Let us try to plot a line chart for this freqVector to see its distribution.

plot(freqVector,type = "l")

Some more examples below -

class('2')
## [1] "character"
class(2.2)
## [1] "numeric"
class(2.2L)
## [1] "numeric"
class(2.0L)
## [1] "integer"
class(2L)
## [1] "integer"
class("Vikas Test")
## [1] "character"
str(2.2)
##  num 2.2
str('Test')
##  chr "Test"
str(2L)
##  int 2

You can practice a lot on data types by either using your own data sets or inbuilt data sets or by generating vectors , matrices , lists and factors etc. Try it!