SOLUTION Day 2: Types of Variables and Summary Statistics

Author

BACKGROUND

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

Load Data

You can use the getwd() to check your working directory.
.. implies go back one level.

remove(list = ls())

?getwd # returns an absolute filepath representing the current working directory of the R process

getwd()

[1] "/Users/arvindsharma/Library/CloudStorage/Dropbox/WCAS/BCE_Summer/Data Analysis/Summer 2025/shared/Day 3"

dir() #produce a character vector of the names of files or directories in the named directory.

 [1] "~$Programming and Data Skills Quiz.docx"                                 
 [2] "datasets_import"                                                         
 [3] "II. Basics of R Programming_files"                                       
 [4] "II. Basics of R Programming.html"                                        
 [5] "II. Basics of R Programming.qmd"                                         
 [6] "III. Data Manipulation_files"                                            
 [7] "III. Data Manipulation.html"                                             
 [8] "III. Data Manipulation.qmd"                                              
 [9] "III.-Data-Manipulation_files"                                            
[10] "images"                                                                  
[11] "IV. Data Visualisation_files"                                            
[12] "IV. Data Visualisation.html"                                             
[13] "IV. Data Visualisation.qmd"                                              
[14] "IV.-Data-Visualisation_files"                                            
[15] "R Programming and Data Skills Quiz.docx"                                 
[16] "rsconnect"                                                               
[17] "SOLUTION_day2 discussion_types of variables on titanic dataset_files"    
[18] "SOLUTION_day2 discussion_types of variables on titanic dataset.html"     
[19] "SOLUTION_day2 discussion_types of variables on titanic dataset.qmd"      
[20] "SOLUTION_day2 discussion_types of variables on titanic dataset.rmarkdown"
[21] "SOLUTION_day2-discussion_types-of-variables-on-titanic-dataset_files"    
[22] "SOLUTION_day2-discussion_types-of-variables-on-titanic-dataset.rmarkdown"

train <- read.csv("../Day 2/train.csv")

Exploratory Data Analysis

?head
head(train, n = 4)

  PassengerId Survived Pclass
1           1        0      3
2           2        1      1
3           3        1      3
4           4        1      1
                                                 Name    Sex Age SibSp Parch
1                             Braund, Mr. Owen Harris   male  22     1     0
2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
3                              Heikkinen, Miss. Laina female  26     0     0
4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
            Ticket    Fare Cabin Embarked
1        A/5 21171  7.2500              S
2         PC 17599 71.2833   C85        C
3 STON/O2. 3101282  7.9250              S
4           113803 53.1000  C123        S

Variables with missing values

visdat will work only for small datasets.

# install.packages("visdat")

library(visdat)
?visdat
df <- train 

vis_miss(df)

vis_dat(df)

library(psych)
age_summ_stats <- describe(df$Age)

typeof(age_summ_stats)

[1] "list"

length(df$Age) - age_summ_stats[[2]]

[1] 177

Age has 177 missing values.

class(df$Age)

[1] "numeric"

head(is.na(df$Age)) # logical vector of T/F

[1] FALSE FALSE FALSE FALSE FALSE  TRUE

class(is.na(df$Age))

[1] "logical"

temp1 <- is.na(df$Age)
temp2 <- as.numeric(is.na(df$Age))

class(as.numeric(is.na(df$Age)))

[1] "numeric"

sum(is.na(df$Age))

[1] 177

You get the same answer - 177 missing values.

Type of variable and levels of measurement

names(df)

 [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
 [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
[11] "Cabin"       "Embarked"

# install.packages("dplyr")
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

glimpse(df)

Rows: 891
Columns: 12
$ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
$ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
$ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
$ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
$ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
$ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
$ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
$ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
$ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
$ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
$ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…

table(df$Pclass)


  1   2   3 
216 184 491

table(df$Embarked)


      C   Q   S 
  2 168  77 644

Variable Name	Type of variable	Level of Measurement
`Passenger Id`	Qualitative	Nominal
`Survived` *	Qualitative	Ordinal
`Pclass`	Qualitative	Ordinal
`Name`	Qualitative	Nominal
`Sex`	Qualitative	Nominal
`Age`	Quantitative	Ratio
`SibSp` (Number of Siblings/Spouses Aboard)	Quantitative	Ratio
`Parch` (Number of Parents/Children Aboard)	Quantitative	Ratio
`Ticket` *	Qualitative	Nominal
`Fare`	Quantitative	Ratio
`Cabin`	Qualitative	Nominal
`Embarked` (C = Cherbourg; Q = Queenstown; S = Southampton)	Qualitative	Nominal

Professional looking summary statistics

Use the stargazer packageLinks to an external site. to create a basic professional looking summary statistics table. Make sure to comment your code, indent it properly, and even explicitly specify the argument. In less than 3 sentences describe any interesting trends you find.

# install.packages("stargazer")

library(stargazer)


Please cite as:

 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

?`stargazer-package`

stargazer(df, type = "text")


==============================================
Statistic    N   Mean   St. Dev.  Min    Max  
----------------------------------------------
PassengerId 891 446.000 257.354    1     891  
Survived    891  0.384   0.487     0      1   
Pclass      891  2.309   0.836     1      3   
Age         714 29.699   14.526  0.420 80.000 
SibSp       891  0.523   1.103     0      8   
Parch       891  0.382   0.806     0      6   
Fare        891 32.204   49.693  0.000 512.329
----------------------------------------------

variable_labels <- c("Passenger Id", 
                     "Survived", 
                     "Passenger Class", 
                     "Age", "# of Siblings", 
                     "# Parents or Children", 
                     "Fare"
                     )

class(variable_labels)

[1] "character"

length(variable_labels)

[1] 7

stargazer(df, 
          type              = "text", 
          title             = "Summary Statistics", 
          covariate.labels  = variable_labels, 
          notes             = c("N = 891.", "Age has 177 missing values"), 
          omit.summary.stat = "n", 
          digits            =  2
          )


Summary Statistics
=================================================
Statistic              Mean  St. Dev. Min   Max  
-------------------------------------------------
Passenger Id          446.00  257.35   1    891  
Survived               0.38    0.49    0     1   
Passenger Class        2.31    0.84    1     3   
Age                   29.70   14.53   0.42 80.00 
# of Siblings          0.52    1.10    0     8   
# Parents or Children  0.38    0.81    0     6   
Fare                  32.20   49.69   0.00 512.33
-------------------------------------------------
N = 891.                                         
Age has 177 missing values

table(df$Sex)


female   male 
   314    577

Measures of Central Tendency

data <- c(20,40,25,30,50, 37,421,77,1,53, 99,51,33)

mean(data)

[1] 72.07692

median(data)

[1] 40

hist(data)

Measures of Dispersion

Standard deviation

In the same units

?sd

sd(x = data)

[1] 107.7532

round(x = sd(x = data), digits = 2)

[1] 107.75

var(data)

[1] 11610.74

var(data) == sd(x = data)^2

[1] TRUE

max(data)

[1] 421

min(data)

[1] 1

range(data)

[1]   1 421

max(data) - min(data)

[1] 420

Boxplots

Outer fence suggests only 1 outlier.

?boxplot
boxplot(data, horizontal = T)

53-30

[1] 23

summary(data)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00   30.00   40.00   72.08   53.00  421.00

IQR(data)

[1] 23

boxplot(data, horizontal = T, range = 3)