SOLUTION Day 2: Types of Variables and Summary Statistics

Author

AS

BACKGROUND

The sinking of the Titanic is one of the most infamous shipwrecks in history.   

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

Load Data

  • You can use the getwd() to check your working directory.

  • .. implies go back one level.

remove(list = ls())

?getwd # returns an absolute filepath representing the current working directory of the R process

getwd()
[1] "/Users/arvindsharma/Library/CloudStorage/Dropbox/WCAS/BCE_Summer/Data Analysis/Summer 2025/shared/Day 3"
dir() #produce a character vector of the names of files or directories in the named directory.
 [1] "~$Programming and Data Skills Quiz.docx"                                 
 [2] "datasets_import"                                                         
 [3] "II. Basics of R Programming_files"                                       
 [4] "II. Basics of R Programming.html"                                        
 [5] "II. Basics of R Programming.qmd"                                         
 [6] "III. Data Manipulation_files"                                            
 [7] "III. Data Manipulation.html"                                             
 [8] "III. Data Manipulation.qmd"                                              
 [9] "III.-Data-Manipulation_files"                                            
[10] "images"                                                                  
[11] "IV. Data Visualisation_files"                                            
[12] "IV. Data Visualisation.html"                                             
[13] "IV. Data Visualisation.qmd"                                              
[14] "IV.-Data-Visualisation_files"                                            
[15] "R Programming and Data Skills Quiz.docx"                                 
[16] "rsconnect"                                                               
[17] "SOLUTION_day2 discussion_types of variables on titanic dataset_files"    
[18] "SOLUTION_day2 discussion_types of variables on titanic dataset.html"     
[19] "SOLUTION_day2 discussion_types of variables on titanic dataset.qmd"      
[20] "SOLUTION_day2 discussion_types of variables on titanic dataset.rmarkdown"
[21] "SOLUTION_day2-discussion_types-of-variables-on-titanic-dataset_files"    
[22] "SOLUTION_day2-discussion_types-of-variables-on-titanic-dataset.rmarkdown"
train <- read.csv("../Day 2/train.csv")

Exploratory Data Analysis

?head
head(train, n = 4)
  PassengerId Survived Pclass
1           1        0      3
2           2        1      1
3           3        1      3
4           4        1      1
                                                 Name    Sex Age SibSp Parch
1                             Braund, Mr. Owen Harris   male  22     1     0
2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
3                              Heikkinen, Miss. Laina female  26     0     0
4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
            Ticket    Fare Cabin Embarked
1        A/5 21171  7.2500              S
2         PC 17599 71.2833   C85        C
3 STON/O2. 3101282  7.9250              S
4           113803 53.1000  C123        S

Variables with missing values

  • visdat will work only for small datasets.
# install.packages("visdat")

library(visdat)
?visdat
df <- train 

vis_miss(df)

vis_dat(df)

library(psych)
age_summ_stats <- describe(df$Age)

typeof(age_summ_stats)
[1] "list"
length(df$Age) - age_summ_stats[[2]]
[1] 177
  • Age has 177 missing values.
class(df$Age)
[1] "numeric"
head(is.na(df$Age)) # logical vector of T/F
[1] FALSE FALSE FALSE FALSE FALSE  TRUE
class(is.na(df$Age))
[1] "logical"
temp1 <- is.na(df$Age)
temp2 <- as.numeric(is.na(df$Age))

class(as.numeric(is.na(df$Age)))
[1] "numeric"
sum(is.na(df$Age))
[1] 177

You get the same answer - 177 missing values.

Type of variable and levels of measurement

names(df)
 [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
 [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
[11] "Cabin"       "Embarked"   
# install.packages("dplyr")
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
glimpse(df)
Rows: 891
Columns: 12
$ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
$ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
$ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
$ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
$ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
$ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
$ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
$ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
$ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
$ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
$ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
table(df$Pclass)

  1   2   3 
216 184 491 
table(df$Embarked)

      C   Q   S 
  2 168  77 644 
Variable Name Type of variable Level of Measurement
Passenger Id Qualitative Nominal
Survived * Qualitative Ordinal
Pclass Qualitative Ordinal
Name Qualitative Nominal
Sex Qualitative Nominal
Age Quantitative Ratio
SibSp (Number of Siblings/Spouses Aboard) Quantitative Ratio
Parch (Number of Parents/Children Aboard) Quantitative Ratio
Ticket * Qualitative Nominal
Fare Quantitative Ratio
Cabin Qualitative Nominal
Embarked (C = Cherbourg; Q = Queenstown; S = Southampton) Qualitative Nominal

Professional looking summary statistics

Use the stargazer packageLinks to an external site. to create a basic professional looking summary statistics table.  Make sure to comment your code, indent it properly, and even explicitly specify the argument.  In less than 3 sentences describe any interesting trends you find.

# install.packages("stargazer")

library(stargazer)

Please cite as: 
 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 
?`stargazer-package`

stargazer(df, type = "text")

==============================================
Statistic    N   Mean   St. Dev.  Min    Max  
----------------------------------------------
PassengerId 891 446.000 257.354    1     891  
Survived    891  0.384   0.487     0      1   
Pclass      891  2.309   0.836     1      3   
Age         714 29.699   14.526  0.420 80.000 
SibSp       891  0.523   1.103     0      8   
Parch       891  0.382   0.806     0      6   
Fare        891 32.204   49.693  0.000 512.329
----------------------------------------------
variable_labels <- c("Passenger Id", 
                     "Survived", 
                     "Passenger Class", 
                     "Age", "# of Siblings", 
                     "# Parents or Children", 
                     "Fare"
                     )

class(variable_labels)
[1] "character"
length(variable_labels)
[1] 7
stargazer(df, 
          type              = "text", 
          title             = "Summary Statistics", 
          covariate.labels  = variable_labels, 
          notes             = c("N = 891.", "Age has 177 missing values"), 
          omit.summary.stat = "n", 
          digits            =  2
          )

Summary Statistics
=================================================
Statistic              Mean  St. Dev. Min   Max  
-------------------------------------------------
Passenger Id          446.00  257.35   1    891  
Survived               0.38    0.49    0     1   
Passenger Class        2.31    0.84    1     3   
Age                   29.70   14.53   0.42 80.00 
# of Siblings          0.52    1.10    0     8   
# Parents or Children  0.38    0.81    0     6   
Fare                  32.20   49.69   0.00 512.33
-------------------------------------------------
N = 891.                                         
Age has 177 missing values                       
table(df$Sex)

female   male 
   314    577 

Measures of Central Tendency

data <- c(20,40,25,30,50, 37,421,77,1,53, 99,51,33)
mean(data)
[1] 72.07692
median(data)
[1] 40
hist(data)

Measures of Dispersion

Standard deviation

  • In the same units
?sd

sd(x = data)
[1] 107.7532
round(x = sd(x = data), digits = 2)
[1] 107.75
var(data)
[1] 11610.74
var(data) == sd(x = data)^2
[1] TRUE
max(data)
[1] 421
min(data)
[1] 1
range(data)
[1]   1 421
max(data) - min(data)
[1] 420

Boxplots

  • Outer fence suggests only 1 outlier.
?boxplot
boxplot(data, horizontal = T)

53-30
[1] 23
summary(data)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00   30.00   40.00   72.08   53.00  421.00 
IQR(data)
[1] 23
boxplot(data, horizontal = T, range = 3)