mydata <- read.csv('c:/train.csv') 
library(psych)
options(repos = c(CRAN = "https://cran.rstudio.com/"))

Homework 1: Titanic

haiding Luo

9/11/2023

Question1a

What are the types of variable (quantitative / qualitative) and levels of measurement (nominal / ordinal /interval / ratio) for PassengerId, and Age?

I think PassengerId is qualitative variable, while Age is quantitative variable. For the levels of measurement I think passengerld is nominal because it is an unique number; for the Age it is a interval measuremnt.

Question1b

Which variable has the most missing observations?

install.packages("Hmisc")
## 将程序包安装入'C:/Users/pokem/AppData/Local/R/win-library/4.3'
## (因为'lib'没有被指定)
## 
##   有二进制版本的,但源代码版本是后来的:
##       binary source needs_compilation
## Hmisc  5.1-0  5.1-1              TRUE
## 安装源码包'Hmisc'
## Warning in install.packages("Hmisc"): 安装程序包'Hmisc'时退出狀態的值不是0
Hmisc::describe(mydata)
## mydata 
## 
##  12  Variables      891  Observations
## --------------------------------------------------------------------------------
## PassengerId 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      891        0      891        1      446    297.3     45.5     90.0 
##      .25      .50      .75      .90      .95 
##    223.5    446.0    668.5    802.0    846.5 
## 
## lowest :   1   2   3   4   5, highest: 887 888 889 890 891
## --------------------------------------------------------------------------------
## Survived 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##      891        0        2     0.71      342   0.3838   0.4735 
## 
## --------------------------------------------------------------------------------
## Pclass 
##        n  missing distinct     Info     Mean      Gmd 
##      891        0        3     0.81    2.309   0.8631 
##                             
## Value          1     2     3
## Frequency    216   184   491
## Proportion 0.242 0.207 0.551
## 
## For the frequency table, variable is rounded to the nearest 0.02
## --------------------------------------------------------------------------------
## Name 
##        n  missing distinct 
##      891        0      891 
## 
## lowest : Abbing, Mr. Anthony                    Abbott, Mr. Rossmore Edward            Abbott, Mrs. Stanton (Rosa Hunt)       Abelson, Mr. Samuel                    Abelson, Mrs. Samuel (Hannah Wizosky) 
## highest: Yousseff, Mr. Gerious                  Yrois, Miss. Henriette ("Mrs Harbeck") Zabour, Miss. Hileni                   Zabour, Miss. Thamine                  Zimmerman, Mr. Leo                    
## --------------------------------------------------------------------------------
## Sex 
##        n  missing distinct 
##      891        0        2 
##                         
## Value      female   male
## Frequency     314    577
## Proportion  0.352  0.648
## --------------------------------------------------------------------------------
## Age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      714      177       88    0.999     29.7    16.21     4.00    14.00 
##      .25      .50      .75      .90      .95 
##    20.12    28.00    38.00    50.00    56.00 
## 
## lowest : 0.42 0.67 0.75 0.83 0.92, highest: 70   70.5 71   74   80  
## --------------------------------------------------------------------------------
## SibSp 
##        n  missing distinct     Info     Mean      Gmd 
##      891        0        7    0.669    0.523    0.823 
##                                                     
## Value       0.00  0.96  2.00  2.96  4.00  4.96  8.00
## Frequency    608   209    28    16    18     5     7
## Proportion 0.682 0.235 0.031 0.018 0.020 0.006 0.008
## 
## For the frequency table, variable is rounded to the nearest 0.08
## --------------------------------------------------------------------------------
## Parch 
##        n  missing distinct     Info     Mean      Gmd 
##      891        0        7    0.556   0.3816   0.6259 
##                                                     
## Value       0.00  0.96  1.98  3.00  3.96  4.98  6.00
## Frequency    678   118    80     5     4     5     1
## Proportion 0.761 0.132 0.090 0.006 0.004 0.006 0.001
## 
## For the frequency table, variable is rounded to the nearest 0.06
## --------------------------------------------------------------------------------
## Ticket 
##        n  missing distinct 
##      891        0      681 
## 
## lowest : 110152      110413      110465      110564      110813     
## highest: W./C. 6608  W./C. 6609  W.E.P. 5734 W/C 14208   WE/P 5735  
## --------------------------------------------------------------------------------
## Fare 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      891        0      248        1     32.2    36.78    7.225    7.550 
##      .25      .50      .75      .90      .95 
##    7.910   14.454   31.000   77.958  112.079 
## 
## lowest : 0       4.0125  5       6.2375  6.4375 
## highest: 227.525 247.521 262.375 263     512.329
## --------------------------------------------------------------------------------
## Cabin 
##        n  missing distinct 
##      204      687      147 
## 
## lowest : A10 A14 A16 A19 A20, highest: F33 F38 F4  G6  T  
## --------------------------------------------------------------------------------
## Embarked 
##        n  missing distinct 
##      889        2        3 
##                             
## Value          C     Q     S
## Frequency    168    77   644
## Proportion 0.189 0.087 0.724
## --------------------------------------------------------------------------------

Cabin has the most missing observations.

Question 2

Impute missing observations for Age, SibSp, and Parch with the column median (ordinal, interval, or ratio), or the column mode.  To do so, use something like this:  mydata$Age[is.na(mydata$Age)] <- median(mydata$Age, na.rm=TRUE) . You can read up on indexing in dataframeLinks to an external site. (i.e. “dataframe_name$variable_name”). 

median_age <- median(mydata$Age, na.rm=TRUE)
original_age_data <- mydata$Age
mydata$Age[is.na(mydata$Age)] <- median_age
mydata$SibSp[is.na(mydata$SibSp)] <- median(mydata$SibSp, na.rm=TRUE)
mydata$Parch[is.na(mydata$Parch)] <- median(mydata$Parch, na.rm=TRUE)
modified_age_data <- mydata$Age
summary(original_age_data)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.42   20.12   28.00   29.70   38.00   80.00     177
summary(mydata$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.42   22.00   28.00   29.36   35.00   80.00

Question 3

Install the psych package in R: install.packages(‘pscyh’). Invoke the package using library(psych). Then provide descriptive statistics for Age, SibSp, and Parch (e.g., describe(mydata$Age).

library(psych)
psych::describe(mydata$Age)
##    vars   n  mean    sd median trimmed mad  min max range skew kurtosis   se
## X1    1 891 29.36 13.02     28   28.83 8.9 0.42  80 79.58 0.51     0.97 0.44
hist(mydata$Age, xlab = "Age", ylab = "number of people", main = "Age chart")

psych::describe(mydata$SibSp)
##    vars   n mean  sd median trimmed mad min max range skew kurtosis   se
## X1    1 891 0.52 1.1      0    0.27   0   0   8     8 3.68    17.73 0.04
hist(mydata$SibSp, xlab = "Siblings",ylab = "number of people", main = "Siblings chart")

psych::describe(mydata$Parch)
##    vars   n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 891 0.38 0.81      0    0.18   0   0   6     6 2.74     9.69 0.03
hist(mydata$Parch, xlab = "Parents / Children", ylab="number", main = "Parents Children")

Question 4

Provide a cross-tabulation of Survived and Sex (e.g., table(mydata$Survived, mydata$Sex).  What do you notice?

table(mydata$Survived, mydata$Sex)
##    
##     female male
##   0     81  468
##   1    233  109

Based on this table, I noticed that there is a significant disparity between the survival rates of men and women. The survival rate for men is quite low, while it is comparatively high for women. This could be attributed to the fact that during the emergency situation at the time, there was a policy of prioritizing women and children for the use of lifeboats for evacuation.

Question 5

Provide notched boxplots for Survived and Age (e.g., boxplot(mydata$Age~mydata$Survived, notch=TRUE, horizontal=T).  What do  you notice?

boxplot(mydata$Age~mydata$Survived, notch=TRUE, horizontal=T, ylab = "Survived",
        xlab = "Age", main = "Survived and Age boxplot graph")

The ages of male and female passengers who survived are quite similar. There is a higher number of casualties among passengers who are older (approximately 55 years and older), and I noticed that there are 4 children who did not survive. This chart largely corroborates the point I made in my previous question, which is that more females and children survived.