mydata <- read.csv('c:/train.csv')
library(psych)
options(repos = c(CRAN = "https://cran.rstudio.com/"))
haiding Luo
What are the types of variable (quantitative / qualitative) and levels of measurement (nominal / ordinal /interval / ratio) for PassengerId, and Age?
I think PassengerId is qualitative variable, while Age is quantitative variable. For the levels of measurement I think passengerld is nominal because it is an unique number; for the Age it is a interval measuremnt.
Which variable has the most missing observations?
install.packages("Hmisc")
## 将程序包安装入'C:/Users/pokem/AppData/Local/R/win-library/4.3'
## (因为'lib'没有被指定)
##
## 有二进制版本的,但源代码版本是后来的:
## binary source needs_compilation
## Hmisc 5.1-0 5.1-1 TRUE
## 安装源码包'Hmisc'
## Warning in install.packages("Hmisc"): 安装程序包'Hmisc'时退出狀態的值不是0
Hmisc::describe(mydata)
## mydata
##
## 12 Variables 891 Observations
## --------------------------------------------------------------------------------
## PassengerId
## n missing distinct Info Mean Gmd .05 .10
## 891 0 891 1 446 297.3 45.5 90.0
## .25 .50 .75 .90 .95
## 223.5 446.0 668.5 802.0 846.5
##
## lowest : 1 2 3 4 5, highest: 887 888 889 890 891
## --------------------------------------------------------------------------------
## Survived
## n missing distinct Info Sum Mean Gmd
## 891 0 2 0.71 342 0.3838 0.4735
##
## --------------------------------------------------------------------------------
## Pclass
## n missing distinct Info Mean Gmd
## 891 0 3 0.81 2.309 0.8631
##
## Value 1 2 3
## Frequency 216 184 491
## Proportion 0.242 0.207 0.551
##
## For the frequency table, variable is rounded to the nearest 0.02
## --------------------------------------------------------------------------------
## Name
## n missing distinct
## 891 0 891
##
## lowest : Abbing, Mr. Anthony Abbott, Mr. Rossmore Edward Abbott, Mrs. Stanton (Rosa Hunt) Abelson, Mr. Samuel Abelson, Mrs. Samuel (Hannah Wizosky)
## highest: Yousseff, Mr. Gerious Yrois, Miss. Henriette ("Mrs Harbeck") Zabour, Miss. Hileni Zabour, Miss. Thamine Zimmerman, Mr. Leo
## --------------------------------------------------------------------------------
## Sex
## n missing distinct
## 891 0 2
##
## Value female male
## Frequency 314 577
## Proportion 0.352 0.648
## --------------------------------------------------------------------------------
## Age
## n missing distinct Info Mean Gmd .05 .10
## 714 177 88 0.999 29.7 16.21 4.00 14.00
## .25 .50 .75 .90 .95
## 20.12 28.00 38.00 50.00 56.00
##
## lowest : 0.42 0.67 0.75 0.83 0.92, highest: 70 70.5 71 74 80
## --------------------------------------------------------------------------------
## SibSp
## n missing distinct Info Mean Gmd
## 891 0 7 0.669 0.523 0.823
##
## Value 0.00 0.96 2.00 2.96 4.00 4.96 8.00
## Frequency 608 209 28 16 18 5 7
## Proportion 0.682 0.235 0.031 0.018 0.020 0.006 0.008
##
## For the frequency table, variable is rounded to the nearest 0.08
## --------------------------------------------------------------------------------
## Parch
## n missing distinct Info Mean Gmd
## 891 0 7 0.556 0.3816 0.6259
##
## Value 0.00 0.96 1.98 3.00 3.96 4.98 6.00
## Frequency 678 118 80 5 4 5 1
## Proportion 0.761 0.132 0.090 0.006 0.004 0.006 0.001
##
## For the frequency table, variable is rounded to the nearest 0.06
## --------------------------------------------------------------------------------
## Ticket
## n missing distinct
## 891 0 681
##
## lowest : 110152 110413 110465 110564 110813
## highest: W./C. 6608 W./C. 6609 W.E.P. 5734 W/C 14208 WE/P 5735
## --------------------------------------------------------------------------------
## Fare
## n missing distinct Info Mean Gmd .05 .10
## 891 0 248 1 32.2 36.78 7.225 7.550
## .25 .50 .75 .90 .95
## 7.910 14.454 31.000 77.958 112.079
##
## lowest : 0 4.0125 5 6.2375 6.4375
## highest: 227.525 247.521 262.375 263 512.329
## --------------------------------------------------------------------------------
## Cabin
## n missing distinct
## 204 687 147
##
## lowest : A10 A14 A16 A19 A20, highest: F33 F38 F4 G6 T
## --------------------------------------------------------------------------------
## Embarked
## n missing distinct
## 889 2 3
##
## Value C Q S
## Frequency 168 77 644
## Proportion 0.189 0.087 0.724
## --------------------------------------------------------------------------------
Cabin has the most missing observations.
Impute missing observations for Age, SibSp, and Parch with the column median (ordinal, interval, or ratio), or the column mode. To do so, use something like this: mydata$Age[is.na(mydata$Age)] <- median(mydata$Age, na.rm=TRUE) . You can read up on indexing in dataframeLinks to an external site. (i.e. “dataframe_name$variable_name”).
median_age <- median(mydata$Age, na.rm=TRUE)
original_age_data <- mydata$Age
mydata$Age[is.na(mydata$Age)] <- median_age
mydata$SibSp[is.na(mydata$SibSp)] <- median(mydata$SibSp, na.rm=TRUE)
mydata$Parch[is.na(mydata$Parch)] <- median(mydata$Parch, na.rm=TRUE)
modified_age_data <- mydata$Age
summary(original_age_data)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.42 20.12 28.00 29.70 38.00 80.00 177
summary(mydata$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.42 22.00 28.00 29.36 35.00 80.00
Install the psych package in R: install.packages(‘pscyh’). Invoke the package using library(psych). Then provide descriptive statistics for Age, SibSp, and Parch (e.g., describe(mydata$Age).
library(psych)
psych::describe(mydata$Age)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 891 29.36 13.02 28 28.83 8.9 0.42 80 79.58 0.51 0.97 0.44
hist(mydata$Age, xlab = "Age", ylab = "number of people", main = "Age chart")
psych::describe(mydata$SibSp)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 891 0.52 1.1 0 0.27 0 0 8 8 3.68 17.73 0.04
hist(mydata$SibSp, xlab = "Siblings",ylab = "number of people", main = "Siblings chart")
psych::describe(mydata$Parch)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 891 0.38 0.81 0 0.18 0 0 6 6 2.74 9.69 0.03
hist(mydata$Parch, xlab = "Parents / Children", ylab="number", main = "Parents Children")
Provide a cross-tabulation of Survived and Sex (e.g., table(mydata$Survived, mydata$Sex). What do you notice?
table(mydata$Survived, mydata$Sex)
##
## female male
## 0 81 468
## 1 233 109
Based on this table, I noticed that there is a significant disparity between the survival rates of men and women. The survival rate for men is quite low, while it is comparatively high for women. This could be attributed to the fact that during the emergency situation at the time, there was a policy of prioritizing women and children for the use of lifeboats for evacuation.
Provide notched boxplots for Survived and Age (e.g., boxplot(mydata$Age~mydata$Survived, notch=TRUE, horizontal=T). What do you notice?
boxplot(mydata$Age~mydata$Survived, notch=TRUE, horizontal=T, ylab = "Survived",
xlab = "Age", main = "Survived and Age boxplot graph")
The ages of male and female passengers who survived are quite similar. There is a higher number of casualties among passengers who are older (approximately 55 years and older), and I noticed that there are 4 children who did not survive. This chart largely corroborates the point I made in my previous question, which is that more females and children survived.