Q1a. What are the types of variable (quantitative / qualitative) and levels of measurement(nominal / ordinal / interval / ratio) for PassengerId and Age?
A1a: PassengerId is a Qualitative variable and Nominal, Age is Quantitative and Discrete. In reality, age could be continuous, but given how the data is using only years, in steps of 1, this should be classified as discrete.
note: found this site that explains R Markdown formatting https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf
Q1b. Which variable has the most missing observations? You could have Googled for “Count NA values in R for all columns in a dataframe” or something like that.
A1b. Found this site, using colSums shows that the Age Column is missing 177 observations, running is.empty (from rapportools library) on Cabin shows 687 missing values. Other columns all have values.
myData <- read.csv('/Users/valcourc/Personal/courses/ADEC7301.02/data/Assignment_1/train.csv')
colSums(is.na(myData))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
sum(rapportools::is.empty(myData$Cabin))
## [1] 687
## I could not find the function to find empty or NA over each column in the dataset in one statement....should I use a loop?
Q2. Impute missing observations for Age, SibSp, and Parch with the column median (ordinal, interval, or ratio), or the column mode. To do so, use something like this: mydata\(Age[is.na(mydata\)Age)] <- median(mydata$Age, na.rm=TRUE).
A2. To impute using the median age of all passengers in the Titanic, you can indeed assign any row with Age having ‘NA’ to this median age. I see values for SibSp and Parch for all rows. I believe you could write logic to adjust the passengers age based on whether they had a parent, children, spouse, and/or siblings on board, but that sounds more complicated to do at this stage of my analysis. I simply used the median age of all passengers here:
myData$Age[is.na(myData$Age)] <- median(myData$Age, na.rm=TRUE)
Q3. Install the psych package in R: install.packages(‘pscyh’). Invoke the package using library(psych). Then provide descriptive statistics for Age, SibSp, and Parch (e.g., describe(mydata$Age). Please comment on what you observe from the summary statistics. A3. See the following code that installs the ‘psych’ package and uses it’s function describe() over the Age, SibSp, and Parch columns.
## Comment: I used the console to run install.packages() and library() for the 'psych' package. I am not sure if you can do this directly in R Markdown, I was getting errors when trying (not sure if I was doing it incorrectly)
psych::describe(myData$Age)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 891 29.36 13.02 28 28.83 8.9 0.42 80 79.58 0.51 0.97 0.44
psych::describe(myData$SibSp)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 891 0.52 1.1 0 0.27 0 0 8 8 3.68 17.73 0.04
psych::describe(myData$Parch)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 891 0.38 0.81 0 0.18 0 0 6 6 2.74 9.69 0.03
Looking at the output of describe() on each of these columns, we can summarize that the median age is 28 and slightly skewed to right(positively), the kurtosis indicates a low peak in the distribution. Relative to SibSp, about 52% of passengers had a sibling or a spouse with them. For Parch, about 38% of passengers had a parent or a child. I’m not sure if the other statistics are that meaningful for SibSp or Parch.
Q4. Provide a cross-tabulation of Survived and Sex (e.g., table(mydata\(Survived, mydata\)Sex). What do you notice? A4.
table(myData$Survived, myData$Sex)
##
## female male
## 0 81 468
## 1 233 109
If you have seen a movie or have read about the Titanic, you would expect that this table shows that more women survived than men survived. This result shows 233 vs 81 of women surviving and 109 vs 468 of men surviving. I hadn’t noticed before, but the dataset and the results of this table shows a significantly higher number of men on board.
Q5. Provide notched boxplots for Survived and Age (e.g., boxplot(mydata\(Age~mydata\)Survived, notch=TRUE, horizontal=T). What do you notice? A5.
boxplot(myData$Age~myData$Survived, notch=TRUE, horizontal=T, xlab = "Survived", ylab = "Age")
This boxplot shows that first median age of survivors was slightly less than the median age of those who didn’t survive, but it seems like the age of those who survived vs those who didn’t was about the same. The outliers show more older passengers that did not survive and, unfortunately, a few younger passengers.