Week 1 HW 1 assignment - Titanic

Let’s first load the data.

setwd("/Users/ginaocchipinti/Documents/ADEC 7310  Data Analytics/Week 1")

mydata <- read.csv('/Users/ginaocchipinti/Documents/ADEC 7310  Data Analytics/Week 1/train.csv')

Q1 a. What are the types of variable (quantitative / qualitative) and levels of measurement (nominal / ordinal / interval / ratio) for PassengerId and Age?

The code below uses the str() function to further analyze the variables in our dataset. Passenger ID is an integer and it qualitative, nominal data. While a number, there is not intrinsic ordering to them. Passenger 1 is not lower or above Passenger 2, it’s simply a way to label the observation. It is qualitative because there is no value in adding up or taking the average, etc. of the Passenger IDs, the result would not be meaningful. Age is a quantitative variable and is numeric. We could take the average age, for example, and that would be meaningful. Age takes on a few different types, but mainly ratio data as Age has a true zero point, where 0 represents the absence of age. Now in our dataset, there are no people who are 0, or just born, though there are a few children under age 1. For this reason, I can’t say that Age is truly interval, because the intervals between the ages are not the same. While we can generally think of ages is years, there are observations in the dataset of 70.50 and .42, so these are not equally spaced from other ages. Age is ordinal because someone who is 10 is less than someone who is 50 numerically.

?str
str(mydata)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

Q1b. Which variable has the most missing observations?

Using the code below, we can see that Cabin is the variable with the most missing observations, at 687.

mydata[mydata == ""
      |mydata == " "] <- NA
colSums(is.na(mydata))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2

Q2. Impute missing observations for Age, SibSp, and Parch with the column median (ordinal, interval, or ratio), or the column mode. To do so, use something like this: mydata$Age[is.na(mydata$Age)] <- median(mydata$Age, na.rm=TRUE) . You can read up on indexing in dataframe (i.e. “dataframe_name$variable_name”).

What this code below does is for the Age variable, it first calls upon the values in Age where there are missing observations. Then it assigns those values with the median of the values of Age. The median of Age is 28, as we can see from the summary. We can then verify this by counting the sum of the missing values in the Age column. It shows 0 because we just assigned them all the median value.

mydata$Age[is.na(mydata$Age)] <- median(mydata$Age, na.rm=TRUE)
median(mydata$Age, na.rm=TRUE)
## [1] 28
summary(mydata$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.42   22.00   28.00   29.36   35.00   80.00
sum(is.na(mydata$Age))
## [1] 0

With SibSp, we can do the same. The median amount for SibSp is 0, thus we impute 0 the for the missing values.

mydata$sibsp[is.na(mydata$SibSp)] <- median(mydata$SibSp, na.rm=TRUE)
median(mydata$SibSp, na.rm=TRUE)
## [1] 0
summary(mydata$SibSp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.523   1.000   8.000
sum(is.na(mydata$SibSp))
## [1] 0

We can do the same for Parch.

mydata$Parch[is.na(mydata$Parch)] <- median(mydata$Parch, na.rm=TRUE)
median(mydata$Parch, na.rm=TRUE)
## [1] 0
summary(mydata$Parch)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3816  0.0000  6.0000
sum(is.na(mydata$Parch))
## [1] 0

Q3. Install the psych package in R: install.packages(‘pscyh’). Invoke the package using library(psych). Then provide descriptive statistics for Age, SibSp, and Parch (e.g., describe(mydata$Age). Please comment on what you observe from the summary statistics.

Age, SibSp, and Parch all have different means, Age being much higher as the values of age are typically greater than the number of siblings, spouses, parents or children someone has. The latter 2 variables’ means are closer together. Conversely, Age has a lower skew and kurtosis. We can infer that this data follows a more symmetrical distribution and flatter tails and peak. SibSp and Parch have higher values here, in a positive direction, indicating more right tail skewness, where the majority of the data is focused on the left side. This makes sense again as these values are typically small (people won’t have 10s of siblings/spouses/parents/children).

# install.packages("psych")
library(psych)
describe(mydata$Age)
##    vars   n  mean    sd median trimmed mad  min max range skew kurtosis   se
## X1    1 891 29.36 13.02     28   28.83 8.9 0.42  80 79.58 0.51     0.97 0.44
describe(mydata$SibSp)
##    vars   n mean  sd median trimmed mad min max range skew kurtosis   se
## X1    1 891 0.52 1.1      0    0.27   0   0   8     8 3.68    17.73 0.04
describe(mydata$Parch)
##    vars   n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 891 0.38 0.81      0    0.18   0   0   6     6 2.74     9.69 0.03

Q4. Provide a cross-tabulation of Survived and Sex (e.g., table(mydata$Survived, mydata$Sex). What do you notice?

More females survived compared to males on the Titanic.

table(mydata$Survived, mydata$Sex)
##    
##     female male
##   0     81  468
##   1    233  109

Q5. Provide notched boxplots for Survived and Age (e.g., boxplot(mydata$Age~mydata$Survived, notch=TRUE, horizontal=T). What do you notice?

With a boxplot, we can interpret that the median age for both those that survived and those that didn’t is about the same, around late 20s. The quartiles are a bit different though, for those that survived, the 25 % of them are a little younger than 20 while for those that didn’t survive, they are slightly older than 20. Same with the upper quartile, 75% of those that didn’t survive were slightly under 40, while those that survived were even younger, close to around 35. Those that didn’t survive skew slightly older, with a longer “whisker” and a few more outliers that are older, around 70-75.

boxplot(mydata$Age~mydata$Survived, notch=TRUE, horizontal=T, 
        main = "Boxplot of Age vs Survived on the Titanic",
        ylab = "Survived",
        xlab = "Age")