# Data directory "Titanic"

mydata <- read.csv('/Users/pin.lyu/Desktop/titanic/test.csv')

Q1

a) What are the types of variable (quantitative / qualitative) and levels of measurement. (nominal / ordinal / interval / ratio) for PassengerId and Age?
```
# Print "PassengerId"
str(mydata$PassengerId)
```
```
##  int [1:418] 892 893 894 895 896 897 898 899 900 901 ...
```
```
# Print "Age"
str(mydata$Age)
```
```
##  num [1:418] 34.5 47 62 27 22 14 30 26 18 21 ...
```
Answer: “PassengerId” is a qualitative data. the id is a way to classify each individual passengers, therefore, this data is considered a nominal data. “Age” is a quantitative data and normally it’s measured on a ratio age. This is because age has a true zero value where when an individual’s age is missing, 0 can represent the absence of one’s age.

Which variable has the most missing observations? You could have Googled for “Count NA values in R for all columns in a dataframe.” or something like that.

# Number of 0s in "SibSp"
table(mydata$SibSp)

## 
##   0   1   2   3   4   5   8 
## 283 110  14   4   4   1   2

# Number of 0s in "Parch"
table(mydata$Parch)

## 
##   0   1   2   3   4   5   6   9 
## 324  52  33   3   2   1   1   2

# Number of N/As in "Age"
table(mydata$Age)

## 
## 0.17 0.33 0.75 0.83 0.92    1    2    3    5    6    7    8    9   10 11.5   12 
##    1    1    1    1    1    3    2    1    1    3    1    2    2    2    1    2 
##   13   14 14.5   15   16   17   18 18.5   19   20   21   22 22.5   23   24   25 
##    3    2    1    1    2    7   13    3    4    8   17   16    1   11   17   11 
##   26 26.5   27   28 28.5   29   30   31   32 32.5   33   34 34.5   35   36 36.5 
##   12    1   12    7    1   10   15    6    6    2    6    1    1    5    9    1 
##   37   38 38.5   39   40 40.5   41   42   43   44   45   46   47   48   49   50 
##    3    3    1    6    5    1    5    5    4    1    9    3    5    5    3    5 
##   51   53   54   55   57   58   59   60 60.5   61   62   63   64   67   76 
##    1    3    2    6    3    1    1    3    1    2    1    2    3    1    1

Answer: From the chart shown above, we can see that “SibSp” has 283 zeros, “Parch” has 324 zeros, and “Age” has only 86 zeors. Hence, “Parch” is the variable in this data set that has the most missing entries.

Q2

Impute missing observations for Age, SibSp, and Parch with the column median (ordinal, interval, or ratio), or the column mode. To do so, use something like this: mydata$Age[is.na(mydata$Age)] <- median(mydata$Age, na.rm=TRUE) . You can read up on indexing in dataframe. (i.e. “dataframe_name$variable_name”).

# Calculate Median 
median_age <- median(mydata$Age, na.rm=TRUE)

# Turning N/As into median in "Age"
mydata$Age[is.na(mydata$Age)] <- median_age

summary(mydata)

##   PassengerId         Pclass          Name               Sex           
##  Min.   : 892.0   Min.   :1.000   Length:418         Length:418        
##  1st Qu.: 996.2   1st Qu.:1.000   Class :character   Class :character  
##  Median :1100.5   Median :3.000   Mode  :character   Mode  :character  
##  Mean   :1100.5   Mean   :2.266                                        
##  3rd Qu.:1204.8   3rd Qu.:3.000                                        
##  Max.   :1309.0   Max.   :3.000                                        
##                                                                        
##       Age            SibSp            Parch           Ticket         
##  Min.   : 0.17   Min.   :0.0000   Min.   :0.0000   Length:418        
##  1st Qu.:23.00   1st Qu.:0.0000   1st Qu.:0.0000   Class :character  
##  Median :27.00   Median :0.0000   Median :0.0000   Mode  :character  
##  Mean   :29.60   Mean   :0.4474   Mean   :0.3923                     
##  3rd Qu.:35.75   3rd Qu.:1.0000   3rd Qu.:0.0000                     
##  Max.   :76.00   Max.   :8.0000   Max.   :9.0000                     
##                                                                      
##       Fare            Cabin             Embarked        
##  Min.   :  0.000   Length:418         Length:418        
##  1st Qu.:  7.896   Class :character   Class :character  
##  Median : 14.454   Mode  :character   Mode  :character  
##  Mean   : 35.627                                        
##  3rd Qu.: 31.500                                        
##  Max.   :512.329                                        
##  NA's   :1

Answer: Now the minimal value in “Age” is changed to 0.17 which means that we have successfully changed N/As into the median value of the age data. Next, we will apply the same procedure to “SibSp” and “Parch”.

# Same process on "SibSp"
median_SibSp <- median(mydata$SibSp, na.rm = T)
mydata$SibSp[is.na(mydata$SibSp)] <- median_SibSp

# Same process on "Parch"
median_Parch <- median(mydata$Parch, na.rm = T)
mydata$Parch[is.na(mydata$Parch)] <- median_Parch


summary(mydata)

##   PassengerId         Pclass          Name               Sex           
##  Min.   : 892.0   Min.   :1.000   Length:418         Length:418        
##  1st Qu.: 996.2   1st Qu.:1.000   Class :character   Class :character  
##  Median :1100.5   Median :3.000   Mode  :character   Mode  :character  
##  Mean   :1100.5   Mean   :2.266                                        
##  3rd Qu.:1204.8   3rd Qu.:3.000                                        
##  Max.   :1309.0   Max.   :3.000                                        
##                                                                        
##       Age            SibSp            Parch           Ticket         
##  Min.   : 0.17   Min.   :0.0000   Min.   :0.0000   Length:418        
##  1st Qu.:23.00   1st Qu.:0.0000   1st Qu.:0.0000   Class :character  
##  Median :27.00   Median :0.0000   Median :0.0000   Mode  :character  
##  Mean   :29.60   Mean   :0.4474   Mean   :0.3923                     
##  3rd Qu.:35.75   3rd Qu.:1.0000   3rd Qu.:0.0000                     
##  Max.   :76.00   Max.   :8.0000   Max.   :9.0000                     
##                                                                      
##       Fare            Cabin             Embarked        
##  Min.   :  0.000   Length:418         Length:418        
##  1st Qu.:  7.896   Class :character   Class :character  
##  Median : 14.454   Mode  :character   Mode  :character  
##  Mean   : 35.627                                        
##  3rd Qu.: 31.500                                        
##  Max.   :512.329                                        
##  NA's   :1

Note here that we see from the summary chart that the min of both “SibSp” abd “Parch” is zero. Why did this happen? would the zero values be replaced by the median of all the data in those colunms? The answer is that the median of each of these variable is 0. Therefore, to replace the 0s in them with median would not change anything.

Q3

Install the psych package in R: install.packages(‘pscyh’). Invoke the package using library(psych). Then provide descriptive statistics for Age, SibSp, and Parch (e.g., describe(mydata$Age) .

# Switch on "psych" package
library(psych)

# Descriptive stats of "Age", "SibSp", and "Parch"
describe(mydata$Age)

##    vars   n mean   sd median trimmed  mad  min max range skew kurtosis   se
## X1    1 418 29.6 12.7     27    28.8 7.41 0.17  76 75.83 0.66     0.88 0.62

describe(mydata$SibSp)

##    vars   n mean  sd median trimmed mad min max range skew kurtosis   se
## X1    1 418 0.45 0.9      0    0.28   0   0   8     8 4.14    26.03 0.04

describe(mydata$Parch)

##    vars   n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 418 0.39 0.98      0    0.16   0   0   9     9 4.62    30.86 0.05

Q4

Provide a cross-tabulation of Survived and Sex (e.g., table(mydata$Survived, mydata$Sex). What do you notice?
```
mydata2 <- read.csv('/Users/pin.lyu/Desktop/titanic/train.csv')

table(mydata2$Survived, mydata2$Sex)
```
```
##    
##     female male
##   0     81  468
##   1    233  109
```
Answer: What I noticed is that the amount of survived individuals are disproportional among two sexes. There are more females survived during the wreckage than males. And this makes because women and children were given priority to abroad life boats.

Q5

Provide notched boxplots for Survived and Age (e.g., boxplot(mydata$Age~mydata$Survived, notch=TRUE, horizontal=T). What do you notice?

boxplot(mydata2$Age~mydata2$Survived,
        notch=T, 
        horizontal=T,
        xlab = "Age",
        ylab = "Survived & Died",
        main = "Survival Stats Across Different Age",
        col  = "skyblue",
        border = "navy")

Answer: What I noticed from this graph is that most of individuals from both sexes who survived from the ship wreckage were people from 20-40 years old. For both sexes, the median age of the person survived is around 28 years old. Additionally, we can tell that lots of children survived as well. However, the accuracy of this interpretation based on the data is unclear due to the modification that we’ve made on age which we replaced missing values with the median of the age data.

DA_Assignment#1

2023-09-10

Q1

Q2

Q3

Q4

Q5