This is my first markdown document, made for HW 1!

First, I will bring in data with the train.csv file:

mydata <- read.csv("C:/Users/cosovich/Desktop/Intro to Data Analysis/Week 1/train.csv")

Q1 a.  What  are the types of variable (quantitative / qualitative) and [levels of measurement] (nominal / ordinal / interval / ratio) for PassengerId and Age?

Both PassengerId and Age are quantitative variables, but they differ in their levels of measurement. PassengerId is a way of uniquely identifying each case so it is nominal. Age should be catagorized as ratio.

Q1b.  Which variable has the most missing observations?  You could have Googled for “Count NA values in R for all columns in a dataframeLinks to an external site.” or something like that.  

colSums(is.na(mydata))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

Here we can see that Age is the variable with the highest count of missing values, with 177.

Q2.  Impute missing observations for Age, SibSp, and Parch with the column median (ordinal, interval, or ratio), or the column mode.  To do so, use something like this:  mydata$Age[is.na(mydata$Age)] <- median(mydata$Age, na.rm=TRUE) . You can read up on indexing in dataframeLinks to an external site. (i.e. “dataframe_name$variable_name”).  

#Now, we;ll impute the missing Age observations with the column median
mydata$Age[is.na(mydata$Age)] <- median(mydata$Age, na.rm=TRUE)

If that worked, we should have all zeroes

colSums(is.na(mydata))
## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0           0 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0           0           0

Bingo!

Q3.  Install the psych package in R:  install.packages(‘pscyh’).  Invoke the package using library(psych). Then provide descriptive statistics for Age, SibSp, and Parch (e.g., describe(mydata$Age).  Please comment on what you observe from the summary statistics.

#install the psych package
install.packages("psych")
## Installing package into 'C:/Users/cosovich/AppData/Local/R/win-library/4.5'
## (as 'lib' is unspecified)
## package 'psych' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\cosovich\AppData\Local\Temp\RtmpsDBGcV\downloaded_packages
library(psych)

#Now I'll get the descriptive statistics for 3 variables
describe(mydata$Age)
##    vars   n  mean    sd median trimmed mad  min max range skew kurtosis   se
## X1    1 891 29.36 13.02     28   28.83 8.9 0.42  80 79.58 0.51     0.97 0.44
describe(mydata$SibSp)
##    vars   n mean  sd median trimmed mad min max range skew kurtosis   se
## X1    1 891 0.52 1.1      0    0.27   0   0   8     8 3.68    17.73 0.04
describe(mydata$Parch)
##    vars   n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 891 0.38 0.81      0    0.18   0   0   6     6 2.74     9.69 0.03

The mode of SibSp is 0, which suggest that most passengers were travelling alone, without siblings or spouses. The parents/children ratio is also low, suggesting the many were not travelling with large numbers of children. Ages ranged greatly, but with a median age around 28, along with other statistics, we can assume that many passengers were young adults, possibly without spouses or children.

Q4.  Provide a cross-tabulation of Survived and Sex (e.g., table(mydata$Survived, mydata$Sex).  What do you notice?

table(mydata$Survived, mydata$Sex)
##    
##     female male
##   0     81  468
##   1    233  109

With this cross-tabulation, we can see the disproportionate survival rate between men (18%) and women (74%). As women and children were prioritized when filling the lifeboats to escape the sinking ship, more men may have gone down with the ship or have died swimming for their lives in the frigid water.

Q5.  Provide notched boxplots for Survived and Age (e.g., boxplot(mydata$Age~mydata$Survived, notch=TRUE, horizontal=T).  What do  you notice?

boxplot(mydata$Age ~ mydata$Survived, notch=TRUE, horizontal=T)

From this boxplot, you can see a snapshot of death and survival by age on the Titanic. It is hard to miss the many outlying data points, clearly demonstrating the low survival rates of the elderly in the tragic accident. This makes sense, as the subset was more likely to be unable to escape the sinking ship and is more susceptible to the low temperatures if they did reach the lifeboats.