Most of my conclusions seem to be confirmed by the analysis above. The survival rate among women apears to be much higher than among men. However my assumption about classes seems to be not entirely correct. 2nd Class seems to be displaying different data. 1st Class and 3rd class survival rates are in line with my assumptions.
- Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example – if it makes sense you could sum two columns together)
NewData <- data.frame(MyData$class, MyData$age, MyData$sex, MyData$survived)
colnames(NewData) <- c("Class","Age", "Gender", "Survival")
head(NewData)
## Class Age Gender Survival
## 1 1st class adults man yes
## 2 1st class adults man yes
## 3 1st class adults man yes
## 4 1st class adults man yes
## 5 1st class adults man yes
## 6 1st class adults man yes
AdultData <- NewData[which(NewData$Age=='adults'),]
head(AdultData)
## Class Age Gender Survival
## 1 1st class adults man yes
## 2 1st class adults man yes
## 3 1st class adults man yes
## 4 1st class adults man yes
## 5 1st class adults man yes
## 6 1st class adults man yes
I have excluded children from my data, because that data is making me sad
AdultData$SurvivalNum <- ifelse(AdultData$Survival=='yes', 1, 0)
myTable <- as.data.frame(table(AdultData$Class, AdultData$SurvivalNum))
myTable3 <- as.data.frame(table(AdultData$Gender, AdultData$Survival))
myTable3
## Var1 Var2 Freq
## 1 man no 659
## 2 women no 106
## 3 man yes 146
## 4 women yes 296
Data below includes percentages of total observations in my dataset.
MyTable2 <- myTable[which(myTable$Var2=='1'),]
MyTable3 <- myTable[which(myTable$Var2=='0'),]
MyTable2Gen <- myTable3[which(myTable3$Var2=='yes'),]
MyTable3Gen <- myTable3[which(myTable3$Var2=='no'),]
PercMyTable2<-myTable
PercMyTable2$Percentage<-PercMyTable2$Freq/1207*100
PercMyTable2
## Var1 Var2 Freq Percentage
## 1 1st class 0 122 10.107705
## 2 2nd class 0 167 13.835957
## 3 3rd class 0 476 39.436620
## 4 1st class 1 197 16.321458
## 5 2nd class 1 94 7.787904
## 6 3rd class 1 151 12.510356
PercMyTable3<-myTable3
PercMyTable3$Percentage<-PercMyTable3$Freq/1207*100
PercMyTable3
## Var1 Var2 Freq Percentage
## 1 man no 659 54.598177
## 2 women no 106 8.782104
## 3 man yes 146 12.096106
## 4 women yes 296 24.523612
- Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.
tbl <- with(AdultData, table(Gender, SurvivalNum));
barplot(tbl, main="Survival by Gender", names.arg = c("Did Not Survive", "Survived"), beside = TRUE, legend = TRUE)

tbl2 <- with(AdultData, table(Class, SurvivalNum));
barplot(tbl2, main="Survival by Class", names.arg = c("Did Not Survive", "Survived"), beside = TRUE, legend = TRUE)

hist(MyTable2$Freq, col='red')

ggplot(MyTable2Gen, aes(x=Var1, y=Freq))+geom_point()+ xlab("Gender") + ylab("Survived Count") + ggtitle("Survivals by Gender")

ggplot(MyTable3Gen, aes(x=Var1, y=Freq))+geom_point()+ xlab("Gender") + ylab("Casualty Count") + ggtitle("Survivals by Gender")

ggplot(MyTable2, aes(x=Var1, y=Freq))+geom_point()+ xlab("Class") + ylab("Survived Count") + ggtitle("Survivals by Class")

ggplot(MyTable3, aes(x=Var1, y=Freq))+geom_point()+ xlab("Class") + ylab("Casualty Count") + ggtitle("Survivals by Class")

Most of my conclusions seem to be confirmed by the analysis above. The survival rate among women apears to be much higher than among men. It looks like my assumption that chivalry was not dead at that time was right. However my assumption about classes seems to be not entirely correct. 2nd Class seems to be displaying different results, since more people from 2nd class survived that died. 1st Class and 3rd class survival rates are in line with my assumptions. In retrospect, the dataset that I picked had made it very challenging to include graphs in my analysis since the data was largely not numeric.
- BONUS – place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career. Done!