R Bridge Course Final Project
This is a final project to show off what you have learned. Select your data set from the list below: http://vincentarelbundock.github.io/Rdatasets/ (click on the csv index for a list). Another good source is found here:
https://https://archive.ics.uci.edu/ml/datasets.html The presentation approach is up to you but it should contain the following:
Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text.
Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example - if it makes sense you could sum two columns together)
Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.
Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end.
BONUS - place the original .csv in a github file and have R read from the link. This will be a very useful skill as you progress in your data science education and career.
Please submit your .rmd file and the .csv file as well as a link to your RPubs.
This study is about social class and survival chance on Titanic. Titanic was called ‘unsinkable’ luxurious ship yet sank in North Atlantic Ocean on April 15th 1912 after it struck an iceberg. And there were nearly 1500 lives loss because of very limited lifeboats for the passengers. My analysis will provide survival outcomes of the distribution in the first, second, and third-class passengers based on the data set, and discover whether social class and age will be the priority for the chance of survival in this life-and-death case.
Methods: 1. Observer survival chance relevant to passenger’s social class and sex by comparing numbers and ratios of men and women from three classes. 2. Observer survival chance relevant to passenger’s age by discovering Age distribution in each group from subset of data.
https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/datasets/Titanic.csv
Titanic<-read.csv(url("https://raw.githubusercontent.com/czhu505/R_W2_Assignment/master/Titanic.csv"))[-1]
library (ggplot2)
options(repos = c(CRAN = "http://cran.rstudio.com"))
install.packages("ggplot2")
## Installing package into 'C:/Users/Zhu/Documents/R/win-library/3.3'
## (as 'lib' is unspecified)
## package 'ggplot2' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Zhu\AppData\Local\Temp\Rtmp88wamZ\downloaded_packages
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
creating two data frames:
* passenger_Class - find survival chances by comparing number and ratios in different classes
* passenger_Age - find whether Age matter for survival chance
1313 observations of 6 variables : Name, PClass, Age, Sex, Survived, SexCode
head(Titanic)
## Name PClass Age Sex
## 1 Allen, Miss Elisabeth Walton 1st 29.00 female
## 2 Allison, Miss Helen Loraine 1st 2.00 female
## 3 Allison, Mr Hudson Joshua Creighton 1st 30.00 male
## 4 Allison, Mrs Hudson JC (Bessie Waldo Daniels) 1st 25.00 female
## 5 Allison, Master Hudson Trevor 1st 0.92 male
## 6 Anderson, Mr Harry 1st 47.00 male
## Survived SexCode
## 1 1 1
## 2 0 1
## 3 0 0
## 4 0 1
## 5 1 0
## 6 1 0
str(Titanic)
## 'data.frame': 1313 obs. of 6 variables:
## $ Name : Factor w/ 1310 levels "Abbing, Mr Anthony",..: 22 25 26 27 24 31 45 46 50 54 ...
## $ PClass : Factor w/ 4 levels "*","1st","2nd",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Age : num 29 2 30 25 0.92 47 63 39 58 71 ...
## $ Sex : Factor w/ 2 levels "female","male": 1 1 2 1 2 2 1 2 1 2 ...
## $ Survived: int 1 0 0 0 1 1 1 0 1 0 ...
## $ SexCode : int 1 1 0 1 0 0 1 0 1 0 ...
Observed from statistic results
PClass(#number of passengers): 1st (#322), 2nd (#279), 3rd (#711), and * (#1 unknown)
Age: 557 NA (# number of unknown data)
summary(Titanic)
## Name PClass Age
## Carlsson, Mr Frans Olof : 2 * : 1 Min. : 0.17
## Connolly, Miss Kate : 2 1st:322 1st Qu.:21.00
## Kelly, Mr James : 2 2nd:279 Median :28.00
## Abbing, Mr Anthony : 1 3rd:711 Mean :30.40
## Abbott, Master Eugene Joseph: 1 3rd Qu.:39.00
## Abbott, Mr Rossmore Edward : 1 Max. :71.00
## (Other) :1304 NA's :557
## Sex Survived SexCode
## female:462 Min. :0.0000 Min. :0.0000
## male :851 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000
## Mean :0.3427 Mean :0.3519
## 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000
##
(1 row in PClass with unknow value ’*’ will be excludsive in the analysis.)
T1 <- data.frame(Titanic$PClass,Titanic$Survived, Titanic$SexCode)
C1<-T1$Titanic.PClass=="1st"
C2<-T1$Titanic.PClass=="2nd"
C3<-T1$Titanic.PClass=="3rd"
S<-T1$Titanic.Survived>0
M<-T1$Titanic.SexCode<1
W<-T1$Titanic.SexCode>0
C1_M<-T1[which(C1 & M),] #dataset for 1st Class men
C2_M<-T1[which(C2 & M),] #dataset for 2nd Class men
C3_M<-T1[which(C3 & M),] #dataset for 3rd Class men
C1_W<-T1[which(C1 & W),] #dataset for 1st Class women
C2_W<-T1[which(C2 & W),] #dataset for 2nd Class women
C3_W<-T1[which(C3 & W),] #dataset for 3rd Class women
m1<-nrow(C1_M) #number of 1st class men
m2<-nrow(C2_M) #number of 2nd class men
m3<-nrow(C3_M) #number of 3rd class men
w1<-nrow(C1_W) #number of 1st class women
w2<-nrow(C2_W) #number of 2nd class women
w3<-nrow(C3_W) #number of 3rd class women
C1_S_M<-T1[which(C1 & S & M),] #dataset for 1st Class survial men
C2_S_M<-T1[which(C2 & S & M),] #dataset for 2nd Class survial men
C3_S_M<-T1[which(C3 & S & M),] #dataset for 3rd Class survial men
C1_S_W<-T1[which(C1 & S & W),] #dataset for 1st Class survial women
C2_S_W<-T1[which(C2 & S & W),] #dataset for 2nd Class survial women
C3_S_W<-T1[which(C3 & S & W),] #dataset for 3rd Class survial women
m_s1<-nrow(C1_S_M) #number of 1st class survial men
m_s2<-nrow(C2_S_M) #number of 2nd class survial men
m_s3<-nrow(C3_S_M) #number of 3rd class survial men
w_s1<-nrow(C1_S_W) #number of 1st class survial women
w_s2<-nrow(C2_S_W) #number of 2nd class survial women
w_s3<-nrow(C3_S_W) #number of 3rd class survial women
t<-matrix(c(m1,m2,m3,w1,w2,w3,m_s1,m_s2,m_s3,w_s1,w_s2,w_s3),ncol = 4) #creat a matrix
colnames(t)<-c('Men','Women','Saved_Men','Saved_Women')
row.names(t)<-c('1st','2nd','3rd')
passenger_Class<-as.data.frame(t) # creat passenger_Class data frame
passenger_Class$Sum_inClass<-c(322,279,711) #add column Sum_inClass
passenger_Class$Saved_inClass<- passenger_Class$Saved_Men + passenger_Class$Saved_Women #add column Saved_inClass
#print(passenger_Class[,1:6]) # 5.1
passenger_Class$Men_inClass<-passenger_Class$Men/passenger_Class$Sum_inClass #add column Men_inClass
passenger_Class$Women_inClass<-1-passenger_Class$Men_inClass #add column Women_inClass
passenger_Class$Saved_Men_inClass<-passenger_Class$Saved_Men/passenger_Class$Men #add column Saved_Men_inClass
passenger_Class$Saved_Women_inClass<-passenger_Class$Saved_Women/passenger_Class$Women #add column Saved_Women_inClass
passenger_Class$TotalSaved_inClass<-passenger_Class$Saved_inClass/passenger_Class$Sum_inClass #add column TotalSaved_inClass
passenger_Class$TotalSaved<-passenger_Class$Saved_inClass/1313 #add column TotalSaved
#print(passenger_Class[,7:12]) #5.1
Removed 557 NA from Age
756 observations of 5 variables: Titanic.Age, Titanic.PClass, Titanic.Survived, Titanic.SexCode
T2 <- na.omit(data.frame(Titanic$Age,Titanic$PClass,Titanic$Sex, Titanic$Survived, Titanic$SexCode)) #remove NA and * dataset
vPClass<-as.integer(T2$Titanic.PClass)-1 #PClass is changed factor to numeric data type
passenger_Age<-data.frame(T2$Titanic.Age,vPClass,T2$Titanic.Sex,T2$Titanic.Survived,T2$Titanic.SexCode) # creat passenger_Age data frame
names(passenger_Age)<-c("Age","PClass","Sex","Survived","SexCode") #rename column names
str(passenger_Age)
## 'data.frame': 756 obs. of 5 variables:
## $ Age : num 29 2 30 25 0.92 47 63 39 58 71 ...
## $ PClass : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Sex : Factor w/ 2 levels "female","male": 1 1 2 1 2 2 1 2 1 2 ...
## $ Survived: int 1 0 0 0 1 1 1 0 1 0 ...
## $ SexCode : int 1 1 0 1 0 0 1 0 1 0 ...
Creat passenger_Age_Survived subset for Surivals (313 out of 756 samples)
passenger_Age_Survived<-subset(passenger_Age,passenger_Age$Survived>0)
passenger_Age_Survived[4]<-NULL #remove 3rd column
names(passenger_Age_Survived)<-c("Age_S","PClass_S","Sex_S","SexCode_S") #rename column names
str(passenger_Age_Survived)
## 'data.frame': 313 obs. of 4 variables:
## $ Age_S : num 29 0.92 47 63 58 19 50 37 47 26 ...
## $ PClass_S : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Sex_S : Factor w/ 2 levels "female","male": 1 2 2 1 1 1 1 2 1 2 ...
## $ SexCode_S: int 1 0 0 1 1 1 1 0 1 0 ...
Analysis of numbers:
-By columns:
The numbers of Men and women is more even in higher classes. Men and women 's ratios in 1st,2nd and 3rd classes are 1.25, 1.6, and 2.35 respectivly. There are more survival women than men in each class.
-By rows:
Numbers of men and women in 3rd class are more than numbers of men and women in other classes. The 3rd class men are 2.79 times of men in 1st class and 2.9 times of men in 2nd class. The 3rd class women are 1.48 times of women in 1st class, 1.98 times of women in 2nd class.
Total number of survived man and women in 1st class are higher than other two classes.
The numbers of survivals in each class look fairly close.
print(passenger_Class[,1:6])
## Men Women Saved_Men Saved_Women Sum_inClass Saved_inClass
## 1st 179 143 59 134 322 193
## 2nd 172 107 25 94 279 119
## 3rd 499 212 58 80 711 138
Analysis of ratios:
-By columns:
The variance of ratios between Men and women is increasing from high class to low class. The percentage survival in women are much higher chance than men in each class.
-By rows:
Majority women in 1st and 2nd class were rescued.
Total number of survived man and women in 1st class are higher than other classes, nearly 40% 3rd class women were rescued.
The percentage of total survivals in 1st class is pretty high comparing to 2nd and 3rd classes percentage, according to the percentage of total saved ratios in 60%, 43% and 19% respectively.
Though percentages based on total amount of passengers in each class is fairly close, the percentage of survival in each class show higher class has higher survival chance.
print(passenger_Class[,7:12])
## Men_inClass Women_inClass Saved_Men_inClass Saved_Women_inClass
## 1st 0.5559006 0.4440994 0.3296089 0.9370629
## 2nd 0.6164875 0.3835125 0.1453488 0.8785047
## 3rd 0.7018284 0.2981716 0.1162325 0.3773585
## TotalSaved_inClass TotalSaved
## 1st 0.5993789 0.14699162
## 2nd 0.4265233 0.09063214
## 3rd 0.1940928 0.10510282
Overview from Histogram:
Each number of women in different class is bigger than men.
Age distribution for men and women are similar in each class which means 756 samples were evenly selected in Sex dataset.
3rd class has bigger sample size than 1st and 2nd classes because 3rd classes had many more people than other two classes.
qplot(Age,data= passenger_Age,fill= Sex,facets = .~PClass,bins=30)
-Men
Amount of survived men observably reduce in all three classes, especially in 2nd and 3rd class.
In 1st class, no Men after 60-year-old was survived; in 2nd class, no men after 43; in 3rd class, no men after 48.
Survived men in young age (<20) have similar number as in total samples.
-Women
In 1st class, survived women age has similar distribution as in total samples, since nearly 94% women were served.
In 2nd class, survived women age before 45 has similar shape since nearly 88% women were survived.
In 3rd class, survived women age around 25-31 mostly died since nearly 38% women were survived.
qplot(Age_S,data= passenger_Age_Survived,fill= Sex_S,facets = .~PClass_S,bins=30)
By observed Boxplt results, the median ages in 1st,2nd and 3rd classes are approximately to 40,28,23 respectively.
Majority passenger ages for 1st class is at the approximate range (29,50), 2nd class at (23,37), 3rd class at (18,31).
Obviosity, survived passengers are younger in 3rd class than in 1st class.
boxplot(passenger_Age$Age~passenger_Age$PClass, xlab='PClass',ylab='Age',col=c("light yellow","pink","light grey"), main='Boxplot Distribution for 756 Passenger Age in 3 classes')
By observed Boxplt results,the median ages in 1st,2nd and 3rd classes approximately go down to 37,25,21 respectively.
Majority survived passenger ages for 1st class at the approximate range (23,49), 2nd class at (15,33), 3rd class at (17,29).
It shows younger passengers have higher rate to be selected for survials.
boxplot(passenger_Age_Survived$Age_S ~ passenger_Age_Survived$PClass_S, xlab='PClass',ylab='Age',col=c("light yellow","pink","light grey"), main='Boxplot for age of survials in classes (313/756 samples)')
By observed scatter plot in 3rd class, survived young men age between 19 and 40 had higher ratio (men vs women) than other two classes.
qplot(PClass_S,Age_S,data=passenger_Age_Survived,color= Sex_S,facets = .~PClass_S)
In Titanic social class and survival chance study, my study outcomes are:
Higher class has more chance to be rescued in men and women.
Women in all classes had more chance than men to be rescued.
Young people in all classes were more likely to be saved.