This project is working on data cleaning so that it can be used for analysis and visualization. There are several statistical tests, visualizations and interpretations.

Data from titanic3.csv

Titanic3 data (titanic3.csv) contains information on the survival status of individual passengers on board the Titanic. This data does not contain information about the ship’s crew, but includes the actual and estimated ages of nearly 80% of the passengers.

WORDCLOUD

WordCloud is a technique to show which words are the most frequent among the given text.

Create a wordcloud from the data in the home/destination variable and interpret it.

In this case, we use the word cloud to make it easier to find out the home and destination of the most passengers on the Titanic.

library(wordcloud2)
library(rvest)
library(RColorBrewer)
library(tm)
## Loading required package: NLP
library(SnowballC)
library(tidytext)
dat <- read.csv(file="D:/A SEMESTER 5/KOMSTAT/uts/titanic3.csv",head=TRUE)
str(dat)
## 'data.frame':    1309 obs. of  14 variables:
##  $ pclass   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ survived : int  1 1 0 0 0 1 1 0 1 0 ...
##  $ name     : chr  "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mr. Hudson Joshua Creighton" ...
##  $ sex      : chr  "female" "male" "female" "male" ...
##  $ age      : num  29 0.92 2 30 25 48 63 39 53 71 ...
##  $ sibsp    : int  0 1 1 1 1 0 1 0 2 0 ...
##  $ parch    : int  0 2 2 2 2 0 0 0 0 0 ...
##  $ ticket   : chr  "24160" "113781" "113781" "113781" ...
##  $ fare     : num  211 152 152 152 152 ...
##  $ cabin    : chr  "B5" "C22 C26" "C22 C26" "C22 C26" ...
##  $ embarked : chr  "S" "S" "S" "S" ...
##  $ boat     : chr  "2" "11" "" "" ...
##  $ body     : int  NA NA NA 135 NA NA NA NA NA 22 ...
##  $ home.dest: chr  "St Louis, MO" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" ...
setwd("D:\\A SEMESTER 5\\KOMSTAT\\uts")
home = dat$home.dest
write.table(home,"home.txt")
data=read.table("home.txt")
cdata=file.path("D:","A SEMESTER 5","KOMSTAT","uts","home.txt")
docs=Corpus(DirSource("D:\\A SEMESTER 5\\KOMSTAT\\uts", pattern = "home.txt", encoding 
= "UTF-8"))

Cleaning Data

We have to clean the data to make it easier for us to analyze the data.

#cleaning data
docs=tm_map(docs, removePunctuation)
docs=tm_map(docs,removeNumbers)
docs=tm_map(docs,tolower)
docs=tm_map(docs,stripWhitespace)
#Membuat dokumen term matrix
dtm=DocumentTermMatrix(docs)
#Convert document term matrix ke dataframe
df=tidy(dtm)
df=df[order(-df$count),c(2,3)]
df
## # A tibble: 414 x 2
##    term     count
##    <chr>    <dbl>
##  1 new        123
##  2 york       116
##  3 england     99
##  4 london      44
##  5 sweden      38
##  6 cornwall    30
##  7 ireland     27
##  8 paris       27
##  9 montreal    24
## 10 chicago     20
## # ... with 404 more rows
#wordcloud
wordcloud2(data = df, shape = 'circle',size=0.5)

-Interpretation: Based on the wordcloud above, it can be seen that in the home/destination variable the most many are New York and England. Then there is London, Sweden, Cornwallm and beyond.

Two-way contingency table

Two-way contingency table for survived and pclass variables Two-way contingency table are used in statistical analysis to summarize the relationship between two categorical variables.

tabel = xtabs(~survived+pclass,data=dat)
tabel
##         pclass
## survived   1   2   3
##        0 123 158 528
##        1 200 119 181

Chi-Square Test of Independence

Perform an independence test with a chi-square test to find out whether survival status (survived) and passenger class are mutually independent or dependent (pclass)!

▪ Hypothesis

𝐻0: survival status and passenger class are mutually independent

𝐻1: survival status (survived) and passenger class are mutually dependent

▪ Level of significance 𝛼 = 0.05

▪ Test Statistics

chisq.test(tabel)
## 
##  Pearson's Chi-squared test
## 
## data:  tabel
## X-squared = 127.86, df = 2, p-value < 2.2e-16

▪ Conclusion

From the above calculation, we get 𝑋^2 = 127.86 and 𝑝𝑣𝑎𝑙𝑢𝑒 = 2.2𝑒 − 16. since 𝑝 − 𝑣𝑎𝑙𝑢𝑒 < 0.05 then reject 𝐻0 So, it can be concluded that the survival status (survived) and passenger class are interdependent.

t-test

Determine the surviving female passenger and save the result as female.surv and do the same for male passengers and save the result as male.surv. Then print the first 3 observations for each one gender. Take the age variable and calculate the mean and standard deviation of each sex. What is the average age of female and male passengers who life is different? Perform hypothesis testing manually (create your own function) and using the t.test() function in R with a significance level of = 0.01! Assume that each data is normally distributed and has the same variance

surv <- dat[dat$survived==1, ]
female.surv <- surv[surv$sex=="female",]
male.surv <- surv[dat$sex=="male", ]
head(female.surv,3)
##   pclass survived                                          name    sex age
## 1      1        1                 Allen, Miss. Elisabeth Walton female  29
## 7      1        1             Andrews, Miss. Kornelia Theodosia female  63
## 9      1        1 Appleton, Mrs. Edward Dale (Charlotte Lamson) female  53
##   sibsp parch ticket     fare cabin embarked boat body           home.dest
## 1     0     0  24160 211.3375    B5        S    2   NA        St Louis, MO
## 7     1     0  13502  77.9583    D7        S   10   NA          Hudson, NY
## 9     2     0  11769  51.4792  C101        S    D   NA Bayside, Queens, NY
usia.female = female.surv$age
usia.female[is.na(usia.female)]<-0
usia.male = male.surv$age
usia.male[is.na(usia.male)]<-0
mean(usia.male)
## [1] 8.152242
mean(usia.female)
## [1] 25.68168
sd(usia.male)
## [1] 15.16748
sd(usia.female)
## [1] 17.15438
#manually
twosam <- function(y1, y2,alpha=0.01) {
  n1 <- length(y1); n2 <- length(y2)
  yb1 <- mean(y1); yb2 <- mean(y2)
  var.s1 <- var(y1); var.s2 <- var(y2)
  var.s <- ((n1-1)*var.s1 + (n2-1)*var.s2)/(n1+n2-2)
  tst <- (yb1 - yb2)/sqrt(var.s*(1/n1 + 1/n2))
  df <- n1+n2-2
  pvalue <- 2*pt(-abs(tst), df)
  lower.CI <- (yb1-yb2)-qt(1-alpha/2,df)*sqrt(var.s*(1/n1 + 1/n2))
  upper.CI <- (yb1-yb2)+qt(1-alpha/2,df)*sqrt(var.s*(1/n1 + 1/n2)) 
  list(t.stat=tst,df = df, p.value = pvalue, "95% CI" = c(lower.CI,upper.CI)
       ,mean.x = yb1,mean.y=yb2)
  }
tstat <- twosam(usia.female, usia.male, alpha = 0.01)
tstat
## $t.stat
## [1] 17.2924
## 
## $df
## [1] 1180
## 
## $p.value
## [1] 6.855954e-60
## 
## $`95% CI`
## [1] 14.91407 20.14481
## 
## $mean.x
## [1] 25.68168
## 
## $mean.y
## [1] 8.152242
#built in
t.test(usia.female,usia.male,alternative="two.sided",conf.level = 
         0.99,var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  usia.female and usia.male
## t = 17.292, df = 1180, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 99 percent confidence interval:
##  14.91407 20.14481
## sample estimates:
## mean of x mean of y 
## 25.681681  8.152242

The results of manual calculations with functions and built-in functions of R, namely 𝑡.𝑡𝑒𝑠𝑡() produce the same output.

Since 𝑝−𝑣𝑎𝑙𝑢𝑒< 0, then H0 is rejected. So, it can be concluded that the average the ages of the surviving male and female passengers are different.

Who are the survived female passengers over 50 years old to the city of New York, NY? (Hint: use the subsetting function)

So, the female passenger who survived is over 50 years old with the destination city New York, NY are

subset(female.surv, subset = (age > 50 & home.dest == "New York, NY"))
##     pclass survived                                                 name    sex
## 80       1        1 Cornell, Mrs. Robert Clifford (Malvina Helen Lamson) female
## 248      1        1       Rothschild, Mrs. Martin (Elizabeth L. Barrett) female
##     age sibsp parch   ticket fare cabin embarked boat body    home.dest
## 80   55     2     0    11770 25.7  C101        S    2   NA New York, NY
## 248  54     1     0 PC 17603 59.4              C    6   NA New York, NY