This project is working on data cleaning so that it can be used for analysis and visualization. There are several statistical tests, visualizations and interpretations.
Data from titanic3.csv
Titanic3 data (titanic3.csv) contains information on the survival status of individual passengers on board the Titanic. This data does not contain information about the ship’s crew, but includes the actual and estimated ages of nearly 80% of the passengers.
WordCloud is a technique to show which words are the most frequent among the given text.
Create a wordcloud from the data in the home/destination variable and interpret it.
In this case, we use the word cloud to make it easier to find out the home and destination of the most passengers on the Titanic.
library(wordcloud2)
library(rvest)
library(RColorBrewer)
library(tm)
## Loading required package: NLP
library(SnowballC)
library(tidytext)
dat <- read.csv(file="D:/A SEMESTER 5/KOMSTAT/uts/titanic3.csv",head=TRUE)
str(dat)
## 'data.frame': 1309 obs. of 14 variables:
## $ pclass : int 1 1 1 1 1 1 1 1 1 1 ...
## $ survived : int 1 1 0 0 0 1 1 0 1 0 ...
## $ name : chr "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mr. Hudson Joshua Creighton" ...
## $ sex : chr "female" "male" "female" "male" ...
## $ age : num 29 0.92 2 30 25 48 63 39 53 71 ...
## $ sibsp : int 0 1 1 1 1 0 1 0 2 0 ...
## $ parch : int 0 2 2 2 2 0 0 0 0 0 ...
## $ ticket : chr "24160" "113781" "113781" "113781" ...
## $ fare : num 211 152 152 152 152 ...
## $ cabin : chr "B5" "C22 C26" "C22 C26" "C22 C26" ...
## $ embarked : chr "S" "S" "S" "S" ...
## $ boat : chr "2" "11" "" "" ...
## $ body : int NA NA NA 135 NA NA NA NA NA 22 ...
## $ home.dest: chr "St Louis, MO" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" ...
setwd("D:\\A SEMESTER 5\\KOMSTAT\\uts")
home = dat$home.dest
write.table(home,"home.txt")
data=read.table("home.txt")
cdata=file.path("D:","A SEMESTER 5","KOMSTAT","uts","home.txt")
docs=Corpus(DirSource("D:\\A SEMESTER 5\\KOMSTAT\\uts", pattern = "home.txt", encoding
= "UTF-8"))
We have to clean the data to make it easier for us to analyze the data.
#cleaning data
docs=tm_map(docs, removePunctuation)
docs=tm_map(docs,removeNumbers)
docs=tm_map(docs,tolower)
docs=tm_map(docs,stripWhitespace)
#Membuat dokumen term matrix
dtm=DocumentTermMatrix(docs)
#Convert document term matrix ke dataframe
df=tidy(dtm)
df=df[order(-df$count),c(2,3)]
df
## # A tibble: 414 x 2
## term count
## <chr> <dbl>
## 1 new 123
## 2 york 116
## 3 england 99
## 4 london 44
## 5 sweden 38
## 6 cornwall 30
## 7 ireland 27
## 8 paris 27
## 9 montreal 24
## 10 chicago 20
## # ... with 404 more rows
#wordcloud
wordcloud2(data = df, shape = 'circle',size=0.5)
-Interpretation: Based on the wordcloud above, it can be seen that in the home/destination variable the most many are New York and England. Then there is London, Sweden, Cornwallm and beyond.
Two-way contingency table for survived and pclass variables Two-way contingency table are used in statistical analysis to summarize the relationship between two categorical variables.
tabel = xtabs(~survived+pclass,data=dat)
tabel
## pclass
## survived 1 2 3
## 0 123 158 528
## 1 200 119 181
Perform an independence test with a chi-square test to find out whether survival status (survived) and passenger class are mutually independent or dependent (pclass)!
▪ Hypothesis
𝐻0: survival status and passenger class are mutually independent
𝐻1: survival status (survived) and passenger class are mutually dependent
▪ Level of significance 𝛼 = 0.05
▪ Test Statistics
chisq.test(tabel)
##
## Pearson's Chi-squared test
##
## data: tabel
## X-squared = 127.86, df = 2, p-value < 2.2e-16
▪ Conclusion
From the above calculation, we get 𝑋^2 = 127.86 and 𝑝𝑣𝑎𝑙𝑢𝑒 = 2.2𝑒 − 16. since 𝑝 − 𝑣𝑎𝑙𝑢𝑒 < 0.05 then reject 𝐻0 So, it can be concluded that the survival status (survived) and passenger class are interdependent.
Determine the surviving female passenger and save the result as female.surv and do the same for male passengers and save the result as male.surv. Then print the first 3 observations for each one gender. Take the age variable and calculate the mean and standard deviation of each sex. What is the average age of female and male passengers who life is different? Perform hypothesis testing manually (create your own function) and using the t.test() function in R with a significance level of = 0.01! Assume that each data is normally distributed and has the same variance
surv <- dat[dat$survived==1, ]
female.surv <- surv[surv$sex=="female",]
male.surv <- surv[dat$sex=="male", ]
head(female.surv,3)
## pclass survived name sex age
## 1 1 1 Allen, Miss. Elisabeth Walton female 29
## 7 1 1 Andrews, Miss. Kornelia Theodosia female 63
## 9 1 1 Appleton, Mrs. Edward Dale (Charlotte Lamson) female 53
## sibsp parch ticket fare cabin embarked boat body home.dest
## 1 0 0 24160 211.3375 B5 S 2 NA St Louis, MO
## 7 1 0 13502 77.9583 D7 S 10 NA Hudson, NY
## 9 2 0 11769 51.4792 C101 S D NA Bayside, Queens, NY
usia.female = female.surv$age
usia.female[is.na(usia.female)]<-0
usia.male = male.surv$age
usia.male[is.na(usia.male)]<-0
mean(usia.male)
## [1] 8.152242
mean(usia.female)
## [1] 25.68168
sd(usia.male)
## [1] 15.16748
sd(usia.female)
## [1] 17.15438
#manually
twosam <- function(y1, y2,alpha=0.01) {
n1 <- length(y1); n2 <- length(y2)
yb1 <- mean(y1); yb2 <- mean(y2)
var.s1 <- var(y1); var.s2 <- var(y2)
var.s <- ((n1-1)*var.s1 + (n2-1)*var.s2)/(n1+n2-2)
tst <- (yb1 - yb2)/sqrt(var.s*(1/n1 + 1/n2))
df <- n1+n2-2
pvalue <- 2*pt(-abs(tst), df)
lower.CI <- (yb1-yb2)-qt(1-alpha/2,df)*sqrt(var.s*(1/n1 + 1/n2))
upper.CI <- (yb1-yb2)+qt(1-alpha/2,df)*sqrt(var.s*(1/n1 + 1/n2))
list(t.stat=tst,df = df, p.value = pvalue, "95% CI" = c(lower.CI,upper.CI)
,mean.x = yb1,mean.y=yb2)
}
tstat <- twosam(usia.female, usia.male, alpha = 0.01)
tstat
## $t.stat
## [1] 17.2924
##
## $df
## [1] 1180
##
## $p.value
## [1] 6.855954e-60
##
## $`95% CI`
## [1] 14.91407 20.14481
##
## $mean.x
## [1] 25.68168
##
## $mean.y
## [1] 8.152242
#built in
t.test(usia.female,usia.male,alternative="two.sided",conf.level =
0.99,var.equal = TRUE)
##
## Two Sample t-test
##
## data: usia.female and usia.male
## t = 17.292, df = 1180, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 99 percent confidence interval:
## 14.91407 20.14481
## sample estimates:
## mean of x mean of y
## 25.681681 8.152242
The results of manual calculations with functions and built-in functions of R, namely 𝑡.𝑡𝑒𝑠𝑡() produce the same output.
Since 𝑝−𝑣𝑎𝑙𝑢𝑒< 0, then H0 is rejected. So, it can be concluded that the average the ages of the surviving male and female passengers are different.
Who are the survived female passengers over 50 years old to the city of New York, NY? (Hint: use the subsetting function)
So, the female passenger who survived is over 50 years old with the destination city New York, NY are
subset(female.surv, subset = (age > 50 & home.dest == "New York, NY"))
## pclass survived name sex
## 80 1 1 Cornell, Mrs. Robert Clifford (Malvina Helen Lamson) female
## 248 1 1 Rothschild, Mrs. Martin (Elizabeth L. Barrett) female
## age sibsp parch ticket fare cabin embarked boat body home.dest
## 80 55 2 0 11770 25.7 C101 S 2 NA New York, NY
## 248 54 1 0 PC 17603 59.4 C 6 NA New York, NY