Project One Titanic
The project will use the dataset - titanic.csv, to explore the data with at least one data visualization and write a short essay to describe the source and topic of the data, any variables included, what the visualization represents, any interesting patterns or surprises that arise within the visualization, and anything that I could not get to work or that I wished I could have included
Load the data
setwd("C:/Users/wrxio/projects/Datasets")
titanic <- read_csv("titanic.csv")
## Rows: 1310 Columns: 14
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): name, sex, ticket, cabin, embarked, boat, home.dest
## dbl (7): pclass, survived, age, sibsp, parch, fare, body
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(titanic)
## # A tibble: 6 x 14
## pclass survived name sex age sibsp parch ticket fare cabin embarked
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <chr>
## 1 1 1 Allen, M~ fema~ 29 0 0 24160 211. B5 S
## 2 1 1 Allison,~ male 0.917 1 2 113781 152. C22 ~ S
## 3 1 0 Allison,~ fema~ 2 1 2 113781 152. C22 ~ S
## 4 1 0 Allison,~ male 30 1 2 113781 152. C22 ~ S
## 5 1 0 Allison,~ fema~ 25 1 2 113781 152. C22 ~ S
## 6 1 1 Anderson~ male 48 0 0 19952 26.6 E12 S
## # ... with 3 more variables: boat <chr>, body <dbl>, home.dest <chr>
Organize and clean the dataset
Make all headers lowercase and remove spaces
Check the result: After cleaning up, look up the variable names and the structure of the data.
names(titanic) <- tolower(names(titanic))
names(titanic) <- gsub(" ","",names(titanic))
names(titanic)
## [1] "pclass" "survived" "name" "sex" "age" "sibsp"
## [7] "parch" "ticket" "fare" "cabin" "embarked" "boat"
## [13] "body" "home.dest"
str(titanic)
## spec_tbl_df [1,310 x 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ pclass : num [1:1310] 1 1 1 1 1 1 1 1 1 1 ...
## $ survived : num [1:1310] 1 1 0 0 0 1 1 0 1 0 ...
## $ name : chr [1:1310] "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mr. Hudson Joshua Creighton" ...
## $ sex : chr [1:1310] "female" "male" "female" "male" ...
## $ age : num [1:1310] 29 0.917 2 30 25 ...
## $ sibsp : num [1:1310] 0 1 1 1 1 0 1 0 2 0 ...
## $ parch : num [1:1310] 0 2 2 2 2 0 0 0 0 0 ...
## $ ticket : chr [1:1310] "24160" "113781" "113781" "113781" ...
## $ fare : num [1:1310] 211 152 152 152 152 ...
## $ cabin : chr [1:1310] "B5" "C22 C26" "C22 C26" "C22 C26" ...
## $ embarked : chr [1:1310] "S" "S" "S" "S" ...
## $ boat : chr [1:1310] "2" "11" NA NA ...
## $ body : num [1:1310] NA NA NA 135 NA NA NA NA NA 22 ...
## $ home.dest: chr [1:1310] "St Louis, MO" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" ...
## - attr(*, "spec")=
## .. cols(
## .. pclass = col_double(),
## .. survived = col_double(),
## .. name = col_character(),
## .. sex = col_character(),
## .. age = col_double(),
## .. sibsp = col_double(),
## .. parch = col_double(),
## .. ticket = col_character(),
## .. fare = col_double(),
## .. cabin = col_character(),
## .. embarked = col_character(),
## .. boat = col_character(),
## .. body = col_double(),
## .. home.dest = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
Get the result of the total non-survived and survived count
library(ggplot2)
# total non-survived and survived count
ggplot(titanic, aes(x = survived)) +
geom_bar(width=0.5, fill = "blue") + #Tried: fill = "coral"
geom_text(stat='count', aes(label=stat(count)), vjust=-0.5,) +
theme_classic()+
ggtitle("Titanic Total Number for the Non-survived and Survived") +
xlab("Tatol Number for the Non-survived (0) and Survived (1)") +
ylab("Count Number") +
labs(fill = "Titanic Survived by Sex")
## Warning: Removed 1 rows containing non-finite values (stat_count).
## Removed 1 rows containing non-finite values (stat_count).

Get the result of Titanic Total Number of the Each Class Level
ggplot(titanic, aes(x = pclass, fill=pclass)) +
#geom_bar(position = position_dodge()) +
geom_bar(width=0.5, fill = "green", position = position_dodge()) +
geom_text(stat='count', aes(label=stat(count)), position=position_dodge(width=1), vjust=-0.5)+
theme_classic() +
ggtitle("Titanic Total Number of the Each Class Level") +
xlab("Class level 1, 2, 3") +
ylab("Count Number for the Each Class Level") +
labs(fill = "Total Number of the Each Class Level")
## Warning: Removed 1 rows containing non-finite values (stat_count).
## Removed 1 rows containing non-finite values (stat_count).

Get the Titanic Total Number by Sex
ggplot(titanic, aes(x = sex)) +
geom_bar(width=0.5, fill = "red") + #Tried: fill = "coral"
geom_text(stat='count', aes(label=stat(count)), vjust=-0.5,) +
theme_classic()+
ggtitle("Titanic Total Number by Sex") +
xlab("Total Number by Sex") +
ylab("Count Number") +
labs(fill = "Titanic Total Number by Sex")

Get the result of the Titanic Number of Non-Survived and Survived by Sex
ggplot(titanic, aes(x = survived, fill=sex)) +
geom_bar(position = position_dodge()) +
geom_text(stat='count',
aes(label=stat(count)), position=position_dodge(width=1), vjust=-0.5)+
theme_classic() +
ggtitle("Titanic Number of Non-Survived and Survived by Sex") +
xlab("Non-Survived (0) and Survived (1) by Sex") +
ylab("Count Number") +
labs(fill = "Titanic Non-Survived and Survived by Sex")
## Warning: Removed 1 rows containing non-finite values (stat_count).
## Removed 1 rows containing non-finite values (stat_count).

Get the result of the Age Density
# Age Density
ggplot(titanic, aes(x = age)) +
geom_density(fill='coral') +
theme_classic()+
ggtitle("Titanic Age Density") +
xlab("Age") +
ylab("Density") +
labs(fill = "Age Density")
## Warning: Removed 264 rows containing non-finite values (stat_density).

Create a treemap
The treemap explores the sex of survived
treemap(titanic, index="sex", vSize="survived",
vColor="pclass", type="value", # note: type="value"
palette="RdYlBu")

Create a treemap
The treemap explores the sex of survived
But use RColorBrewer to change the palette to RdYlBu
treemap(titanic, index="sex", vSize="survived",
vColor="pclass", type="manual", # note: type = "manual"
palette="RdYlBu")

a. The source and topic of the data, any variables included, what kind of variables they are, how you cleaned the dataset up (be detailed and specific, using proper terminology where appropriate).
The topic of the data is the file “titanic.csv”, and the dataset is from the link https://www.kaggle.com/ and is saved as csv format. The dataset includes these variables and its data type (in data name-datatype): “pclass-num”, “survived-num”, “name-chr”, “sex-chr”, “age-num”, “sibsp-num”, “parch-num”, “ticket-chr”, “fare-num”, “cabin-chr”, “embarked-chr”, “boat-chr”, “body-num”, and “home.dest-chr”.
The way to clean the dataset up are the following:
Make all headers lowercase – to keep the consistent for the format of the data names and easy to use them later on Remove spaces – to avoid the unnecessary mistake when use the dataset
b. What the visualization represents, any interesting patterns or surprises that arise within the visualization.
The first visualization is to get the result of the total Non-survived (809 persons) and Survived (500 persons) count with filling the color in blue. The most of persons on the Titanic were not survived, and the total Non-survived persons (809) are more than the Survived persons (500).
The second visualization is to get the result of the Number for the Each Class Level as the Class 1 (323 persons), Class 2 (277 persons) and Class 3 (709 persons) with filling the color in green. The most persons are use the Class 3 (709 persons).
Then, the visualization is to get the result of the Titanic Total Number by Sex, which female persons are 466 and male persons are 843, with the filling color in red. There is one person’s sex is n/a.
Then, the visualization is to get the result of the Titanic Number of Non-Survived and Survived by Sex, which the Non-Survived (0) female is 127 and male is 682, and the Survived (1) female is 339 and male is 161 with the related color and legend to distinguish them. The result is compliant with the policy to let the woman to escape first.
Then, the visualization is to get the result of Titanic Age Density by Age. It looks like the most ages are between form 20- to 35-year-old.
Finally, using the treemaps explore the sex of survived with the different colors
c. Anything that you might have shown that you could not get to work or that you wished you could have included.
I do have more thinking and question in the fields of the clean up the dataset and build the more model analysis. I tried the ways to clean up the dateset by making all headers lowercase and removing spaces. But I need to learn more. In addition, more questions come into my mind, sus as how to clean up missing data in R, how to find and remove irrelevant data, how to remove unwanted observations from the dataset and any rules for this, how to fix structural errors, how to filter unwanted outliers and keep the wanted outliers, how to handle missing data, how to validate the dataset is good or not before and after cleaning up, etc.
I would like to know to how to build up more models to do analysis and also use different models to handle the different cases.