Project One Titanic

The project will use the dataset - titanic.csv, to explore the data with at least one data visualization and write a short essay to describe the source and topic of the data, any variables included, what the visualization represents, any interesting patterns or surprises that arise within the visualization, and anything that I could not get to work or that I wished I could have included

Load the data

setwd("C:/Users/wrxio/projects/Datasets")
titanic <- read_csv("titanic.csv")
## Rows: 1310 Columns: 14
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): name, sex, ticket, cabin, embarked, boat, home.dest
## dbl (7): pclass, survived, age, sibsp, parch, fare, body
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(titanic)
## # A tibble: 6 x 14
##   pclass survived name      sex      age sibsp parch ticket  fare cabin embarked
##    <dbl>    <dbl> <chr>     <chr>  <dbl> <dbl> <dbl> <chr>  <dbl> <chr> <chr>   
## 1      1        1 Allen, M~ fema~ 29         0     0 24160  211.  B5    S       
## 2      1        1 Allison,~ male   0.917     1     2 113781 152.  C22 ~ S       
## 3      1        0 Allison,~ fema~  2         1     2 113781 152.  C22 ~ S       
## 4      1        0 Allison,~ male  30         1     2 113781 152.  C22 ~ S       
## 5      1        0 Allison,~ fema~ 25         1     2 113781 152.  C22 ~ S       
## 6      1        1 Anderson~ male  48         0     0 19952   26.6 E12   S       
## # ... with 3 more variables: boat <chr>, body <dbl>, home.dest <chr>

Organize and clean the dataset

Make all headers lowercase and remove spaces

Check the result: After cleaning up, look up the variable names and the structure of the data.

names(titanic) <- tolower(names(titanic))
names(titanic) <- gsub(" ","",names(titanic))

names(titanic)
##  [1] "pclass"    "survived"  "name"      "sex"       "age"       "sibsp"    
##  [7] "parch"     "ticket"    "fare"      "cabin"     "embarked"  "boat"     
## [13] "body"      "home.dest"
str(titanic)
## spec_tbl_df [1,310 x 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ pclass   : num [1:1310] 1 1 1 1 1 1 1 1 1 1 ...
##  $ survived : num [1:1310] 1 1 0 0 0 1 1 0 1 0 ...
##  $ name     : chr [1:1310] "Allen, Miss. Elisabeth Walton" "Allison, Master. Hudson Trevor" "Allison, Miss. Helen Loraine" "Allison, Mr. Hudson Joshua Creighton" ...
##  $ sex      : chr [1:1310] "female" "male" "female" "male" ...
##  $ age      : num [1:1310] 29 0.917 2 30 25 ...
##  $ sibsp    : num [1:1310] 0 1 1 1 1 0 1 0 2 0 ...
##  $ parch    : num [1:1310] 0 2 2 2 2 0 0 0 0 0 ...
##  $ ticket   : chr [1:1310] "24160" "113781" "113781" "113781" ...
##  $ fare     : num [1:1310] 211 152 152 152 152 ...
##  $ cabin    : chr [1:1310] "B5" "C22 C26" "C22 C26" "C22 C26" ...
##  $ embarked : chr [1:1310] "S" "S" "S" "S" ...
##  $ boat     : chr [1:1310] "2" "11" NA NA ...
##  $ body     : num [1:1310] NA NA NA 135 NA NA NA NA NA 22 ...
##  $ home.dest: chr [1:1310] "St Louis, MO" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" "Montreal, PQ / Chesterville, ON" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   pclass = col_double(),
##   ..   survived = col_double(),
##   ..   name = col_character(),
##   ..   sex = col_character(),
##   ..   age = col_double(),
##   ..   sibsp = col_double(),
##   ..   parch = col_double(),
##   ..   ticket = col_character(),
##   ..   fare = col_double(),
##   ..   cabin = col_character(),
##   ..   embarked = col_character(),
##   ..   boat = col_character(),
##   ..   body = col_double(),
##   ..   home.dest = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Get the result of the total non-survived and survived count

library(ggplot2)
# total non-survived and survived count
ggplot(titanic, aes(x = survived)) +
  geom_bar(width=0.5, fill = "blue") +    #Tried: fill = "coral"
  geom_text(stat='count', aes(label=stat(count)), vjust=-0.5,) +
  theme_classic()+
  ggtitle("Titanic Total Number for the Non-survived and Survived") +
  xlab("Tatol Number for the Non-survived (0) and Survived (1)") +
  ylab("Count Number") + 
  labs(fill = "Titanic Survived by Sex")
## Warning: Removed 1 rows containing non-finite values (stat_count).
## Removed 1 rows containing non-finite values (stat_count).

Get the result of Titanic Total Number of the Each Class Level

ggplot(titanic, aes(x = pclass, fill=pclass)) +
  #geom_bar(position = position_dodge()) +
  geom_bar(width=0.5, fill = "green", position = position_dodge()) +
  geom_text(stat='count', aes(label=stat(count)), position=position_dodge(width=1), vjust=-0.5)+
  theme_classic() +
  ggtitle("Titanic Total Number of the Each Class Level") +
  xlab("Class level 1, 2, 3") +
  ylab("Count Number for the Each Class Level") + 
  labs(fill = "Total Number of the Each Class Level")
## Warning: Removed 1 rows containing non-finite values (stat_count).
## Removed 1 rows containing non-finite values (stat_count).

Get the Titanic Total Number by Sex

ggplot(titanic, aes(x = sex)) +
  geom_bar(width=0.5, fill = "red") +    #Tried: fill = "coral"
  geom_text(stat='count', aes(label=stat(count)), vjust=-0.5,) +
  theme_classic()+
  ggtitle("Titanic Total Number by Sex") +
  xlab("Total Number by Sex") +
  ylab("Count Number") + 
  labs(fill = "Titanic Total Number by Sex")

Get the result of the Titanic Number of Non-Survived and Survived by Sex

ggplot(titanic, aes(x = survived, fill=sex)) +
  geom_bar(position = position_dodge()) +
  geom_text(stat='count', 
      aes(label=stat(count)), position=position_dodge(width=1), vjust=-0.5)+
  theme_classic() +
  ggtitle("Titanic Number of Non-Survived and Survived by Sex") +
  xlab("Non-Survived (0) and Survived (1) by Sex") +
  ylab("Count Number") + 
  labs(fill = "Titanic Non-Survived and Survived by Sex")
## Warning: Removed 1 rows containing non-finite values (stat_count).
## Removed 1 rows containing non-finite values (stat_count).

Get the result of the Age Density

# Age Density
ggplot(titanic, aes(x = age)) +
 geom_density(fill='coral') +
 theme_classic()+
 ggtitle("Titanic Age Density") +
  xlab("Age") +
  ylab("Density") + 
  labs(fill = "Age Density")
## Warning: Removed 264 rows containing non-finite values (stat_density).

Create a treemap

The treemap explores the sex of survived

treemap(titanic, index="sex", vSize="survived", 
                vColor="pclass", type="value",   # note: type="value"
                palette="RdYlBu")

Create a treemap

The treemap explores the sex of survived

But use RColorBrewer to change the palette to RdYlBu

treemap(titanic, index="sex", vSize="survived", 
        vColor="pclass", type="manual",    # note: type = "manual"
        palette="RdYlBu")

a. The source and topic of the data, any variables included, what kind of variables they are, how you cleaned the dataset up (be detailed and specific, using proper terminology where appropriate).

The way to clean the dataset up are the following:

Make all headers lowercase – to keep the consistent for the format of the data names and easy to use them later on Remove spaces – to avoid the unnecessary mistake when use the dataset

b. What the visualization represents, any interesting patterns or surprises that arise within the visualization.

The first visualization is to get the result of the total Non-survived (809 persons) and Survived (500 persons) count with filling the color in blue. The most of persons on the Titanic were not survived, and the total Non-survived persons (809) are more than the Survived persons (500).

The second visualization is to get the result of the Number for the Each Class Level as the Class 1 (323 persons), Class 2 (277 persons) and Class 3 (709 persons) with filling the color in green. The most persons are use the Class 3 (709 persons).

Then, the visualization is to get the result of the Titanic Total Number by Sex, which female persons are 466 and male persons are 843, with the filling color in red. There is one person’s sex is n/a.

Then, the visualization is to get the result of Titanic Age Density by Age. It looks like the most ages are between form 20- to 35-year-old.

Finally, using the treemaps explore the sex of survived with the different colors

c. Anything that you might have shown that you could not get to work or that you wished you could have included.

I do have more thinking and question in the fields of the clean up the dataset and build the more model analysis. I tried the ways to clean up the dateset by making all headers lowercase and removing spaces. But I need to learn more. In addition, more questions come into my mind, sus as how to clean up missing data in R, how to find and remove irrelevant data, how to remove unwanted observations from the dataset and any rules for this, how to fix structural errors, how to filter unwanted outliers and keep the wanted outliers, how to handle missing data, how to validate the dataset is good or not before and after cleaning up, etc.

I would like to know to how to build up more models to do analysis and also use different models to handle the different cases.