1 Set Up

Clear all environments.

Installing and loading all the libraries. Make sure you have the libraries installed.

2 Import Data

Now, I will import my data.

Make sure you comment out or exclude or do not use View(train) command.

df <- read.csv("~/Library/CloudStorage/Dropbox/WCAS/Summer/Data Analysis/share/Day 2/train.csv")

## Data has been imported correctly -
# head(df) # first 5 rows of the data 
# tail(df) # last 5 rows of the data 
# str(df)

glimpse(df) # from tidyverse package

## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…

| Variable                          | Type         | Level of Measurement |
|-----------------------------------|--------------|----------------------|
| Passenger ID                      | Qualitative  | Nominal              |
| Survived\*\*                      | Qualitative  | Nominal              |
| Passenger class\*\*               | Qualitative  | Ordinal              |
| Name                              | Qualitative  | Nominal              |
| Sex\*\*                           | Qualitative  | Nominal              |
| Age                               | Quantitive   | Ratio                |
| Number of Siblings/Spouses Aboard | Quantitative | Ratio                |
| Number of Parents/Children Aboard | Quantitative | Ratio                |
| Ticket                            | Qualitative  | Nominal              |
| Fare                              | Quantitative | Ratio                |
| Cabin                             | Qualitative  | ??                   |
| Embarked                          | Qualitative  | Nominal              |

Read upn in indexing.

2.1 Visualization of Dataset

vis_dat(df)   # from vis_dat

vis_miss(df)  # from vis_dat

Age has 20% missing values, accounting for 1.7% of the entire dataset.

2.2 Treating Missing Data

I can drop all observations corresponding to missing age values - df_drop

OR
I can impute all missing age values. df_imputed

Drop all rows corresponding to missing Age values.

?na.omit
df_drop    <- na.omit(df)

Now I will impute Missing Values with Mean of the same variable (Age)

# replace missing values of age with the mean/median

######### STEP 1: FIND THE MEAN/MEDAIN OF THE AGE VARIABLE
?mean()
mean(df$Age, na.rm = TRUE)

## [1] 29.69912

median(df$Age, na.rm = TRUE)

## [1] 28

?is.na

######### STEP 2: DUPLICATE YOUR ORIGINAL DATA (I do not want to override the raw data)
df_imputed <- df # duplicate the original data


######### STEP 3: CHANGE THE VARIBALE VALUES HERE
  describe(df_imputed$Age) # mean before imputing

##    vars   n mean    sd median trimmed   mad  min max range skew kurtosis   se
## X1    1 714 29.7 14.53     28   29.27 13.34 0.42  80 79.58 0.39     0.16 0.54

  # REPLACING THE MISSING AGE VALUES WITH THE MEAN OF AGE VARIBALE 
df_imputed$Age[is.na(df_imputed$Age)] <- mean(df_imputed$Age, 
                                              na.rm = TRUE
                                              )

  describe(df_imputed$Age)  # mean after imputing

##    vars   n mean sd median trimmed  mad  min max range skew kurtosis   se
## X1    1 891 29.7 13   29.7   29.25 9.34 0.42  80 79.58 0.43     0.95 0.44

3 Summary Statistics

Here is my table from stargazer on the clean data. DO NOT RUN summary statistics on the original data.

Split the arguments into different lines within a function.
Align your code too (= , #)

?stargazer

# BASIC COMMAND
## stargazer(df,  type = "text") # Age has 714 observations only, while all other. variables have 891 observations.

# EMBELLISHED COMMAND
stargazer(df_drop,                  
          type           = "text",                                   # output format - "html"
          notes          = "N=891, but age has 177 missing values", 
          summary.stat   = c("mean","sd","min", "max"), 
          digits         = 1,                                        # decimal places 
          title          = "Titanic Data Summary Statistics"
          )

## 
## Titanic Data Summary Statistics
## =========================================
## Statistic      Mean   St. Dev. Min   Max 
## -----------------------------------------
## PassengerId    448.6   259.1    1    891 
## Survived        0.4     0.5     0     1  
## Pclass          2.2     0.8     1     3  
## Age            29.7     14.5   0.4  80.0 
## SibSp           0.5     0.9     0     5  
## Parch           0.4     0.9     0     6  
## Fare           34.7     52.9   0.0  512.3
## -----------------------------------------
## N=891, but age has 177 missing values

Age has 714 observations only, while all other variables have 891 observations.

DESCRIBE YOUR OBSERVATIONS HERE…

4 Variables Visualization

Simple code in base R to create charts.

See some code in base R how to create charts here.

?boxplot
 
# Layout to split the screen

layout(mat = matrix(c(1,2),2,1, byrow=TRUE),  height = c(1,8))
 

# Draw the boxplot and the histogram 
par(mar=c(0, 3.1, 1.1, 2.1))

boxplot(df$Age , 
        horizontal = TRUE,  
        ylim       = c(0, 100), 
        xaxt       = "n" ,
        col        = rgb(0.8, 0.8, 0,0.5) , 
        frame      = F
        )


par(mar=c(4, 3.1, 1.1, 2.1))

?hist
hist(df$Age , 
     breaks  = 10 , 
     col     = rgb(0.2,0.8,0.5,0.5) , 
     border  = F , 
     main    = "" , 
     xlab    = "Age", 
     xlim    = c(0,100)
     )

5 Graphing with ggplot2 package instead of Base R

Make sure to install ggplot2 package.
Watch some videos on ggplot2. Grammar of Graphics description, Basic Syntax Introduction
Then, first test ggplot2 command on mpg data.

5.1 ggplot2 - try it out with the 3 basic arguments

5.1.1 Get a “fake” data just to test the command

data() # open base R datset
?mpg   # open help file to see variables name in the data

dim(mpg)       # dimensions (rows and columns)

## [1] 234  11

colnames(mpg)  # print out the column names

##  [1] "manufacturer" "model"        "displ"        "year"         "cyl"         
##  [6] "trans"        "drv"          "cty"          "hwy"          "fl"          
## [11] "class"

5.1.2 Now, run the basic ggplot2 command

Note for histograms and boxplots you need only one variable.
Note for scatterplot (below) you need two variables.

ggplot(data = mpg,
       mapping =  aes(x = cty, 
                      y = hwy)
       ) + geom_point()

5.1.3 Try facetting the data

You can create subplots - split the data by a variable !

colnames(mpg)  # print out the column names

##  [1] "manufacturer" "model"        "displ"        "year"         "cyl"         
##  [6] "trans"        "drv"          "cty"          "hwy"          "fl"          
## [11] "class"

ggplot(data = mpg,
       mapping =  aes(x = cty, 
                      y = hwy)
       ) + geom_point()           + facet_grid(vars(year))

5.2 Apply ggplot2 on your actual data

Use the titanic data in Day 2 Dropbox folder (see my import command).

df <- read.csv("~/Library/CloudStorage/Dropbox/WCAS/Summer/Data Analysis/share/Day 2/train.csv")

Note for histograms and boxplots you need only one variable. I will randomly choose one variable.

5.2.1 Test on a single variable on actual data

?ggplot

ggplot(data    = df,              # change df to whatever you call your data frame name 
       mapping = aes(x = SibSp)   # choose your numeric variable
        ) + geom_bar()            # play around with option to create different graphs

# shorter code for the same graph above
ggplot(mapping = aes(x = df$SibSp)
        ) + geom_bar()

5.3 Break the chart by another variable

Try to fix the labels of the chart.

?facet_grid

ggplot(data    = df,              # change df to whatever you call your data frame name 
       mapping = aes(x = SibSp)   # choose your numeric variable
        ) + geom_bar()  + facet_grid(rows = df$Pclass)

# A set of variables or expressions quoted by vars() and defining faceting groups on the rows

6 ggplot to visualizing numeric data

We will have to transform our entire data into “long form” and use the facet option to see the distributions of variables in our data.

We will have to install the reshape2 package and use the melt function in it to get our data into long format.

6.1 Step 1: Reshape your data for appropriate input for ggplot

6.1.1 Original Data

str(df)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

head(df)

##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q

tail(df)

##     PassengerId Survived Pclass                                     Name    Sex
## 886         886        0      3     Rice, Mrs. William (Margaret Norton) female
## 887         887        0      2                    Montvila, Rev. Juozas   male
## 888         888        1      1             Graham, Miss. Margaret Edith female
## 889         889        0      3 Johnston, Miss. Catherine Helen "Carrie" female
## 890         890        1      1                    Behr, Mr. Karl Howell   male
## 891         891        0      3                      Dooley, Mr. Patrick   male
##     Age SibSp Parch     Ticket   Fare Cabin Embarked
## 886  39     0     5     382652 29.125              Q
## 887  27     0     0     211536 13.000              S
## 888  19     0     0     112053 30.000   B42        S
## 889  NA     1     2 W./C. 6607 23.450              S
## 890  26     0     0     111369 30.000  C148        C
## 891  32     0     0     370376  7.750              Q

?melt

  df_melted <- melt(df)

## Using Name, Sex, Ticket, Cabin, Embarked as id variables

# df_melted <- reshape2::melt(df) ## equivalent command - I am just specyfying the package name too

6.1.2 Transformed Data:

Data is in long format now.

str(df_melted)

## 'data.frame':    6237 obs. of  7 variables:
##  $ Name    : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex     : chr  "male" "female" "female" "female" ...
##  $ Ticket  : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Cabin   : chr  "" "C85" "" "C123" ...
##  $ Embarked: chr  "S" "C" "S" "S" ...
##  $ variable: Factor w/ 7 levels "PassengerId",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ value   : num  1 2 3 4 5 6 7 8 9 10 ...

head(df_melted)

##                                                  Name    Sex           Ticket
## 1                             Braund, Mr. Owen Harris   male        A/5 21171
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female         PC 17599
## 3                              Heikkinen, Miss. Laina female STON/O2. 3101282
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female           113803
## 5                            Allen, Mr. William Henry   male           373450
## 6                                    Moran, Mr. James   male           330877
##   Cabin Embarked    variable value
## 1              S PassengerId     1
## 2   C85        C PassengerId     2
## 3              S PassengerId     3
## 4  C123        S PassengerId     4
## 5              S PassengerId     5
## 6              Q PassengerId     6

tail(df_melted)

##                                          Name    Sex     Ticket Cabin Embarked
## 6232     Rice, Mrs. William (Margaret Norton) female     382652              Q
## 6233                    Montvila, Rev. Juozas   male     211536              S
## 6234             Graham, Miss. Margaret Edith female     112053   B42        S
## 6235 Johnston, Miss. Catherine Helen "Carrie" female W./C. 6607              S
## 6236                    Behr, Mr. Karl Howell   male     111369  C148        C
## 6237                      Dooley, Mr. Patrick   male     370376              Q
##      variable  value
## 6232     Fare 29.125
## 6233     Fare 13.000
## 6234     Fare 30.000
## 6235     Fare 23.450
## 6236     Fare 30.000
## 6237     Fare  7.750

6.2 Step 2: Apply ggplot and use the facet wrap function

The first command works but hard to see the data.

ggplot(data = df_melted, 
       aes(x = value)
       ) + 
  geom_histogram() + 
  facet_wrap(facets = . ~ variable) # facet_grid (. ~ var) justjust means to facet the grid on the variable

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_bin()`).

Just add one more option to control the scale and be able to read the data much better.

?facet_wrap

ggplot(data = df_melted, 
       aes(x = value)
       ) + 
  geom_histogram() + 
  facet_wrap(facets = ~ variable, 
             scales = "free_x") # let your x axis vary for every subplot

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_bin()`).

Of course, you can clean up x axis and y axis labels such as “value” and “count” above too. Try it !

7 WIDE TO LONG

Data values remain the same.

dim(df)

## [1] 891  12

glimpse(df)

## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…

df_subset <- df[ 3:8 ,c(1,3,6,10)]
df_subset

##   PassengerId Pclass Age    Fare
## 3           3      3  26  7.9250
## 4           4      1  35 53.1000
## 5           5      3  35  8.0500
## 6           6      3  NA  8.4583
## 7           7      1  54 51.8625
## 8           8      3   2 21.0750

melted_df_subset <- melt(data = df_subset)

## No id variables; using all as measure variables

melted_df_subset

##       variable   value
## 1  PassengerId  3.0000
## 2  PassengerId  4.0000
## 3  PassengerId  5.0000
## 4  PassengerId  6.0000
## 5  PassengerId  7.0000
## 6  PassengerId  8.0000
## 7       Pclass  3.0000
## 8       Pclass  1.0000
## 9       Pclass  3.0000
## 10      Pclass  3.0000
## 11      Pclass  1.0000
## 12      Pclass  3.0000
## 13         Age 26.0000
## 14         Age 35.0000
## 15         Age 35.0000
## 16         Age      NA
## 17         Age 54.0000
## 18         Age  2.0000
## 19        Fare  7.9250
## 20        Fare 53.1000
## 21        Fare  8.0500
## 22        Fare  8.4583
## 23        Fare 51.8625
## 24        Fare 21.0750

8 GGPLOT2 OPTIONS

Play with options.

ggplot(data = df_melted, 
       aes(x = value)
       ) + 
  geom_histogram() + 
  facet_wrap(facets = . ~ variable) + labs(title = "Titanic Dataset Histograms",
                                               y = "", 
                                               x ="Variables"
                                           ) + theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_bin()`).

ggplot(data = df_melted, 
       aes(x = value)
       ) + 
  geom_boxplot() + 
  facet_wrap(facets = . ~ variable) + labs(title = "Titanic Dataset Histograms",
                                               y = "", 
                                               x ="Variables"
                                           ) + theme_gray()

## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

ggplot on titanic data

Arvind Sharma

2024-04-16