1 Set Up

Clear all environments.

Installing and loading all the libraries. Make sure you have the libraries installed.

2 Import Data

Now, I will import my data.

Make sure you comment out or exclude or do not use View(train) command.

df <- read.csv("~/Library/CloudStorage/Dropbox/WCAS/Summer/Data Analysis/share/Day 2/train.csv")

## Data has been imported correctly -
# head(df) # first 5 rows of the data 
# tail(df) # last 5 rows of the data 
# str(df)

glimpse(df) # from tidyverse package

## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…

| Variable                          | Type         | Level of Measurement |
|-----------------------------------|--------------|----------------------|
| Passenger ID                      | Qualitative  | Nominal              |
| Survived\*\*                      | Qualitative  | Nominal              |
| Passenger class\*\*               | Qualitative  | Ordinal              |
| Name                              | Qualitative  | Nominal              |
| Sex\*\*                           | Qualitative  | Nominal              |
| Age                               | Quantitive   | Ratio                |
| Number of Siblings/Spouses Aboard | Quantitative | Ratio                |
| Number of Parents/Children Aboard | Quantitative | Ratio                |
| Ticket                            | Qualitative  | Nominal              |
| Fare                              | Quantitative | Ratio                |
| Cabin                             | Qualitative  | ??                   |
| Embarked                          | Qualitative  | Nominal              |

Read upn in indexing.

2.1 Visualization of Dataset

vis_dat(df)   # from vis_dat

vis_miss(df)  # from vis_dat

Age has 20% missing values, accounting for 1.7% of the entire dataset.

2.2 Treating Missing Data

I can drop all observations corresponding to missing age values - df_drop

OR
I can impute all missing age values. df_imputed

Drop all rows corresponding to missing Age values.

?na.omit
df_drop    <- na.omit(df)

Now I will impute Missing Values with Mean of the same variable (Age)

# replace missing values of age with the mean/median

######### STEP 1: FIND THE MEAN/MEDAIN OF THE AGE VARIABLE
?mean()
mean(df$Age, na.rm = TRUE)

## [1] 29.69912

median(df$Age, na.rm = TRUE)

## [1] 28

?is.na

######### STEP 2: DUPLICATE YOUR ORIGINAL DATA (I do not want to override the raw data)
df_imputed <- df # duplicate the original data


######### STEP 3: CHANGE THE VARIBALE VALUES HERE
  describe(df_imputed$Age) # mean before imputing

##    vars   n mean    sd median trimmed   mad  min max range skew kurtosis   se
## X1    1 714 29.7 14.53     28   29.27 13.34 0.42  80 79.58 0.39     0.16 0.54

  # REPLACING THE MISSING AGE VALUES WITH THE MEAN OF AGE VARIBALE 
df_imputed$Age[is.na(df_imputed$Age)] <- mean(df_imputed$Age, 
                                              na.rm = TRUE
                                              )

  describe(df_imputed$Age)  # mean after imputing

##    vars   n mean sd median trimmed  mad  min max range skew kurtosis   se
## X1    1 891 29.7 13   29.7   29.25 9.34 0.42  80 79.58 0.43     0.95 0.44

3 Summary Statistics

Here is my table from stargazer on the clean data. DO NOT RUN summary statistics on the original data.

Split the arguments into different lines within a function.
Align your code too (= , #)

?stargazer

# BASIC COMMAND
## stargazer(df,  type = "text") # Age has 714 observations only, while all other. variables have 891 observations.

# EMBELLISHED COMMAND
stargazer(df_drop,                  
          type           = "text",                                   # output format - "html"
          notes          = "N=891, but age has 177 missing values", 
          summary.stat   = c("mean","sd","min", "max"), 
          digits         = 1,                                        # decimal places 
          title          = "Titanic Data Summary Statistics"
          )

## 
## Titanic Data Summary Statistics
## =========================================
## Statistic      Mean   St. Dev. Min   Max 
## -----------------------------------------
## PassengerId    448.6   259.1    1    891 
## Survived        0.4     0.5     0     1  
## Pclass          2.2     0.8     1     3  
## Age            29.7     14.5   0.4  80.0 
## SibSp           0.5     0.9     0     5  
## Parch           0.4     0.9     0     6  
## Fare           34.7     52.9   0.0  512.3
## -----------------------------------------
## N=891, but age has 177 missing values

Age has 714 observations only, while all other variables have 891 observations.

DESCRIBE YOUR OBSERVATIONS HERE…

4 Variables Visualization

Simple code in base R to create charts.

See some code in base R how to create charts here.

?boxplot
 
# Layout to split the screen

layout(mat = matrix(c(1,2),2,1, byrow=TRUE),  height = c(1,8))
 

# Draw the boxplot and the histogram 
par(mar=c(0, 3.1, 1.1, 2.1))

boxplot(df$Age , 
        horizontal = TRUE,  
        ylim       = c(0, 100), 
        xaxt       = "n" ,
        col        = rgb(0.8, 0.8, 0,0.5) , 
        frame      = F
        )


par(mar=c(4, 3.1, 1.1, 2.1))

?hist
hist(df$Age , 
     breaks  = 10 , 
     col     = rgb(0.2,0.8,0.5,0.5) , 
     border  = F , 
     main    = "" , 
     xlab    = "Age", 
     xlim    = c(0,100)
     )

Day 3 In Class

Arvind Sharma

2023-07-20