Read this help file for R Markdown.
Clear all environments.
Installing and loading all the libraries. Make sure you have the libraries installed.
Now, I will import my data.
Make sure you comment out or exclude or do not use
View(train) command.
df <- read.csv("~/Library/CloudStorage/Dropbox/WCAS/Summer/Data Analysis/share/Day 2/train.csv")
## Data has been imported correctly -
# head(df) # first 5 rows of the data
# tail(df) # last 5 rows of the data
# str(df)
glimpse(df) # from tidyverse package
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
| Variable | Type | Level of Measurement |
|-----------------------------------|--------------|----------------------|
| Passenger ID | Qualitative | Nominal |
| Survived\*\* | Qualitative | Nominal |
| Passenger class\*\* | Qualitative | Ordinal |
| Name | Qualitative | Nominal |
| Sex\*\* | Qualitative | Nominal |
| Age | Quantitive | Ratio |
| Number of Siblings/Spouses Aboard | Quantitative | Ratio |
| Number of Parents/Children Aboard | Quantitative | Ratio |
| Ticket | Qualitative | Nominal |
| Fare | Quantitative | Ratio |
| Cabin | Qualitative | ?? |
| Embarked | Qualitative | Nominal |
Read upn in indexing.
vis_dat(df) # from vis_dat
vis_miss(df) # from vis_dat
Age has 20% missing values, accounting for 1.7% of the entire dataset.
I can drop all observations corresponding to missing age values -
df_drop
OR
I can impute all missing age values.
df_imputed
Drop all rows corresponding to missing Age values.
?na.omit
df_drop <- na.omit(df)
Now I will impute Missing Values with Mean of the same variable (Age)
# replace missing values of age with the mean/median
######### STEP 1: FIND THE MEAN/MEDAIN OF THE AGE VARIABLE
?mean()
mean(df$Age, na.rm = TRUE)
## [1] 29.69912
median(df$Age, na.rm = TRUE)
## [1] 28
?is.na
######### STEP 2: DUPLICATE YOUR ORIGINAL DATA (I do not want to override the raw data)
df_imputed <- df # duplicate the original data
######### STEP 3: CHANGE THE VARIBALE VALUES HERE
describe(df_imputed$Age) # mean before imputing
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 714 29.7 14.53 28 29.27 13.34 0.42 80 79.58 0.39 0.16 0.54
# REPLACING THE MISSING AGE VALUES WITH THE MEAN OF AGE VARIBALE
df_imputed$Age[is.na(df_imputed$Age)] <- mean(df_imputed$Age,
na.rm = TRUE
)
describe(df_imputed$Age) # mean after imputing
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 891 29.7 13 29.7 29.25 9.34 0.42 80 79.58 0.43 0.95 0.44
Here is my table from stargazer on the clean data. DO NOT RUN summary statistics on the original data.
Split the arguments into different lines within a function.
Align your code too (= , #)
?stargazer
# BASIC COMMAND
## stargazer(df, type = "text") # Age has 714 observations only, while all other. variables have 891 observations.
# EMBELLISHED COMMAND
stargazer(df_drop,
type = "text", # output format - "html"
notes = "N=891, but age has 177 missing values",
summary.stat = c("mean","sd","min", "max"),
digits = 1, # decimal places
title = "Titanic Data Summary Statistics"
)
##
## Titanic Data Summary Statistics
## =========================================
## Statistic Mean St. Dev. Min Max
## -----------------------------------------
## PassengerId 448.6 259.1 1 891
## Survived 0.4 0.5 0 1
## Pclass 2.2 0.8 1 3
## Age 29.7 14.5 0.4 80.0
## SibSp 0.5 0.9 0 5
## Parch 0.4 0.9 0 6
## Fare 34.7 52.9 0.0 512.3
## -----------------------------------------
## N=891, but age has 177 missing values
Age has 714 observations only, while all other variables have 891 observations.
DESCRIBE YOUR OBSERVATIONS HERE…
Simple code in base R to create charts.
See some code in base R how to create charts here.
?boxplot
# Layout to split the screen
layout(mat = matrix(c(1,2),2,1, byrow=TRUE), height = c(1,8))
# Draw the boxplot and the histogram
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(df$Age ,
horizontal = TRUE,
ylim = c(0, 100),
xaxt = "n" ,
col = rgb(0.8, 0.8, 0,0.5) ,
frame = F
)
par(mar=c(4, 3.1, 1.1, 2.1))
?hist
hist(df$Age ,
breaks = 10 ,
col = rgb(0.2,0.8,0.5,0.5) ,
border = F ,
main = "" ,
xlab = "Age",
xlim = c(0,100)
)