Read this help file for R Markdown.
Clear all environments.
Installing and loading all the libraries. Make sure you have the libraries installed.
Now, I will import my data.
Make sure you comment out or exclude or do not use
View(train)
command.
df <- read.csv("~/Library/CloudStorage/Dropbox/WCAS/Summer/Data Analysis/share/Day 2/train.csv")
## Data has been imported correctly -
# head(df) # first 5 rows of the data
# tail(df) # last 5 rows of the data
# str(df)
glimpse(df) # from tidyverse package
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
| Variable | Type | Level of Measurement |
|-----------------------------------|--------------|----------------------|
| Passenger ID | Qualitative | Nominal |
| Survived\*\* | Qualitative | Nominal |
| Passenger class\*\* | Qualitative | Ordinal |
| Name | Qualitative | Nominal |
| Sex\*\* | Qualitative | Nominal |
| Age | Quantitive | Ratio |
| Number of Siblings/Spouses Aboard | Quantitative | Ratio |
| Number of Parents/Children Aboard | Quantitative | Ratio |
| Ticket | Qualitative | Nominal |
| Fare | Quantitative | Ratio |
| Cabin | Qualitative | ?? |
| Embarked | Qualitative | Nominal |
Read upn in indexing.
vis_dat(df) # from vis_dat
vis_miss(df) # from vis_dat
Age has 20% missing values, accounting for 1.7% of the entire dataset.
I can drop all observations corresponding to missing age values -
df_drop
OR
I can impute all missing age values.
df_imputed
Drop all rows corresponding to missing Age values.
?na.omit
df_drop <- na.omit(df)
Now I will impute Missing Values with Mean of the same variable (Age)
# replace missing values of age with the mean/median
######### STEP 1: FIND THE MEAN/MEDAIN OF THE AGE VARIABLE
?mean()
mean(df$Age, na.rm = TRUE)
## [1] 29.69912
median(df$Age, na.rm = TRUE)
## [1] 28
?is.na
######### STEP 2: DUPLICATE YOUR ORIGINAL DATA (I do not want to override the raw data)
df_imputed <- df # duplicate the original data
######### STEP 3: CHANGE THE VARIBALE VALUES HERE
describe(df_imputed$Age) # mean before imputing
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 714 29.7 14.53 28 29.27 13.34 0.42 80 79.58 0.39 0.16 0.54
# REPLACING THE MISSING AGE VALUES WITH THE MEAN OF AGE VARIBALE
df_imputed$Age[is.na(df_imputed$Age)] <- mean(df_imputed$Age,
na.rm = TRUE
)
describe(df_imputed$Age) # mean after imputing
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 891 29.7 13 29.7 29.25 9.34 0.42 80 79.58 0.43 0.95 0.44
Here is my table from stargazer on the clean data. DO NOT RUN summary statistics on the original data.
Split the arguments into different lines within a function.
Align your code too (=
, #
)
?stargazer
# BASIC COMMAND
## stargazer(df, type = "text") # Age has 714 observations only, while all other. variables have 891 observations.
# EMBELLISHED COMMAND
stargazer(df_drop,
type = "text", # output format - "html"
notes = "N=891, but age has 177 missing values",
summary.stat = c("mean","sd","min", "max"),
digits = 1, # decimal places
title = "Titanic Data Summary Statistics"
)
##
## Titanic Data Summary Statistics
## =========================================
## Statistic Mean St. Dev. Min Max
## -----------------------------------------
## PassengerId 448.6 259.1 1 891
## Survived 0.4 0.5 0 1
## Pclass 2.2 0.8 1 3
## Age 29.7 14.5 0.4 80.0
## SibSp 0.5 0.9 0 5
## Parch 0.4 0.9 0 6
## Fare 34.7 52.9 0.0 512.3
## -----------------------------------------
## N=891, but age has 177 missing values
Age has 714 observations only, while all other variables have 891 observations.
DESCRIBE YOUR OBSERVATIONS HERE…
Simple code in base R to create charts.
See some code in base R how to create charts here.
?boxplot
# Layout to split the screen
layout(mat = matrix(c(1,2),2,1, byrow=TRUE), height = c(1,8))
# Draw the boxplot and the histogram
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(df$Age ,
horizontal = TRUE,
ylim = c(0, 100),
xaxt = "n" ,
col = rgb(0.8, 0.8, 0,0.5) ,
frame = F
)
par(mar=c(4, 3.1, 1.1, 2.1))
?hist
hist(df$Age ,
breaks = 10 ,
col = rgb(0.2,0.8,0.5,0.5) ,
border = F ,
main = "" ,
xlab = "Age",
xlim = c(0,100)
)
Make sure to install ggplot2
package.
Watch some videos on ggplot2. Grammar of Graphics description, Basic Syntax Introduction
Then, first test ggplot2
command on mpg
data.
data() # open base R datset
?mpg # open help file to see variables name in the data
dim(mpg) # dimensions (rows and columns)
## [1] 234 11
colnames(mpg) # print out the column names
## [1] "manufacturer" "model" "displ" "year" "cyl"
## [6] "trans" "drv" "cty" "hwy" "fl"
## [11] "class"
Note for histograms and boxplots you need only one variable.
Note for scatterplot (below) you need two variables.
ggplot(data = mpg,
mapping = aes(x = cty,
y = hwy)
) + geom_point()
You can create subplots - split the data by a variable !
colnames(mpg) # print out the column names
## [1] "manufacturer" "model" "displ" "year" "cyl"
## [6] "trans" "drv" "cty" "hwy" "fl"
## [11] "class"
ggplot(data = mpg,
mapping = aes(x = cty,
y = hwy)
) + geom_point() + facet_grid(vars(year))
Use the titanic data in Day 2 Dropbox folder (see my import command).
df <- read.csv("~/Library/CloudStorage/Dropbox/WCAS/Summer/Data Analysis/share/Day 2/train.csv")
?ggplot
ggplot(data = df, # change df to whatever you call your data frame name
mapping = aes(x = SibSp) # choose your numeric variable
) + geom_bar() # play around with option to create different graphs
# shorter code for the same graph above
ggplot(mapping = aes(x = df$SibSp)
) + geom_bar()
Try to fix the labels of the chart.
?facet_grid
ggplot(data = df, # change df to whatever you call your data frame name
mapping = aes(x = SibSp) # choose your numeric variable
) + geom_bar() + facet_grid(rows = df$Pclass)
# A set of variables or expressions quoted by vars() and defining faceting groups on the rows
We will have to transform our entire data into “long form” and use
the facet
option to see the distributions of variables in
our data.
We will have to install the reshape2
package and use the
melt
function in it to get our data into long format.
str(df)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
head(df)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
tail(df)
## PassengerId Survived Pclass Name Sex
## 886 886 0 3 Rice, Mrs. William (Margaret Norton) female
## 887 887 0 2 Montvila, Rev. Juozas male
## 888 888 1 1 Graham, Miss. Margaret Edith female
## 889 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female
## 890 890 1 1 Behr, Mr. Karl Howell male
## 891 891 0 3 Dooley, Mr. Patrick male
## Age SibSp Parch Ticket Fare Cabin Embarked
## 886 39 0 5 382652 29.125 Q
## 887 27 0 0 211536 13.000 S
## 888 19 0 0 112053 30.000 B42 S
## 889 NA 1 2 W./C. 6607 23.450 S
## 890 26 0 0 111369 30.000 C148 C
## 891 32 0 0 370376 7.750 Q
?melt
df_melted <- melt(df)
## Using Name, Sex, Ticket, Cabin, Embarked as id variables
# df_melted <- reshape2::melt(df) ## equivalent command - I am just specyfying the package name too
Data is in long format now.
str(df_melted)
## 'data.frame': 6237 obs. of 7 variables:
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked: chr "S" "C" "S" "S" ...
## $ variable: Factor w/ 7 levels "PassengerId",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ value : num 1 2 3 4 5 6 7 8 9 10 ...
head(df_melted)
## Name Sex Ticket
## 1 Braund, Mr. Owen Harris male A/5 21171
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female PC 17599
## 3 Heikkinen, Miss. Laina female STON/O2. 3101282
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 113803
## 5 Allen, Mr. William Henry male 373450
## 6 Moran, Mr. James male 330877
## Cabin Embarked variable value
## 1 S PassengerId 1
## 2 C85 C PassengerId 2
## 3 S PassengerId 3
## 4 C123 S PassengerId 4
## 5 S PassengerId 5
## 6 Q PassengerId 6
tail(df_melted)
## Name Sex Ticket Cabin Embarked
## 6232 Rice, Mrs. William (Margaret Norton) female 382652 Q
## 6233 Montvila, Rev. Juozas male 211536 S
## 6234 Graham, Miss. Margaret Edith female 112053 B42 S
## 6235 Johnston, Miss. Catherine Helen "Carrie" female W./C. 6607 S
## 6236 Behr, Mr. Karl Howell male 111369 C148 C
## 6237 Dooley, Mr. Patrick male 370376 Q
## variable value
## 6232 Fare 29.125
## 6233 Fare 13.000
## 6234 Fare 30.000
## 6235 Fare 23.450
## 6236 Fare 30.000
## 6237 Fare 7.750
The first command works but hard to see the data.
ggplot(data = df_melted,
aes(x = value)
) +
geom_histogram() +
facet_wrap(facets = . ~ variable) # facet_grid (. ~ var) justjust means to facet the grid on the variable
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_bin()`).
Just add one more option to control the scale and be able to read the data much better.
?facet_wrap
ggplot(data = df_melted,
aes(x = value)
) +
geom_histogram() +
facet_wrap(facets = ~ variable,
scales = "free_x") # let your x axis vary for every subplot
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_bin()`).
Of course, you can clean up x axis and y axis labels such as “value” and “count” above too. Try it !
Data values remain the same.
dim(df)
## [1] 891 12
glimpse(df)
## Rows: 891
## Columns: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
## $ Sex <chr> "male", "female", "female", "female", "male", "male", "mal…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6", "C…
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…
df_subset <- df[ 3:8 ,c(1,3,6,10)]
df_subset
## PassengerId Pclass Age Fare
## 3 3 3 26 7.9250
## 4 4 1 35 53.1000
## 5 5 3 35 8.0500
## 6 6 3 NA 8.4583
## 7 7 1 54 51.8625
## 8 8 3 2 21.0750
melted_df_subset <- melt(data = df_subset)
## No id variables; using all as measure variables
melted_df_subset
## variable value
## 1 PassengerId 3.0000
## 2 PassengerId 4.0000
## 3 PassengerId 5.0000
## 4 PassengerId 6.0000
## 5 PassengerId 7.0000
## 6 PassengerId 8.0000
## 7 Pclass 3.0000
## 8 Pclass 1.0000
## 9 Pclass 3.0000
## 10 Pclass 3.0000
## 11 Pclass 1.0000
## 12 Pclass 3.0000
## 13 Age 26.0000
## 14 Age 35.0000
## 15 Age 35.0000
## 16 Age NA
## 17 Age 54.0000
## 18 Age 2.0000
## 19 Fare 7.9250
## 20 Fare 53.1000
## 21 Fare 8.0500
## 22 Fare 8.4583
## 23 Fare 51.8625
## 24 Fare 21.0750
Play with options.
ggplot(data = df_melted,
aes(x = value)
) +
geom_histogram() +
facet_wrap(facets = . ~ variable) + labs(title = "Titanic Dataset Histograms",
y = "",
x ="Variables"
) + theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_bin()`).
ggplot(data = df_melted,
aes(x = value)
) +
geom_boxplot() +
facet_wrap(facets = . ~ variable) + labs(title = "Titanic Dataset Histograms",
y = "",
x ="Variables"
) + theme_gray()
## Warning: Removed 177 rows containing non-finite outside the scale range
## (`stat_boxplot()`).