Welcome to the third episode of our R from Zero to Hero journey. In this section we will work to understand our data, and try to see if there is anything obvious just by the look at it. for anything more complex, we will rely on ML a bit later. BUt for now, what sort of Data do we have ? and what is special about them ? This is the part of Machine Learning pipeline that is a bit similar to the descriptive analysis in the early stage of a statistics analysis.
In the previous episode, we learned how to import and export data. Today we will start with loading our insider-trades dataset
my_dataframe_rds <- readRDS('data/my_dataframe_rds.rds')
print(head(my_dataframe_rds))
Perfect, the dataset is intact, so now we can start with the exploration. there are certain built-in functions in R that will give us an idea of the data set pretty quickly
# number of rows
nrow(my_dataframe_rds)
[1] 353
# number of columns
ncol(my_dataframe_rds)
[1] 7
# dimensionality
dim(my_dataframe_rds)
[1] 353 7
# column names
names(my_dataframe_rds)
[1] "name" "market_cap" "gics_setor"
[4] "country" "insider_buys_usd_1mo" "insider_buys_volume_1mo"
[7] "X"
Another function that I find pretty easy and useful is str, lets take a look
str(my_dataframe_rds)
'data.frame': 353 obs. of 7 variables:
$ name : chr "3I GROUP PLC" "AB IGNITIS GRUPE" "ADDLIFE AB-B" "AFRY AB" ...
$ market_cap : num 3.57e+10 1.41e+09 1.30e+09 1.99e+09 6.06e+09 ...
$ gics_setor : chr "Financials" "Utilities" "Health Care" "Industrials" ...
$ country : chr "United Kingdom" "Lithuania" "Sweden" "Sweden" ...
$ insider_buys_usd_1mo : num 304737 9776 106983 87888 1248654 ...
$ insider_buys_volume_1mo: num 0.03 0.22 0.65 0.17 0.15 0.1 0.03 0.05 0.08 1.75 ...
$ X : logi NA NA NA NA NA NA ...
str gives us an overview of the number of observation, of variable, as well as their type and a preview of the values. I would be using it hundreds of time per day.
Now lets jump to the next familly of function for exploratory analysis. The first one is summary. This function will give us the min, max and quartiles information for each of the numeric values. For the non numeric value it won’t help much.
summary(my_dataframe_rds)
name market_cap gics_setor country
Length:353 Min. :1.001e+09 Length:353 Length:353
Class :character 1st Qu.:1.601e+09 Class :character Class :character
Mode :character Median :2.977e+09 Mode :character Mode :character
Mean :1.229e+10
3rd Qu.:1.006e+10
Max. :2.208e+11
NA's :2
insider_buys_usd_1mo insider_buys_volume_1mo X
Min. : 6838 Min. : 0.000 Mode:logical
1st Qu.: 85215 1st Qu.: 0.020 NA's:353
Median : 205036 Median : 0.080
Mean : 1158610 Mean : 8.486
3rd Qu.: 590408 3rd Qu.: 0.210
Max. :81152111 Max. :2820.380
NA's :2 NA's :2
library(Hmisc)
describe(my_dataframe_rds)
my_dataframe_rds
7 Variables 353 Observations
-------------------------------------------------------------------------------------
name
n missing distinct
350 3 350
lowest : 3I GROUP PLC AB IGNITIS GRUPE ADDLIFE AB-B AFRY AB AGREE REALTY CORP
highest: ZAI LAB LTD-ADR ZHEJIANG DINGLI MACHINERY -A ZHEJIANG NARADA POWER SOUR-A ZIGNAGO VETRO SPA ZWSOFT CO LTD GUANGZHOU-A
-------------------------------------------------------------------------------------
market_cap
n missing distinct Info Mean Gmd .05 .10
351 2 351 1 1.229e+10 1.744e+10 1.122e+09 1.196e+09
.25 .50 .75 .90 .95
1.601e+09 2.977e+09 1.006e+10 2.798e+10 5.727e+10
lowest : 1000900000 1018090000 1018670000 1025820000 1027190000
highest: 1.3625e+11 1.41575e+11 1.65911e+11 1.85403e+11 2.20802e+11
-------------------------------------------------------------------------------------
gics_setor
n missing distinct
351 2 11
lowest : Communication Services Consumer Discretionary Consumer Staples Energy Financials
highest: Industrials Information Technology Materials Real Estate Utilities
-------------------------------------------------------------------------------------
country
n missing distinct
351 2 35
lowest : Australia Austria Belgium Bermuda Canada
highest: Switzerland Thailand Turkey United Kingdom United States
-------------------------------------------------------------------------------------
insider_buys_usd_1mo
n missing distinct Info Mean Gmd .05 .10 .25
351 2 351 1 1158610 1919961 19607 33976 85215
.50 .75 .90 .95
205036 590408 1973687 2866866
lowest : 6837.91 8937.85 9166.68 9775.64 11580.2
highest: 14334200 18870800 40444100 43797500 81152100
-------------------------------------------------------------------------------------
insider_buys_volume_1mo
n missing distinct Info Mean Gmd .05 .10 .25
351 2 80 0.996 8.486 16.82 0.010 0.010 0.020
.50 .75 .90 .95
0.080 0.210 0.690 1.785
lowest : 0 0.01 0.02 0.03 0.04 , highest: 10.78 12.48 14.27 20.88 2820.38
-------------------------------------------------------------------------------------
Variables with all observations missing:
[1] X
Mean, median and range: mean(), median(), range() Quartiles and percentiles: quantile()
range(my_dataframe_rds$market_cap, na.rm = TRUE)
[1] 1000901159 220801584887
quantile(my_dataframe_rds$market_cap, na.rm = TRUE)
0% 25% 50% 75% 100%
1000901159 1601382638 2976572110 10059097262 220801584887
quantile(my_dataframe_rds$market_cap, c(0.1, 0.3, 0.65), na.rm = TRUE)
10% 30% 65%
1195810194 1804607088 6179684461
var(my_dataframe_rds$market_cap, na.rm = TRUE)
[1] 6.624067e+20
hist(my_dataframe_rds$market_cap)
library(magrittr) ## for pipe operations
my_dataframe_rds$market_cap %>% na.omit() %>% density() %>%
plot(main='Density of Market Capitalization')