In this lecture we will see how to use vizualization, transformation and modeling to explore your data in a systematic way. This task is usually referred by statisticians as exploratory data analysis, or EDA for short. EDA is an iterative cycle in which you: 1. generate questions about your data; 2. search for answers by vizualizing, transforming and modeling your data; 3. use what you learn to refine your question and/or generate new questions.
During the initial phases of the EDA you should feel totally free to explore and investigate any idea that occurs to you. EDA is a creative process, and the key is to ask a large quantity of questions to your data. Indeed, the ultimate goal of EDA is to develop an understanding of your data and the best way to do it is by using questions as tools to guide you through the investigation.
There is no general rule about which questions you should ask to guide you through the research. However, two types of questions will always be useful for making discoveries within your data: 1. what type of variation occurs within my variables? 2. what type of covariation occurs between my variables?
Before going through the exploratory analysis of your data you need to upload your data in the R environment. You can consider this step as a “pre-processing” phase of the analysis. In the last lecture we saw the main vectors in R (atomic vectors and lists). Here, we will see how to upload and handle matrices of data. The most widely used data matrices in R are data frames and tibbles. A data frame is simply a matrix where each column can be of a different type (i.e., numeric, character, logical). In a data frame}rows correspond to observations while columns correspond to variables.
R comes with several built-in datasets, including the famous “iris data” collected by Anderson and analyzed by Fisher in the 1930s. Another widely used dataset for introduction to data anlysis in R is the “cars dataset”. You can type “iris” or “cars” to see the dataset.
# Upload the "iris" dataset and "assign" it to a dataframe
data <- iris
The first thing you can do is to check the dimensions of the dataset and the names of the columns.
# Check out the dimension of the data
dim(data)
## [1] 150 5
# Check out the names of the columns
ls(data) # alphabetic order
## [1] "Petal.Length" "Petal.Width" "Sepal.Length" "Sepal.Width"
## [5] "Species"
colnames(data) #dataset order
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## [5] "Species"
A very useful command is the “summary()” function. “summary()” depicts a series of descriptive statistics for each numeric column (i.e., minimum, maximum, median, mean, 1st and 3rd decile).
summary(data)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
“iris” and “cars” data are used for 101 R programming classes. However, these dataset just contain numeric variables and are really well-behaved data very different from “real world data”.
Before uploading your data, make sure that you set the working directory in the same folder of your data.
setwd('...')
R has build-in functions to upload text data, the “read.table()” and “read.csv()” functions. You may want to use the “header = TRUE” (this function tells R that the first row contains the names of the columns of the table), “sep=”;"" (this function tells her that data in your text file are separated by “;”) and “stringsAsFactors = FALSE” (this function is a logical that indicates whether strings in a data frame should be treated as factor variables or as just plain strings) options depending on the data source you are currently using.
An equivalent version is the “read_csv()” function from the “readr” library.
library(readr)
dataset <- read_csv(NULL)
In order to upload data from Stata, SAS or SPSS you need to install the “haven” library.
library(haven)
dataset <- read_sav(NULL) # for SPSS data
dataset <- read_sas(NULL, NULL) # for SAS data
dataset <- read_stata(NULL) # for Stata data
In order to upload data from Excel you need to install the “readxl” library.
In the following we will upload the Compustat Data that can be found here. Compustat is a database of financial, statistical and market information on active and inactive global companies throughout the world. Here, we will focus just on northern-American enterprises financial account data in the years from 1997 to 2017.
library(readxl)
data <- read_excel("G:\\Il mio Drive\\Econometrics Lab\\Data\\Compustat Data.xlsx")
To see the data that you just uploaded you can use the “View” function. Beware that, as the size of your dataset increases, it may take some seconds to view your data.
View(data)
When you upload “complex” data sources the first thing that you should check is the type of each column vector in the dataframe. You can do so by using the “str()” function.
str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 266663 obs. of 32 variables:
## $ Global Company Key : num 1004 1004 1004 1004 1004 ...
## $ Data Date : chr "31/05/1998" "31/05/1999" "31/05/2000" "31/05/2001" ...
## $ Data Year - Fiscal : num 1997 1998 1999 2000 2001 ...
## $ Industry Format : chr "INDL" "INDL" "INDL" "INDL" ...
## $ Level of Consolidation - Company Annual Descriptor: chr "C" "C" "C" "C" ...
## $ Population Source : chr "D" "D" "D" "D" ...
## $ Data Format : chr "STD" "STD" "STD" "STD" ...
## $ Ticker Symbol : chr "AIR" "AIR" "AIR" "AIR" ...
## $ ISO Currency Code : chr "USD" "USD" "USD" "USD" ...
## $ Current Assets - Total : num 468 508 511 486 437 ...
## $ Assets - Total : num 671 727 741 702 710 ...
## $ Average Short-Term Borrowings : num 20.7 45.5 94.9 71.9 36.2 ...
## $ Long-Term Debt Due in One Year : num 0.237 0.42 0.429 0.41 2.025 ...
## $ Debt in Current Liabilities - Total : num 0.237 0.42 26.314 13.652 42.525 ...
## $ Long-Term Debt - Total : num 178 181 180 180 218 ...
## $ Earnings Before Interest and Taxes : num 64.72 77.38 70.66 45.79 4.71 ...
## $ Earnings Before Interest : num 79 94.4 89 64.4 27.2 ...
## $ Employees : num 2.7 2.9 2.9 2.5 2.2 2.1 2.3 2.6 3.3 3.9 ...
## $ Current Liabilities - Total : num 149 174 164 125 150 ...
## $ Liabilities - Total : num 370 401 401 362 400 ...
## $ Net Income (Loss) : num 35.7 41.7 35.2 18.5 -58.9 ...
## $ Net Interest Income : num NA NA NA NA NA NA NA NA NA NA ...
## $ Nonperforming Assets - Total : num NA NA NA NA NA NA NA NA NA NA ...
## $ In Process R&D Expense : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Sales/Turnover (Net) : num 782 918 1024 874 639 ...
## $ Interest Expense - Total (Financial Services) : num NA NA NA NA NA NA NA NA NA NA ...
## $ Income Taxes - Total : num 15.5 18.11 14.36 1.69 -39.29 ...
## $ Active/Inactive Status Marker : chr "A" "A" "A" "A" ...
## $ Research Co Reason for Deletion : num NA NA NA NA NA NA NA NA NA NA ...
## $ GIC Groups : num 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ GIC Sectors : num 20 20 20 20 20 20 20 20 20 20 ...
## $ Standard Industry Classification Code : num 5080 5080 5080 5080 5080 5080 5080 5080 5080 5080 ...
As you can see, these data are a collection of “character” (chr) and “numeric” (num) vectors. “str()” reports also the dimension of the dataset: 266663 observations and 32 variables; and the classes of the dataset: “tibble” (tbl_df, tbl) and “data frame” (data.frame).
What happens if we run the “summary()” function on non-numeric data? “R” just tells you that those columns contain "character} vectors.
Moreover, “summary()” gives you the number of “NA’s” for each numeric column.
summary(data)
## Global Company Key Data Date Data Year - Fiscal
## Min. : 1004 Length:266663 Min. :1997
## 1st Qu.: 17673 Class :character 1st Qu.:2001
## Median : 62110 Mode :character Median :2007
## Mean : 79169 Mean :2007
## 3rd Qu.:145786 3rd Qu.:2012
## Max. :330227 Max. :2017
## NA's :844
## Industry Format Level of Consolidation - Company Annual Descriptor
## Length:266663 Length:266663
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
## Population Source Data Format Ticker Symbol
## Length:266663 Length:266663 Length:266663
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## ISO Currency Code Current Assets - Total Assets - Total
## Length:266663 Min. : -7.76 Min. : 0
## Class :character 1st Qu.: 8.79 1st Qu.: 38
## Mode :character Median : 58.86 Median : 309
## Mean : 917.93 Mean : 12063
## 3rd Qu.: 317.01 3rd Qu.: 1731
## Max. :192486.65 Max. :3771200
## NA's :98138 NA's :39030
## Average Short-Term Borrowings Long-Term Debt Due in One Year
## Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 0.0 Median : 0.4
## Mean : 198.2 Mean : 302.1
## 3rd Qu.: 0.0 3rd Qu.: 11.1
## Max. :335926.0 Max. :496570.1
## NA's :245632 NA's :51973
## Debt in Current Liabilities - Total Long-Term Debt - Total
## Min. : -2267.1 Min. : 0
## 1st Qu.: 0.0 1st Qu.: 0
## Median : 2.5 Median : 16
## Mean : 1219.2 Mean : 2066
## 3rd Qu.: 36.0 3rd Qu.: 300
## Max. :605462.5 Max. :3296298
## NA's :40727 NA's :39524
## Earnings Before Interest and Taxes Earnings Before Interest
## Min. :-80053.00 Min. :-76735.00
## 1st Qu.: -1.97 1st Qu.: -0.86
## Median : 6.17 Median : 11.12
## Mean : 329.49 Mean : 468.21
## 3rd Qu.: 78.07 3rd Qu.: 115.87
## Max. :130622.00 Max. :130622.00
## NA's :67546 NA's :70945
## Employees Current Liabilities - Total Liabilities - Total
## Min. : 0.00 Min. : 0.0 Min. : 0
## 1st Qu.: 0.09 1st Qu.: 4.4 1st Qu.: 12
## Median : 0.48 Median : 27.6 Median : 161
## Mean : 7.98 Mean : 730.4 Mean : 10409
## 3rd Qu.: 3.12 3rd Qu.: 181.2 3rd Qu.: 1135
## Max. :2545.21 Max. :329795.0 Max. :3589783
## NA's :76862 NA's :97225 NA's :39377
## Net Income (Loss) Net Interest Income Nonperforming Assets - Total
## Min. :-99289.00 Min. :-3546.00 Min. : 0.0
## 1st Qu.: -4.68 1st Qu.: 8.21 1st Qu.: 0.3
## Median : 1.59 Median : 25.23 Median : 4.0
## Mean : 140.94 Mean : 656.08 Mean : 426.5
## 3rd Qu.: 34.68 3rd Qu.: 92.98 3rd Qu.: 21.2
## Max. :104821.00 Max. :54652.00 Max. :316713.0
## NA's :65728 NA's :245224 NA's :246675
## In Process R&D Expense Sales/Turnover (Net)
## Min. :-6831.27 Min. :-15009.3
## 1st Qu.: 0.00 1st Qu.: 11.7
## Median : 0.00 Median : 101.1
## Mean : -1.35 Mean : 2394.4
## 3rd Qu.: 0.00 3rd Qu.: 738.9
## Max. : 11.00 Max. :496785.0
## NA's :117161 NA's :65747
## Interest Expense - Total (Financial Services) Income Taxes - Total
## Min. : 0.00 Min. :-45415.00
## 1st Qu.: 6.17 1st Qu.: 0.00
## Median : 15.56 Median : 0.66
## Mean : 617.64 Mean : 70.88
## 3rd Qu.: 53.06 3rd Qu.: 13.54
## Max. :85948.88 Max. : 37162.00
## NA's :250704 NA's :40255
## Active/Inactive Status Marker Research Co Reason for Deletion
## Length:266663 Min. : 1.0
## Class :character 1st Qu.: 1.0
## Mode :character Median : 1.0
## Mean : 3.8
## 3rd Qu.: 7.0
## Max. :20.0
## NA's :150836
## GIC Groups GIC Sectors Standard Industry Classification Code
## Min. :1010 Min. :10.00 Min. : 100
## 1st Qu.:2030 1st Qu.:20.00 1st Qu.:3569
## Median :3520 Median :35.00 Median :6020
## Mean :3326 Mean :33.09 Mean :5084
## 3rd Qu.:4040 3rd Qu.:40.00 3rd Qu.:6722
## Max. :6010 Max. :60.00 Max. :9997
## NA's :31855 NA's :31855
A good way to start exploring your data is by checking if there is something “strange”. As you can’t go through all the observation, a good rule of thumbs is to check the “head” and “tail” of your data. The “head” and “tail” commands provide a good way to explore the first 6 and the last 6 observations in your data.
head(data)
## # A tibble: 6 x 32
## `Global Company~ `Data Date` `Data Year - Fi~ `Industry Forma~
## <dbl> <chr> <dbl> <chr>
## 1 1004 31/05/1998 1997 INDL
## 2 1004 31/05/1999 1998 INDL
## 3 1004 31/05/2000 1999 INDL
## 4 1004 31/05/2001 2000 INDL
## 5 1004 31/05/2002 2001 INDL
## 6 1004 31/05/2003 2002 INDL
## # ... with 28 more variables: `Level of Consolidation - Company Annual
## # Descriptor` <chr>, `Population Source` <chr>, `Data Format` <chr>,
## # `Ticker Symbol` <chr>, `ISO Currency Code` <chr>, `Current Assets -
## # Total` <dbl>, `Assets - Total` <dbl>, `Average Short-Term
## # Borrowings` <dbl>, `Long-Term Debt Due in One Year` <dbl>, `Debt in
## # Current Liabilities - Total` <dbl>, `Long-Term Debt - Total` <dbl>,
## # `Earnings Before Interest and Taxes` <dbl>, `Earnings Before
## # Interest` <dbl>, Employees <dbl>, `Current Liabilities - Total` <dbl>,
## # `Liabilities - Total` <dbl>, `Net Income (Loss)` <dbl>, `Net Interest
## # Income` <dbl>, `Nonperforming Assets - Total` <dbl>, `In Process R&D
## # Expense` <dbl>, `Sales/Turnover (Net)` <dbl>, `Interest Expense -
## # Total (Financial Services)` <dbl>, `Income Taxes - Total` <dbl>,
## # `Active/Inactive Status Marker` <chr>, `Research Co Reason for
## # Deletion` <dbl>, `GIC Groups` <dbl>, `GIC Sectors` <dbl>, `Standard
## # Industry Classification Code` <dbl>
tail(data)
## # A tibble: 6 x 32
## `Global Company~ `Data Date` `Data Year - Fi~ `Industry Forma~
## <dbl> <chr> <dbl> <chr>
## 1 328795 31/12/2013 2013 INDL
## 2 328795 31/12/2014 2014 INDL
## 3 328795 31/12/2015 2015 INDL
## 4 328795 31/12/2016 2016 INDL
## 5 328795 31/12/2017 2017 INDL
## 6 330227 30/09/2017 2017 INDL
## # ... with 28 more variables: `Level of Consolidation - Company Annual
## # Descriptor` <chr>, `Population Source` <chr>, `Data Format` <chr>,
## # `Ticker Symbol` <chr>, `ISO Currency Code` <chr>, `Current Assets -
## # Total` <dbl>, `Assets - Total` <dbl>, `Average Short-Term
## # Borrowings` <dbl>, `Long-Term Debt Due in One Year` <dbl>, `Debt in
## # Current Liabilities - Total` <dbl>, `Long-Term Debt - Total` <dbl>,
## # `Earnings Before Interest and Taxes` <dbl>, `Earnings Before
## # Interest` <dbl>, Employees <dbl>, `Current Liabilities - Total` <dbl>,
## # `Liabilities - Total` <dbl>, `Net Income (Loss)` <dbl>, `Net Interest
## # Income` <dbl>, `Nonperforming Assets - Total` <dbl>, `In Process R&D
## # Expense` <dbl>, `Sales/Turnover (Net)` <dbl>, `Interest Expense -
## # Total (Financial Services)` <dbl>, `Income Taxes - Total` <dbl>,
## # `Active/Inactive Status Marker` <chr>, `Research Co Reason for
## # Deletion` <dbl>, `GIC Groups` <dbl>, `GIC Sectors` <dbl>, `Standard
## # Industry Classification Code` <dbl>
You can subset your data in two ways: 1. get a subset of columns (variables); 2. get a subset of rows (observations).
In order to extract from your dataset a single column you can proceed in three ways: (i) extract the column by recalling its position, (ii-iii) extracting the column by its name. Keep in mind that R by defauls shows you the first 1000 observations of the selected column.
colnames(data)
## [1] "Global Company Key"
## [2] "Data Date"
## [3] "Data Year - Fiscal"
## [4] "Industry Format"
## [5] "Level of Consolidation - Company Annual Descriptor"
## [6] "Population Source"
## [7] "Data Format"
## [8] "Ticker Symbol"
## [9] "ISO Currency Code"
## [10] "Current Assets - Total"
## [11] "Assets - Total"
## [12] "Average Short-Term Borrowings"
## [13] "Long-Term Debt Due in One Year"
## [14] "Debt in Current Liabilities - Total"
## [15] "Long-Term Debt - Total"
## [16] "Earnings Before Interest and Taxes"
## [17] "Earnings Before Interest"
## [18] "Employees"
## [19] "Current Liabilities - Total"
## [20] "Liabilities - Total"
## [21] "Net Income (Loss)"
## [22] "Net Interest Income"
## [23] "Nonperforming Assets - Total"
## [24] "In Process R&D Expense"
## [25] "Sales/Turnover (Net)"
## [26] "Interest Expense - Total (Financial Services)"
## [27] "Income Taxes - Total"
## [28] "Active/Inactive Status Marker"
## [29] "Research Co Reason for Deletion"
## [30] "GIC Groups"
## [31] "GIC Sectors"
## [32] "Standard Industry Classification Code"
head(data[,15])
## # A tibble: 6 x 1
## `Long-Term Debt - Total`
## <dbl>
## 1 178.
## 2 181.
## 3 180.
## 4 180.
## 5 218.
## 6 165.
head(data$`Long-Term Debt - Total`)
## [1] 177.509 180.939 180.447 179.987 217.699 164.658
head(data[,c("Long-Term Debt - Total")]) #approximation
## # A tibble: 6 x 1
## `Long-Term Debt - Total`
## <dbl>
## 1 178.
## 2 181.
## 3 180.
## 4 180.
## 5 218.
## 6 165.
To subset a number of observation select them as follows:
data[1:10,]
## # A tibble: 10 x 32
## `Global Company~ `Data Date` `Data Year - Fi~ `Industry Forma~
## <dbl> <chr> <dbl> <chr>
## 1 1004 31/05/1998 1997 INDL
## 2 1004 31/05/1999 1998 INDL
## 3 1004 31/05/2000 1999 INDL
## 4 1004 31/05/2001 2000 INDL
## 5 1004 31/05/2002 2001 INDL
## 6 1004 31/05/2003 2002 INDL
## 7 1004 31/05/2004 2003 INDL
## 8 1004 31/05/2005 2004 INDL
## 9 1004 31/05/2006 2005 INDL
## 10 1004 31/05/2007 2006 INDL
## # ... with 28 more variables: `Level of Consolidation - Company Annual
## # Descriptor` <chr>, `Population Source` <chr>, `Data Format` <chr>,
## # `Ticker Symbol` <chr>, `ISO Currency Code` <chr>, `Current Assets -
## # Total` <dbl>, `Assets - Total` <dbl>, `Average Short-Term
## # Borrowings` <dbl>, `Long-Term Debt Due in One Year` <dbl>, `Debt in
## # Current Liabilities - Total` <dbl>, `Long-Term Debt - Total` <dbl>,
## # `Earnings Before Interest and Taxes` <dbl>, `Earnings Before
## # Interest` <dbl>, Employees <dbl>, `Current Liabilities - Total` <dbl>,
## # `Liabilities - Total` <dbl>, `Net Income (Loss)` <dbl>, `Net Interest
## # Income` <dbl>, `Nonperforming Assets - Total` <dbl>, `In Process R&D
## # Expense` <dbl>, `Sales/Turnover (Net)` <dbl>, `Interest Expense -
## # Total (Financial Services)` <dbl>, `Income Taxes - Total` <dbl>,
## # `Active/Inactive Status Marker` <chr>, `Research Co Reason for
## # Deletion` <dbl>, `GIC Groups` <dbl>, `GIC Sectors` <dbl>, `Standard
## # Industry Classification Code` <dbl>
While to get a random sample of observation you can use the “sample function”. This will be very useful when we will discuss machine learning algorithms as you will usually devide your dataset in a training set (on which you will build your machine learning algorithm) and a test set (on which you will test your algorithm).
sample_of_observations <- sample(seq_len(nrow(data)), size = nrow(data)*0.1)
head(data[sample_of_observations,]) #10 percent of obs
## # A tibble: 6 x 32
## `Global Company~ `Data Date` `Data Year - Fi~ `Industry Forma~
## <dbl> <chr> <dbl> <chr>
## 1 117827 30/11/2013 2013 INDL
## 2 157679 31/12/2004 2004 INDL
## 3 13597 31/12/2005 2005 INDL
## 4 23245 31/12/2014 2014 INDL
## 5 128559 31/12/2002 2002 FS
## 6 181790 30/06/2015 2015 INDL
## # ... with 28 more variables: `Level of Consolidation - Company Annual
## # Descriptor` <chr>, `Population Source` <chr>, `Data Format` <chr>,
## # `Ticker Symbol` <chr>, `ISO Currency Code` <chr>, `Current Assets -
## # Total` <dbl>, `Assets - Total` <dbl>, `Average Short-Term
## # Borrowings` <dbl>, `Long-Term Debt Due in One Year` <dbl>, `Debt in
## # Current Liabilities - Total` <dbl>, `Long-Term Debt - Total` <dbl>,
## # `Earnings Before Interest and Taxes` <dbl>, `Earnings Before
## # Interest` <dbl>, Employees <dbl>, `Current Liabilities - Total` <dbl>,
## # `Liabilities - Total` <dbl>, `Net Income (Loss)` <dbl>, `Net Interest
## # Income` <dbl>, `Nonperforming Assets - Total` <dbl>, `In Process R&D
## # Expense` <dbl>, `Sales/Turnover (Net)` <dbl>, `Interest Expense -
## # Total (Financial Services)` <dbl>, `Income Taxes - Total` <dbl>,
## # `Active/Inactive Status Marker` <chr>, `Research Co Reason for
## # Deletion` <dbl>, `GIC Groups` <dbl>, `GIC Sectors` <dbl>, `Standard
## # Industry Classification Code` <dbl>
head(data[-sample_of_observations,]) #90 percent of obs
## # A tibble: 6 x 32
## `Global Company~ `Data Date` `Data Year - Fi~ `Industry Forma~
## <dbl> <chr> <dbl> <chr>
## 1 1004 31/05/1998 1997 INDL
## 2 1004 31/05/1999 1998 INDL
## 3 1004 31/05/2000 1999 INDL
## 4 1004 31/05/2001 2000 INDL
## 5 1004 31/05/2002 2001 INDL
## 6 1004 31/05/2003 2002 INDL
## # ... with 28 more variables: `Level of Consolidation - Company Annual
## # Descriptor` <chr>, `Population Source` <chr>, `Data Format` <chr>,
## # `Ticker Symbol` <chr>, `ISO Currency Code` <chr>, `Current Assets -
## # Total` <dbl>, `Assets - Total` <dbl>, `Average Short-Term
## # Borrowings` <dbl>, `Long-Term Debt Due in One Year` <dbl>, `Debt in
## # Current Liabilities - Total` <dbl>, `Long-Term Debt - Total` <dbl>,
## # `Earnings Before Interest and Taxes` <dbl>, `Earnings Before
## # Interest` <dbl>, Employees <dbl>, `Current Liabilities - Total` <dbl>,
## # `Liabilities - Total` <dbl>, `Net Income (Loss)` <dbl>, `Net Interest
## # Income` <dbl>, `Nonperforming Assets - Total` <dbl>, `In Process R&D
## # Expense` <dbl>, `Sales/Turnover (Net)` <dbl>, `Interest Expense -
## # Total (Financial Services)` <dbl>, `Income Taxes - Total` <dbl>,
## # `Active/Inactive Status Marker` <chr>, `Research Co Reason for
## # Deletion` <dbl>, `GIC Groups` <dbl>, `GIC Sectors` <dbl>, `Standard
## # Industry Classification Code` <dbl>
Moreover, you may want to subset your data based on a certain value of a variable.
sub_data <-subset(data, !is.na(data$Employees))
Before starting your analysis it is central to check how the missing values are distributed. There are three categories of missing data: 1. missing completely at random (MCAR); 2. missing at random (MAR); 3. missing not at random (MNAR).
Missing Completely at Random, MCAR, means there is no relationship between the missingness of the data and any values, observed or missing. Those missing data points are a random subset of the data. There is nothing systematic going on that makes some data more likely to be missing than others.
Missing at Random, MAR, means there is a systematic relationship between the propensity of missing values and the observed data. Whether an observation is missing has nothing to do with the missing values, but it does have to do with the values of an individual’s observed variables. So, for example, if men are more likely to tell you their weight than women, weight is MAR.
Missing Not at Random, MNAR, means there is a relationship between the propensity of a value to be missing and its values. This is a case where the people with the lowest education are missing on education or the sickest people are most likely to drop out of the study.
MNAR is called “non-ignorable” because the missing data mechanism itself has to be modeled as you deal with the missing data. You have to include some model for why the data are missing and what the likely values are.
“Missing Completely at Random” and “Missing at Random” are both considered ‘ignorable’ because we don’t have to include any information about the missing data itself when we deal with the missing data.
Each of these possible scenarios requires a different way to be handled. For instance, multiple imputation assumes the data are at least missing at random. So the important distinction here is whether the data are MAR as opposed to MNAR. Listwise deletion, however, requires the data are MCAR in order to not introduce bias in the results.
There is a very useful test for MCAR, Little’s test. Using Little’s test we can tests the null hypothesis that the missing data is MCAR ( source ). A p.value of less than 0.05 is usually interpreted as being that the missing data is not MCAR (i.e., is either Missing At Random or MNAR).
BaylorEdPsych::LittleMCAR(data)
But like all tests of assumptions, it’s not definitive. So run it, but use it as only one piece of information.
A second technique is to create dummy variables for whether a variable is missing.
\[\begin{equation} k= \begin{cases} 1 = missing; \\ 0 = observed. \end{cases} \end{equation}\]
You can then run t-tests and chi-square tests between this variable and other variables in the data set to see if the missingness on this variable is related to the values of other variables.
For example, if women really are less likely to tell you their weight than men, a chi-square test will tell you that the percentage of missing data on the weight variable is higher for women than men.
The only true way to distinguish between MNAR and MAR is to measure some of that missing data. It’s a common practice among professional surveyors to, for example, follow-up on a paper survey with phone calls to a group of the non-respondents and ask a few key survey items. This allows you to compare respondents to non-respondents.
If their responses on those key items differ by very much, that’s good evidence that the data are MNAR.
However in most missing data situations, we don’t have the luxury of getting a hold of the missing data. So while we can’t test it directly, we can examine patterns in the data get an idea of what’s the most likely mechanism ( source ).
The Amelia package, developed by Harvard’s professor Gary King and his colleagues, is probably the best tool to perform such an analysis.
library(Amelia)
## Loading required package: Rcpp
## Warning: package 'Rcpp' was built under R version 3.6.1
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.5, built: 2018-05-07)
## ## Copyright (C) 2005-2019 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
missmap(data[1:100,], main = "Missing values vs Observed")
## Warning in if (class(obj) == "amelia") {: la condizione la lunghezza > 1 e
## solo il promo elemento verrà utilizzato
## Warning: Unknown or uninitialised column: 'arguments'.
## Warning: Unknown or uninitialised column: 'arguments'.
## Warning: Unknown or uninitialised column: 'imputations'.
A good idea is not just to check the first observations, but to check a random sample of observations:
missmap(data[1798:2198,], main = "Missing values vs Observed")
## Warning in if (class(obj) == "amelia") {: la condizione la lunghezza > 1 e
## solo il promo elemento verrà utilizzato
## Warning: Unknown or uninitialised column: 'arguments'.
## Warning: Unknown or uninitialised column: 'arguments'.
## Warning: Unknown or uninitialised column: 'imputations'.
Three variables are found to have a high missing values rates (more than \(90\%\) of missing values).
summary(data$`Net Interest Income`)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -3546.00 8.21 25.23 656.08 92.98 54652.00 245224
length(which(!is.na(data$`Net Interest Income`)))/nrow(data)
## [1] 0.08039736
summary(data$`Nonperforming Assets - Total`)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.3 4.0 426.5 21.2 316713.0 246675
length(which(!is.na(data$`Nonperforming Assets - Total`)))/nrow(data)
## [1] 0.07495603
summary(data$`Interest Expense - Total (Financial Services)`)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 6.17 15.56 617.64 53.06 85948.88 250704
length(which(!is.na(data$`Interest Expense - Total (Financial Services)`)))/nrow(data)
## [1] 0.05984707
A way to deal with highly missing variables is to delete them from your data (using a subsetting option).
dim(data)
## [1] 266663 32
data <- data[, !names(data) %in% c("Interest Expense - Total (Financial Services)", "Net Interest Income", "Nonperforming Assets - Total")]
dim(data)
## [1] 266663 29
Once we excluded the three variables with a high number of missing values (probably MNAR) we can remove the missing values from the dataset by making the explicit assumtion of MAR. This assumption is quite strong and “you do not want to do it in your real data analysis”. You can try some different imputation methods in the packages “amelia” and “mice”.
data_clean <- na.omit(data)
A good way to start exploring variation in your data is to visualize the distribution of your variables’ values. R comes with some implemented functions for plotting (i.e., “plot”, “hist”, etc). Another good alternative for better looking plots is to use the functions implemented in the package “ggplot2”. Here we will see how to make basic plots in both ways.
How to visualize the distribution of a variable will depend on whether the variable is categorical or continuous. A variable is categorical if it can only take a small set of values. In R, categorical variables are usually saved as character vectors. You can use a bar chart to examine their distribution.
library(ggplot2)
ggplot(data = data_clean) +
geom_bar(mapping = aes(x = data_clean$`ISO Currency Code`))
The height of the bars displays how many observations occurred for each value of \(x\). This can be done with the function “table”.
table(data_clean$`ISO Currency Code`)
##
## CAD USD
## 745 4996
A variable is continuous if it can take any of a large set of ordered values. You can examine the distribution of a contnuous variable by using a histogram.
# Lets First Compare the Obs in Data v. Data_clean
# Histogram
hist(data_clean$`Data Year - Fiscal`)
hist(data_clean$`Data Year - Fiscal`,
col=2,
main = "Histogram",
sub = "Firms per Year Observations",
xlab = "Year",
ylab = "Frequency")
axis(1, at=1997:2012, labels=c(seq(1997,2012,1)))
ggplot(data = data_clean) +
geom_histogram(mapping = aes(x = data_clean$`Data Year - Fiscal`),
binwidth = 0.5,
fill="blue") +
scale_x_discrete(name ="Year", limits=c(seq(1997, 2012, 1))) +
scale_y_continuous(name = "Frequency") +
ggtitle("Firms per Year \n Observations") #\n to split long title into multiple lines.
table(data_clean$`Data Year - Fiscal`)
##
## 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
## 97 138 224 156 611 552 527 527 533 477 440 417 371 333 299
## 2012
## 39
To check how two variable are covarying you can use a scatter plot (or two-way plot).
plot(data_clean$`Assets - Total`, data_clean$`Sales/Turnover (Net)`)
ggplot(data = data_clean,
mapping = aes(x = data_clean$`Assets - Total`,
y = `Sales/Turnover (Net)`)) +
geom_point()
To draw multiple scatter plots you can use the “pairs” command.
pairs(~ data_clean$Employees + data_clean$`Assets - Total` + data_clean$`Sales/Turnover (Net)` , data_clean)
An easy way to check if there are outliers in your data is by using a “boxplot”.
year <- as.factor(data_clean$`Data Year - Fiscal`)
plot(year, data_clean$Employees)
plot(year, data_clean$Employees, varwidth=T)
plot(year, data_clean$Employees, varwidth=T, horizontal=T)
plot(year, data_clean$Employees, varwidth=T, horizontal=T, xlab="Employees",ylab="Years")
Let’s now focus on the density distribution of employees.
plot(density(data_clean$Employees))
summary(data_clean$Employees)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.018 0.139 2.363 0.591 235.000
plot(density(data_clean$Employees[which(data_clean$Employees<0.591)]))
You may want to compare the empirical distribution of a certain variable with a set of simulated distributions. Any statistical distribution can be generated from a statistical model and it provides a description of how the data were generated.
Before comparing empirical and simulated distributions, we can introduce the normal distribution. You can generate normally distributed data by using the “rnorm” function.
x <- rnorm(1000, mean = 2, sd = 10)
bin <- hist(x,100)
You can explore the cumulative density distribution of \(x\) by running the “ecdf()” function.
ecdf(x)
## Empirical CDF
## Call: ecdf(x)
## x[1:1000] = -31.521, -29.762, -29.448, ..., 30.585, 35.76
plot(ecdf(x))
An plot that is extensively used to compare different distributions is quantile-quantile plot (remember that it is not a propbability distribution plot).
A point (x, y) on the qq-plot corresponds to one of the quantiles of the second distribution (y-coordinate) plotted against the same quantile of the first distribution (x-coordinate). Thus the line is a parametric curve with the parameter which is the (number of the) interval for the quantile. If the two distribution are similar the points of the qqplot are on the line x=y.
qqnorm(x)
qqline(x, col="red")
x_norm<-(x-mean(x))/sd(x)
qqnorm(x_norm)
abline(0,1)
If you want to test another distribution, you can still use the qq-plot.
x_weibull<-rweibull(n=500,shape=3,scale=1.5)
y_teo<-rweibull(n=500, shape=2.9, scale=1.45)
qqplot(x_weibull,y_teo)
abline(0,1)
#install.packages("fitdistrplus")
library("fitdistrplus")
## Warning: package 'fitdistrplus' was built under R version 3.6.1
## Loading required package: MASS
## Loading required package: survival
## Loading required package: npsurv
## Loading required package: lsei
plotdist(data_clean$Employees, histo = TRUE, demp = TRUE)
Let’s now cut the tail of the distribution to get a “clearer” density of the variable.
quantile(data_clean$Employees, probs = c(0.05, 0.95))
## 5% 95%
## 0.0 8.3
employees <- data_clean$Employees[which(data_clean$Employees<8.3)] + 1
plotdist(employees, histo = TRUE, demp = TRUE)
We can now try to fit a series of theoretical distribution to the empirical distribution of data. Let’s first get some summary statistics on the min, max, median, mean, standard deviation, skewness (of the asymmetry of the probability distribution of a real-valued random variable about its mean; i.e., positive skewness means left leaning curve, while negative skewness means right leaning curve) and kurtosis (in probability theory and statistics, kurtosis is a measure of the “tailedness” of the probability distribution of a real-valued random variable) of the distribution.
descdist(employees, boot = 1000)
## summary statistics
## ------
## min: 1 max: 9.221
## median: 1.121
## mean: 1.555673
## estimated sd: 1.172585
## estimated skewness: 3.605735
## estimated kurtosis: 17.51095
You can compare these values with the values of the standard normal distribution that previously generate.
x<-rnorm(1000, mean = 0, sd = 1)
descdist(x, boot = 1000)
## summary statistics
## ------
## min: -3.146009 max: 3.147178
## median: -0.07835143
## mean: -0.04105328
## estimated sd: 0.968878
## estimated skewness: 0.06082422
## estimated kurtosis: 2.925351
Clearly, the distribution of the number of employees is not a normal distribution. Let’s see how this distribution “visually” compares with a Weibull, a Gamma and a log-normal distribution.
# Weibull
fw <- fitdist(employees, "weibull", method = "mle", lower = c(0, 0))
plot(fw)
# Gamma
fg <- fitdist(employees, "gamma", method = "mle", lower = c(0, 0))
plot(fg)
# log-normal
fln <- fitdist(employees, "lnorm", method = "mle", lower = c(0, 0))
plot(fln)
How about a power law distribution?
# power-law distribution
plw <- fitdist(employees, "plcon",
start = list(xmin=1,alpha=2),
lower = c(xmin = 0, alpha = 1))
plot(plw)
Now we can make a comparative plot with all the theoretical distributions and the empirical distribution.
par(mfrow = c(2, 2))
plot.legend <- c("Weibull", "lognormal", "gamma", "Power law")
denscomp(list(fw, fln, fg, plw), legendtext = plot.legend)
qqcomp(list(fw, fln, fg, plw), legendtext = plot.legend)
cdfcomp(list(fw, fln, fg, plw), legendtext = plot.legend)
ppcomp(list(fw, fln, fg, plw), legendtext = plot.legend)
The P-P plot (probability-probability plot or percent-percent plot or P value plot) is a probability plot for assessing how closely two data sets agree, which plots the two cumulative distribution functions against each other.
Kolmogorov-Smirnov, Cramer-von Miser and Anderson-Darling are all goodness-of-fit statistics based on the CDF distance (i.e., the Kolmogorov-Smirnov statistic quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution, or between the empirical distribution functions of two samples). AIC and BIC are classical penalized criteria based on the loglikehood
gofstat(list(fw, fln, fg, plw),fitnames = c("weibull", "lognorma", "gamma", "Power law"))
## Goodness-of-fit statistics
## weibull lognorma gamma
## Kolmogorov-Smirnov statistic 0.3357433 0.2480156 0.2667879
## Cramer-von Mises statistic 147.7416083 105.3514542 133.3044502
## Anderson-Darling statistic 759.7990779 569.3733191 694.8111291
## Power law
## Kolmogorov-Smirnov statistic 0.2249882
## Cramer-von Mises statistic 105.7809480
## Anderson-Darling statistic 1374.1499989
##
## Goodness-of-fit criteria
## weibull lognorma gamma Power law
## Akaike's Information Criterion 13700.11 9995.39 12016.61 1234.001
## Bayesian Information Criterion 13713.31 10008.60 12029.82 1247.209