class: center, middle, inverse, title-slide .title[ # Exploratory Data Analysis ] .subtitle[ ## using R Programming ] .author[ ### Allan R. Lao, MBA, DIT ] .date[ ### Department of ICT 2023-10-02 ] --- # Exploratory Data Analysis Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations  --- #EDA Key Terms Before you begin exploratory data analysis, it's important to understand a few key terms: - **Value**: A data value is a piece of information, such as a number or a date. - **Variable**: A data variable is a characteristic that you can measure, such as weight or income. - **Distribution**: The distribution of a dataset is how the dataset is spread out. You can visualize a dataset's distribution by observing its shape on a graph. - **Outlier**: An outlier is a data value that is significantly different, including much higher or lower, from the rest of a dataset. - **Data model**: A data model is a method of organizing data and relationships between values in a dataset. --- # Know your data The first step to conducting exploratory data analysis is to observe your dataset at a high level. Start by determining the size of your dataset, including how many rows and columns it has. This can help you predict any future issues you might have with your data. --- # The hflights Dataset Includes data on all flights that departed **Houston, TX in 2011** ```r df <- as.data.frame(hflights) str(df) ``` ``` ## 'data.frame': 227496 obs. of 21 variables: ## $ Year : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ... ## $ Month : int 1 1 1 1 1 1 1 1 1 1 ... ## $ DayofMonth : int 1 2 3 4 5 6 7 8 9 10 ... ## $ DayOfWeek : int 6 7 1 2 3 4 5 6 7 1 ... ## $ DepTime : int 1400 1401 1352 1403 1405 1359 1359 1355 1443 1443 ... ## $ ArrTime : int 1500 1501 1502 1513 1507 1503 1509 1454 1554 1553 ... ## $ UniqueCarrier : chr "AA" "AA" "AA" "AA" ... ## $ FlightNum : int 428 428 428 428 428 428 428 428 428 428 ... ## $ TailNum : chr "N576AA" "N557AA" "N541AA" "N403AA" ... ## $ ActualElapsedTime: int 60 60 70 70 62 64 70 59 71 70 ... ## $ AirTime : int 40 45 48 39 44 45 43 40 41 45 ... ## $ ArrDelay : int -10 -9 -8 3 -3 -7 -1 -16 44 43 ... ## $ DepDelay : int 0 1 -8 3 5 -1 -1 -5 43 43 ... ## $ Origin : chr "IAH" "IAH" "IAH" "IAH" ... ## $ Dest : chr "DFW" "DFW" "DFW" "DFW" ... ## $ Distance : int 224 224 224 224 224 224 224 224 224 224 ... ## $ TaxiIn : int 7 6 5 9 9 6 12 7 8 6 ... ## $ TaxiOut : int 13 9 17 22 9 13 15 12 22 19 ... ## $ Cancelled : int 0 0 0 0 0 0 0 0 0 0 ... ## $ CancellationCode : chr "" "" "" "" ... ## $ Diverted : int 0 0 0 0 0 0 0 0 0 0 ... ``` >There are 227496 observations; 21 variables --- # The hflights Dataset ```r knitr::kable( head(df, 10),format = 'html' ) ``` <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Year </th> <th style="text-align:right;"> Month </th> <th style="text-align:right;"> DayofMonth </th> <th style="text-align:right;"> DayOfWeek </th> <th style="text-align:right;"> DepTime </th> <th style="text-align:right;"> ArrTime </th> <th style="text-align:left;"> UniqueCarrier </th> <th style="text-align:right;"> FlightNum </th> <th style="text-align:left;"> TailNum </th> <th style="text-align:right;"> ActualElapsedTime </th> <th style="text-align:right;"> AirTime </th> <th style="text-align:right;"> ArrDelay </th> <th style="text-align:right;"> DepDelay </th> <th style="text-align:left;"> Origin </th> <th style="text-align:left;"> Dest </th> <th style="text-align:right;"> Distance </th> <th style="text-align:right;"> TaxiIn </th> <th style="text-align:right;"> TaxiOut </th> <th style="text-align:right;"> Cancelled </th> <th style="text-align:left;"> CancellationCode </th> <th style="text-align:right;"> Diverted </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 5424 </td> <td style="text-align:right;"> 2011 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 1400 </td> <td style="text-align:right;"> 1500 </td> <td style="text-align:left;"> AA </td> <td style="text-align:right;"> 428 </td> <td style="text-align:left;"> N576AA </td> <td style="text-align:right;"> 60 </td> <td style="text-align:right;"> 40 </td> <td style="text-align:right;"> -10 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:left;"> IAH </td> <td style="text-align:left;"> DFW </td> <td style="text-align:right;"> 224 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:left;"> </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> 5425 </td> <td style="text-align:right;"> 2011 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 1401 </td> <td style="text-align:right;"> 1501 </td> <td style="text-align:left;"> AA </td> <td style="text-align:right;"> 428 </td> <td style="text-align:left;"> N557AA </td> <td style="text-align:right;"> 60 </td> <td style="text-align:right;"> 45 </td> <td style="text-align:right;"> -9 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> IAH </td> <td style="text-align:left;"> DFW </td> <td style="text-align:right;"> 224 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:left;"> </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> 5426 </td> <td style="text-align:right;"> 2011 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1352 </td> <td style="text-align:right;"> 1502 </td> <td style="text-align:left;"> AA </td> <td style="text-align:right;"> 428 </td> <td style="text-align:left;"> N541AA </td> <td style="text-align:right;"> 70 </td> <td style="text-align:right;"> 48 </td> <td style="text-align:right;"> -8 </td> <td style="text-align:right;"> -8 </td> <td style="text-align:left;"> IAH </td> <td style="text-align:left;"> DFW </td> <td style="text-align:right;"> 224 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 17 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:left;"> </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> 5427 </td> <td style="text-align:right;"> 2011 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 1403 </td> <td style="text-align:right;"> 1513 </td> <td style="text-align:left;"> AA </td> <td style="text-align:right;"> 428 </td> <td style="text-align:left;"> N403AA </td> <td style="text-align:right;"> 70 </td> <td style="text-align:right;"> 39 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> IAH </td> <td style="text-align:left;"> DFW </td> <td style="text-align:right;"> 224 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 22 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:left;"> </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> 5428 </td> <td style="text-align:right;"> 2011 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 1405 </td> <td style="text-align:right;"> 1507 </td> <td style="text-align:left;"> AA </td> <td style="text-align:right;"> 428 </td> <td style="text-align:left;"> N492AA </td> <td style="text-align:right;"> 62 </td> <td style="text-align:right;"> 44 </td> <td style="text-align:right;"> -3 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:left;"> IAH </td> <td style="text-align:left;"> DFW </td> <td style="text-align:right;"> 224 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:left;"> </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> 5429 </td> <td style="text-align:right;"> 2011 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 1359 </td> <td style="text-align:right;"> 1503 </td> <td style="text-align:left;"> AA </td> <td style="text-align:right;"> 428 </td> <td style="text-align:left;"> N262AA </td> <td style="text-align:right;"> 64 </td> <td style="text-align:right;"> 45 </td> <td style="text-align:right;"> -7 </td> <td style="text-align:right;"> -1 </td> <td style="text-align:left;"> IAH </td> <td style="text-align:left;"> DFW </td> <td style="text-align:right;"> 224 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:left;"> </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> 5430 </td> <td style="text-align:right;"> 2011 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 1359 </td> <td style="text-align:right;"> 1509 </td> <td style="text-align:left;"> AA </td> <td style="text-align:right;"> 428 </td> <td style="text-align:left;"> N493AA </td> <td style="text-align:right;"> 70 </td> <td style="text-align:right;"> 43 </td> <td style="text-align:right;"> -1 </td> <td style="text-align:right;"> -1 </td> <td style="text-align:left;"> IAH </td> <td style="text-align:left;"> DFW </td> <td style="text-align:right;"> 224 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:left;"> </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> 5431 </td> <td style="text-align:right;"> 2011 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 1355 </td> <td style="text-align:right;"> 1454 </td> <td style="text-align:left;"> AA </td> <td style="text-align:right;"> 428 </td> <td style="text-align:left;"> N477AA </td> <td style="text-align:right;"> 59 </td> <td style="text-align:right;"> 40 </td> <td style="text-align:right;"> -16 </td> <td style="text-align:right;"> -5 </td> <td style="text-align:left;"> IAH </td> <td style="text-align:left;"> DFW </td> <td style="text-align:right;"> 224 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:left;"> </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> 5432 </td> <td style="text-align:right;"> 2011 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 9 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 1443 </td> <td style="text-align:right;"> 1554 </td> <td style="text-align:left;"> AA </td> <td style="text-align:right;"> 428 </td> <td style="text-align:left;"> N476AA </td> <td style="text-align:right;"> 71 </td> <td style="text-align:right;"> 41 </td> <td style="text-align:right;"> 44 </td> <td style="text-align:right;"> 43 </td> <td style="text-align:left;"> IAH </td> <td style="text-align:left;"> DFW </td> <td style="text-align:right;"> 224 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 22 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:left;"> </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> 5433 </td> <td style="text-align:right;"> 2011 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1443 </td> <td style="text-align:right;"> 1553 </td> <td style="text-align:left;"> AA </td> <td style="text-align:right;"> 428 </td> <td style="text-align:left;"> N504AA </td> <td style="text-align:right;"> 70 </td> <td style="text-align:right;"> 45 </td> <td style="text-align:right;"> 43 </td> <td style="text-align:right;"> 43 </td> <td style="text-align:left;"> IAH </td> <td style="text-align:left;"> DFW </td> <td style="text-align:right;"> 224 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 19 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:left;"> </td> <td style="text-align:right;"> 0 </td> </tr> </tbody> </table> --- #What are Types of Data in Statistics .pull-left[  ] .pull-right[ - **Categorical data** describes the data that fits into the categories. ex. (Gender , Country) - **Quantitative data** is also known as numerical data which represents the numerical value (i.e., how much, how often, how many). ex( age, sales, population) ] Source : <https://byjus.com/maths/types-of-data-in-statistics/> --- #Converting to Factors **Factor** variables are **categorical** variables that can be either numeric or string variables ```r df$Origin[1:20] ``` ``` ## [1] "IAH" "IAH" "IAH" "IAH" "IAH" "IAH" "IAH" "IAH" "IAH" "IAH" "IAH" "IAH" ## [13] "IAH" "IAH" "IAH" "IAH" "IAH" "IAH" "IAH" "IAH" ``` ```r class(df$Origin) ``` ``` ## [1] "character" ``` ```r unique(df$Origin) ``` ``` ## [1] "IAH" "HOU" ``` Since Origin appears to have categorical values of **IAH** and **HOU** only then convert to Factor Examine also DayOfWeek and Month as possible factors. ```r df$Origin <- as.factor(df$Origin) class(df$Origin) ``` ``` ## [1] "factor" ``` --- #Categorical Variables as Bar Plots ```r plot(df$Origin) ``` <img src="eda-basics_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ```r summary(df) ``` ``` ## Year Month DayofMonth DayOfWeek DepTime ## Min. :2011 Min. : 1.000 Min. : 1.00 Min. :1.000 Min. : 1 ## 1st Qu.:2011 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.:2.000 1st Qu.:1021 ## Median :2011 Median : 7.000 Median :16.00 Median :4.000 Median :1416 ## Mean :2011 Mean : 6.514 Mean :15.74 Mean :3.948 Mean :1396 ## 3rd Qu.:2011 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:6.000 3rd Qu.:1801 ## Max. :2011 Max. :12.000 Max. :31.00 Max. :7.000 Max. :2400 ## NA's :2905 ## ArrTime UniqueCarrier FlightNum TailNum ## Min. : 1 Length:227496 Min. : 1 Length:227496 ## 1st Qu.:1215 Class :character 1st Qu.: 855 Class :character ## Median :1617 Mode :character Median :1696 Mode :character ## Mean :1578 Mean :1962 ## 3rd Qu.:1953 3rd Qu.:2755 ## Max. :2400 Max. :7290 ## NA's :3066 ## ActualElapsedTime AirTime ArrDelay DepDelay ## Min. : 34.0 Min. : 11.0 Min. :-70.000 Min. :-33.000 ## 1st Qu.: 77.0 1st Qu.: 58.0 1st Qu.: -8.000 1st Qu.: -3.000 ## Median :128.0 Median :107.0 Median : 0.000 Median : 0.000 ## Mean :129.3 Mean :108.1 Mean : 7.094 Mean : 9.445 ## 3rd Qu.:165.0 3rd Qu.:141.0 3rd Qu.: 11.000 3rd Qu.: 9.000 ## Max. :575.0 Max. :549.0 Max. :978.000 Max. :981.000 ## NA's :3622 NA's :3622 NA's :3622 NA's :2905 ## Origin Dest Distance TaxiIn ## HOU: 52299 Length:227496 Min. : 79.0 Min. : 1.000 ## IAH:175197 Class :character 1st Qu.: 376.0 1st Qu.: 4.000 ## Mode :character Median : 809.0 Median : 5.000 ## Mean : 787.8 Mean : 6.099 ## 3rd Qu.:1042.0 3rd Qu.: 7.000 ## Max. :3904.0 Max. :165.000 ## NA's :3066 ## TaxiOut Cancelled CancellationCode Diverted ## Min. : 1.00 Min. :0.00000 Length:227496 Min. :0.000000 ## 1st Qu.: 10.00 1st Qu.:0.00000 Class :character 1st Qu.:0.000000 ## Median : 14.00 Median :0.00000 Mode :character Median :0.000000 ## Mean : 15.09 Mean :0.01307 Mean :0.002853 ## 3rd Qu.: 18.00 3rd Qu.:0.00000 3rd Qu.:0.000000 ## Max. :163.00 Max. :1.00000 Max. :1.000000 ## NA's :2947 ``` add describe() here --- #Univariate analysis In univariate analysis, the output is a single variable and all data collected is for it. There is no cause-and-effect relationship at all ## Charts to use - Categorical ( Bar , Pie Charts) - Quantitative (Histogram, BoxPlot, DensityPlot) --- #Categorical Variables .pull-left[ ```r plot(df$Month) ``` <img src="eda-basics_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> ] .pull-right[ ```r plot(df$DayOfWeek) ``` <img src="eda-basics_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> ] --- #Categorical Variables ```r plot(df$UniqueCarrier) ``` <img src="eda-basics_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- #Quantitative Variables .pull-left[ ```r hist(df$DepDelay) ``` <img src="eda-basics_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> ] .pull-right[ ```r hist(df$ArrDelay) ``` <img src="eda-basics_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> ] --- #BiVariate Analysis Bivariate analysis refers to the analysis of two variables to determine **relationships** between them. One variable here is dependent while the other is independent. ###Types of bivariate data analysis - **Numerical and Numerical** – In this type, both the variables of bivariate data, independent and dependent, - are having numerical values. - **Categorical and Categorical** – When both the variables are categorical. - **Numerical and Categorical** – When one variable is numerical and one is categorical. --- Numerical and Numerical ```r # Get only flights from American Airlines AA <- subset(hflights, UniqueCarrier == "AA" ) plot(x= AA$ArrDelay , y= AA$DepDelay, xlab = "Arrival Delay", ylab = "Departure Delay", col = "dodgerblue1") ``` <img src="eda-basics_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- Numerical and Categorical ```r plot(AA$ArrDelay ~ AA$DayOfWeek , xlab = "Day of Week", ylab = "Arrival Delay", col = "dodgerblue1") ``` <img src="eda-basics_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> --- Categorical and Categorical ```r plot(as.factor(df$DayOfWeek) ~ as.factor(df$Origin)) ``` <img src="eda-basics_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- class: inverse, center, middle #Next : Data Manipulation