Exploratory Data Analysis (EDA)

Zhenning "Jimmy" Xu, followe me on Twitter: https://twitter.com/MKTJimmyxu
10.29.2021

Agenda

This presentation is created using R Studio and RMarkdown. For more details on authoring R presentations please visit https://support.rstudio.com/hc/en-us/articles/200486468.

Intro (5 mins)
What is EDA?
Why R?
Avocado Pricing at Retail (tutorial)
Perceptual Mapping (if time allows)

Why R (R programming/R Studio)?

plot of chunk figures-side

Ref: Why R? https://techvidvan.com/tutorials/r-tutorial/ R Career �C Discover various Opportunities and Scope of R Programming! https://data-flair.training/blogs/r-careers/

Exploratory Data Analysis

This presentation is created using R Studio and RMarkdown. For more details on authoring R presentations please visit https://support.rstudio.com/hc/en-us/articles/200486468.

EDA is use to understand features of our data.The following are some common procedures:

Data importing / Data cleaning
Data Processing/Exploration (outlier detection, correlations, subsets, etc.)
Trend Analysis (distribution, etc.)
Pattern Recognition/Feature Selection/Factor Analysis
Dimension Reduction - Principal Component Analysis (PCA)

Exploratory Data Analysis

Exploratory Data Analysis (EDA) and data visualization often go hand in hand. The following are some popular methods:

Summary statistics (means, medians, quantiles, histograms, boxplots)
Data Subsetting
Time Series Analysis
PCA (Perceptual Mapping - A top marketing skill to add to your resume)
Cluster analysis
Mapping (Spatial marketing, geo-marketing, heatmapping, etc.)
Networks and Graphs

EDA Examples

Top marketing skills to add to your resume:

Perceptual Mapping - https://rpubs.com/utjimmyx/brand_positioning
Time Series Analysis/Trend Analysis (seasonality analysis) - https://rpubs.com/utjimmyx/retailpricing
Cluster analysis - https://rpubs.com/utjimmyx/segmentation_analysis
Mapping (Spatial marketing, geo-marketing, heatmapping, etc.) - https://rpubs.com/utjimmyx/mapboxapp
Dashboard Design (turn your observations into interactive stories) - https://rpubs.com/utjimmyx/Flexdashboard_Kern_vote

Slide With Code

#We will be using the avocado dataset for this exercise
urlfile<-'https://raw.github.com/utjimmyx/resources/master/avocado_HAA.csv'
data<-read.csv(urlfile, fileEncoding="UTF-8-BOM")
summary(data)

     date           average_price    total_volume         type          
 Length:12628       Min.   :0.500   Min.   :    253   Length:12628      
 Class :character   1st Qu.:1.100   1st Qu.:  15733   Class :character  
 Mode  :character   Median :1.320   Median :  94806   Mode  :character  
                    Mean   :1.359   Mean   : 325259                     
                    3rd Qu.:1.570   3rd Qu.: 430222                     
                    Max.   :2.780   Max.   :5660216                     
      year       geography        
 Min.   :2017   Length:12628      
 1st Qu.:2018   Class :character  
 Median :2019   Mode  :character  
 Mean   :2019                     
 3rd Qu.:2020                     
 Max.   :2020

cor(data$total_volume, data$average_price ,  method = "pearson", use = "complete.obs")

[1] -0.4169306

Slide With Code

#We will be using the avocado dataset for this exercise
urlfile<-'https://raw.github.com/utjimmyx/resources/master/avocado_HAA.csv'
data<-read.csv(urlfile, fileEncoding="UTF-8-BOM")
summary(data)

     date           average_price    total_volume         type          
 Length:12628       Min.   :0.500   Min.   :    253   Length:12628      
 Class :character   1st Qu.:1.100   1st Qu.:  15733   Class :character  
 Mode  :character   Median :1.320   Median :  94806   Mode  :character  
                    Mean   :1.359   Mean   : 325259                     
                    3rd Qu.:1.570   3rd Qu.: 430222                     
                    Max.   :2.780   Max.   :5660216                     
      year       geography        
 Min.   :2017   Length:12628      
 1st Qu.:2018   Class :character  
 Median :2019   Mode  :character  
 Mean   :2019                     
 3rd Qu.:2020                     
 Max.   :2020

cor(data$total_volume, data$average_price ,  method = "pearson", use = "complete.obs")

[1] -0.4169306

Slide With Plot

plot of chunk unnamed-chunk-3