Title: “Exploratory Data Analysis in R”
Author: “T K Chakrabarty”
Date: “2024-06-18”
Output: slidy_presentation

Learning from Data: Environment

Data Types and Variables

Data Type Example Variable
Character “A”, “Good” String/Character
Categorical Male, Female Nominal
Categorical with order Poor, Middle, Rich Ordinal
Integer -100, 3, 213 Numeric(Discrete)
Real/Floating 21.2, -675.45 Numeric(Continuous)
Complex 5+2i, 3-7i Complex
Logical TRUE, FALSE Logical

Data Generating Process

Learning: Characterization of uncertainty

Exploratory Data Analysis

A systematic way of performorming various tasks of handling the data in hand by visualizing, examining the summary statistics, transforming and modeling to generate research questions and refine answers for the following questions:

Steps in EDA : Step 1 First Approach To DATA

STeps in EDA: Step 2

Steps in EDA

Step 3 - Analyzing numerical variables(Discrete or Continuous) - Graphically - Quantitatively

Step 4 - Analyzing pair of categorical and categorical variables - Joint distribution - Conditional distribution

Step 5 – Analyzing pair of categorical and numeric variables (discrete/continuous)

Step 6 - Modeling and Implementations

Step 7 - Statistical Inference

Step 8 - Results and Conclusions

Slide with R Output

str(cars)
## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Plots

plot(cars)

Categorical Variables

data1=data.frame(Gender=c("M","F","M","F"), Admit=c("yes","yes","no","no"),
                 Number=c(1198,557,1493,1278))
attach(data1)
data1
##   Gender Admit Number
## 1      M   yes   1198
## 2      F   yes    557
## 3      M    no   1493
## 4      F    no   1278

R Codes for plots

library(ggplot2)

bp=ggplot(data1,aes(x=Gender, y=Number,fill=Admit)) + geom_bar(stat=“identity”, position=“stack”) + scale_fill_manual(values=c(‘yellowgreen’, ‘yellow2’))

bp

Bar Plots

New Data

library(MASS)
attach(iris)
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Scatter Plot

lookup = c(setosa='blue', versicola='green', virginica='orange')
col.ind = lookup[iris$Species]
plot(Sepal.Width ~ Sepal.Length, data = iris, pch=21, col="gray", bg= col.ind)

library(MASS)
attach(iris)
## The following objects are masked from iris (pos = 3):
## 
##     Petal.Length, Petal.Width, Sepal.Length, Sepal.Width, Species
truehist(iris$Sepal.Width)