Presentations.knit

Title: “Exploratory Data Analysis in R”

Author: “T K Chakrabarty”

Date: “2024-06-18”

Output: slidy_presentation

Learning from Data: Environment

A learning environment consists of several Input and Output features, called as variables.
A variable is any characteristic, number or quantity that can be measured or counted.
These variables generate observations which we call data.
Depending on the type of data these features generate in a given environment, we classify these variables accordingly.

Data Types and Variables

Data Type	Example	Variable
Character	“A”, “Good”	String/Character
Categorical	Male, Female	Nominal
Categorical with order	Poor, Middle, Rich	Ordinal
Integer	-100, 3, 213	Numeric(Discrete)
Real/Floating	21.2, -675.45	Numeric(Continuous)
Complex	5+2i, 3-7i	Complex
Logical	TRUE, FALSE	Logical

Data Generating Process

Learning: Characterization of uncertainty

Learning environment may have one, two or multiple number and multiple types of variables.
Learning goal is to understand the data generating process approximately (Estimation/Inference).
The notion of uncertainty is described through the univariate or joint statistical/probability distributions.
Our goal now turns out to be to estimate some unknown parameters of a probability model or examine plausibility of a hypothesized model based on the given data

Exploratory Data Analysis

A systematic way of performorming various tasks of handling the data in hand by visualizing, examining the summary statistics, transforming and modeling to generate research questions and refine answers for the following questions:

Variations: the tendency of the values of a variable to change from measurement to measurement
Which values are the most common? Why?
Which values are rare? Why? Does that match your expectations?
Can you see any unusual patterns? What might explain them?
Look at the Distribution of every variables

Steps in EDA : Step 1 First Approach To DATA

Assess the general characteristics of the dataset
How many records do we have? How many variables?
What are the variable names? Are they meaningful?
What type is each variable e.g., numeric, categorical, logical?
How many unique values does each variable have?
What value occurs most frequently, and how often does it occur?
Are there missing observations? If so, how frequently does this occur?

STeps in EDA: Step 2

Analyzing/ Distributions of categorical variables
Analyzing a categorical (ordinal or nominal) variable might include questions such as the number of levels and data points in each level.
How many levels does the variable have?
How many data points does each level have?
Is the data uniformly distributed?
What proportions of the total do each level represent?

Steps in EDA

Step 3 - Analyzing numerical variables(Discrete or Continuous) - Graphically - Quantitatively

Step 4 - Analyzing pair of categorical and categorical variables - Joint distribution - Conditional distribution

Step 5 – Analyzing pair of categorical and numeric variables (discrete/continuous)

Step 6 - Modeling and Implementations

Step 7 - Statistical Inference

Step 8 - Results and Conclusions

Slide with R Output

str(cars)

## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Plots

plot(cars)

Categorical Variables

data1=data.frame(Gender=c("M","F","M","F"), Admit=c("yes","yes","no","no"),
                 Number=c(1198,557,1493,1278))
attach(data1)
data1

##   Gender Admit Number
## 1      M   yes   1198
## 2      F   yes    557
## 3      M    no   1493
## 4      F    no   1278

R Codes for plots

library(ggplot2)

bp=ggplot(data1,aes(x=Gender, y=Number,fill=Admit)) + geom_bar(stat=“identity”, position=“stack”) + scale_fill_manual(values=c(‘yellowgreen’, ‘yellow2’))

Bar Plots

New Data

library(MASS)
attach(iris)
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Scatter Plot

lookup = c(setosa='blue', versicola='green', virginica='orange')
col.ind = lookup[iris$Species]
plot(Sepal.Width ~ Sepal.Length, data = iris, pch=21, col="gray", bg= col.ind)

library(MASS)
attach(iris)

## The following objects are masked from iris (pos = 3):
## 
##     Petal.Length, Petal.Width, Sepal.Length, Sepal.Width, Species

truehist(iris$Sepal.Width)