| Title: “Exploratory Data Analysis in R” |
| Author: “T K Chakrabarty” |
| Date: “2024-06-18” |
| Output: slidy_presentation |
| Data Type | Example | Variable |
|---|---|---|
| Character | “A”, “Good” | String/Character |
| Categorical | Male, Female | Nominal |
| Categorical with order | Poor, Middle, Rich | Ordinal |
| Integer | -100, 3, 213 | Numeric(Discrete) |
| Real/Floating | 21.2, -675.45 | Numeric(Continuous) |
| Complex | 5+2i, 3-7i | Complex |
| Logical | TRUE, FALSE | Logical |
A systematic way of performorming various tasks of handling the data in hand by visualizing, examining the summary statistics, transforming and modeling to generate research questions and refine answers for the following questions:
Variations: the tendency of the values of a variable to change from measurement to measurement
Which values are the most common? Why?
Which values are rare? Why? Does that match your expectations?
Can you see any unusual patterns? What might explain them?
Look at the Distribution of every variables
Assess the general characteristics of the dataset
How many records do we have? How many variables?
What are the variable names? Are they meaningful?
What type is each variable e.g., numeric, categorical, logical?
How many unique values does each variable have?
What value occurs most frequently, and how often does it occur?
Are there missing observations? If so, how frequently does this occur?
Analyzing/ Distributions of categorical variables
Analyzing a categorical (ordinal or nominal) variable might include questions such as the number of levels and data points in each level.
How many levels does the variable have?
How many data points does each level have?
Is the data uniformly distributed?
What proportions of the total do each level represent?
Step 3 - Analyzing numerical variables(Discrete or Continuous) - Graphically - Quantitatively
Step 4 - Analyzing pair of categorical and categorical variables - Joint distribution - Conditional distribution
Step 5 – Analyzing pair of categorical and numeric variables (discrete/continuous)
Step 6 - Modeling and Implementations
Step 7 - Statistical Inference
Step 8 - Results and Conclusions
str(cars)
## 'data.frame': 50 obs. of 2 variables:
## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
plot(cars)
data1=data.frame(Gender=c("M","F","M","F"), Admit=c("yes","yes","no","no"),
Number=c(1198,557,1493,1278))
attach(data1)
data1
## Gender Admit Number
## 1 M yes 1198
## 2 F yes 557
## 3 M no 1493
## 4 F no 1278
library(ggplot2)
bp=ggplot(data1,aes(x=Gender, y=Number,fill=Admit)) + geom_bar(stat=“identity”, position=“stack”) + scale_fill_manual(values=c(‘yellowgreen’, ‘yellow2’))
bp
library(MASS)
attach(iris)
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
lookup = c(setosa='blue', versicola='green', virginica='orange')
col.ind = lookup[iris$Species]
plot(Sepal.Width ~ Sepal.Length, data = iris, pch=21, col="gray", bg= col.ind)
library(MASS)
attach(iris)
## The following objects are masked from iris (pos = 3):
##
## Petal.Length, Petal.Width, Sepal.Length, Sepal.Width, Species
truehist(iris$Sepal.Width)