Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.
Exploratory data analysis (EDA) methods are often called Descriptive Statistics due to the fact that they simply describe, or provide estimates based on, the data at hand.
EDA consists of:
And can be useful for:
For practice we will get the data from the github source https://raw.githubusercontent.com/BijayLalPradhan/D4P/main/BikeData.csv
datacycle=read.csv("https://raw.githubusercontent.com/BijayLalPradhan/D4P/main/BikeData.csv")
head(datacycle)
## user_id age gender student employed cyc_freq distance time speed
## 1 1 28 M 1 1 Daily 3.25 15 13.00
## 2 2 35 M 0 1 Daily 1.11 5 13.32
## 3 3 28 M 0 1 Daily 5.59 23 14.58
## 4 4 44 F 0 1 <once a month 3.24 24 8.10
## 5 5 42 M 0 1 >> per week 7.81 26 18.02
## 6 6 36 M 0 1 >> per week 3.00 20 9.00
## abs_days
## 1 3
## 2 2
## 3 0
## 4 3
## 5 4
## 6 4
See the structure of the variables with observations
str(datacycle)
## 'data.frame': 121 obs. of 10 variables:
## $ user_id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ age : int 28 35 28 44 42 36 45 54 39 44 ...
## $ gender : chr "M" "M" "M" "F" ...
## $ student : int 1 0 0 0 0 0 0 0 0 0 ...
## $ employed: int 1 1 1 1 1 1 1 1 1 0 ...
## $ cyc_freq: chr "Daily" "Daily" "Daily" "<once a month" ...
## $ distance: num 3.25 1.11 5.59 3.24 7.81 ...
## $ time : int 15 5 23 24 26 20 51 39 50 44 ...
## $ speed : num 13 13.3 14.6 8.1 18 ...
## $ abs_days: int 3 2 0 3 4 4 5 0 5 3 ...
We are interest to get some information like
What percentage of male and female are there in the survey?
How the respondent are distributed among different cycle frequency categories? do the percentage follows some kind of pattern?
In order to summarize the distribution of a categorical variable, we first create a table of the different values (categories) the variable takes, how many times each value occurs (count) and, more importantly, how often each value occurs (by converting the counts to percentages).
One-way Frequency Table(Counts)
cbind(table(datacycle$gender))
## [,1]
## F 31
## M 90
We can find out percentage
cbind(100*((table(datacycle$gender)/sum(table(datacycle$gender)))))
## [,1]
## F 25.61983
## M 74.38017
Visual / Graphical Displays There are two simple graphical displays for visualizing the distribution of one categorical variable:
barplot(table(datacycle$cyc_freq))
sorted frequency bar diagram
freq_table <- table(datacycle$cyc_freq)
sorted_freq <- sort(freq_table, decreasing = TRUE)
barplot(sorted_freq, main = "Bar Chart of Frequencies", xlab = "Frequency of cycling", ylab = "Count",col="green")
Number in above the bar
sorted_freq <- sort(table(datacycle$cyc_freq), decreasing = TRUE)
barplot(sorted_freq, ylim=c(0,80), main = "Bar Chart of Frequencies", xlab = "Frequency of cycling", ylab = "Count")
# Add numbers above the bars
text(seq_along(sorted_freq), sorted_freq, labels = sorted_freq, pos = 3, col = "blue")
you can draw pie diagram for the data
pie(table(datacycle$cyc_freq))
There are two types of quantitative variable
Discrete variables
Discrete variables are variables that can only take on a finite or countable number of values. These values are typically distinct and separate from each other, with no values in between. Discrete variables often represent counts.
Examples of discrete variables include:
Number of siblings: This variable can only take on integer values such as 0, 1, 2, 3, etc. It cannot take on non-integer values like 1.5 or 2.7.
Number of cars in a parking lot: Similar to the number of siblings, this variable can only take on whole number values. It cannot have fractions or non-integer values.
Continuous Data
Continuous data refers to data that can take on any value within a certain range and can be measured with any level of precision. It is characterized by an infinite number of possible values between any two points.
In practical terms, continuous data can be measured and represented at any level of precision, including fractions and decimals. Continuous data is often obtained through measurement rather than counting. It is typically represented using real numbers.
Examples of continuous data include:
Height: Height can be measured to any level of precision using a ruler or measuring tape. It can be 5 feet 6 inches, 5.5 feet, or even more precise measurements such as 5 feet 6.25 inches.
Weight: Weight can be measured on a scale to any level of precision, such as 150 pounds, 150.5 pounds, or 150.75 pounds.
Temperature: Temperature can be measured using a thermometer to any level of precision, such as 75 degrees Fahrenheit, 75.5 degrees Fahrenheit, or 75.75 degrees Fahrenheit.
Continuous data is often analyzed using statistical methods appropriate for continuous variables, such as calculating means, standard deviations, and performing regression analysis.
Numerical Measures
The “mean,” often referred to as the “average,” is a measure of central tendency that represents the typical value of a set of numbers. It is calculated by summing up all the values in a dataset and then dividing by the total number of values.
Mathematically, the mean (\(\bar{x}\)) of n numbers
x₁, x₂, ..., xₙ is given by:
\(\bar{x} = \frac{x_1 + x_2 + \ldots + x_n}{n}\)
In words, you add up all the values and then divide by the total number of values.
For example, consider the following set of numbers: 4, 6, 8, 10, and 12.
To find the mean: \(x̄ = \frac{54 + 6 + 8 + 10 + 12}{5} = \frac{40}{5} = 8\) So, the mean of this dataset is 8.
The mean is a useful measure because it gives a single value that represents the central tendency of the data. However, it can be influenced by extreme values, which may not be representative of the majority of the data. Therefore, it’s often used in conjunction with other measures of central tendency, such as the median and mode, to provide a more comprehensive understanding of the data.
mean function is used to find the mean value
mean(datacycle$distance)
## [1] 5.990661
cat("The mean value of distance is", mean(datacycle$distance))
## The mean value of distance is 5.990661
Logical calculation
mean = \(\frac{\text{Sum of all observation}}{\text{number of observation}}\) = \(\frac{\text{sum_data}}{\text{length_data}}\)
sum_data=sum(datacycle$distance)
length_data=length(datacycle$distance)
sum_data/length_data # mean of length
## [1] 5.990661
The median is a measure of central tendency that represents the middle value of a dataset when it’s ordered from least to greatest. To find the median:
For example, consider the dataset: 3, 6, 1, 9, 2, 5, 8, 7, 4.
Another example with an even number of observations: 2, 4, 6, 8, 10, 12.
The median is a robust measure of central tendency because it is not affected by extreme values (outliers) in the dataset, unlike the mean. It’s often used when the data is skewed or when there are outliers present.
quantity <- c(45, 48, 52, 57, 40, 48, 44, 48, 45, 52, 51, 53, 51, 50, 45, 43, 52, 42, 45, 48, 57, 51, 58, 55, 54, 44, 47, 53, 53, 58, 52, 47, 48, 52, 48, 53, 42, 48, 57, 56, 57, 45, 49, 47, 55, 44, 46, 56, 51, 44, 52, 55, 46, 51, 49, 59, 57, 42, 57, 44, 43)
median(quantity)
## [1] 50
The mode is a measure of central tendency that represents the value that appears most frequently in a dataset. Unlike the mean and median, which represent typical values of a dataset, the mode represents the most common value or values. A dataset can have one mode, called unimodal, or multiple modes, called multimodal.
For example, consider the dataset: 2, 3, 3, 5, 5, 5, 7, 8, 8, 9. In this dataset, the mode is 5 because it appears more frequently than any other value.
In some datasets, there may be no mode if all values occur with the same frequency, or if no value repeats. For example, in the dataset: 1, 2, 3, 4, 5, 6, each value occurs only once, so there is no mode.
In R, you can find the mode using the Mode() function. Here’s an example:
By default mode function is not available so we have to write a function code
# Define a function to find the mode
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
datap <- c(2, 3, 3, 5, 5, 5, 7, 8, 8, 9)
cat("Mode of the data is", Mode(datap))
## Mode of the data is 5
Presenting quantitative data
Presenting Discrete Data
barplot(table(datacycle$abs_days), xlab="Nos of days absent", ylab="Frequency")
Presenting continuous Data
Continuous data will be presented in class interval
Equal-width intervals: Divide the range of the data into equal-sized intervals. This method is simple and easy to implement but may not always be suitable for datasets with uneven distributions.
Quantiles or percentiles: Divide the data into intervals based on quantiles or percentiles. For example, quartiles divide the data into four equal parts, while deciles divide it into ten equal parts. This method is useful for analyzing data distribution and identifying outliers.
Natural breaks (Jenks optimization): Use statistical algorithms to determine breakpoints that maximize the difference between groups while minimizing the variation within groups. This method is commonly used in geographic information systems (GIS) for choropleth mapping.
Domain knowledge: Use domain-specific knowledge to determine meaningful intervals for the data. For example, age groups, income brackets, or temperature ranges may be predefined based on relevant criteria.
Custom intervals: Define intervals based on specific criteria or requirements of the analysis. This approach allows for flexibility in grouping the data based on context-specific considerations.
The choice of classification method depends on the characteristics of the data, the objectives of the analysis, and the preferences of the analyst. It’s important to consider factors such as data distribution, sample size, interpretability, and the intended audience when selecting a classification method.
Creating Class Intervals in R
Let’s use the cut() function to create class intervals
for a dataset of ages.
ages <- c(20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70)
# Create class intervals using cut()
class_intervals <- cut(ages, breaks = c(20, 30, 40, 50, 60, 70), include.lowest = TRUE, right=FALSE)
cbind(table(class_intervals))
## [,1]
## [20,30) 2
## [30,40) 2
## [40,50) 2
## [50,60) 2
## [60,70] 3
Drawing Histogram
quantity <- c(45, 48, 52, 57, 46, 48, 47, 48, 45, 52, 51, 53, 51, 50, 45, 43, 52, 42, 45, 48, 57, 51, 58, 55, 54, 44, 47, 53, 53, 58, 52, 47, 48, 52, 48, 53, 42, 48, 57, 56, 57, 45, 49, 47, 55, 44, 46, 56, 51, 44, 52, 55, 46, 51, 49, 59, 57, 42, 57, 44, 43)
hist(quantity,breaks=c(40,45,50,55,60,65,70))
The purpose of study the distribution of data is to know the - Shape: Overall appearance of histogram. Can be symmetric, bell-shaped, left skewed, right skewed, etc.
Center: Mean or Median
Spread: How far our data spreads. Range, Interquartile Range (IQR),standard deviation, variance.
Outliers: Data points that fall far from the bulk of the data
Drawing mean and median in histogram
hist(datacycle$speed, breaks=20)
abline(v = mean(datacycle$speed) , col = "red", lwd = 2)
abline(v=median(datacycle$speed), col="green", lwd = 3)
A measure of dispersion, also known as a measure of spread, is a statistic that describes the variability or spread of values in a dataset. It provides information about how much the individual data points differ from the central tendency (e.g., mean, median) of the dataset. Measures of dispersion are essential for understanding the distribution of data and assessing the level of uncertainty or variability within the dataset.
The common measure of dispersion are
The range covered by the data is the most intuitive measure of variability. The range is exactly the distance between the smallest data point (min) and the largest one (Max).
Range = Max – min
range(datacycle$speed)
## [1] 4.29 30.84
it gives you the smallest and largest value. If you have to find the difference of largest and smallest value as range then you have to write this formula
max(datacycle$speed)-min(datacycle$speed)
## [1] 26.55
While the range quantifies the variability by looking at the range covered by ALL the data, the Inter-Quartile Range or IQR measures the variability of a distribution by giving us the range covered by the MIDDLE 50% of the data.
IQR = Q3 – Q1
Q3 = 3rd Quartile = 75th Percentile
Q1 = 1st Quartile = 25th Percentile
IQR function is used to find the IQR
IQR(datacycle$speed)
## [1] 3.78
Standard Deviation
Standard deviation is a statistical measure of the dispersion or spread of a set of data points. It quantifies the average amount of variation or deviation of individual data points from the mean of the dataset. In other words, it measures how much the data points are spread out around the mean.
Mathematically, the standard deviation \(\sigma\) of a dataset is calculated as the square root of the variance. The variance \(\sigma^2\) is the average of the squared differences between each data point and the mean of the dataset.
The formula for calculating the standard deviation is:
\[ \sigma = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n}} \]
Where: - \(x_i\) represents each individual data point. - \(\bar{x}\) represents the mean of the dataset. - \(n\) represents the total number of data points.
This formula is known as standard deviation of population.
Key points about standard deviation:
Standard deviation is expressed in the same units as the original data.
A smaller standard deviation indicates that the data points are closer to the mean and less spread out, while a larger standard deviation indicates that the data points are more spread out.
Standard deviation is heavily influenced by outliers or extreme values in the dataset.
The formula for calculating the sample standard deviation is:
\[ S = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}} \]
and most of the software uses this formula
the sd function in R also calculates the sample SD
sd(datacycle$speed)
## [1] 3.93216
To find out SD of population we can use \[ \sigma = \sqrt{\frac{n-1}{n}}*s \]
n=length(datacycle$speed)
sdp= sqrt((n-1)/n)*sd(datacycle$speed)
cat("Standard deviation of population is", sdp)
## Standard deviation of population is 3.915877
Skewness:
Skewness measures the asymmetry of the distribution of data around its mean. It indicates whether the data is skewed to the left or right relative to the mean.
Kurtosis:
Kurtosis measures the peakedness or flatness of the distribution of data relative to a normal distribution. It indicates whether the distribution is more or less peaked (leptokurtic) or flat (platykurtic) compared to a normal distribution.
Here skewness measures the symmetry of the distribution, while kurtosis measures the peakedness or flatness of the distribution compared to a normal distribution. Both skewness and kurtosis provide insights into the shape and characteristics of the distribution of data. we use momemt package to calculate skewness and kurtosis
library(moments)
skewness(datacycle$speed)
## [1] 1.264264
kurtosis(datacycle$speed)
## [1] 7.282869
If you dont have the package moments then you have to develop yourself the function using logic
# Define a function to calculate skewness
calculate_skewness <- function(x) {
n <- length(x)
mean_x <- mean(x)
sd_x <- sqrt(sum((x - mean_x)^2) / n )
skewness <- (sum((x - mean_x)^3) / n) / (sd_x^3)
return(skewness)
}
# Define a function to calculate kurtosis
calculate_kurtosis <- function(x) {
n <- length(x)
mean_x <- mean(x)
sd_x <- sqrt(sum((x - mean_x)^2) / n )
kurtosis <- (sum((x - mean_x)^4) / n) / (sd_x^4)
return(kurtosis)
}
calculate_skewness(datacycle$speed)
## [1] 1.264264
calculate_kurtosis(datacycle$speed)
## [1] 7.282869