1 Introduction

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

Exploratory data analysis (EDA) methods are often called Descriptive Statistics due to the fact that they simply describe, or provide estimates based on, the data at hand.

1.1 Exploratory Data Analysis

EDA consists of:

  • Organizing and summarizing the raw data,
  • Discovering important features and patterns in the data and any striking - deviations from those patterns
  • Interpreting our findings in the context of the problem

And can be useful for:

  • Describing the distribution of a single variable (center, spread, shape, outliers)
  • Checking data (for errors or other problems)
  • Checking assumptions to more complex statistical analyses
  • Investigating relationships between variables

1.2 Features of EDA

  • In this notebook covers two broad topics:
    • Examining Distributions — exploring data one variable at a time.
    • Examining Relationships — exploring data two variables at a time.
  • In Exploratory Data Analysis, our exploration of data will always consist of the following two elements:
    • Visual displays
    • Numerical measures.

1.3 Working with data

For practice we will get the data from the github source https://raw.githubusercontent.com/BijayLalPradhan/D4P/main/BikeData.csv

datacycle=read.csv("https://raw.githubusercontent.com/BijayLalPradhan/D4P/main/BikeData.csv")
head(datacycle)
##   user_id age gender student employed      cyc_freq distance time speed
## 1       1  28      M       1        1         Daily     3.25   15 13.00
## 2       2  35      M       0        1         Daily     1.11    5 13.32
## 3       3  28      M       0        1         Daily     5.59   23 14.58
## 4       4  44      F       0        1 <once a month     3.24   24  8.10
## 5       5  42      M       0        1   >> per week     7.81   26 18.02
## 6       6  36      M       0        1   >> per week     3.00   20  9.00
##   abs_days
## 1        3
## 2        2
## 3        0
## 4        3
## 5        4
## 6        4

See the structure of the variables with observations

str(datacycle)
## 'data.frame':    121 obs. of  10 variables:
##  $ user_id : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age     : int  28 35 28 44 42 36 45 54 39 44 ...
##  $ gender  : chr  "M" "M" "M" "F" ...
##  $ student : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ employed: int  1 1 1 1 1 1 1 1 1 0 ...
##  $ cyc_freq: chr  "Daily" "Daily" "Daily" "<once a month" ...
##  $ distance: num  3.25 1.11 5.59 3.24 7.81 ...
##  $ time    : int  15 5 23 24 26 20 51 39 50 44 ...
##  $ speed   : num  13 13.3 14.6 8.1 18 ...
##  $ abs_days: int  3 2 0 3 4 4 5 0 5 3 ...

1.4 Analysis of one categorical variable

  • Distribution of One Categorical Variable
  • Numerical Summaries
    • One-way Frequency Table(Counts)
    • One-way Frequency Table(Percentages)
    • One-way Frequency Table(Combination of Counts and Percentages)
  • Visual or Graphical Displays
    • Bar Chart - Great for categorical data visualization
    • Pie Chart - Use with caution for summarizing categorical data

We are interest to get some information like

What percentage of male and female are there in the survey?

How the respondent are distributed among different cycle frequency categories? do the percentage follows some kind of pattern?

1.5 Numerical measures

In order to summarize the distribution of a categorical variable, we first create a table of the different values (categories) the variable takes, how many times each value occurs (count) and, more importantly, how often each value occurs (by converting the counts to percentages).

  • The result is often called a Frequency Distribution or Frequency Table.
  • A Frequency Distribution or Frequency Table is the primary set of numerical measures for one categorical variable.
  • Consists of a table with each category along with the count and percentage for each category.
  • Provides a summary of the distribution for one categorical variable.

One-way Frequency Table(Counts)

cbind(table(datacycle$gender))
##   [,1]
## F   31
## M   90

We can find out percentage

cbind(100*((table(datacycle$gender)/sum(table(datacycle$gender)))))
##       [,1]
## F 25.61983
## M 74.38017

Visual / Graphical Displays There are two simple graphical displays for visualizing the distribution of one categorical variable:

  • Bar Charts
  • Pie Charts ** Bar Charts**
  • To describe the number of observations in each category of the discrete variable
barplot(table(datacycle$cyc_freq))

sorted frequency bar diagram

freq_table <- table(datacycle$cyc_freq)
sorted_freq <- sort(freq_table, decreasing = TRUE)
barplot(sorted_freq, main = "Bar Chart of Frequencies", xlab = "Frequency of cycling", ylab = "Count",col="green")

Number in above the bar

sorted_freq <- sort(table(datacycle$cyc_freq), decreasing = TRUE)
barplot(sorted_freq, ylim=c(0,80), main = "Bar Chart of Frequencies", xlab = "Frequency of cycling", ylab = "Count")
# Add numbers above the bars
text(seq_along(sorted_freq), sorted_freq, labels = sorted_freq, pos = 3,  col = "blue")

you can draw pie diagram for the data

pie(table(datacycle$cyc_freq))

1.6 one quantitative variable

There are two types of quantitative variable

  • Discrete variables
  • continuous variables

Discrete variables

Discrete variables are variables that can only take on a finite or countable number of values. These values are typically distinct and separate from each other, with no values in between. Discrete variables often represent counts.

Examples of discrete variables include:

  • Number of siblings: This variable can only take on integer values such as 0, 1, 2, 3, etc. It cannot take on non-integer values like 1.5 or 2.7.

  • Number of cars in a parking lot: Similar to the number of siblings, this variable can only take on whole number values. It cannot have fractions or non-integer values.

Continuous Data

Continuous data refers to data that can take on any value within a certain range and can be measured with any level of precision. It is characterized by an infinite number of possible values between any two points.

In practical terms, continuous data can be measured and represented at any level of precision, including fractions and decimals. Continuous data is often obtained through measurement rather than counting. It is typically represented using real numbers.

Examples of continuous data include:

  • Height: Height can be measured to any level of precision using a ruler or measuring tape. It can be 5 feet 6 inches, 5.5 feet, or even more precise measurements such as 5 feet 6.25 inches.

  • Weight: Weight can be measured on a scale to any level of precision, such as 150 pounds, 150.5 pounds, or 150.75 pounds.

  • Temperature: Temperature can be measured using a thermometer to any level of precision, such as 75 degrees Fahrenheit, 75.5 degrees Fahrenheit, or 75.75 degrees Fahrenheit.

  • Continuous data is often analyzed using statistical methods appropriate for continuous variables, such as calculating means, standard deviations, and performing regression analysis.

Numerical Measures

2 Measures of Center

  • Mean
  • Median
  • Mode

2.0.1 Mean

The “mean,” often referred to as the “average,” is a measure of central tendency that represents the typical value of a set of numbers. It is calculated by summing up all the values in a dataset and then dividing by the total number of values.

Mathematically, the mean (\(\bar{x}\)) of n numbers x₁, x₂, ..., xₙ is given by:

\(\bar{x} = \frac{x_1 + x_2 + \ldots + x_n}{n}\)

In words, you add up all the values and then divide by the total number of values.

For example, consider the following set of numbers: 4, 6, 8, 10, and 12.

To find the mean: \(x̄ = \frac{54 + 6 + 8 + 10 + 12}{5} = \frac{40}{5} = 8\) So, the mean of this dataset is 8.

The mean is a useful measure because it gives a single value that represents the central tendency of the data. However, it can be influenced by extreme values, which may not be representative of the majority of the data. Therefore, it’s often used in conjunction with other measures of central tendency, such as the median and mode, to provide a more comprehensive understanding of the data.

mean function is used to find the mean value

mean(datacycle$distance)
## [1] 5.990661
cat("The mean value of distance is", mean(datacycle$distance))
## The mean value of distance is 5.990661

Logical calculation

mean = \(\frac{\text{Sum of all observation}}{\text{number of observation}}\) = \(\frac{\text{sum_data}}{\text{length_data}}\)

sum_data=sum(datacycle$distance)
length_data=length(datacycle$distance)
sum_data/length_data # mean of length
## [1] 5.990661

2.0.2 Median

The median is a measure of central tendency that represents the middle value of a dataset when it’s ordered from least to greatest. To find the median:

  1. Arrange the data in ascending order.
  2. If the number of observations (\(n\)) is odd, the median is the middle value.
  3. If the number of observations (\(n\)) is even, the median is the average of the two middle values.

For example, consider the dataset: 3, 6, 1, 9, 2, 5, 8, 7, 4.

  1. Arrange the data in ascending order: 1, 2, 3, 4, 5, 6, 7, 8, 9.
  2. Since there are 9 observations (odd), the median is the middle value, which is 5.

Another example with an even number of observations: 2, 4, 6, 8, 10, 12.

  1. Arrange the data in ascending order: 2, 4, 6, 8, 10, 12.
  2. Since there are 6 observations (even), the median is the average of the two middle values: (6 + 8) / 2 = 7.

The median is a robust measure of central tendency because it is not affected by extreme values (outliers) in the dataset, unlike the mean. It’s often used when the data is skewed or when there are outliers present.

quantity <- c(45, 48, 52, 57, 40, 48, 44, 48, 45, 52, 51, 53, 51, 50, 45, 43, 52, 42, 45, 48, 57, 51, 58, 55, 54, 44, 47, 53, 53, 58, 52, 47, 48, 52, 48, 53, 42, 48, 57, 56, 57, 45, 49, 47, 55, 44, 46, 56, 51, 44, 52, 55, 46, 51, 49, 59, 57, 42, 57, 44, 43)
median(quantity)
## [1] 50

2.0.3 Mode

The mode is a measure of central tendency that represents the value that appears most frequently in a dataset. Unlike the mean and median, which represent typical values of a dataset, the mode represents the most common value or values. A dataset can have one mode, called unimodal, or multiple modes, called multimodal.

For example, consider the dataset: 2, 3, 3, 5, 5, 5, 7, 8, 8, 9. In this dataset, the mode is 5 because it appears more frequently than any other value.

In some datasets, there may be no mode if all values occur with the same frequency, or if no value repeats. For example, in the dataset: 1, 2, 3, 4, 5, 6, each value occurs only once, so there is no mode.

In R, you can find the mode using the Mode() function. Here’s an example:

By default mode function is not available so we have to write a function code

# Define a function to find the mode
Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}
datap <- c(2, 3, 3, 5, 5, 5, 7, 8, 8, 9)
cat("Mode of the data is", Mode(datap))
## Mode of the data is 5

Presenting quantitative data

Presenting Discrete Data

barplot(table(datacycle$abs_days), xlab="Nos of days absent", ylab="Frequency")

Presenting continuous Data

Continuous data will be presented in class interval

  1. Equal-width intervals: Divide the range of the data into equal-sized intervals. This method is simple and easy to implement but may not always be suitable for datasets with uneven distributions.

  2. Quantiles or percentiles: Divide the data into intervals based on quantiles or percentiles. For example, quartiles divide the data into four equal parts, while deciles divide it into ten equal parts. This method is useful for analyzing data distribution and identifying outliers.

  3. Natural breaks (Jenks optimization): Use statistical algorithms to determine breakpoints that maximize the difference between groups while minimizing the variation within groups. This method is commonly used in geographic information systems (GIS) for choropleth mapping.

  4. Domain knowledge: Use domain-specific knowledge to determine meaningful intervals for the data. For example, age groups, income brackets, or temperature ranges may be predefined based on relevant criteria.

  5. Custom intervals: Define intervals based on specific criteria or requirements of the analysis. This approach allows for flexibility in grouping the data based on context-specific considerations.

The choice of classification method depends on the characteristics of the data, the objectives of the analysis, and the preferences of the analyst. It’s important to consider factors such as data distribution, sample size, interpretability, and the intended audience when selecting a classification method.

Creating Class Intervals in R

Let’s use the cut() function to create class intervals for a dataset of ages.

ages <- c(20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70)
# Create class intervals using cut()
class_intervals <- cut(ages, breaks = c(20, 30, 40, 50, 60, 70), include.lowest = TRUE, right=FALSE)
cbind(table(class_intervals))
##         [,1]
## [20,30)    2
## [30,40)    2
## [40,50)    2
## [50,60)    2
## [60,70]    3

Drawing Histogram

quantity <- c(45, 48, 52, 57, 46, 48, 47, 48, 45, 52, 51, 53, 51, 50, 45, 43, 52, 42, 45, 48, 57, 51, 58, 55, 54, 44, 47, 53, 53, 58, 52, 47, 48, 52, 48, 53, 42, 48, 57, 56, 57, 45, 49, 47, 55, 44, 46, 56, 51, 44, 52, 55, 46, 51, 49, 59, 57, 42, 57, 44, 43)
hist(quantity,breaks=c(40,45,50,55,60,65,70))

The purpose of study the distribution of data is to know the - Shape: Overall appearance of histogram. Can be symmetric, bell-shaped, left skewed, right skewed, etc.

  • Center: Mean or Median

  • Spread: How far our data spreads. Range, Interquartile Range (IQR),standard deviation, variance.

  • Outliers: Data points that fall far from the bulk of the data

Drawing mean and median in histogram

hist(datacycle$speed, breaks=20)
abline(v = mean(datacycle$speed) , col = "red", lwd = 2)
abline(v=median(datacycle$speed), col="green", lwd = 3)

3 Measure of spread / Measure of Dispersion

A measure of dispersion, also known as a measure of spread, is a statistic that describes the variability or spread of values in a dataset. It provides information about how much the individual data points differ from the central tendency (e.g., mean, median) of the dataset. Measures of dispersion are essential for understanding the distribution of data and assessing the level of uncertainty or variability within the dataset.

The common measure of dispersion are

  • Range
  • Inter-Quartile Range (IQR)
  • Standard Deviation

3.1 Range

The range covered by the data is the most intuitive measure of variability. The range is exactly the distance between the smallest data point (min) and the largest one (Max).

Range = Max – min

range(datacycle$speed)
## [1]  4.29 30.84

it gives you the smallest and largest value. If you have to find the difference of largest and smallest value as range then you have to write this formula

max(datacycle$speed)-min(datacycle$speed)
## [1] 26.55

3.2 Inter-Quartile Range (IQR)

While the range quantifies the variability by looking at the range covered by ALL the data, the Inter-Quartile Range or IQR measures the variability of a distribution by giving us the range covered by the MIDDLE 50% of the data.

IQR = Q3 – Q1

Q3 = 3rd Quartile = 75th Percentile

Q1 = 1st Quartile = 25th Percentile

IQR function is used to find the IQR

IQR(datacycle$speed)
## [1] 3.78

Standard Deviation

3.3 Standard Deviation

Standard deviation is a statistical measure of the dispersion or spread of a set of data points. It quantifies the average amount of variation or deviation of individual data points from the mean of the dataset. In other words, it measures how much the data points are spread out around the mean.

Mathematically, the standard deviation \(\sigma\) of a dataset is calculated as the square root of the variance. The variance \(\sigma^2\) is the average of the squared differences between each data point and the mean of the dataset.

The formula for calculating the standard deviation is:

\[ \sigma = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n}} \]

Where: - \(x_i\) represents each individual data point. - \(\bar{x}\) represents the mean of the dataset. - \(n\) represents the total number of data points.

This formula is known as standard deviation of population.

Key points about standard deviation:

  • Standard deviation is expressed in the same units as the original data.

  • A smaller standard deviation indicates that the data points are closer to the mean and less spread out, while a larger standard deviation indicates that the data points are more spread out.

  • Standard deviation is heavily influenced by outliers or extreme values in the dataset.

The formula for calculating the sample standard deviation is:

\[ S = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}} \]

and most of the software uses this formula

the sd function in R also calculates the sample SD

sd(datacycle$speed)
## [1] 3.93216

To find out SD of population we can use \[ \sigma = \sqrt{\frac{n-1}{n}}*s \]

n=length(datacycle$speed)
sdp= sqrt((n-1)/n)*sd(datacycle$speed)
cat("Standard deviation of population is", sdp)
## Standard deviation of population is 3.915877

4 Skewness Kurtosis

Skewness:

Skewness measures the asymmetry of the distribution of data around its mean. It indicates whether the data is skewed to the left or right relative to the mean.

  • Positively skewed: If the tail of the distribution extends more to the right, indicating that the majority of data points are concentrated on the left side of the mean. The skewness value is positive.
  • Negatively skewed: If the tail of the distribution extends more to the left, indicating that the majority of data points are concentrated on the right side of the mean. The skewness value is negative.
  • A skewness value of 0 indicates a symmetrical distribution (perfectly symmetric around the mean).

Kurtosis:

Kurtosis measures the peakedness or flatness of the distribution of data relative to a normal distribution. It indicates whether the distribution is more or less peaked (leptokurtic) or flat (platykurtic) compared to a normal distribution.

  • Leptokurtic: If it has a high peak and heavy tails, indicating that data points are concentrated around the mean and have more extreme values compared to a normal distribution. The kurtosis value is greater than 3.
  • Mesokurtic: If it has similar peakedness to a normal distribution. The kurtosis value is close to 3.
  • Platykurtic: If it has a low peak and lighter tails, indicating that data points are more spread out and have fewer extreme values compared to a normal distribution. The kurtosis value is less than 3.

Here skewness measures the symmetry of the distribution, while kurtosis measures the peakedness or flatness of the distribution compared to a normal distribution. Both skewness and kurtosis provide insights into the shape and characteristics of the distribution of data. we use momemt package to calculate skewness and kurtosis

library(moments)
skewness(datacycle$speed)
## [1] 1.264264
kurtosis(datacycle$speed)
## [1] 7.282869

If you dont have the package moments then you have to develop yourself the function using logic

# Define a function to calculate skewness
calculate_skewness <- function(x) {
  n <- length(x)
  mean_x <- mean(x)
  sd_x <- sqrt(sum((x - mean_x)^2) / n )
  skewness <- (sum((x - mean_x)^3) / n) / (sd_x^3)
  return(skewness)
}

# Define a function to calculate kurtosis
calculate_kurtosis <- function(x) {
  n <- length(x)
  mean_x <- mean(x)
  sd_x <- sqrt(sum((x - mean_x)^2) / n )
  kurtosis <- (sum((x - mean_x)^4) / n) / (sd_x^4)
  return(kurtosis)
}

calculate_skewness(datacycle$speed)
## [1] 1.264264
calculate_kurtosis(datacycle$speed)
## [1] 7.282869