This tutorial is designed as a complete book-style R Markdown resource for students, researchers, data analysts, business analysts, public health researchers, and early-career statisticians.
Exploratory Data Analysis, commonly called EDA, is the first serious conversation we have with data. Before applying regression, machine learning, forecasting, Bayesian modelling, artificial intelligence, or policy analysis, we must understand the data.
EDA helps us answer:
This book covers EDA and descriptive statistics from basic to advanced level using R. This version is corrected for RPubs knitting and avoids common package-conflict errors.
Data are recorded observations, measurements, facts, or responses collected for analysis, interpretation, decision-making, or prediction.
Examples of data include student marks, patient blood pressure, daily temperature, customer sales, website clicks, and survey responses.
example_data <- tibble(
Student = c("A", "B", "C", "D", "E"),
Age = c(21, 24, 22, 25, 23),
Gender = c("Female", "Male", "Female", "Male", "Female"),
Score = c(78, 85, 69, 92, 81),
Attendance = c(90, 88, 76, 95, 84)
)
kable(example_data, caption = "Example of a Small Dataset") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| Student | Age | Gender | Score | Attendance |
|---|---|---|---|---|
| A | 21 | Female | 78 | 90 |
| B | 24 | Male | 85 | 88 |
| C | 22 | Female | 69 | 76 |
| D | 25 | Male | 92 | 95 |
| E | 23 | Female | 81 | 84 |
Exploratory Data Analysis is the systematic process of inspecting, summarising, visualising, and questioning data to discover patterns, anomalies, relationships, data quality problems, and modelling assumptions.
EDA is not only about producing graphs. It is about thinking critically.
Critical EDA questions:
1. What is the unit of
observation?
2. What does each row represent?
3. What does each
column measure?
4. Are the values reasonable?
5. Are there
duplicated records?
6. Are there missing or impossible values?
7. Are the distributions symmetric, skewed, or multimodal?
8. Which
variables may explain the outcome of interest?
Skipping EDA is dangerous. A model can produce results even when the data contain errors, miscoded variables, missing values, outliers, or invalid assumptions.
EDA supports:
eda_confirmatory <- tibble(
Feature = c("Purpose", "Question Type", "Main Tools", "Output"),
EDA = c(
"Explore and understand data",
"What patterns exist?",
"Tables, graphs, summaries",
"Insights, assumptions, possible hypotheses"
),
Confirmatory_Analysis = c(
"Test specific hypotheses",
"Is there evidence for a specific claim?",
"Statistical tests and models",
"P-values, confidence intervals, model estimates"
)
)
kable(eda_confirmatory, caption = "EDA versus Confirmatory Analysis") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| Feature | EDA | Confirmatory_Analysis |
|---|---|---|
| Purpose | Explore and understand data | Test specific hypotheses |
| Question Type | What patterns exist? | Is there evidence for a specific claim? |
| Main Tools | Tables, graphs, summaries | Statistical tests and models |
| Output | Insights, assumptions, possible hypotheses | P-values, confidence intervals, model estimates |
data_types <- tibble(
Type = c("Qualitative", "Quantitative"),
Meaning = c("Non-numerical categories", "Numerical measurements or counts"),
Examples = c("Gender, blood group, region", "Age, income, weight, marks"),
Common_Graphs = c("Bar chart, pie chart, mosaic plot", "Histogram, boxplot, density plot, scatterplot")
)
kable(data_types, caption = "Qualitative and Quantitative Data") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| Type | Meaning | Examples | Common_Graphs |
|---|---|---|---|
| Qualitative | Non-numerical categories | Gender, blood group, region | Bar chart, pie chart, mosaic plot |
| Quantitative | Numerical measurements or counts | Age, income, weight, marks | Histogram, boxplot, density plot, scatterplot |
Discrete data are countable values. Continuous data are measurable values that can take many values within an interval.
Examples:
scales_table <- tibble(
Scale = c("Nominal", "Ordinal", "Interval", "Ratio"),
Description = c(
"Categories without natural order",
"Categories with meaningful order",
"Numerical scale with equal intervals but no true zero",
"Numerical scale with equal intervals and true zero"
),
Examples = c(
"Gender, region, blood group",
"Education level, satisfaction rating",
"Temperature in Celsius",
"Age, income, height, weight"
),
Suitable_Summary = c(
"Count, percentage, mode",
"Median, percentile, rank",
"Mean, SD, correlation",
"All arithmetic summaries"
)
)
kable(scales_table, caption = "Measurement Scales") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| Scale | Description | Examples | Suitable_Summary |
|---|---|---|---|
| Nominal | Categories without natural order | Gender, region, blood group | Count, percentage, mode |
| Ordinal | Categories with meaningful order | Education level, satisfaction rating | Median, percentile, rank |
| Interval | Numerical scale with equal intervals but no true zero | Temperature in Celsius | Mean, SD, correlation |
| Ratio | Numerical scale with equal intervals and true zero | Age, income, height, weight | All arithmetic summaries |
The measurement scale determines the appropriate statistical method. A mean is meaningful for income but not meaningful for blood group.
scale_example <- tibble(
Variable = c("Blood Group", "Satisfaction Rating", "Temperature", "Income"),
Scale = c("Nominal", "Ordinal", "Interval", "Ratio"),
Poor_Summary = c("Mean blood group", "Mean may be weak", "Percentage only", "Mode only"),
Better_Summary = c("Frequency table", "Median and distribution", "Mean and SD", "Mean, median, SD, IQR")
)
kable(scale_example, caption = "Choosing Summaries by Measurement Scale") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| Variable | Scale | Poor_Summary | Better_Summary |
|---|---|---|---|
| Blood Group | Nominal | Mean blood group | Frequency table |
| Satisfaction Rating | Ordinal | Mean may be weak | Median and distribution |
| Temperature | Interval | Percentage only | Mean and SD |
| Income | Ratio | Mode only | Mean, median, SD, IQR |
This book uses built-in R datasets so all examples run immediately.
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## [1] 32 11
The mtcars dataset has rows representing cars and
columns representing car characteristics.
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
Some numerical columns in mtcars are actually
categorical variables.
cars_data <- mtcars %>%
rownames_to_column("car_name") %>%
mutate(
cyl = factor(cyl),
am = factor(am, labels = c("Automatic", "Manual")),
gear = factor(gear),
carb = factor(carb)
)
glimpse(cars_data)## Rows: 32
## Columns: 12
## $ car_name <chr> "Mazda RX4", "Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive",…
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 1…
## $ cyl <fct> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8…
## $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180,…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3…
## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90…
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1…
## $ am <fct> Manual, Manual, Manual, Automatic, Automatic, Automatic, Auto…
## $ gear <fct> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3…
## $ carb <fct> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1…
Descriptive statistics are numerical and graphical techniques used to summarise the main features of a dataset.
They include:
psych::describe(mtcars) %>%
round(3) %>%
kable(caption = "Descriptive Statistics for mtcars") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mpg | 1 | 32 | 20.091 | 6.027 | 19.200 | 19.696 | 5.411 | 10.400 | 33.900 | 23.500 | 0.611 | -0.373 | 1.065 |
| cyl | 2 | 32 | 6.188 | 1.786 | 6.000 | 6.231 | 2.965 | 4.000 | 8.000 | 4.000 | -0.175 | -1.762 | 0.316 |
| disp | 3 | 32 | 230.722 | 123.939 | 196.300 | 222.523 | 140.476 | 71.100 | 472.000 | 400.900 | 0.382 | -1.207 | 21.909 |
| hp | 4 | 32 | 146.688 | 68.563 | 123.000 | 141.192 | 77.095 | 52.000 | 335.000 | 283.000 | 0.726 | -0.136 | 12.120 |
| drat | 5 | 32 | 3.597 | 0.535 | 3.695 | 3.579 | 0.704 | 2.760 | 4.930 | 2.170 | 0.266 | -0.715 | 0.095 |
| wt | 6 | 32 | 3.217 | 0.978 | 3.325 | 3.153 | 0.767 | 1.513 | 5.424 | 3.911 | 0.423 | -0.023 | 0.173 |
| qsec | 7 | 32 | 17.849 | 1.787 | 17.710 | 17.828 | 1.416 | 14.500 | 22.900 | 8.400 | 0.369 | 0.335 | 0.316 |
| vs | 8 | 32 | 0.438 | 0.504 | 0.000 | 0.423 | 0.000 | 0.000 | 1.000 | 1.000 | 0.240 | -2.002 | 0.089 |
| am | 9 | 32 | 0.406 | 0.499 | 0.000 | 0.385 | 0.000 | 0.000 | 1.000 | 1.000 | 0.364 | -1.925 | 0.088 |
| gear | 10 | 32 | 3.688 | 0.738 | 4.000 | 3.615 | 1.483 | 3.000 | 5.000 | 2.000 | 0.529 | -1.070 | 0.130 |
| carb | 11 | 32 | 2.812 | 1.615 | 2.000 | 2.654 | 1.483 | 1.000 | 8.000 | 7.000 | 1.051 | 1.257 | 0.286 |
custom_summary <- mtcars %>%
summarise(
mean_mpg = mean(mpg),
median_mpg = median(mpg),
sd_mpg = sd(mpg),
min_mpg = min(mpg),
max_mpg = max(mpg),
iqr_mpg = IQR(mpg),
skewness_mpg = skewness(mpg),
kurtosis_mpg = kurtosis(mpg)
)
kable(custom_summary, digits = 3, caption = "Custom Summary for MPG") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| mean_mpg | median_mpg | sd_mpg | min_mpg | max_mpg | iqr_mpg | skewness_mpg | kurtosis_mpg |
|---|---|---|---|---|---|---|---|
| 20.091 | 19.2 | 6.027 | 10.4 | 33.9 | 7.375 | 0.64 | 2.799 |
summary_many <- mtcars %>%
summarise(across(
where(is.numeric),
list(
mean = mean,
median = median,
sd = sd,
min = min,
max = max,
skew = skewness
),
.names = "{.col}_{.fn}"
))
summary_many[, 1:12] %>%
kable(digits = 3, caption = "Multiple Summary Measures") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| mpg_mean | mpg_median | mpg_sd | mpg_min | mpg_max | mpg_skew | cyl_mean | cyl_median | cyl_sd | cyl_min | cyl_max | cyl_skew |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 20.091 | 19.2 | 6.027 | 10.4 | 33.9 | 0.64 | 6.188 | 6 | 1.786 | 4 | 8 | -0.183 |
The mean is the arithmetic average of all observations.
\[ \bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i \]
## [1] 20.09062
The median is the middle value after observations are arranged in ascending order.
## [1] 19.2
The mode is the most frequently occurring value.
get_mode <- function(x){
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
get_mode(mtcars$cyl)## [1] 8
central_table <- tibble(
Statistic = c("Mean", "Median", "Mode"),
Best_For = c(
"Symmetric numerical data",
"Skewed numerical data",
"Categorical or repeated values"
),
Sensitive_to_Outliers = c("Yes", "No", "No"),
Example = c("Average height", "Median income", "Most common blood group")
)
kable(central_table, caption = "Mean, Median and Mode") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| Statistic | Best_For | Sensitive_to_Outliers | Example |
|---|---|---|---|
| Mean | Symmetric numerical data | Yes | Average height |
| Median | Skewed numerical data | No | Median income |
| Mode | Categorical or repeated values | No | Most common blood group |
normal_values <- c(45, 48, 50, 51, 52, 53, 55)
outlier_values <- c(normal_values, 150)
outlier_effect <- tibble(
Dataset = c("Without Outlier", "With Outlier"),
Mean = c(mean(normal_values), mean(outlier_values)),
Median = c(median(normal_values), median(outlier_values)),
Difference = c(mean(normal_values) - median(normal_values),
mean(outlier_values) - median(outlier_values))
)
kable(outlier_effect, digits = 2, caption = "Effect of Outlier on Mean and Median") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| Dataset | Mean | Median | Difference |
|---|---|---|---|
| Without Outlier | 50.57 | 51.0 | -0.43 |
| With Outlier | 63.00 | 51.5 | 11.50 |
outlier_df <- tibble(
value = c(normal_values, 150),
group = c(rep("Normal values", length(normal_values)), "Extreme outlier")
)
ggplot(outlier_df, aes(x = value)) +
geom_histogram(bins = 12, fill = "#5DADE2", color = "white") +
geom_vline(aes(xintercept = mean(value)), color = "#C0392B", linewidth = 1.2) +
geom_vline(aes(xintercept = median(value)), color = "#1E8449", linewidth = 1.2, linetype = "dashed") +
labs(
title = "Mean is Pulled Toward the Outlier",
subtitle = "Red line = mean, green dashed line = median",
x = "Value",
y = "Frequency"
) +
theme_minimal()For skewed data, report both mean and median. If they differ strongly, explain why.
Variance measures the average squared deviation from the mean.
\[ s^2 = \frac{\sum_{i=1}^{n}(x_i-\bar{x})^2}{n-1} \]
## [1] 36.3241
The coefficient of variation compares standard deviation relative to the mean.
\[ CV = \frac{s}{\bar{x}} \times 100 \]
cv_mpg <- sd(mtcars$mpg) / mean(mtcars$mpg) * 100
cv_hp <- sd(mtcars$hp) / mean(mtcars$hp) * 100
tibble(
Variable = c("MPG", "Horsepower"),
CV_Percentage = c(cv_mpg, cv_hp)
) %>%
kable(digits = 2, caption = "Coefficient of Variation") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| Variable | CV_Percentage |
|---|---|
| MPG | 30.00 |
| Horsepower | 46.74 |
cars_data %>%
group_by(cyl) %>%
summarise(
n = n(),
mean_mpg = mean(mpg),
sd_mpg = sd(mpg),
iqr_mpg = IQR(mpg),
cv_mpg = sd(mpg) / mean(mpg) * 100,
.groups = "drop"
) %>%
kable(digits = 2, caption = "MPG Dispersion by Cylinder Group") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| cyl | n | mean_mpg | sd_mpg | iqr_mpg | cv_mpg |
|---|---|---|---|---|---|
| 4 | 11 | 26.66 | 4.51 | 7.60 | 16.91 |
| 6 | 7 | 19.74 | 1.45 | 2.35 | 7.36 |
| 8 | 14 | 15.10 | 2.56 | 1.85 | 16.95 |
##
## 4 6 8
## 34.38 21.88 43.75
##
## Automatic Manual
## 4 3 8
## 6 4 3
## 8 12 2
##
## Automatic Manual
## 4 27.27 72.73
## 6 57.14 42.86
## 8 85.71 14.29
ggplot(cars_data, aes(x = cyl, fill = cyl)) +
geom_bar(color = "black") +
labs(
title = "Number of Cars by Cylinder Category",
x = "Cylinders",
y = "Count"
) +
theme_minimal() +
theme(legend.position = "none")ggplot(cars_data, aes(x = cyl, fill = am)) +
geom_bar(position = "fill", color = "black") +
scale_y_continuous(labels = percent) +
labs(
title = "Transmission Type within Cylinder Groups",
x = "Cylinders",
y = "Percentage",
fill = "Transmission"
) +
theme_minimal()For categorical variables, counts alone may be misleading. Percentages are often easier to compare across groups.
ggplot(cars_data, aes(x = mpg)) +
geom_histogram(bins = 10, fill = "#2E86C1", color = "white") +
labs(
title = "Histogram of Miles per Gallon",
x = "Miles per Gallon",
y = "Frequency"
) +
theme_minimal()ggplot(cars_data, aes(x = mpg)) +
geom_density(fill = "#58D68D", alpha = 0.7) +
labs(
title = "Density Plot of Miles per Gallon",
x = "Miles per Gallon",
y = "Density"
) +
theme_minimal()ggplot(cars_data, aes(y = mpg)) +
geom_boxplot(fill = "#F5B041", color = "black") +
labs(
title = "Boxplot of Miles per Gallon",
y = "Miles per Gallon"
) +
theme_minimal()ggplot(cars_data, aes(x = cyl, y = mpg, fill = cyl)) +
geom_violin(trim = FALSE, alpha = 0.7) +
geom_boxplot(width = 0.15, fill = "white") +
labs(
title = "MPG Distribution by Cylinder Group",
x = "Cylinders",
y = "Miles per Gallon"
) +
theme_minimal() +
theme(legend.position = "none")Skewness measures the asymmetry of a distribution.
\[ Skewness = \frac{\frac{1}{n}\sum_{i=1}^{n}(x_i-\bar{x})^3}{s^3} \]
skew_table <- tibble(
Skewness_Value = c("Approximately 0", "Positive", "Negative", "Greater than 1", "Less than -1"),
Interpretation = c(
"Approximately symmetric",
"Right-skewed distribution",
"Left-skewed distribution",
"Strong positive skewness",
"Strong negative skewness"
),
Mean_Median_Relationship = c(
"Mean approximately equals median",
"Mean often greater than median",
"Mean often less than median",
"Mean can be strongly pulled right",
"Mean can be strongly pulled left"
)
)
kable(skew_table, caption = "Interpretation of Skewness") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| Skewness_Value | Interpretation | Mean_Median_Relationship |
|---|---|---|
| Approximately 0 | Approximately symmetric | Mean approximately equals median |
| Positive | Right-skewed distribution | Mean often greater than median |
| Negative | Left-skewed distribution | Mean often less than median |
| Greater than 1 | Strong positive skewness | Mean can be strongly pulled right |
| Less than -1 | Strong negative skewness | Mean can be strongly pulled left |
skew_example <- tibble(
symmetric = rnorm(1000, mean = 50, sd = 10),
positive_skew = rexp(1000, rate = 0.1),
negative_skew = -rexp(1000, rate = 0.1)
) %>%
pivot_longer(cols = everything(), names_to = "Distribution", values_to = "Value")
ggplot(skew_example, aes(x = Value, fill = Distribution)) +
geom_histogram(bins = 30, color = "white", alpha = 0.8) +
facet_wrap(~Distribution, scales = "free") +
labs(
title = "Symmetric, Positive Skew and Negative Skew Distributions",
x = "Value",
y = "Frequency"
) +
theme_minimal() +
theme(legend.position = "none")skew_summary <- mtcars %>%
summarise(across(where(is.numeric), skewness)) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Skewness") %>%
arrange(desc(abs(Skewness)))
kable(skew_summary, digits = 3, caption = "Skewness of mtcars Variables") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| Variable | Skewness |
|---|---|
| carb | 1.102 |
| hp | 0.761 |
| mpg | 0.640 |
| gear | 0.555 |
| wt | 0.444 |
| disp | 0.400 |
| qsec | 0.387 |
| am | 0.382 |
| drat | 0.279 |
| vs | 0.252 |
| cyl | -0.183 |
income <- rexp(1000, rate = 0.002)
income_df <- tibble(
income = income,
log_income = log(income)
)
p1 <- ggplot(income_df, aes(x = income)) +
geom_histogram(bins = 30, fill = "#3498DB", color = "white") +
labs(title = "Original Skewed Income", x = "Income", y = "Frequency") +
theme_minimal()
p2 <- ggplot(income_df, aes(x = log_income)) +
geom_histogram(bins = 30, fill = "#2ECC71", color = "white") +
labs(title = "Log-Transformed Income", x = "Log Income", y = "Frequency") +
theme_minimal()
gridExtra::grid.arrange(p1, p2, ncol = 2)If data are strongly right-skewed, log transformation can make patterns clearer. However, transformation changes interpretation, so always explain it.
Kurtosis describes the tail heaviness of a distribution.
## [1] 2.799467
kurtosis_table <- tibble(
Type = c("Mesokurtic", "Leptokurtic", "Platykurtic"),
Meaning = c(
"Similar to normal distribution",
"Heavy tails and sharper peak",
"Light tails and flatter peak"
),
Practical_Concern = c(
"Standard assumption may be reasonable",
"Extreme values are more likely",
"Extreme values are less common"
)
)
kable(kurtosis_table, caption = "Types of Kurtosis") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| Type | Meaning | Practical_Concern |
|---|---|---|
| Mesokurtic | Similar to normal distribution | Standard assumption may be reasonable |
| Leptokurtic | Heavy tails and sharper peak | Extreme values are more likely |
| Platykurtic | Light tails and flatter peak | Extreme values are less common |
ggplot(cars_data, aes(sample = mpg)) +
stat_qq(color = "#2E86C1") +
stat_qq_line(color = "#D35400", linewidth = 1) +
labs(
title = "Normal Q-Q Plot for MPG",
x = "Theoretical Quantiles",
y = "Sample Quantiles"
) +
theme_minimal()tail_data <- tibble(
Normal = rnorm(1000),
Heavy_Tailed = rt(1000, df = 3)
) %>%
pivot_longer(cols = everything(), names_to = "Distribution", values_to = "Value")
ggplot(tail_data, aes(x = Value, fill = Distribution)) +
geom_density(alpha = 0.5) +
labs(
title = "Normal versus Heavy-Tailed Distribution",
x = "Value",
y = "Density"
) +
theme_minimal()An outlier is an observation that is unusually far from the majority of the data.
Outliers may represent:
\[ Lower\ Fence = Q_1 - 1.5(IQR) \]
\[ Upper\ Fence = Q_3 + 1.5(IQR) \]
q1 <- quantile(mtcars$mpg, 0.25)
q3 <- quantile(mtcars$mpg, 0.75)
iqr_value <- IQR(mtcars$mpg)
lower_fence <- q1 - 1.5 * iqr_value
upper_fence <- q3 + 1.5 * iqr_value
c(lower_fence = lower_fence, upper_fence = upper_fence)## lower_fence.25% upper_fence.75%
## 4.3625 33.8625
## car mpg cyl disp hp drat wt qsec vs am gear carb
## 1 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1
\[ z_i = \frac{x_i-\bar{x}}{s} \]
## integer(0)
A point may not be an outlier in one variable but may be unusual when several variables are considered together.
cars_numeric <- mtcars %>% dplyr::select(mpg, hp, wt, qsec)
center <- colMeans(cars_numeric)
cov_matrix <- cov(cars_numeric)
mahal_values <- mahalanobis(cars_numeric, center, cov_matrix)
outlier_cutoff <- qchisq(0.975, df = ncol(cars_numeric))
tibble(
car = rownames(mtcars),
mahalanobis_distance = mahal_values,
possible_outlier = mahal_values > outlier_cutoff
) %>%
arrange(desc(mahalanobis_distance)) %>%
head(10) %>%
kable(digits = 3, caption = "Multivariate Outlier Check using Mahalanobis Distance") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| car | mahalanobis_distance | possible_outlier |
|---|---|---|
| Maserati Bora | 13.804 | TRUE |
| Merc 230 | 13.429 | TRUE |
| Chrysler Imperial | 10.036 | FALSE |
| Toyota Corolla | 7.895 | FALSE |
| Fiat 128 | 6.877 | FALSE |
| Lincoln Continental | 5.786 | FALSE |
| Ford Pantera L | 5.458 | FALSE |
| Toyota Corona | 5.216 | FALSE |
| Lotus Europa | 5.131 | FALSE |
| Cadillac Fleetwood | 4.826 | FALSE |
Do not remove outliers automatically. First investigate whether they are errors or meaningful extreme values.
Covariance measures whether two variables move together, but its scale depends on the units of measurement.
\[ Cov(X,Y)=\frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{n-1} \]
## [1] -5.116685
\[ r = \frac{\sum (x_i-\bar{x})(y_i-\bar{y})} {\sqrt{\sum (x_i-\bar{x})^2}\sqrt{\sum (y_i-\bar{y})^2}} \]
## [1] -0.8676594
Spearman correlation is useful when the relationship is monotonic but not necessarily linear.
## [1] -0.886422
numeric_data <- mtcars %>% dplyr::select(where(is.numeric))
cor_matrix <- cor(numeric_data)
corrplot(
cor_matrix,
method = "color",
type = "upper",
tl.cex = 0.8,
addCoef.col = "black",
number.cex = 0.7
)ggplot(cars_data, aes(x = wt, y = mpg)) +
geom_point(size = 3, color = "#1F618D") +
geom_smooth(method = "lm", se = TRUE, color = "#C0392B") +
labs(
title = "Relationship Between Car Weight and MPG",
x = "Weight",
y = "Miles per Gallon"
) +
theme_minimal()nonlinear_df <- tibble(
x = seq(-5, 5, length.out = 200),
y = x^2 + rnorm(200, 0, 3)
)
ggplot(nonlinear_df, aes(x = x, y = y)) +
geom_point(alpha = 0.7, color = "#2E86C1") +
geom_smooth(se = FALSE, color = "#C0392B") +
labs(
title = "Example of a Nonlinear Relationship",
x = "X",
y = "Y"
) +
theme_minimal()Correlation does not prove causation. A strong association does not automatically mean that one variable causes the other.
Missing data occur when values are not recorded, lost, unavailable, or intentionally skipped.
Missing data can cause:
missing_types <- tibble(
Type = c("MCAR", "MAR", "MNAR"),
Full_Name = c(
"Missing Completely at Random",
"Missing at Random",
"Missing Not at Random"
),
Meaning = c(
"Missingness unrelated to observed or unobserved data",
"Missingness related to observed variables",
"Missingness related to the missing value itself"
),
Example = c(
"Random machine error",
"Older people less likely to answer online survey",
"High-income people refusing to report income"
)
)
kable(missing_types, caption = "Types of Missing Data") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| Type | Full_Name | Meaning | Example |
|---|---|---|---|
| MCAR | Missing Completely at Random | Missingness unrelated to observed or unobserved data | Random machine error |
| MAR | Missing at Random | Missingness related to observed variables | Older people less likely to answer online survey |
| MNAR | Missing Not at Random | Missingness related to the missing value itself | High-income people refusing to report income |
missing_data <- cars_data
missing_data$mpg[c(2, 5, 8)] <- NA
missing_data$hp[c(4, 10)] <- NA
colSums(is.na(missing_data))## car_name mpg cyl disp hp drat wt qsec
## 0 3 0 0 2 0 0 0
## vs am gear carb
## 0 0 0 0
missing_summary <- missing_data %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count") %>%
mutate(Missing_Percentage = Missing_Count / nrow(missing_data) * 100)
kable(missing_summary, caption = "Missing Data Summary") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| Variable | Missing_Count | Missing_Percentage |
|---|---|---|
| car_name | 0 | 0.000 |
| mpg | 3 | 9.375 |
| cyl | 0 | 0.000 |
| disp | 0 | 0.000 |
| hp | 2 | 6.250 |
| drat | 0 | 0.000 |
| wt | 0 | 0.000 |
| qsec | 0 | 0.000 |
| vs | 0 | 0.000 |
| am | 0 | 0.000 |
| gear | 0 | 0.000 |
| carb | 0 | 0.000 |
missing_summary %>%
filter(Missing_Count > 0) %>%
ggplot(aes(x = reorder(Variable, Missing_Count), y = Missing_Count, fill = Variable)) +
geom_col(color = "black") +
coord_flip() +
labs(
title = "Missing Values by Variable",
x = "Variable",
y = "Missing Count"
) +
theme_minimal() +
theme(legend.position = "none")imputed_data <- missing_data %>%
mutate(
mpg = ifelse(is.na(mpg), median(mpg, na.rm = TRUE), mpg),
hp = ifelse(is.na(hp), median(hp, na.rm = TRUE), hp)
)
colSums(is.na(imputed_data))## car_name mpg cyl disp hp drat wt qsec
## 0 0 0 0 0 0 0 0
## vs am gear carb
## 0 0 0 0
Mean or median imputation is simple, but it can underestimate variability. For formal research, consider multiple imputation or model-based imputation.
Standardisation converts a variable into a scale with mean 0 and standard deviation 1.
\[ z = \frac{x-\bar{x}}{s} \]
standardized_data <- mtcars %>%
mutate(
mpg_z = as.numeric(scale(mpg)),
hp_z = as.numeric(scale(hp)),
wt_z = as.numeric(scale(wt))
)
head(standardized_data[, c("mpg", "mpg_z", "hp", "hp_z", "wt", "wt_z")])## mpg mpg_z hp hp_z wt wt_z
## Mazda RX4 21.0 0.1508848 110 -0.5350928 2.620 -0.610399567
## Mazda RX4 Wag 21.0 0.1508848 110 -0.5350928 2.875 -0.349785269
## Datsun 710 22.8 0.4495434 93 -0.7830405 2.320 -0.917004624
## Hornet 4 Drive 21.4 0.2172534 110 -0.5350928 3.215 -0.002299538
## Hornet Sportabout 18.7 -0.2307345 175 0.4129422 3.440 0.227654255
## Valiant 18.1 -0.3302874 105 -0.6080186 3.460 0.248094592
Transformations are useful when:
log_example <- tibble(
original = rexp(1000, rate = 0.1),
log_transformed = log(original)
)
log_example_long <- log_example %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
ggplot(log_example_long, aes(x = Value, fill = Variable)) +
geom_histogram(bins = 30, color = "white") +
facet_wrap(~Variable, scales = "free") +
labs(
title = "Original and Log-Transformed Data",
x = "Value",
y = "Frequency"
) +
theme_minimal() +
theme(legend.position = "none")Regression can be used in EDA to understand direction, strength, and approximate functional form of relationships.
A simple linear regression model is:
\[ Y_i = \beta_0 + \beta_1X_i + \epsilon_i \]
where:
The intercept is the expected value of \(Y\) when \(X=0\). The slope is the expected change in \(Y\) for one-unit increase in \(X\).
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
broom::tidy(model_simple) %>%
kable(digits = 4, caption = "Simple Linear Regression Coefficients") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 37.2851 | 1.8776 | 19.8576 | 0 |
| wt | -5.3445 | 0.5591 | -9.5590 | 0 |
If the slope for wt is negative, it means that as car
weight increases, fuel efficiency decreases.
## (Intercept) wt
## 37.285126 -5.344472
For example, if the slope is approximately -5.34, then a one-unit increase in car weight is associated with an average decrease of about 5.34 miles per gallon.
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point(size = 3, color = "#1F618D") +
geom_smooth(method = "lm", se = TRUE, color = "#C0392B") +
labs(
title = "Simple Linear Regression: MPG versus Weight",
subtitle = "Regression line shows expected MPG for each value of weight",
x = "Weight",
y = "Miles per Gallon"
) +
theme_minimal()broom::glance(model_simple) %>%
kable(digits = 4, caption = "Model Fit Statistics") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.7528 | 0.7446 | 3.0459 | 91.3753 | 0 | 1 | -80.0147 | 166.0294 | 170.4266 | 278.3219 | 30 | 32 |
A residual is the difference between observed and fitted value.
\[ e_i = y_i - \hat{y}_i \]
## # A tibble: 6 × 9
## .rownames mpg wt .fitted .resid .hat .sigma .cooksd .std.resid
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4 21 2.62 23.3 -2.28 0.0433 3.07 1.33e-2 -0.766
## 2 Mazda RX4 Wag 21 2.88 21.9 -0.920 0.0352 3.09 1.72e-3 -0.307
## 3 Datsun 710 22.8 2.32 24.9 -2.09 0.0584 3.07 1.54e-2 -0.706
## 4 Hornet 4 Drive 21.4 3.22 20.1 1.30 0.0313 3.09 3.02e-3 0.433
## 5 Hornet Sportabout 18.7 3.44 18.9 -0.200 0.0329 3.10 7.60e-5 -0.0668
## 6 Valiant 18.1 3.46 18.8 -0.693 0.0332 3.10 9.21e-4 -0.231
ggplot(reg_aug, aes(x = .fitted, y = .resid)) +
geom_point(size = 3, color = "#2E86C1") +
geom_hline(yintercept = 0, color = "#C0392B", linewidth = 1) +
labs(
title = "Residual Plot",
x = "Fitted Values",
y = "Residuals"
) +
theme_minimal()For a good linear model, residuals should not show a strong curve or funnel pattern.
A multiple linear regression model is:
\[ Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \cdots + \beta_pX_{pi} + \epsilon_i \]
##
## Call:
## lm(formula = mpg ~ wt + hp + cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9290 -1.5598 -0.5311 1.1850 5.8986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.75179 1.78686 21.687 < 2e-16 ***
## wt -3.16697 0.74058 -4.276 0.000199 ***
## hp -0.01804 0.01188 -1.519 0.140015
## cyl -0.94162 0.55092 -1.709 0.098480 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared: 0.8431, Adjusted R-squared: 0.8263
## F-statistic: 50.17 on 3 and 28 DF, p-value: 2.184e-11
broom::tidy(model_multiple) %>%
kable(digits = 4, caption = "Multiple Regression Coefficients") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 38.7518 | 1.7869 | 21.6870 | 0.0000 |
| wt | -3.1670 | 0.7406 | -4.2764 | 0.0002 |
| hp | -0.0180 | 0.0119 | -1.5188 | 0.1400 |
| cyl | -0.9416 | 0.5509 | -1.7092 | 0.0985 |
In multiple regression, each coefficient represents the expected change in the outcome for one-unit increase in that predictor, while holding the other predictors constant.
confint(model_multiple) %>%
as.data.frame() %>%
rownames_to_column("Term") %>%
kable(digits = 4, caption = "Confidence Intervals for Regression Coefficients") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| Term | 2.5 % | 97.5 % |
|---|---|---|
| (Intercept) | 35.0916 | 42.4120 |
| wt | -4.6840 | -1.6500 |
| hp | -0.0424 | 0.0063 |
| cyl | -2.0701 | 0.1869 |
Multicollinearity occurs when predictors are strongly correlated with each other.
## wt hp cyl
## 2.580486 3.258481 4.757456
resid_df <- tibble(residuals = residuals(model_multiple))
ggplot(resid_df, aes(sample = residuals)) +
stat_qq(color = "#2E86C1") +
stat_qq_line(color = "#C0392B") +
labs(
title = "Q-Q Plot of Regression Residuals",
x = "Theoretical Quantiles",
y = "Sample Quantiles"
) +
theme_minimal()A good EDA report should be systematic, reproducible, visual, and critical.
Recommended workflow:
quick_eda <- function(data){
numeric_part <- data %>% dplyr::select(where(is.numeric))
list(
dimensions = dim(data),
variable_names = names(data),
missing_values = colSums(is.na(data)),
numeric_summary = psych::describe(numeric_part)
)
}
quick_eda(mtcars)## $dimensions
## [1] 32 11
##
## $variable_names
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
##
## $missing_values
## mpg cyl disp hp drat wt qsec vs am gear carb
## 0 0 0 0 0 0 0 0 0 0 0
##
## $numeric_summary
## vars n mean sd median trimmed mad min max range skew
## mpg 1 32 20.09 6.03 19.20 19.70 5.41 10.40 33.90 23.50 0.61
## cyl 2 32 6.19 1.79 6.00 6.23 2.97 4.00 8.00 4.00 -0.17
## disp 3 32 230.72 123.94 196.30 222.52 140.48 71.10 472.00 400.90 0.38
## hp 4 32 146.69 68.56 123.00 141.19 77.10 52.00 335.00 283.00 0.73
## drat 5 32 3.60 0.53 3.70 3.58 0.70 2.76 4.93 2.17 0.27
## wt 6 32 3.22 0.98 3.33 3.15 0.77 1.51 5.42 3.91 0.42
## qsec 7 32 17.85 1.79 17.71 17.83 1.42 14.50 22.90 8.40 0.37
## vs 8 32 0.44 0.50 0.00 0.42 0.00 0.00 1.00 1.00 0.24
## am 9 32 0.41 0.50 0.00 0.38 0.00 0.00 1.00 1.00 0.36
## gear 10 32 3.69 0.74 4.00 3.62 1.48 3.00 5.00 2.00 0.53
## carb 11 32 2.81 1.62 2.00 2.65 1.48 1.00 8.00 7.00 1.05
## kurtosis se
## mpg -0.37 1.07
## cyl -1.76 0.32
## disp -1.21 21.91
## hp -0.14 12.12
## drat -0.71 0.09
## wt -0.02 0.17
## qsec 0.34 0.32
## vs -2.00 0.09
## am -1.92 0.09
## gear -1.07 0.13
## carb 1.26 0.29
Research Question: How are car weight, horsepower, transmission type, and number of cylinders related to fuel efficiency?
cars_data %>%
group_by(cyl, am) %>%
summarise(
n = n(),
mean_mpg = mean(mpg),
median_mpg = median(mpg),
sd_mpg = sd(mpg),
mean_hp = mean(hp),
mean_wt = mean(wt),
.groups = "drop"
) %>%
kable(digits = 2, caption = "Grouped Summary by Cylinder and Transmission") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| cyl | am | n | mean_mpg | median_mpg | sd_mpg | mean_hp | mean_wt |
|---|---|---|---|---|---|---|---|
| 4 | Automatic | 3 | 22.90 | 22.80 | 1.45 | 84.67 | 2.94 |
| 4 | Manual | 8 | 28.08 | 28.85 | 4.48 | 81.88 | 2.04 |
| 6 | Automatic | 4 | 19.12 | 18.65 | 1.63 | 115.25 | 3.39 |
| 6 | Manual | 3 | 20.57 | 21.00 | 0.75 | 131.67 | 2.76 |
| 8 | Automatic | 12 | 15.05 | 15.20 | 2.77 | 194.17 | 4.10 |
| 8 | Manual | 2 | 15.40 | 15.40 | 0.57 | 299.50 | 3.37 |
ggplot(cars_data, aes(x = wt, y = mpg, color = cyl, shape = am)) +
geom_point(size = 3) +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "MPG Declines as Car Weight Increases",
subtitle = "Relationship shown by cylinder group and transmission type",
x = "Car Weight",
y = "Miles per Gallon",
color = "Cylinders",
shape = "Transmission"
) +
theme_minimal()##
## Call:
## lm(formula = mpg ~ wt + hp + cyl + am, data = cars_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9387 -1.2560 -0.4013 1.1253 5.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
## wt -2.49683 0.88559 -2.819 0.00908 **
## hp -0.03211 0.01369 -2.345 0.02693 *
## cyl6 -3.03134 1.40728 -2.154 0.04068 *
## cyl8 -2.16368 2.28425 -0.947 0.35225
## amManual 1.80921 1.39630 1.296 0.20646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
## F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
broom::tidy(final_eda_model) %>%
kable(digits = 4, caption = "Final Exploratory Model Coefficients") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 33.7083 | 2.6049 | 12.9404 | 0.0000 |
| wt | -2.4968 | 0.8856 | -2.8194 | 0.0091 |
| hp | -0.0321 | 0.0137 | -2.3450 | 0.0269 |
| cyl6 | -3.0313 | 1.4073 | -2.1540 | 0.0407 |
| cyl8 | -2.1637 | 2.2843 | -0.9472 | 0.3523 |
| amManual | 1.8092 | 1.3963 | 1.2957 | 0.2065 |
The EDA suggests that heavier cars tend to have lower fuel efficiency. Cars with more cylinders generally show lower MPG. Horsepower and weight are positively related, while MPG and weight are negatively related. Regression provides a structured way to quantify these patterns.
Use the iris dataset and calculate mean, median,
standard deviation, skewness, and kurtosis for
Sepal.Length.
data(iris)
tibble(
Mean = mean(iris$Sepal.Length),
Median = median(iris$Sepal.Length),
SD = sd(iris$Sepal.Length),
Skewness = skewness(iris$Sepal.Length),
Kurtosis = kurtosis(iris$Sepal.Length)
) %>%
kable(digits = 3, caption = "Summary of Sepal Length") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| Mean | Median | SD | Skewness | Kurtosis |
|---|---|---|---|---|
| 5.843 | 5.8 | 0.828 | 0.312 | 2.426 |
ggplot(iris, aes(x = Sepal.Length)) +
geom_histogram(aes(y = after_stat(density)), bins = 20, fill = "#5DADE2", color = "white") +
geom_density(color = "#C0392B", linewidth = 1.2) +
labs(
title = "Histogram and Density Plot of Sepal Length",
x = "Sepal Length",
y = "Density"
) +
theme_minimal()ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
geom_boxplot() +
labs(
title = "Sepal Length by Iris Species",
x = "Species",
y = "Sepal Length"
) +
theme_minimal() +
theme(legend.position = "none")iris_numeric <- iris %>% dplyr::select(where(is.numeric))
corrplot(cor(iris_numeric), method = "color", addCoef.col = "black")iris_model <- lm(Sepal.Length ~ Petal.Length + Petal.Width + Species, data = iris)
summary(iris_model)##
## Call:
## lm(formula = Sepal.Length ~ Petal.Length + Petal.Width + Species,
## data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.75238 -0.23089 -0.00211 0.23100 1.03108
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.682982 0.107403 34.291 < 2e-16 ***
## Petal.Length 0.905946 0.074311 12.191 < 2e-16 ***
## Petal.Width -0.005995 0.156260 -0.038 0.969
## Speciesversicolor -1.598362 0.205706 -7.770 1.32e-12 ***
## Speciesvirginica -2.112647 0.304024 -6.949 1.16e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3392 on 145 degrees of freedom
## Multiple R-squared: 0.8367, Adjusted R-squared: 0.8322
## F-statistic: 185.8 on 4 and 145 DF, p-value: < 2.2e-16
broom::tidy(iris_model) %>%
kable(digits = 4, caption = "Regression Model for Iris Data") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 3.6830 | 0.1074 | 34.2913 | 0.0000 |
| Petal.Length | 0.9059 | 0.0743 | 12.1913 | 0.0000 |
| Petal.Width | -0.0060 | 0.1563 | -0.0384 | 0.9694 |
| Speciesversicolor | -1.5984 | 0.2057 | -7.7701 | 0.0000 |
| Speciesvirginica | -2.1126 | 0.3040 | -6.9489 | 0.0000 |
## Ozone Solar.R Wind Temp
## Min. : 1.0 Min. : 7.0 Min. : 2.30 Min. :57.00
## 1st Qu.: 18.0 1st Qu.:113.5 1st Qu.: 7.40 1st Qu.:71.00
## Median : 31.0 Median :207.0 Median : 9.70 Median :79.00
## Mean : 42.1 Mean :184.8 Mean : 9.94 Mean :77.79
## 3rd Qu.: 62.0 3rd Qu.:255.5 3rd Qu.:11.50 3rd Qu.:84.50
## Max. :168.0 Max. :334.0 Max. :20.70 Max. :97.00
## Month Day
## Min. :5.000 Min. : 1.00
## 1st Qu.:6.000 1st Qu.: 9.00
## Median :7.000 Median :16.00
## Mean :7.216 Mean :15.95
## 3rd Qu.:9.000 3rd Qu.:22.50
## Max. :9.000 Max. :31.00
ggplot(airquality_clean, aes(x = Temp, y = Ozone)) +
geom_point(color = "#2E86C1", size = 3) +
geom_smooth(method = "lm", color = "#C0392B") +
labs(
title = "Ozone and Temperature",
x = "Temperature",
y = "Ozone"
) +
theme_minimal()##
## Call:
## lm(formula = Ozone ~ Temp + Wind + Solar.R, data = airquality_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.485 -14.219 -3.551 10.097 95.619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -64.34208 23.05472 -2.791 0.00623 **
## Temp 1.65209 0.25353 6.516 2.42e-09 ***
## Wind -3.33359 0.65441 -5.094 1.52e-06 ***
## Solar.R 0.05982 0.02319 2.580 0.01124 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.18 on 107 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.5948
## F-statistic: 54.83 on 3 and 107 DF, p-value: < 2.2e-16
This book-style tutorial covered the full foundation of Exploratory Data Analysis and Descriptive Statistics using R, from basic summaries to regression-based exploratory modelling.
You learned how to:
The best data analysts do not simply run code. They ask critical questions, inspect assumptions, compare summaries with graphs, and explain results clearly.
Please email Dr Hom at homnath1988@gmail.com. I hope you love numbers like me..