Report Overview

This report introduces the basics of Exploratory Data Analysis (EDA), Descriptive Analysis, and Diagnostic Analys is using R. The goal is to equip you with the skills to summarise, visualise, and draw inferences from data.

Learning Objectives

Activity 1: Data Import and Cleaning

# Load dataset
data <- mtcars

# Check structure and summary
str(data)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
summary(data)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
# Load dplyr for wrangling
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Clean data (remove NAs if any)
clean_data <- data %>% na.omit()

# Filter rows where mpg > 20
filtered_data <- clean_data %>% filter(mpg > 20)

# Add a new calculated column
modified_data <- filtered_data %>% mutate(hp_per_cyl = hp / cyl)

Discussion: - What does the structure and summary output tell us about this dataset? - Why is removing missing data important? - How can new variables like hp_per_cyl help in deeper analysis?

Activity 2: Exploratory Data Analysis

# View sample data
head(clean_data)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
summary(clean_data)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
# Load ggplot2 for visualisations
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
# Histogram of mpg
ggplot(clean_data, aes(x = mpg)) +
  geom_histogram(binwidth = 2, fill = "blue", color = "black") +
  theme_minimal()

# Boxplot of hp
ggplot(clean_data, aes(y = hp)) +
  geom_boxplot(fill = "lightgreen", color = "darkgreen") +
  theme_minimal()

# Scatter plot of wt vs mpg
ggplot(clean_data, aes(x = wt, y = mpg)) +
  geom_point(color = "red") +
  theme_minimal() +
  labs(title = "Scatter Plot of Weight vs. MPG")

Discussion: - What patterns can be seen in mpg distribution? - Are there any outliers in horsepower? - What kind of relationship is evident between weight and mpg?

Activity 3: Descriptive Analysis

# Descriptive statistics using psych package
library(psych)
## Warning: package 'psych' was built under R version 4.4.3
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
describe(clean_data)
##      vars  n   mean     sd median trimmed    mad   min    max  range  skew
## mpg     1 32  20.09   6.03  19.20   19.70   5.41 10.40  33.90  23.50  0.61
## cyl     2 32   6.19   1.79   6.00    6.23   2.97  4.00   8.00   4.00 -0.17
## disp    3 32 230.72 123.94 196.30  222.52 140.48 71.10 472.00 400.90  0.38
## hp      4 32 146.69  68.56 123.00  141.19  77.10 52.00 335.00 283.00  0.73
## drat    5 32   3.60   0.53   3.70    3.58   0.70  2.76   4.93   2.17  0.27
## wt      6 32   3.22   0.98   3.33    3.15   0.77  1.51   5.42   3.91  0.42
## qsec    7 32  17.85   1.79  17.71   17.83   1.42 14.50  22.90   8.40  0.37
## vs      8 32   0.44   0.50   0.00    0.42   0.00  0.00   1.00   1.00  0.24
## am      9 32   0.41   0.50   0.00    0.38   0.00  0.00   1.00   1.00  0.36
## gear   10 32   3.69   0.74   4.00    3.62   1.48  3.00   5.00   2.00  0.53
## carb   11 32   2.81   1.62   2.00    2.65   1.48  1.00   8.00   7.00  1.05
##      kurtosis    se
## mpg     -0.37  1.07
## cyl     -1.76  0.32
## disp    -1.21 21.91
## hp      -0.14 12.12
## drat    -0.71  0.09
## wt      -0.02  0.17
## qsec     0.34  0.32
## vs      -2.00  0.09
## am      -1.92  0.09
## gear    -1.07  0.13
## carb     1.26  0.29
# Grouped statistics by cylinder count
clean_data %>%
  group_by(cyl) %>%
  summarise(
    count = n(),
    mean_mpg = mean(mpg),
    sd_mpg = sd(mpg),
    median_mpg = median(mpg),
    min_mpg = min(mpg),
    max_mpg = max(mpg)
  )
## # A tibble: 3 × 7
##     cyl count mean_mpg sd_mpg median_mpg min_mpg max_mpg
##   <dbl> <int>    <dbl>  <dbl>      <dbl>   <dbl>   <dbl>
## 1     4    11     26.7   4.51       26      21.4    33.9
## 2     6     7     19.7   1.45       19.7    17.8    21.4
## 3     8    14     15.1   2.56       15.2    10.4    19.2

Optional Visualisation:

# Mean mpg by number of cylinders
clean_data %>%
  group_by(cyl) %>%
  summarise(mean_mpg = mean(mpg)) %>%
  ggplot(aes(x = factor(cyl), y = mean_mpg)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(x = "Number of Cylinders", y = "Mean MPG", title = "Mean MPG by Cylinders") +
  theme_minimal()

Discussion: - What do the summary stats tell us about central tendency and spread? - How does mpg differ by cylinder count? - Why do we care about mean and standard deviation when comparing groups?

Activity 4: Diagnostic Analysis

# Correlation matrix
cor_matrix <- round(cor(clean_data), 2)
print(cor_matrix)
##        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
## mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.66  0.60  0.48 -0.55
## cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.81 -0.52 -0.49  0.53
## disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.71 -0.59 -0.56  0.39
## hp   -0.78  0.83  0.79  1.00 -0.45  0.66 -0.71 -0.72 -0.24 -0.13  0.75
## drat  0.68 -0.70 -0.71 -0.45  1.00 -0.71  0.09  0.44  0.71  0.70 -0.09
## wt   -0.87  0.78  0.89  0.66 -0.71  1.00 -0.17 -0.55 -0.69 -0.58  0.43
## qsec  0.42 -0.59 -0.43 -0.71  0.09 -0.17  1.00  0.74 -0.23 -0.21 -0.66
## vs    0.66 -0.81 -0.71 -0.72  0.44 -0.55  0.74  1.00  0.17  0.21 -0.57
## am    0.60 -0.52 -0.59 -0.24  0.71 -0.69 -0.23  0.17  1.00  0.79  0.06
## gear  0.48 -0.49 -0.56 -0.13  0.70 -0.58 -0.21  0.21  0.79  1.00  0.27
## carb -0.55  0.53  0.39  0.75 -0.09  0.43 -0.66 -0.57  0.06  0.27  1.00
# Correlation plot
library(corrplot)
## corrplot 0.95 loaded
corrplot(cor_matrix, method = "circle", type = "upper", tl.cex = 0.8)

# Relationship: hp vs mpg
ggplot(clean_data, aes(x = hp, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE, color = "blue") +
  theme_minimal() +
  labs(title = "HP vs MPG with Linear Fit")
## `geom_smooth()` using formula = 'y ~ x'

# Boxplot: mpg by gear count
ggplot(clean_data, aes(x = factor(gear), y = mpg)) +
  geom_boxplot(fill = "orange") +
  labs(x = "Number of Gears", y = "MPG", title = "MPG by Gear Count") +
  theme_minimal()

Discussion: - Which variables are highly correlated? - What does the linear fit between hp and mpg show? - Are there performance differences in cars with different gear counts?

Optional: Statistical Inference

# Hypothesis test: Is mean mpg > 20?
t.test(clean_data$mpg, mu = 20, alternative = "greater")
## 
##  One Sample t-test
## 
## data:  clean_data$mpg
## t = 0.08506, df = 31, p-value = 0.4664
## alternative hypothesis: true mean is greater than 20
## 95 percent confidence interval:
##  18.28418      Inf
## sample estimates:
## mean of x 
##  20.09062
# Confidence interval for mpg
t.test(clean_data$mpg)$conf.int
## [1] 17.91768 22.26357
## attr(,"conf.level")
## [1] 0.95

Discussion: - What do the hypothesis test results suggest? - How do we interpret the confidence interval in this context?

Conclusion

In this lesson, you learned how to: - Prepare and explore data using R - Apply descriptive and diagnostic techniques - Conduct basic statistical inference