Introduction

This report provides a comprehensive analysis of the distribution of variables in the airquality dataset, focusing on skewness and kurtosis to understand data symmetry and tail behavior. Skewness reveals the data’s tendency to lean towards one side, while kurtosis sheds light on the extremity of data points in the tails. These metrics are essential for assessing the normality of the distribution, which guides further analysis and modeling decisions.

Data Summary and Setup

The airquality dataset contains daily air quality measurements in New York, with variables such as Ozone, Temp (temperature), Wind, and Solar.R (solar radiation). In this analysis, we will focus on the Ozone variable

library(moments)
data = airquality
head(data)
Ozone Solar.R Wind Temp Month Day
41 190 7.4 67 5 1
36 118 8.0 72 5 2
12 149 12.6 74 5 3
18 313 11.5 62 5 4
NA NA 14.3 56 5 5
28 NA 14.9 66 5 6

Statistical Analysis

Summary Statistics

This section presents basic descriptive statistics for the ozone variable, providing a snapshot of data distribution and central tendencies.

ozone_data = na.omit(data$Ozone)

#summary
summary_stats = summary(ozone_data)
summary_stats
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   18.00   31.50   42.13   63.25  168.00
#calculate standard deviation
std_dev = sd(ozone_data)
cat("Standard Deviaion:", round(std_dev,2), "\n")
## Standard Deviaion: 32.99

Interpretation of Summary Statistics

Minimum and Maximum values indicate the range of ozone concentration. Mean and Median provide central tendencies, while Standard Deviation indicates data spread.

Skewness and Kurtosis Analysis

The skewness and Kurtosis of the Ozone variable provide insights into its shape and tail weight.

skewness_value = skewness(ozone_data)
kurtosis_value = kurtosis(ozone_data)


cat("skewness:", round(skewness_value, 2), "\n")
## skewness: 1.23
cat("Kurtosis:", round(kurtosis_value, 2), "\n")
## Kurtosis: 4.18

Interpretation

Skewness: A value near zero suggest symmetry in the distribution. Positive skewness indicates a longer right tail, while negative skewness indicates a longer left tail. Kurtosis: Values around 3 are typical for normal distributions.A higher value indicates heavier tails,suggesting more extreme values(outliers) than a normal distribution.

Data Visualization

The histogram and density plot visually display the distribution of Ozone values, allowing us to assess its symmetry and potential deviations from normality.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
ggplot(data = data.frame(ozone_data) ,aes(x= ozone_data))+
  geom_histogram(aes(y = after_stat(density)), color = "black", fill = "skyblue", bins = 15)+
  geom_density(color = "darkblue", linewidth = 1) +
  labs(title = "Ozone Level Distribution in New York",
       subtitle = paste("Skewness:", round(skewness_value, 2), "| Kurtosis:", round(kurtosis_value, 2)),
       x = "Ozone Levels", y = "Density") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 12, color = "blue")
  )

Conclusion

The analysis of the Ozone variable’s distribution reveals the following insights:

Skewness and Kurtosis: The metrics indicate the symmetry and tail characteristics of the data, highlighting any tendencies for skewness or concentration in the tails. Understanding these patterns helps inform us about possible outliers or asymmetry.

Data Visualization: Visuals assist in evaluating distribution tendencies, guiding the need for data transformation or adjustment for more accurate modeling.

Foundational Analysis: This initial distribution analysis is foundational, helping us decide on further steps in exploratory data analysis (EDA) or statistical modeling. Identifying skewness or heavy tails allows for better decisions on transformations or model selection, supporting data-driven decisions effectively.