This report provides a comprehensive analysis of the distribution of variables in the airquality dataset, focusing on skewness and kurtosis to understand data symmetry and tail behavior. Skewness reveals the data’s tendency to lean towards one side, while kurtosis sheds light on the extremity of data points in the tails. These metrics are essential for assessing the normality of the distribution, which guides further analysis and modeling decisions.
The airquality dataset contains daily air quality measurements in New York, with variables such as Ozone, Temp (temperature), Wind, and Solar.R (solar radiation). In this analysis, we will focus on the Ozone variable
Ozone | Solar.R | Wind | Temp | Month | Day |
---|---|---|---|---|---|
41 | 190 | 7.4 | 67 | 5 | 1 |
36 | 118 | 8.0 | 72 | 5 | 2 |
12 | 149 | 12.6 | 74 | 5 | 3 |
18 | 313 | 11.5 | 62 | 5 | 4 |
NA | NA | 14.3 | 56 | 5 | 5 |
28 | NA | 14.9 | 66 | 5 | 6 |
This section presents basic descriptive statistics for the ozone variable, providing a snapshot of data distribution and central tendencies.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 18.00 31.50 42.13 63.25 168.00
#calculate standard deviation
std_dev = sd(ozone_data)
cat("Standard Deviaion:", round(std_dev,2), "\n")
## Standard Deviaion: 32.99
Minimum and Maximum values indicate the range of ozone concentration. Mean and Median provide central tendencies, while Standard Deviation indicates data spread.
The skewness and Kurtosis of the Ozone variable provide insights into its shape and tail weight.
skewness_value = skewness(ozone_data)
kurtosis_value = kurtosis(ozone_data)
cat("skewness:", round(skewness_value, 2), "\n")
## skewness: 1.23
## Kurtosis: 4.18
Skewness: A value near zero suggest symmetry in the distribution. Positive skewness indicates a longer right tail, while negative skewness indicates a longer left tail. Kurtosis: Values around 3 are typical for normal distributions.A higher value indicates heavier tails,suggesting more extreme values(outliers) than a normal distribution.
The histogram and density plot visually display the distribution of Ozone values, allowing us to assess its symmetry and potential deviations from normality.
## Warning: package 'ggplot2' was built under R version 4.4.2
ggplot(data = data.frame(ozone_data) ,aes(x= ozone_data))+
geom_histogram(aes(y = after_stat(density)), color = "black", fill = "skyblue", bins = 15)+
geom_density(color = "darkblue", linewidth = 1) +
labs(title = "Ozone Level Distribution in New York",
subtitle = paste("Skewness:", round(skewness_value, 2), "| Kurtosis:", round(kurtosis_value, 2)),
x = "Ozone Levels", y = "Density") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold"),
plot.subtitle = element_text(size = 12, color = "blue")
)
The analysis of the Ozone variable’s distribution reveals the following insights:
Skewness and Kurtosis: The metrics indicate the symmetry and tail characteristics of the data, highlighting any tendencies for skewness or concentration in the tails. Understanding these patterns helps inform us about possible outliers or asymmetry.
Data Visualization: Visuals assist in evaluating distribution tendencies, guiding the need for data transformation or adjustment for more accurate modeling.
Foundational Analysis: This initial distribution analysis is foundational, helping us decide on further steps in exploratory data analysis (EDA) or statistical modeling. Identifying skewness or heavy tails allows for better decisions on transformations or model selection, supporting data-driven decisions effectively.