EDA for Air_index dataset

Dataset Description

Head

  country          state      city                        station
1   India Andhra_Pradesh Amaravati Secretariat, Amaravati - APPCB
2   India Andhra_Pradesh Amaravati Secretariat, Amaravati - APPCB
3   India Andhra_Pradesh Anantapur   Gulzarpet, Anantapur - APPCB
4   India Andhra_Pradesh Anantapur   Gulzarpet, Anantapur - APPCB
5   India Andhra_Pradesh Anantapur   Gulzarpet, Anantapur - APPCB
6   India Andhra_Pradesh Anantapur   Gulzarpet, Anantapur - APPCB
          last_update latitude longitude pollutant_id pollutant_min
1 18-09-2024 08:00:00 16.51508  80.51817        PM2.5            31
2 18-09-2024 08:00:00 16.51508  80.51817        OZONE             6
3 18-09-2024 08:00:00 14.67589  77.59303          NO2            11
4 18-09-2024 08:00:00 14.67589  77.59303          NH3             1
5 18-09-2024 08:00:00 14.67589  77.59303          SO2             1
6 18-09-2024 08:00:00 14.67589  77.59303        OZONE             7
  pollutant_max pollutant_avg
1            72            49
2            52            12
3           124            29
4             4             2
5            33            12
6            34            16

Structure

'data.frame':   3307 obs. of  11 variables:
 $ country      : chr  "India" "India" "India" "India" ...
 $ state        : chr  "Andhra_Pradesh" "Andhra_Pradesh" "Andhra_Pradesh" "Andhra_Pradesh" ...
 $ city         : chr  "Amaravati" "Amaravati" "Anantapur" "Anantapur" ...
 $ station      : chr  "Secretariat, Amaravati - APPCB" "Secretariat, Amaravati - APPCB" "Gulzarpet, Anantapur - APPCB" "Gulzarpet, Anantapur - APPCB" ...
 $ last_update  : chr  "18-09-2024 08:00:00" "18-09-2024 08:00:00" "18-09-2024 08:00:00" "18-09-2024 08:00:00" ...
 $ latitude     : num  16.5 16.5 14.7 14.7 14.7 ...
 $ longitude    : num  80.5 80.5 77.6 77.6 77.6 ...
 $ pollutant_id : chr  "PM2.5" "OZONE" "NO2" "NH3" ...
 $ pollutant_min: int  31 6 11 1 1 7 3 1 53 69 ...
 $ pollutant_max: int  72 52 124 4 33 34 4 1 92 69 ...
 $ pollutant_avg: int  49 12 29 2 12 16 3 1 75 69 ...

Summary

   country             state               city             station         
 Length:3307        Length:3307        Length:3307        Length:3307       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 last_update           latitude        longitude     pollutant_id      
 Length:3307        Min.   : 8.515   Min.   :70.91   Length:3307       
 Class :character   1st Qu.:19.059   1st Qu.:75.58   Class :character  
 Mode  :character   Median :23.234   Median :77.30   Mode  :character  
                    Mean   :22.684   Mean   :78.55                     
                    3rd Qu.:27.194   3rd Qu.:80.32                     
                    Max.   :34.066   Max.   :94.64                     
                                                                       
 pollutant_min   pollutant_max    pollutant_avg   
 Min.   :  1.0   Min.   :  1.00   Min.   :  1.00  
 1st Qu.:  4.0   1st Qu.: 14.00   1st Qu.:  8.00  
 Median : 12.0   Median : 35.00   Median : 20.00  
 Mean   : 17.7   Mean   : 51.43   Mean   : 29.81  
 3rd Qu.: 25.0   3rd Qu.: 69.00   3rd Qu.: 41.00  
 Max.   :215.0   Max.   :500.00   Max.   :296.00  
 NA's   :306     NA's   :306      NA's   :306

Null Values

      country         state          city       station   last_update 
            0             0             0             0             0 
     latitude     longitude  pollutant_id pollutant_min pollutant_max 
            0             0             0           306           306 
pollutant_avg 
          306

      country         state          city       station   last_update 
            0             0             0             0             0 
     latitude     longitude  pollutant_id pollutant_min pollutant_max 
            0             0             0             0             0 
pollutant_avg 
            0

Univariate analysis

Distribution of Pollutant_min levels

Distribution of Pollutant_max levels

Distribution of Pollutant_avg levels

Plot the average pollutant levels by state

Plot the average pollutant level by pollutant_id

Bivariate analysis

Scatter plot of pollutant_avg vs pollutant_ma

Box plot of pollutant_avg by pollutant_id

Box plot of pollutant_avg by state

pollutant average levels by city and pollutant ID

Multivarient

heatmap

---
title: "EDA for Air_index dataset"
output: 
  flexdashboard::flex_dashboard:
    orientation: rows
    vertical_layout: scroll
    theme: yeti
    social: menu
    source_code: embed
---

```{r setup, include=FALSE}
library(flexdashboard)
library(ggplot2)
library(dplyr)
library(tidyr)
```
## Dataset Description {.tabset}
```{r}
# Read the dataset
df <- read.csv("air_index.csv")
```

### Head
```{r}
head(df)
```
### Structure
```{r}
str(df)
```
### Summary
```{r}
summary(df)
```
### Null Values
```{r}
# Check for missing values
colSums(is.na(df))
```
```{r}
# Filling missing values in 'pollutant_min', 'pollutant_max', and 'pollutant_avg' with their column means
df <- df %>%
  mutate(
    pollutant_min = ifelse(is.na(pollutant_min), mean(pollutant_min, na.rm = TRUE), pollutant_min),
    pollutant_max = ifelse(is.na(pollutant_max), mean(pollutant_max, na.rm = TRUE), pollutant_max),
    pollutant_avg = ifelse(is.na(pollutant_avg), mean(pollutant_avg, na.rm = TRUE), pollutant_avg)
  )
```
```{r}

colSums(is.na(df))
```
## Univariate analysis {.tabset}

### Distribution of Pollutant_min levels
```{r}
ggplot(df, aes(x = pollutant_min)) +
  geom_histogram(fill = "skyblue", color = "black", bins = 30) +
  labs(title = "Distribution of Pollutant Min Levels", x = "Pollutant    Min", y = "Count")
```

### Distribution of Pollutant_max levels
```{r}
ggplot(df, aes(x = pollutant_max)) +
  geom_histogram(fill = "lightgreen", color = "black", bins = 30) +
  labs(title = "Distribution of Pollutant Max Levels", x = "Pollutant Max", y = "Count")
```


### Distribution of Pollutant_avg levels
```{r}
ggplot(df, aes(x = pollutant_avg)) +
  geom_histogram(fill = "lightcoral", color = "black", bins = 30) +
  labs(title = "Distribution of Pollutant Avg Levels", x = "Pollutant Avg", y = "Count")
```

###  Plot the average pollutant levels by state
```{r}
state_pollution_summary <- df %>%
  group_by(state) %>%
  summarise(avg_pollutant_level = mean(pollutant_avg, na.rm = TRUE))
# Sort states by the average pollutant level to find the highest
most_polluted_state <- state_pollution_summary %>%
  arrange(desc(avg_pollutant_level))
# Plot the average pollutant levels by state
ggplot(state_pollution_summary, aes(x = reorder(state, avg_pollutant_level), y = avg_pollutant_level)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() + # Flip coordinates for better readability
  labs(title = "Average Pollutant Levels by State",
       x = "State",
       y = "Average Pollutant Level") +
  theme_minimal()
```

### Plot the average pollutant level by pollutant_id
```{r}
ggplot(df, aes(x = pollutant_id, y = pollutant_avg)) +
  geom_bar(stat = "summary", fun = "mean", fill = "lightcoral") +
  labs(title = "Average Pollutant Levels by Pollutant ID",
       x = "Pollutant ID",
       y = "Average Pollutant Level") +
  theme_minimal()
```
  
## Bivariate analysis {.tabset}
### Scatter plot of pollutant_avg vs pollutant_ma
```{r}
# Scatter plot of pollutant_avg vs pollutant_max
ggplot(df, aes(x = pollutant_max, y = pollutant_avg)) +
  geom_point(color = "blue", alpha = 0.6) +
  labs(title = "Scatter Plot of Pollutant Max vs Pollutant Avg", x = "Pollutant Max", y = "Pollutant Avg")
```

### Box plot of pollutant_avg by pollutant_id
```{r}
# Bivariate Plot: Box plot of pollutant_avg by pollutant_id
ggplot(df, aes(x = pollutant_id, y = pollutant_avg, fill = pollutant_id)) +
  geom_boxplot() +
  labs(title = "Box Plot of Pollutant Avg by Pollutant Type", x = "Pollutant Type", y = "Pollutant Avg") +
  theme(legend.position = "none")  # Removes the legend for simplicity
```

### Box plot of pollutant_avg by state
```{r}
# Bivariate Plot: Box plot of pollutant_avg by state
ggplot(df, aes(x = state, y = pollutant_avg, fill = state)) +
  geom_boxplot() +
  labs(title = "Box Plot of Pollutant Avg by state", x = "state", y = "Pollutant Avg") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position = "none")  # Rotate city labels
```


### pollutant average levels by city and pollutant ID
```{r}
# Create a bar plot to show pollutant average levels by city and pollutant ID
tamilnadu_data <- df %>%
  filter(state == "TamilNadu")
ggplot(tamilnadu_data, aes(x = city, y = pollutant_avg, fill = pollutant_id)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Pollutant Average Levels by City and Pollutant Type",
       x = "City",
       y = "Pollutant Average Level",
       fill = "Pollutant ID") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_fill_brewer(palette = "Set1")  # Optional: for a more visually appealing color palette
```


## Multivarient
### heatmap
```{r}
# Create a heatmap using ggplot
tamilnadu_data <- df %>%
  filter(state == "TamilNadu")
ggplot(tamilnadu_data, aes(x = city, y = pollutant_id, fill = pollutant_avg)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "lightblue", high = "darkblue", name = "Avg Pollutant") +
  labs(title = "Heatmap of Pollutant Averages by City in Tamin Nsadu and Pollutant Type",
       x = "City", y = "Pollutant Type") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
install.packages('rmarkdown')
```