Data Dive 4

For this data dive I have chosen the below columns for analysis.

  1. “country” (Nominal)
  2. “temperature_celsius” (Interval)
  3. “wind_mph” (Ratio)
  4. “humidity” (Ratio)
  5. “uv_index” (Ratio)
  6. “air_quality_PM2.5” (Ratio)
  7. “air_quality_PM10” (Ratio)

Loading and Summarizing data

The first step is to import the data set and create 5 samples. I have chosen the columns “country”,“wind_direction”,“condition_text”,“temperature_celsius”, “wind_mph”, “humidity”,“uv_index”,“air_quality_PM2.5”,“air_quality_PM10” and summarized them.

We start with reading the CSV file, and then generate five random subsamples from the data, and then print summary statistics for specific columns in each of these sub samples.

##    country          wind_direction     condition_text     temperature_celsius
##  Length:1267        Length:1267        Length:1267        Min.   : 2.90      
##  Class :character   Class :character   Class :character   1st Qu.:17.80      
##  Mode  :character   Mode  :character   Mode  :character   Median :23.60      
##                                                           Mean   :22.45      
##                                                           3rd Qu.:27.00      
##                                                           Max.   :45.00      
##     wind_mph         humidity         uv_index     air_quality_PM2.5
##  Min.   : 2.200   Min.   : 10.00   Min.   :1.000   Min.   :  0.50   
##  1st Qu.: 2.900   1st Qu.: 64.00   1st Qu.:1.000   1st Qu.:  3.30   
##  Median : 5.600   Median : 77.00   Median :1.000   Median :  8.00   
##  Mean   : 6.609   Mean   : 72.92   Mean   :2.512   Mean   : 21.81   
##  3rd Qu.: 9.400   3rd Qu.: 88.00   3rd Qu.:5.000   3rd Qu.: 18.30   
##  Max.   :36.700   Max.   :100.00   Max.   :9.000   Max.   :814.30   
##  air_quality_PM10
##  Min.   :  0.50  
##  1st Qu.:  6.00  
##  Median : 12.90  
##  Mean   : 33.61  
##  3rd Qu.: 29.60  
##  Max.   :937.50  
##    country          wind_direction     condition_text     temperature_celsius
##  Length:1267        Length:1267        Length:1267        Min.   : 2.90      
##  Class :character   Class :character   Class :character   1st Qu.:18.00      
##  Mode  :character   Mode  :character   Mode  :character   Median :24.00      
##                                                           Mean   :22.63      
##                                                           3rd Qu.:28.00      
##                                                           Max.   :42.00      
##     wind_mph       humidity         uv_index      air_quality_PM2.5
##  Min.   : 2.2   Min.   :  6.00   Min.   : 1.000   Min.   :  0.50   
##  1st Qu.: 2.5   1st Qu.: 63.00   1st Qu.: 1.000   1st Qu.:  3.20   
##  Median : 4.3   Median : 78.00   Median : 1.000   Median :  7.40   
##  Mean   : 6.4   Mean   : 73.24   Mean   : 2.467   Mean   : 18.69   
##  3rd Qu.: 8.5   3rd Qu.: 89.00   3rd Qu.: 5.000   3rd Qu.: 16.40   
##  Max.   :36.7   Max.   :100.00   Max.   :10.000   Max.   :700.80   
##  air_quality_PM10
##  Min.   :  0.60  
##  1st Qu.:  5.40  
##  Median : 11.50  
##  Mean   : 30.39  
##  3rd Qu.: 27.30  
##  Max.   :869.60  
##    country          wind_direction     condition_text     temperature_celsius
##  Length:1267        Length:1267        Length:1267        Min.   : 2.90      
##  Class :character   Class :character   Class :character   1st Qu.:17.00      
##  Mode  :character   Mode  :character   Mode  :character   Median :23.00      
##                                                           Mean   :22.36      
##                                                           3rd Qu.:27.45      
##                                                           Max.   :42.00      
##     wind_mph         humidity         uv_index      air_quality_PM2.5
##  Min.   : 2.200   Min.   :  6.00   Min.   : 1.000   Min.   :  0.50   
##  1st Qu.: 3.400   1st Qu.: 64.00   1st Qu.: 1.000   1st Qu.:  3.20   
##  Median : 4.900   Median : 78.00   Median : 1.000   Median :  7.70   
##  Mean   : 6.516   Mean   : 73.57   Mean   : 2.193   Mean   : 19.53   
##  3rd Qu.: 8.100   3rd Qu.: 89.00   3rd Qu.: 1.000   3rd Qu.: 17.20   
##  Max.   :43.800   Max.   :100.00   Max.   :10.000   Max.   :895.10   
##  air_quality_PM10 
##  Min.   :   0.50  
##  1st Qu.:   5.55  
##  Median :  12.40  
##  Mean   :  31.01  
##  3rd Qu.:  26.50  
##  Max.   :1079.10  
##    country          wind_direction     condition_text     temperature_celsius
##  Length:1267        Length:1267        Length:1267        Min.   : 3.00      
##  Class :character   Class :character   Class :character   1st Qu.:18.00      
##  Mode  :character   Mode  :character   Mode  :character   Median :24.00      
##                                                           Mean   :22.58      
##                                                           3rd Qu.:27.00      
##                                                           Max.   :45.00      
##     wind_mph         humidity         uv_index     air_quality_PM2.5
##  Min.   : 2.200   Min.   : 10.00   Min.   :1.000   Min.   :  0.50   
##  1st Qu.: 3.600   1st Qu.: 64.00   1st Qu.:1.000   1st Qu.:  3.10   
##  Median : 5.600   Median : 78.00   Median :1.000   Median :  7.30   
##  Mean   : 6.638   Mean   : 73.52   Mean   :2.305   Mean   : 19.46   
##  3rd Qu.: 9.400   3rd Qu.: 89.00   3rd Qu.:4.000   3rd Qu.: 17.85   
##  Max.   :30.000   Max.   :100.00   Max.   :9.000   Max.   :895.10   
##  air_quality_PM10 
##  Min.   :   0.50  
##  1st Qu.:   5.45  
##  Median :  12.30  
##  Mean   :  31.56  
##  3rd Qu.:  28.50  
##  Max.   :1079.10  
##    country          wind_direction     condition_text     temperature_celsius
##  Length:1267        Length:1267        Length:1267        Min.   : 5.00      
##  Class :character   Class :character   Class :character   1st Qu.:17.15      
##  Mode  :character   Mode  :character   Mode  :character   Median :24.00      
##                                                           Mean   :22.63      
##                                                           3rd Qu.:28.00      
##                                                           Max.   :39.60      
##     wind_mph         humidity         uv_index     air_quality_PM2.5
##  Min.   : 2.200   Min.   : 12.00   Min.   :1.000   Min.   :  0.50   
##  1st Qu.: 3.800   1st Qu.: 62.50   1st Qu.:1.000   1st Qu.:  3.10   
##  Median : 5.600   Median : 78.00   Median :1.000   Median :  6.90   
##  Mean   : 6.753   Mean   : 73.09   Mean   :2.424   Mean   : 18.52   
##  3rd Qu.: 8.800   3rd Qu.: 88.00   3rd Qu.:4.500   3rd Qu.: 17.30   
##  Max.   :43.800   Max.   :100.00   Max.   :9.000   Max.   :725.90   
##  air_quality_PM10
##  Min.   :  0.60  
##  1st Qu.:  5.80  
##  Median : 11.50  
##  Mean   : 30.04  
##  3rd Qu.: 26.70  
##  Max.   :911.70

Calculating the frequency of country names

The code displays a summary of the entire data frame and then presents the top 5 most frequently occurring countries along with their frequencies within that specific subset of data.

## Summary for DataFrame:
##    country          wind_direction     condition_text     temperature_celsius
##  Length:1267        Length:1267        Length:1267        Min.   : 2.90      
##  Class :character   Class :character   Class :character   1st Qu.:17.80      
##  Mode  :character   Mode  :character   Mode  :character   Median :23.60      
##                                                           Mean   :22.45      
##                                                           3rd Qu.:27.00      
##                                                           Max.   :45.00      
##     wind_mph         humidity         uv_index     air_quality_PM2.5
##  Min.   : 2.200   Min.   : 10.00   Min.   :1.000   Min.   :  0.50   
##  1st Qu.: 2.900   1st Qu.: 64.00   1st Qu.:1.000   1st Qu.:  3.30   
##  Median : 5.600   Median : 77.00   Median :1.000   Median :  8.00   
##  Mean   : 6.609   Mean   : 72.92   Mean   :2.512   Mean   : 21.81   
##  3rd Qu.: 9.400   3rd Qu.: 88.00   3rd Qu.:5.000   3rd Qu.: 18.30   
##  Max.   :36.700   Max.   :100.00   Max.   :9.000   Max.   :814.30   
##  air_quality_PM10
##  Min.   :  0.50  
##  1st Qu.:  6.00  
##  Median : 12.90  
##  Mean   : 33.61  
##  3rd Qu.: 29.60  
##  Max.   :937.50  
## 
## Frequency of top 5 country:
## 
## Indonesia  Bulgaria    Turkey   Bolivia     China 
##        21        14        14        13        13 
## Summary for DataFrame:
##    country          wind_direction     condition_text     temperature_celsius
##  Length:1267        Length:1267        Length:1267        Min.   : 2.90      
##  Class :character   Class :character   Class :character   1st Qu.:18.00      
##  Mode  :character   Mode  :character   Mode  :character   Median :24.00      
##                                                           Mean   :22.63      
##                                                           3rd Qu.:28.00      
##                                                           Max.   :42.00      
##     wind_mph       humidity         uv_index      air_quality_PM2.5
##  Min.   : 2.2   Min.   :  6.00   Min.   : 1.000   Min.   :  0.50   
##  1st Qu.: 2.5   1st Qu.: 63.00   1st Qu.: 1.000   1st Qu.:  3.20   
##  Median : 4.3   Median : 78.00   Median : 1.000   Median :  7.40   
##  Mean   : 6.4   Mean   : 73.24   Mean   : 2.467   Mean   : 18.69   
##  3rd Qu.: 8.5   3rd Qu.: 89.00   3rd Qu.: 5.000   3rd Qu.: 16.40   
##  Max.   :36.7   Max.   :100.00   Max.   :10.000   Max.   :700.80   
##  air_quality_PM10
##  Min.   :  0.60  
##  1st Qu.:  5.40  
##  Median : 11.50  
##  Mean   : 30.39  
##  3rd Qu.: 27.30  
##  Max.   :869.60  
## 
## Frequency of top 5 country:
## 
##  Thailand Indonesia  Bulgaria   Belgium    Panama 
##        21        19        17        16        16 
## Summary for DataFrame:
##    country          wind_direction     condition_text     temperature_celsius
##  Length:1267        Length:1267        Length:1267        Min.   : 2.90      
##  Class :character   Class :character   Class :character   1st Qu.:17.00      
##  Mode  :character   Mode  :character   Mode  :character   Median :23.00      
##                                                           Mean   :22.36      
##                                                           3rd Qu.:27.45      
##                                                           Max.   :42.00      
##     wind_mph         humidity         uv_index      air_quality_PM2.5
##  Min.   : 2.200   Min.   :  6.00   Min.   : 1.000   Min.   :  0.50   
##  1st Qu.: 3.400   1st Qu.: 64.00   1st Qu.: 1.000   1st Qu.:  3.20   
##  Median : 4.900   Median : 78.00   Median : 1.000   Median :  7.70   
##  Mean   : 6.516   Mean   : 73.57   Mean   : 2.193   Mean   : 19.53   
##  3rd Qu.: 8.100   3rd Qu.: 89.00   3rd Qu.: 1.000   3rd Qu.: 17.20   
##  Max.   :43.800   Max.   :100.00   Max.   :10.000   Max.   :895.10   
##  air_quality_PM10 
##  Min.   :   0.50  
##  1st Qu.:   5.55  
##  Median :  12.40  
##  Mean   :  31.01  
##  3rd Qu.:  26.50  
##  Max.   :1079.10  
## 
## Frequency of top 5 country:
## 
##  Bulgaria     Sudan    Bhutan Indonesia Venezuela 
##        20        17        15        15        15 
## Summary for DataFrame:
##    country          wind_direction     condition_text     temperature_celsius
##  Length:1267        Length:1267        Length:1267        Min.   : 3.00      
##  Class :character   Class :character   Class :character   1st Qu.:18.00      
##  Mode  :character   Mode  :character   Mode  :character   Median :24.00      
##                                                           Mean   :22.58      
##                                                           3rd Qu.:27.00      
##                                                           Max.   :45.00      
##     wind_mph         humidity         uv_index     air_quality_PM2.5
##  Min.   : 2.200   Min.   : 10.00   Min.   :1.000   Min.   :  0.50   
##  1st Qu.: 3.600   1st Qu.: 64.00   1st Qu.:1.000   1st Qu.:  3.10   
##  Median : 5.600   Median : 78.00   Median :1.000   Median :  7.30   
##  Mean   : 6.638   Mean   : 73.52   Mean   :2.305   Mean   : 19.46   
##  3rd Qu.: 9.400   3rd Qu.: 89.00   3rd Qu.:4.000   3rd Qu.: 17.85   
##  Max.   :30.000   Max.   :100.00   Max.   :9.000   Max.   :895.10   
##  air_quality_PM10 
##  Min.   :   0.50  
##  1st Qu.:   5.45  
##  Median :  12.30  
##  Mean   :  31.56  
##  3rd Qu.:  28.50  
##  Max.   :1079.10  
## 
## Frequency of top 5 country:
## 
##  Bulgaria Indonesia  Maldives  Thailand    Gambia 
##        23        16        14        14        13 
## Summary for DataFrame:
##    country          wind_direction     condition_text     temperature_celsius
##  Length:1267        Length:1267        Length:1267        Min.   : 5.00      
##  Class :character   Class :character   Class :character   1st Qu.:17.15      
##  Mode  :character   Mode  :character   Mode  :character   Median :24.00      
##                                                           Mean   :22.63      
##                                                           3rd Qu.:28.00      
##                                                           Max.   :39.60      
##     wind_mph         humidity         uv_index     air_quality_PM2.5
##  Min.   : 2.200   Min.   : 12.00   Min.   :1.000   Min.   :  0.50   
##  1st Qu.: 3.800   1st Qu.: 62.50   1st Qu.:1.000   1st Qu.:  3.10   
##  Median : 5.600   Median : 78.00   Median :1.000   Median :  6.90   
##  Mean   : 6.753   Mean   : 73.09   Mean   :2.424   Mean   : 18.52   
##  3rd Qu.: 8.800   3rd Qu.: 88.00   3rd Qu.:4.500   3rd Qu.: 17.30   
##  Max.   :43.800   Max.   :100.00   Max.   :9.000   Max.   :725.90   
##  air_quality_PM10
##  Min.   :  0.60  
##  1st Qu.:  5.80  
##  Median : 11.50  
##  Mean   : 30.04  
##  3rd Qu.: 26.70  
##  Max.   :911.70  
## 
## Frequency of top 5 country:
## 
##      Sudan  Indonesia     Turkey   Bulgaria Madagascar 
##         26         18         17         16         16

Calculating the Summary Statistics

We can now calculate summary statistics (Mean, Median, Variance, Standard Deviation) for each specified column in the data frame.

## Summary statistics for Subsample 1 :
##          temperature_celsius humidity wind_mph uv_index air_quality_PM2.5
## Mean     22.07916            73.73481 6.62281  2.190213 17.93228         
## Median   22.2                78       5.6      1        7.3              
## Variance 41.96032            398.1808 20.26404 4.909288 1842.841         
## StdDev   6.477679            19.95447 4.501559 2.215691 42.92833         
##          air_quality_PM10
## Mean     28.70805        
## Median   12.2            
## Variance 3020.887        
## StdDev   54.9626         
## 
## Summary statistics for Subsample 2 :
##          temperature_celsius humidity wind_mph uv_index air_quality_PM2.5
## Mean     22.54917            73.49329 6.613812 2.374901 17.85036         
## Median   24                  78       5.1      1        7.3              
## Variance 41.16448            392.1048 22.72048 5.403572 1411.39          
## StdDev   6.415955            19.80164 4.766601 2.324558 37.56847         
##          air_quality_PM10
## Mean     29.55762        
## Median   11.9            
## Variance 2703.957        
## StdDev   51.99959        
## 
## Summary statistics for Subsample 3 :
##          temperature_celsius humidity wind_mph uv_index air_quality_PM2.5
## Mean     22.49061            72.83189 6.556906 2.343331 17.6588          
## Median   23                  78       5.6      1        7.2              
## Variance 41.11734            421.666  20.09632 5.290404 1726.691         
## StdDev   6.412281            20.53451 4.482892 2.300088 41.55348         
##          air_quality_PM10
## Mean     29.31358        
## Median   11.4            
## Variance 3302.87         
## StdDev   57.4706         
## 
## Summary statistics for Subsample 4 :
##          temperature_celsius humidity wind_mph uv_index air_quality_PM2.5
## Mean     22.47837            73.72849 6.433386 2.303078 19.80205         
## Median   24                  78       5.6      1        7.1              
## Variance 41.85723            398.6908 17.97962 5.178213 2949.439         
## StdDev   6.469716            19.96724 4.240238 2.275569 54.30874         
##          air_quality_PM10
## Mean     31.58248        
## Median   12.3            
## Variance 5320.122        
## StdDev   72.93916        
## 
## Summary statistics for Subsample 5 :
##          temperature_celsius humidity wind_mph uv_index air_quality_PM2.5
## Mean     22.82028            72.94317 6.425099 2.277032 21.68137         
## Median   24                  78       5.6      1        8                
## Variance 43.71265            413.9841 19.41509 5.195704 3347.369         
## StdDev   6.611554            20.3466  4.406256 2.279409 57.85645         
##          air_quality_PM10
## Mean     34.80781        
## Median   13.3            
## Variance 5855.506        
## StdDev   76.52128

Comparing Standard Deviation across Sub Samples

This code creates a bar plot that compares the standard deviations of the previous taken variables across different sub samples.

Visualizing outliers in sub samples.

The boxplots are created separately for each variable, showing the distribution and variation of that variable across different subsamples. Each boxplot represents the spread of the data for that variable within the different subsamples.

### Visualizing outliers to depict anomaly in sub samples

By adding the red markers to the boxplots, the code helps visually identify potential outliers in the dataset for each variable and across different subsamples.

### Count of anomalies within sub samples

We can now count the number of anomalies within each variable across various sub samples.

We do not observe significant anomalies between the var of temperature_celsius variable. However sampling does affect the uv_index variable as can be seen in the bar chart.