For this data dive I have chosen the below columns for analysis.
The first step is to import the data set and create 5 samples. I have chosen the columns “country”,“wind_direction”,“condition_text”,“temperature_celsius”, “wind_mph”, “humidity”,“uv_index”,“air_quality_PM2.5”,“air_quality_PM10” and summarized them.
We start with reading the CSV file, and then generate five random subsamples from the data, and then print summary statistics for specific columns in each of these sub samples.
## country wind_direction condition_text temperature_celsius
## Length:1267 Length:1267 Length:1267 Min. : 2.90
## Class :character Class :character Class :character 1st Qu.:17.80
## Mode :character Mode :character Mode :character Median :23.60
## Mean :22.45
## 3rd Qu.:27.00
## Max. :45.00
## wind_mph humidity uv_index air_quality_PM2.5
## Min. : 2.200 Min. : 10.00 Min. :1.000 Min. : 0.50
## 1st Qu.: 2.900 1st Qu.: 64.00 1st Qu.:1.000 1st Qu.: 3.30
## Median : 5.600 Median : 77.00 Median :1.000 Median : 8.00
## Mean : 6.609 Mean : 72.92 Mean :2.512 Mean : 21.81
## 3rd Qu.: 9.400 3rd Qu.: 88.00 3rd Qu.:5.000 3rd Qu.: 18.30
## Max. :36.700 Max. :100.00 Max. :9.000 Max. :814.30
## air_quality_PM10
## Min. : 0.50
## 1st Qu.: 6.00
## Median : 12.90
## Mean : 33.61
## 3rd Qu.: 29.60
## Max. :937.50
## country wind_direction condition_text temperature_celsius
## Length:1267 Length:1267 Length:1267 Min. : 2.90
## Class :character Class :character Class :character 1st Qu.:18.00
## Mode :character Mode :character Mode :character Median :24.00
## Mean :22.63
## 3rd Qu.:28.00
## Max. :42.00
## wind_mph humidity uv_index air_quality_PM2.5
## Min. : 2.2 Min. : 6.00 Min. : 1.000 Min. : 0.50
## 1st Qu.: 2.5 1st Qu.: 63.00 1st Qu.: 1.000 1st Qu.: 3.20
## Median : 4.3 Median : 78.00 Median : 1.000 Median : 7.40
## Mean : 6.4 Mean : 73.24 Mean : 2.467 Mean : 18.69
## 3rd Qu.: 8.5 3rd Qu.: 89.00 3rd Qu.: 5.000 3rd Qu.: 16.40
## Max. :36.7 Max. :100.00 Max. :10.000 Max. :700.80
## air_quality_PM10
## Min. : 0.60
## 1st Qu.: 5.40
## Median : 11.50
## Mean : 30.39
## 3rd Qu.: 27.30
## Max. :869.60
## country wind_direction condition_text temperature_celsius
## Length:1267 Length:1267 Length:1267 Min. : 2.90
## Class :character Class :character Class :character 1st Qu.:17.00
## Mode :character Mode :character Mode :character Median :23.00
## Mean :22.36
## 3rd Qu.:27.45
## Max. :42.00
## wind_mph humidity uv_index air_quality_PM2.5
## Min. : 2.200 Min. : 6.00 Min. : 1.000 Min. : 0.50
## 1st Qu.: 3.400 1st Qu.: 64.00 1st Qu.: 1.000 1st Qu.: 3.20
## Median : 4.900 Median : 78.00 Median : 1.000 Median : 7.70
## Mean : 6.516 Mean : 73.57 Mean : 2.193 Mean : 19.53
## 3rd Qu.: 8.100 3rd Qu.: 89.00 3rd Qu.: 1.000 3rd Qu.: 17.20
## Max. :43.800 Max. :100.00 Max. :10.000 Max. :895.10
## air_quality_PM10
## Min. : 0.50
## 1st Qu.: 5.55
## Median : 12.40
## Mean : 31.01
## 3rd Qu.: 26.50
## Max. :1079.10
## country wind_direction condition_text temperature_celsius
## Length:1267 Length:1267 Length:1267 Min. : 3.00
## Class :character Class :character Class :character 1st Qu.:18.00
## Mode :character Mode :character Mode :character Median :24.00
## Mean :22.58
## 3rd Qu.:27.00
## Max. :45.00
## wind_mph humidity uv_index air_quality_PM2.5
## Min. : 2.200 Min. : 10.00 Min. :1.000 Min. : 0.50
## 1st Qu.: 3.600 1st Qu.: 64.00 1st Qu.:1.000 1st Qu.: 3.10
## Median : 5.600 Median : 78.00 Median :1.000 Median : 7.30
## Mean : 6.638 Mean : 73.52 Mean :2.305 Mean : 19.46
## 3rd Qu.: 9.400 3rd Qu.: 89.00 3rd Qu.:4.000 3rd Qu.: 17.85
## Max. :30.000 Max. :100.00 Max. :9.000 Max. :895.10
## air_quality_PM10
## Min. : 0.50
## 1st Qu.: 5.45
## Median : 12.30
## Mean : 31.56
## 3rd Qu.: 28.50
## Max. :1079.10
## country wind_direction condition_text temperature_celsius
## Length:1267 Length:1267 Length:1267 Min. : 5.00
## Class :character Class :character Class :character 1st Qu.:17.15
## Mode :character Mode :character Mode :character Median :24.00
## Mean :22.63
## 3rd Qu.:28.00
## Max. :39.60
## wind_mph humidity uv_index air_quality_PM2.5
## Min. : 2.200 Min. : 12.00 Min. :1.000 Min. : 0.50
## 1st Qu.: 3.800 1st Qu.: 62.50 1st Qu.:1.000 1st Qu.: 3.10
## Median : 5.600 Median : 78.00 Median :1.000 Median : 6.90
## Mean : 6.753 Mean : 73.09 Mean :2.424 Mean : 18.52
## 3rd Qu.: 8.800 3rd Qu.: 88.00 3rd Qu.:4.500 3rd Qu.: 17.30
## Max. :43.800 Max. :100.00 Max. :9.000 Max. :725.90
## air_quality_PM10
## Min. : 0.60
## 1st Qu.: 5.80
## Median : 11.50
## Mean : 30.04
## 3rd Qu.: 26.70
## Max. :911.70
The code displays a summary of the entire data frame and then presents the top 5 most frequently occurring countries along with their frequencies within that specific subset of data.
## Summary for DataFrame:
## country wind_direction condition_text temperature_celsius
## Length:1267 Length:1267 Length:1267 Min. : 2.90
## Class :character Class :character Class :character 1st Qu.:17.80
## Mode :character Mode :character Mode :character Median :23.60
## Mean :22.45
## 3rd Qu.:27.00
## Max. :45.00
## wind_mph humidity uv_index air_quality_PM2.5
## Min. : 2.200 Min. : 10.00 Min. :1.000 Min. : 0.50
## 1st Qu.: 2.900 1st Qu.: 64.00 1st Qu.:1.000 1st Qu.: 3.30
## Median : 5.600 Median : 77.00 Median :1.000 Median : 8.00
## Mean : 6.609 Mean : 72.92 Mean :2.512 Mean : 21.81
## 3rd Qu.: 9.400 3rd Qu.: 88.00 3rd Qu.:5.000 3rd Qu.: 18.30
## Max. :36.700 Max. :100.00 Max. :9.000 Max. :814.30
## air_quality_PM10
## Min. : 0.50
## 1st Qu.: 6.00
## Median : 12.90
## Mean : 33.61
## 3rd Qu.: 29.60
## Max. :937.50
##
## Frequency of top 5 country:
##
## Indonesia Bulgaria Turkey Bolivia China
## 21 14 14 13 13
## Summary for DataFrame:
## country wind_direction condition_text temperature_celsius
## Length:1267 Length:1267 Length:1267 Min. : 2.90
## Class :character Class :character Class :character 1st Qu.:18.00
## Mode :character Mode :character Mode :character Median :24.00
## Mean :22.63
## 3rd Qu.:28.00
## Max. :42.00
## wind_mph humidity uv_index air_quality_PM2.5
## Min. : 2.2 Min. : 6.00 Min. : 1.000 Min. : 0.50
## 1st Qu.: 2.5 1st Qu.: 63.00 1st Qu.: 1.000 1st Qu.: 3.20
## Median : 4.3 Median : 78.00 Median : 1.000 Median : 7.40
## Mean : 6.4 Mean : 73.24 Mean : 2.467 Mean : 18.69
## 3rd Qu.: 8.5 3rd Qu.: 89.00 3rd Qu.: 5.000 3rd Qu.: 16.40
## Max. :36.7 Max. :100.00 Max. :10.000 Max. :700.80
## air_quality_PM10
## Min. : 0.60
## 1st Qu.: 5.40
## Median : 11.50
## Mean : 30.39
## 3rd Qu.: 27.30
## Max. :869.60
##
## Frequency of top 5 country:
##
## Thailand Indonesia Bulgaria Belgium Panama
## 21 19 17 16 16
## Summary for DataFrame:
## country wind_direction condition_text temperature_celsius
## Length:1267 Length:1267 Length:1267 Min. : 2.90
## Class :character Class :character Class :character 1st Qu.:17.00
## Mode :character Mode :character Mode :character Median :23.00
## Mean :22.36
## 3rd Qu.:27.45
## Max. :42.00
## wind_mph humidity uv_index air_quality_PM2.5
## Min. : 2.200 Min. : 6.00 Min. : 1.000 Min. : 0.50
## 1st Qu.: 3.400 1st Qu.: 64.00 1st Qu.: 1.000 1st Qu.: 3.20
## Median : 4.900 Median : 78.00 Median : 1.000 Median : 7.70
## Mean : 6.516 Mean : 73.57 Mean : 2.193 Mean : 19.53
## 3rd Qu.: 8.100 3rd Qu.: 89.00 3rd Qu.: 1.000 3rd Qu.: 17.20
## Max. :43.800 Max. :100.00 Max. :10.000 Max. :895.10
## air_quality_PM10
## Min. : 0.50
## 1st Qu.: 5.55
## Median : 12.40
## Mean : 31.01
## 3rd Qu.: 26.50
## Max. :1079.10
##
## Frequency of top 5 country:
##
## Bulgaria Sudan Bhutan Indonesia Venezuela
## 20 17 15 15 15
## Summary for DataFrame:
## country wind_direction condition_text temperature_celsius
## Length:1267 Length:1267 Length:1267 Min. : 3.00
## Class :character Class :character Class :character 1st Qu.:18.00
## Mode :character Mode :character Mode :character Median :24.00
## Mean :22.58
## 3rd Qu.:27.00
## Max. :45.00
## wind_mph humidity uv_index air_quality_PM2.5
## Min. : 2.200 Min. : 10.00 Min. :1.000 Min. : 0.50
## 1st Qu.: 3.600 1st Qu.: 64.00 1st Qu.:1.000 1st Qu.: 3.10
## Median : 5.600 Median : 78.00 Median :1.000 Median : 7.30
## Mean : 6.638 Mean : 73.52 Mean :2.305 Mean : 19.46
## 3rd Qu.: 9.400 3rd Qu.: 89.00 3rd Qu.:4.000 3rd Qu.: 17.85
## Max. :30.000 Max. :100.00 Max. :9.000 Max. :895.10
## air_quality_PM10
## Min. : 0.50
## 1st Qu.: 5.45
## Median : 12.30
## Mean : 31.56
## 3rd Qu.: 28.50
## Max. :1079.10
##
## Frequency of top 5 country:
##
## Bulgaria Indonesia Maldives Thailand Gambia
## 23 16 14 14 13
## Summary for DataFrame:
## country wind_direction condition_text temperature_celsius
## Length:1267 Length:1267 Length:1267 Min. : 5.00
## Class :character Class :character Class :character 1st Qu.:17.15
## Mode :character Mode :character Mode :character Median :24.00
## Mean :22.63
## 3rd Qu.:28.00
## Max. :39.60
## wind_mph humidity uv_index air_quality_PM2.5
## Min. : 2.200 Min. : 12.00 Min. :1.000 Min. : 0.50
## 1st Qu.: 3.800 1st Qu.: 62.50 1st Qu.:1.000 1st Qu.: 3.10
## Median : 5.600 Median : 78.00 Median :1.000 Median : 6.90
## Mean : 6.753 Mean : 73.09 Mean :2.424 Mean : 18.52
## 3rd Qu.: 8.800 3rd Qu.: 88.00 3rd Qu.:4.500 3rd Qu.: 17.30
## Max. :43.800 Max. :100.00 Max. :9.000 Max. :725.90
## air_quality_PM10
## Min. : 0.60
## 1st Qu.: 5.80
## Median : 11.50
## Mean : 30.04
## 3rd Qu.: 26.70
## Max. :911.70
##
## Frequency of top 5 country:
##
## Sudan Indonesia Turkey Bulgaria Madagascar
## 26 18 17 16 16
We can now calculate summary statistics (Mean, Median, Variance, Standard Deviation) for each specified column in the data frame.
## Summary statistics for Subsample 1 :
## temperature_celsius humidity wind_mph uv_index air_quality_PM2.5
## Mean 22.07916 73.73481 6.62281 2.190213 17.93228
## Median 22.2 78 5.6 1 7.3
## Variance 41.96032 398.1808 20.26404 4.909288 1842.841
## StdDev 6.477679 19.95447 4.501559 2.215691 42.92833
## air_quality_PM10
## Mean 28.70805
## Median 12.2
## Variance 3020.887
## StdDev 54.9626
##
## Summary statistics for Subsample 2 :
## temperature_celsius humidity wind_mph uv_index air_quality_PM2.5
## Mean 22.54917 73.49329 6.613812 2.374901 17.85036
## Median 24 78 5.1 1 7.3
## Variance 41.16448 392.1048 22.72048 5.403572 1411.39
## StdDev 6.415955 19.80164 4.766601 2.324558 37.56847
## air_quality_PM10
## Mean 29.55762
## Median 11.9
## Variance 2703.957
## StdDev 51.99959
##
## Summary statistics for Subsample 3 :
## temperature_celsius humidity wind_mph uv_index air_quality_PM2.5
## Mean 22.49061 72.83189 6.556906 2.343331 17.6588
## Median 23 78 5.6 1 7.2
## Variance 41.11734 421.666 20.09632 5.290404 1726.691
## StdDev 6.412281 20.53451 4.482892 2.300088 41.55348
## air_quality_PM10
## Mean 29.31358
## Median 11.4
## Variance 3302.87
## StdDev 57.4706
##
## Summary statistics for Subsample 4 :
## temperature_celsius humidity wind_mph uv_index air_quality_PM2.5
## Mean 22.47837 73.72849 6.433386 2.303078 19.80205
## Median 24 78 5.6 1 7.1
## Variance 41.85723 398.6908 17.97962 5.178213 2949.439
## StdDev 6.469716 19.96724 4.240238 2.275569 54.30874
## air_quality_PM10
## Mean 31.58248
## Median 12.3
## Variance 5320.122
## StdDev 72.93916
##
## Summary statistics for Subsample 5 :
## temperature_celsius humidity wind_mph uv_index air_quality_PM2.5
## Mean 22.82028 72.94317 6.425099 2.277032 21.68137
## Median 24 78 5.6 1 8
## Variance 43.71265 413.9841 19.41509 5.195704 3347.369
## StdDev 6.611554 20.3466 4.406256 2.279409 57.85645
## air_quality_PM10
## Mean 34.80781
## Median 13.3
## Variance 5855.506
## StdDev 76.52128
This code creates a bar plot that compares the standard deviations of the previous taken variables across different sub samples.
The boxplots are created separately for each variable, showing the distribution and variation of that variable across different subsamples. Each boxplot represents the spread of the data for that variable within the different subsamples.
### Visualizing outliers to depict anomaly in sub samples
By adding the red markers to the boxplots, the code helps visually identify potential outliers in the dataset for each variable and across different subsamples.
### Count of anomalies within sub samples
We can now count the number of anomalies within each variable across various sub samples.
We do not observe significant anomalies between the var of temperature_celsius variable. However sampling does affect the uv_index variable as can be seen in the bar chart.