These are the packages we have used to tidy our Weather dataset.
library(magrittr)
library(dplyr)
library(tidyr)
library(Hmisc)
library(lubridate)
library(readr)
To tidy and preprocess this dataset on weather, we took many steps which will be outlined here. Base R functions were employed to import both data sets and perform a merge using a key shared between both sets. We first had to grasp an understanding of the dataset such as the stucture of the variables, how many variables there were, which ones were useful to use and which ones weren’t. We used basic functions to summarise and understand our data. To begin the tidying process we subset our data, removing any variables we thought were not useful for our project. We performed type conversions we thought were appropriate on the data. We created two new columns which held the difference in temperatures, one column for the value in celsius and one for farenheit. We scoured the dataset for NA values, if found we thought it was appropriate to impute missing values with the mean values for the respective columns. Outliers were scanned for through the use of box plots and removed using math functions and filtering methods. logarithmic transformations were performed on both newly created columns to normalise the values and obtain a graph that is more easily understood.
Data was obtained from kaggle (https://www.kaggle.com/smid80/weatherww2), theres were two csv files. The first described weather conditions which were recorded on multiple days at certain weather stations around the world such a maximum and minimum temperatures. The second dataset had information on the weather stations themselves such as location, lattitude and longitude. Both datasets were merged into a new datset called “data_w” on the “STA”" key.
data_wsum <- read.csv("C:/Users/shubhdeep/Desktop/DataPreprocess/WeatherSummary.csv")
head(data_wsum)
data_wloc <- read.csv("C:/Users/shubhdeep/Desktop/DataPreprocess/StationLocations.csv")
head(data_wloc)
data_w <- merge(data_wsum, data_wloc, by.x = "STA", by.y = "WBAN")
head(data_w)
In this step we gained an understanding of the variables within the data. There were many so the ones we filtered out in the proceeding step will not be mention here, 1. STA: Weather station ID 2. Precip: Precipitation in mm 3. MaxTemp: Max temp in degrees C 4. MinTemp: Min temp in degrees C 5. MeanTemp: Mean temp in degrees C 6. YR: Year of observation 7. MO: Month of observation 8. DA: Day of observation 9. PRCP: Precipitation in inches and hundredths 10. MAX: Max temp in degrees F 11. MIN: Min temp in degrees F 12. MEA: Mean temp in degrees F 13. NAME: Weather station name 14. STATE.COUNTRY.ID: Location 15. LAT: Latitude (String) 16. LON: Longitude (String) 17. ELEV: Elevation (Above sea level) 19. Latitude: Numeric Latitude value 20. Longitude: Numeric Longitude value
summary(data_w)
STA Date Precip WindGustSpd MaxTemp
Min. :10001 1945-4-16: 122 0 :64267 Min. :18.52 Min. :-33.33
1st Qu.:11801 1945-4-17: 122 T :16753 1st Qu.:29.63 1st Qu.: 25.56
Median :22508 1945-4-19: 122 0.254 : 3389 Median :37.04 Median : 29.44
Mean :29659 1945-4-20: 122 0.508 : 2909 Mean :37.77 Mean : 27.05
3rd Qu.:33501 1945-4-22: 122 0.762 : 2015 3rd Qu.:43.06 3rd Qu.: 31.67
Max. :82506 1945-4-23: 122 1.016 : 1639 Max. :75.93 Max. : 50.00
(Other) :118308 (Other):28068 NA's :118508
MinTemp MeanTemp Snowfall PoorWeather YR
Min. :-38.33 Min. :-35.56 0 :115690 :84803 Min. :40.00
1st Qu.: 15.00 1st Qu.: 20.56 : 1163 1 :31980 1st Qu.:43.00
Median : 21.11 Median : 25.56 5.08 : 534 0 : 870 Median :44.00
Mean : 17.79 Mean : 22.41 2.54 : 339 1 1: 310 Mean :43.81
3rd Qu.: 23.33 3rd Qu.: 27.22 7.62 : 330 100000 : 263 3rd Qu.:45.00
Max. : 34.44 Max. : 40.00 10.16 : 205 1 1 : 133 Max. :45.00
(Other): 779 (Other): 681
MO DA PRCP DR SPD
Min. : 1.000 Min. : 1.0 0 :62335 Min. : 2 Min. :10.00
1st Qu.: 4.000 1st Qu.: 8.0 T :16753 1st Qu.:11 1st Qu.:16.00
Median : 7.000 Median :16.0 0.01 : 3389 Median :32 Median :20.00
Mean : 6.726 Mean :15.8 0.02 : 2909 Mean :27 Mean :20.40
3rd Qu.:10.000 3rd Qu.:23.0 0.03 : 2015 3rd Qu.:34 3rd Qu.:23.25
Max. :12.000 Max. :31.0 : 1932 Max. :78 Max. :41.00
(Other):29707 NA's :118507 NA's :118508
MAX MIN MEA SNF SND
Min. :-28 Min. :-37.00 Min. :-32.00 0 :115690 Min. :0
1st Qu.: 78 1st Qu.: 59.00 1st Qu.: 69.00 : 1163 1st Qu.:0
Median : 85 Median : 70.00 Median : 78.00 0.2 : 534 Median :0
Mean : 81 Mean : 64.27 Mean : 72.64 0.1 : 339 Mean :0
3rd Qu.: 89 3rd Qu.: 74.00 3rd Qu.: 81.00 0.3 : 330 3rd Qu.:0
Max. :122 Max. : 94.00 Max. :104.00 0.4 : 205 Max. :0
NA's :474 NA's :468 NA's :498 (Other): 779 NA's :113477
FT FB FTI ITH PGT TSHDSBRSGF
Mode:logical Mode:logical Mode:logical Mode:logical Min. : 0.00 :84803
NA's:119040 NA's:119040 NA's:119040 NA's:119040 1st Qu.: 8.50 1 :31980
Median :11.60 0 : 870
Mean :12.09 1 1: 310
3rd Qu.:15.00 100000 : 263
Max. :23.90 1 1 : 133
NA's :118515 (Other): 681
SD3 RHX RHN RVG WTE
Mode:logical Mode:logical Mode:logical Mode:logical Mode:logical
NA's:119040 NA's:119040 NA's:119040 NA's:119040 NA's:119040
NAME STATE.COUNTRY.ID LAT LON ELEV
WHEELER/AFB 810.1: 2192 PM :10329 2129N : 2796 08008W : 2588 Min. : 1.0
BALBOA/ALBROOK : 2185 IN : 9530 0855N : 2547 07936W : 2547 1st Qu.: 9.0
MOLOKAI/AP 524 : 2154 HI : 9185 0858N : 2185 15802W : 2192 Median : 26.0
HICKAM/AFB : 2118 BZ : 6265 2109N : 2154 07933W : 2185 Mean : 416.4
SAN JOSE : 2044 IY : 5551 2120N : 2118 15706W : 2154 3rd Qu.: 93.0
RIO HATO : 1750 AU : 5337 0822N : 1750 15757W : 2118 Max. :9999.0
(Other) :106597 (Other):72843 (Other):105490 (Other):105256
Latitude Longitude
Min. :-27.60 Min. :-175.00
1st Qu.: 6.75 1st Qu.: -79.50
Median : 17.90 Median : -22.62
Mean : 17.55 Mean : -15.79
3rd Qu.: 27.68 3rd Qu.: 44.87
Max. : 67.02 Max. : 177.37
str(data_w)
'data.frame': 119040 obs. of 38 variables:
$ STA : int 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 ...
$ Date : Factor w/ 2192 levels "1940-1-1","1940-1-10",..: 1005 1016 1027 1030 1031 1032 1033 1034 1035 1006 ...
$ Precip : Factor w/ 540 levels "0","0.254","0.508",..: 5 1 182 182 1 1 540 244 540 244 ...
$ WindGustSpd : num NA NA NA NA NA NA NA NA NA NA ...
$ MaxTemp : num 25.6 28.9 26.1 26.7 26.7 ...
$ MinTemp : num 22.2 21.7 22.2 22.2 21.7 ...
$ MeanTemp : num 23.9 25.6 24.4 24.4 24.4 ...
$ Snowfall : Factor w/ 36 levels "","#VALUE!","0",..: 3 3 3 3 3 3 3 3 3 3 ...
$ PoorWeather : Factor w/ 39 levels "","0","1","1 1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ YR : int 42 42 42 42 42 42 42 42 42 42 ...
$ MO : int 7 7 7 7 7 7 7 7 7 7 ...
$ DA : int 1 2 3 4 5 6 7 8 9 10 ...
$ PRCP : Factor w/ 541 levels "","0","0.01",..: 6 2 12 12 2 2 541 16 541 16 ...
$ DR : int NA NA NA NA NA NA NA NA NA NA ...
$ SPD : int NA NA NA NA NA NA NA NA NA NA ...
$ MAX : int 78 84 79 80 80 80 83 80 81 78 ...
$ MIN : int 72 71 72 72 71 71 73 72 73 71 ...
$ MEA : int 75 78 76 76 76 76 78 76 77 74 ...
$ SNF : Factor w/ 36 levels "","0","0.1","0.2",..: 2 2 2 2 2 2 2 2 2 2 ...
$ SND : int NA NA NA NA NA NA NA NA NA NA ...
$ FT : logi NA NA NA NA NA NA ...
$ FB : logi NA NA NA NA NA NA ...
$ FTI : logi NA NA NA NA NA NA ...
$ ITH : logi NA NA NA NA NA NA ...
$ PGT : num NA NA NA NA NA NA NA NA NA NA ...
$ TSHDSBRSGF : Factor w/ 39 levels "","0","1","1 1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ SD3 : logi NA NA NA NA NA NA ...
$ RHX : logi NA NA NA NA NA NA ...
$ RHN : logi NA NA NA NA NA NA ...
$ RVG : logi NA NA NA NA NA NA ...
$ WTE : logi NA NA NA NA NA NA ...
$ NAME : Factor w/ 159 levels "ABADAN","ACCRA",..: 2 2 2 2 2 2 2 2 2 2 ...
$ STATE.COUNTRY.ID: Factor w/ 64 levels "AL","AT","AU",..: 21 21 21 21 21 21 21 21 21 21 ...
$ LAT : Factor w/ 157 levels "0123S","0159N",..: 11 11 11 11 11 11 11 11 11 11 ...
$ LON : Factor w/ 158 levels "00010W","00037E",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ELEV : int 62 62 62 62 62 62 62 62 62 62 ...
$ Latitude : num 5.6 5.6 5.6 5.6 5.6 5.6 5.6 5.6 5.6 5.6 ...
$ Longitude : num -0.3 -0.3 -0.3 -0.3 -0.3 -0.3 -0.3 -0.3 -0.3 -0.3 ...
Here we subset our data removing any variables we deemed were unfit for our purpose or were just completely full of unobtainable NA values. We performed some type conversions on selected attributes shown in code below.
data_w <- subset(data_w, select = -c(4,8,9,14,15,19,20,21:31))
head(data_w)
data_w$Date <- as.Date(data_w$Date, format="%Y-%m-%d")
str(data_w)
'data.frame': 119040 obs. of 20 variables:
$ STA : int 10001 10001 10001 10001 10001 10001 10001 10001 10001 10001 ...
$ Date : Date, format: "1942-07-01" "1942-07-02" "1942-07-03" ...
$ Precip : Factor w/ 540 levels "0","0.254","0.508",..: 5 1 182 182 1 1 540 244 540 244 ...
$ MaxTemp : num 25.6 28.9 26.1 26.7 26.7 ...
$ MinTemp : num 22.2 21.7 22.2 22.2 21.7 ...
$ MeanTemp : num 23.9 25.6 24.4 24.4 24.4 ...
$ YR : int 42 42 42 42 42 42 42 42 42 42 ...
$ MO : int 7 7 7 7 7 7 7 7 7 7 ...
$ DA : int 1 2 3 4 5 6 7 8 9 10 ...
$ PRCP : Factor w/ 541 levels "","0","0.01",..: 6 2 12 12 2 2 541 16 541 16 ...
$ MAX : int 78 84 79 80 80 80 83 80 81 78 ...
$ MIN : int 72 71 72 72 71 71 73 72 73 71 ...
$ MEA : int 75 78 76 76 76 76 78 76 77 74 ...
$ NAME : Factor w/ 159 levels "ABADAN","ACCRA",..: 2 2 2 2 2 2 2 2 2 2 ...
$ STATE.COUNTRY.ID: Factor w/ 64 levels "AL","AT","AU",..: 21 21 21 21 21 21 21 21 21 21 ...
$ LAT : Factor w/ 157 levels "0123S","0159N",..: 11 11 11 11 11 11 11 11 11 11 ...
$ LON : Factor w/ 158 levels "00010W","00037E",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ELEV : int 62 62 62 62 62 62 62 62 62 62 ...
$ Latitude : num 5.6 5.6 5.6 5.6 5.6 5.6 5.6 5.6 5.6 5.6 ...
$ Longitude : num -0.3 -0.3 -0.3 -0.3 -0.3 -0.3 -0.3 -0.3 -0.3 -0.3 ...
data_w$Precip <- as.character(data_w$Precip)
data_w$Precip <- as.numeric(data_w$Precip)
NAs introduced by coercion
data_w$PRCP <- as.character(data_w$PRCP)
data_w$PRCP <- as.numeric(data_w$PRCP)
NAs introduced by coercion
data_w$NAME <- as.character(data_w$NAME)
data_w$STATE.COUNTRY.ID <- as.character(data_w$STATE.COUNTRY.ID)
data_w$YR <- as.factor(data_w$YR)
data_w$MO <- as.factor(data_w$MO)
data_w$DA <- as.factor(data_w$DA)
data_w$Latitude <- as.factor(data_w$Latitude)
data_w$Longitude <- as.factor(data_w$Longitude)
head(data_w)
Two new columns were made using the “mutate” function. One for temperature difference in degrees Celsius and one for temperature difference in Farenheit.
data_w <- data_w %>% mutate(TempDiffCel = MaxTemp - MinTemp)
data_w <- data_w %>% mutate(TempDiffFar = MAX - MIN)
head(data_w)
The colSums() function helped us in identifying and displaying the number of NA values in each column. Any NA values we found were imputed with the mean of their respective column. Data type conversion were made after using the impute function.
colSums(is.na(data_w))
STA Date Precip MaxTemp MinTemp
0 0 16753 0 0
MeanTemp YR MO DA PRCP
0 0 0 0 18685
MAX MIN MEA NAME STATE.COUNTRY.ID
474 468 498 0 0
LAT LON ELEV Latitude Longitude
0 0 0 0 0
TempDiffCel TempDiffFar
0 498
data_w$Precip = impute(data_w$Precip, fun = mean)
data_w$PRCP = impute(data_w$PRCP, fun = mean)
data_w$MAX = impute(data_w$MAX, fun = mean)
data_w$MIN = impute(data_w$MIN, fun = mean)
data_w$MEA = impute(data_w$MEA, fun = mean)
data_w$TempDiffFar = impute(data_w$TempDiffFar, fun = mean)
class(data_w$Precip)
[1] "impute"
class(data_w$PRCP)
[1] "impute"
class(data_w$MAX)
[1] "impute"
class(data_w$MIN)
[1] "impute"
class(data_w$MEA)
[1] "impute"
class(data_w$TempDiffFar)
[1] "impute"
data_w$Precip <- as.numeric(data_w$Precip)
data_w$PRCP <- as.numeric(data_w$PRCP)
data_w$MAX <- as.numeric(data_w$MAX)
data_w$MIN <- as.numeric(data_w$MIN)
data_w$MEA <- as.numeric(data_w$MEA)
data_w$TempDiffFar <- as.numeric(data_w$TempDiffFar)
Outliers were found using the help of boxplots of numeric attributes. Removal was done using maths to remove any values lieing outside of the upper and lower fences for the respective attribute. The filter() function was used to remove values outside the specified fence range. To check for removal, a second set of boxplots were made to then confirm removal. This second set of boxplots showed that outliers still existed but these new outliers were still within the fences of the original boxplots so we decided not to remove the new “outliers”.
weather2 <- data_w
weather2$Precip %>% boxplot(main="Precip outlier Boxplot")
summary(weather2$Precip)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 0.000 3.754 3.754 307.340
upperPrecip = 3.754 + (3.754 * 1.5)
weather2 <- weather2 %>% filter(Precip < upperPrecip)
class(weather2$Precip)
[1] "numeric"
weather2$Precip %>% boxplot(main="Precip outlier Boxplot (Post outlier removal)")
weather2$MaxTemp %>% boxplot(main="MaxTemp outlier Boxplot")
summary(weather2$MaxTemp)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-33.33 25.00 29.44 27.03 31.67 50.00
IQRMaxTemp = 31.67 - 25
upperMaxTemp = 31.67 + IQRMaxTemp * 1.5
lowerMaxTemp = 25 - IQRMaxTemp * 1.5
weather2 <- weather2 %>% filter(MaxTemp < upperMaxTemp & MaxTemp > lowerMaxTemp)
class(weather2$MaxTemp)
[1] "numeric"
weather2$MaxTemp %>% boxplot(main="MaxTemp outlier Boxplot (Post outlier removal)")
weather2$MinTemp %>% boxplot(main="MinTemp outlier Boxplot")
summary(weather2$MinTemp)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-17.78 16.67 21.11 19.40 23.33 34.44
IQRMinTemp = 23.33 - 16.67
upperMinTemp = 23.33 + IQRMinTemp * 1.5
lowerMinTemp = 16.67 - IQRMinTemp * 1.5
weather2 <- weather2 %>% filter(MinTemp < upperMinTemp & MinTemp > lowerMinTemp)
class(weather2$MinTemp)
[1] "numeric"
weather2$MinTemp %>% boxplot(main="MinTemp outlier Boxplot (Post outlier removal)")
weather2$MeanTemp %>% boxplot(main="MeanTemp outlier Boxplot")
summary(weather2$MeanTemp)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-17.78 22.78 25.56 24.75 27.78 37.22
IQRMeanTemp = 27.78 - 22.78
upperMeanTemp = 27.78 + IQRMeanTemp * 1.5
lowerMeanTemp = 22.78 - IQRMeanTemp * 1.5
weather2 <- weather2 %>% filter(MeanTemp < upperMeanTemp & MeanTemp > lowerMeanTemp)
class(weather2$MeanTemp)
[1] "numeric"
weather2$MeanTemp %>% boxplot(main="MeanTemp outlier Boxplot (Post outlier removal)")
weather2$TempDiffCel %>% boxplot(main="TempDiffCel outlier Boxplot")
summary(weather2$TempDiffCel)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 6.667 8.889 9.431 11.667 27.778
IQRTempDiffCel = 11.667 - 6.667
upperTempDiffCel = 11.667 + IQRTempDiffCel * 1.5
lowerTempDiffCel = 6.667 - IQRTempDiffCel * 1.5
weather2 <- weather2 %>% filter(TempDiffCel < upperTempDiffCel & TempDiffCel > lowerTempDiffCel)
class(weather2$TempDiffCel)
[1] "numeric"
weather2$TempDiffCel %>% boxplot(main="TempDiffCel outlier Boxplot (Post outlier removal)")
weather2$MAX %>% boxplot(main="MAX outlier Boxplot")
summary(weather2$MAX)
Min. 1st Qu. Median Mean 3rd Qu. Max.
61.0 82.0 86.0 85.5 89.0 107.0
IQRMAX = 89 - 82
upperMAX =89 + IQRMAX * 1.5
lowerMAX = 82 - IQRMAX * 1.5
weather2 <- weather2 %>% filter(MAX < upperMAX & MAX > lowerMAX)
class(weather2$MAX)
[1] "numeric"
weather2$MAX %>% boxplot(main="MAX outlier Boxplot (Post outlier removal)")
weather2$MIN %>% boxplot(main="MIN outlier Boxplot")
summary(weather2$MIN)
Min. 1st Qu. Median Mean 3rd Qu. Max.
45.00 65.00 71.00 69.21 74.00 90.00
IQRMIN = 74 - 65
upperMIN = 74 + IQRMIN * 1.5
lowerMIN = 65 - IQRMIN * 1.5
weather2 <- weather2 %>% filter(MIN < upperMIN & MIN > lowerMIN)
class(weather2$MIN)
[1] "numeric"
weather2$MIN %>% boxplot(main="MIN outlier Boxplot (Post outlier removal)")
weather2$MEA %>% boxplot(main="MEA outlier Boxplot")
summary(weather2$MEA)
Min. 1st Qu. Median Mean 3rd Qu. Max.
60.00 75.00 79.00 77.86 82.00 94.00
IQRMEA = 82 - 74
upperMEA = 82 + IQRMEA * 1.5
lowerMEA = 74 - IQRMEA * 1.5
weather2 <- weather2 %>% filter(MEA < upperMEA & MEA > lowerMEA)
class(weather2$MEA)
[1] "numeric"
weather2$MEA %>% boxplot(main="MEA outlier Boxplot (Post outlier removal)")
weather2$TempDiffFar %>% boxplot(main="TempDiffFar outlier Boxplot")
summary(weather2$TempDiffFar)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 12.00 15.00 16.06 20.00 34.00
IQRTempDiffFar = 20 - 12
upperTempDiffFar = 20 + IQRTempDiffFar * 1.5
lowerTempDiffFar = 12 - IQRTempDiffFar * 1.5
weather2 <- weather2 %>% filter(TempDiffFar < upperTempDiffFar & TempDiffFar > lowerTempDiffFar)
class(weather2$TempDiffFar)
[1] "numeric"
weather2$TempDiffFar %>% boxplot(main="TempDiffFar outlier Boxplot (Post outlier removal)")
Histograms of the two newly created variables were made. We them performed log transformation on these to normalise the histogram for better understanding as the initial histograms showed some slight skewing.
hist(weather2$TempDiffCel, main = "Histogram of diffference in temperature (Celsius)", xlab = "Temperature difference (Celsius)")
hist(log(weather2$TempDiffCel), main = "Histogram of log tranformed temperature difference (Celsius)", xlab = "log(Temperature Difference)")
hist(weather2$TempDiffFar, main = "Histogram of diffference in temperature (Farenheit)", xlab = "Temperature difference (Farenheit)")
hist(log(weather2$TempDiffFar), main = "Histogram of log tranformed temperature difference (Farenheit)", xlab = "log(Temperature Difference)")