Most of the time in data analysis is spent cleaning and preparing data repetitively throughout the journey of analysis [Hadley Wickham n.d]. So, it is important that one should know how to properly utilize the tools and frameworks of data cleaning. In this vignette, we are going to clean the weather dataset from DataCamp by following the concept of tidy data by Hadley Wickham.
The source code for this vignette can be accessed via https://github.com/justin-htet/stds-vignette.
Before cleaning the data, we must first understand the basic structure of the data by observing the dimensions, column names and the data type of each column.
# cascually explore the data
weather <- readRDS("weather.rds")
weather
# check the structure of the data frame
dplyr::glimpse(weather)
## Observations: 286
## Variables: 35
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ year <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 20…
## $ month <int> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, …
## $ measure <chr> "Max.TemperatureF", "Mean.TemperatureF", "Min.Temperatur…
## $ X1 <chr> "64", "52", "39", "46", "40", "26", "74", "63", "52", "3…
## $ X2 <chr> "42", "38", "33", "40", "27", "17", "92", "72", "51", "3…
## $ X3 <chr> "51", "44", "37", "49", "42", "24", "100", "79", "57", "…
## $ X4 <chr> "43", "37", "30", "24", "21", "13", "69", "54", "39", "3…
## $ X5 <chr> "42", "34", "26", "37", "25", "12", "85", "66", "47", "3…
## $ X6 <chr> "45", "42", "38", "45", "40", "36", "100", "93", "85", "…
## $ X7 <chr> "38", "30", "21", "36", "20", "-3", "92", "61", "29", "3…
## $ X8 <chr> "29", "24", "18", "28", "16", "3", "92", "70", "47", "30…
## $ X9 <chr> "49", "39", "29", "49", "41", "28", "100", "93", "86", "…
## $ X10 <chr> "48", "43", "38", "45", "39", "37", "100", "95", "89", "…
## $ X11 <chr> "39", "36", "32", "37", "31", "27", "92", "87", "82", "2…
## $ X12 <chr> "39", "35", "31", "28", "27", "25", "85", "75", "64", "2…
## $ X13 <chr> "42", "37", "32", "28", "26", "24", "75", "65", "55", "2…
## $ X14 <chr> "45", "39", "33", "29", "27", "25", "82", "68", "53", "2…
## $ X15 <chr> "42", "37", "32", "33", "29", "27", "89", "75", "60", "3…
## $ X16 <chr> "44", "40", "35", "42", "36", "30", "96", "85", "73", "3…
## $ X17 <chr> "49", "45", "41", "46", "41", "32", "100", "85", "70", "…
## $ X18 <chr> "44", "40", "36", "34", "30", "26", "89", "73", "57", "2…
## $ X19 <chr> "37", "33", "29", "25", "22", "20", "69", "63", "56", "3…
## $ X20 <chr> "36", "32", "27", "30", "24", "20", "89", "79", "69", "3…
## $ X21 <chr> "36", "33", "30", "30", "27", "25", "85", "77", "69", "3…
## $ X22 <chr> "44", "39", "33", "39", "34", "25", "89", "79", "69", "3…
## $ X23 <chr> "47", "45", "42", "45", "42", "37", "100", "91", "82", "…
## $ X24 <chr> "46", "44", "41", "46", "44", "41", "100", "98", "96", "…
## $ X25 <chr> "59", "52", "44", "58", "43", "29", "100", "75", "49", "…
## $ X26 <chr> "50", "44", "37", "31", "29", "28", "70", "60", "49", "3…
## $ X27 <chr> "52", "45", "38", "34", "31", "29", "70", "60", "50", "3…
## $ X28 <chr> "52", "46", "40", "42", "35", "27", "76", "65", "53", "2…
## $ X29 <chr> "41", "36", "30", "26", "20", "10", "64", "51", "37", "3…
## $ X30 <chr> "30", "26", "22", "10", "4", "-6", "50", "38", "26", "30…
## $ X31 <chr> "30", "25", "20", "8", "5", "1", "57", "44", "31", "30.3…
So, the weather data frame consists of 286 rows with 35 variables, mostly character variables with three interger variables.
The columns X1 to X31 seems like days of the month with the values of various weather measures.
The first column X is just a duplicate of row numbers and we are going to remove it.
# remove the first column
weather <- weather[, -1]
head(weather)
It looks like the values for the weather measurements (column 3) for each day of the month are stored in the columns X1 to X31. From tidy data perspective, the data set is messy because:
In each case, we will be using gather and spread from tidyr package. The basic idea is shown by the animation below (made by Garrick Aden-Buie) [R-Ladies Sydney 2018].
The column names X1
-X31
represent days of the month, which should be values of a new variable called day
.
# gather X1-X31 variable names into the new "day" column
# and the values of those 31 variables into the new "value" column
collected_days_weather2 <- weather %>%
gather(key = day, value = value, c("X1":"X31"), na.rm = TRUE)
head(collected_days_weather2)
The values in the column measure
of the weather dataset should be variables.
# spread the "measure" variables across the dataset with values from the column "value"
tidy_weather <- collected_days_weather2 %>%
spread(key = measure, value = value)
head(tidy_weather)
Now, we can see that each observation represents the various weather measures for a day of the month.
After tidying our data, the next step is to prepare for the data analysis by checking the data type, value, and range of each variable and transforming data types and values necessarily, and combining or separating columns.
# remove X's from day column
tidy_weather$day <- str_replace(tidy_weather$day, "X", "")
# unite the year, month, and day columns into a new column
tidy_weather <- tidy_weather %>%
unite(col = date, year, month, day, sep = "-")
head(tidy_weather)
The data type of the variable date
is character. So, it is needed to convert into the date data type.
# using lubridate's ymd()
tidy_weather$date <- ymd(tidy_weather$date)
glimpse(tidy_weather)
## Observations: 366
## Variables: 23
## $ date <date> 2014-12-01, 2014-12-10, 2014-12-11, 2…
## $ CloudCover <chr> "6", "8", "8", "7", "5", "4", "2", "8"…
## $ Events <chr> "Rain", "Rain", "Rain-Snow", "Snow", "…
## $ Max.Dew.PointF <chr> "46", "45", "37", "28", "28", "29", "3…
## $ Max.Gust.SpeedMPH <chr> "29", "29", "28", "21", "23", "20", "2…
## $ Max.Humidity <chr> "74", "100", "92", "85", "75", "82", "…
## $ Max.Sea.Level.PressureIn <chr> "30.45", "29.58", "29.81", "29.88", "2…
## $ Max.TemperatureF <chr> "64", "48", "39", "39", "42", "45", "4…
## $ Max.VisibilityMiles <chr> "10", "10", "10", "10", "10", "10", "1…
## $ Max.Wind.SpeedMPH <chr> "22", "23", "21", "16", "17", "15", "1…
## $ Mean.Humidity <chr> "63", "95", "87", "75", "65", "68", "7…
## $ Mean.Sea.Level.PressureIn <chr> "30.13", "29.5", "29.61", "29.85", "29…
## $ Mean.TemperatureF <chr> "52", "43", "36", "35", "37", "39", "3…
## $ Mean.VisibilityMiles <chr> "10", "3", "7", "10", "10", "10", "10"…
## $ Mean.Wind.SpeedMPH <chr> "13", "13", "13", "11", "12", "10", "6…
## $ MeanDew.PointF <chr> "40", "39", "31", "27", "26", "27", "2…
## $ Min.DewpointF <chr> "26", "37", "27", "25", "24", "25", "2…
## $ Min.Humidity <chr> "52", "89", "82", "64", "55", "53", "6…
## $ Min.Sea.Level.PressureIn <chr> "30.01", "29.43", "29.44", "29.81", "2…
## $ Min.TemperatureF <chr> "39", "38", "32", "31", "32", "33", "3…
## $ Min.VisibilityMiles <chr> "10", "1", "1", "7", "10", "10", "10",…
## $ PrecipitationIn <chr> "0.01", "0.28", "0.02", "T", "T", "0.0…
## $ WindDirDegrees <chr> "268", "357", "230", "286", "298", "30…
After glimpsing at tidy_weather
, it looks like all the weather measures except Events
are numbers but in character
data type. So, these measures must be converted to a numeric data type.
# first, move the Events variable to column 2
tidy_weather <- tidy_weather %>%
select(date, Events, CloudCover:WindDirDegrees)
# convert the other measure columns to numeric
convert_num_weather <- mutate_each(tidy_weather,
funs = list(as.numeric),
CloudCover:WindDirDegrees)
## Warning: NAs introduced by coercion
There is a warning about NAs being returned by numeric conversion. It is meant that some of the values are letters. Let’s see which column is causing the problem.
# check if all the values in each column are numeric with
# all.is.numeric from Hmisc package
tidy_weather %>%
select(CloudCover:WindDirDegrees) %>%
mutate_all(all.is.numeric) %>%
map(all)
## $CloudCover
## [1] TRUE
##
## $Max.Dew.PointF
## [1] TRUE
##
## $Max.Gust.SpeedMPH
## [1] FALSE
##
## $Max.Humidity
## [1] TRUE
##
## $Max.Sea.Level.PressureIn
## [1] TRUE
##
## $Max.TemperatureF
## [1] TRUE
##
## $Max.VisibilityMiles
## [1] TRUE
##
## $Max.Wind.SpeedMPH
## [1] TRUE
##
## $Mean.Humidity
## [1] TRUE
##
## $Mean.Sea.Level.PressureIn
## [1] TRUE
##
## $Mean.TemperatureF
## [1] TRUE
##
## $Mean.VisibilityMiles
## [1] TRUE
##
## $Mean.Wind.SpeedMPH
## [1] TRUE
##
## $MeanDew.PointF
## [1] TRUE
##
## $Min.DewpointF
## [1] TRUE
##
## $Min.Humidity
## [1] TRUE
##
## $Min.Sea.Level.PressureIn
## [1] TRUE
##
## $Min.TemperatureF
## [1] TRUE
##
## $Min.VisibilityMiles
## [1] TRUE
##
## $PrecipitationIn
## [1] FALSE
##
## $WindDirDegrees
## [1] TRUE
# check the columns that have FALSE value
tail(tidy_weather[, c("Max.Gust.SpeedMPH", "PrecipitationIn")], n = 25)
So, the column Max.Gust.SpeedMPH
has NA
s and the column PrecipitationIn
has “T”s (Trace). “Traces” in precipitation are defined as less than 0.005 inch. So, in this case, we need to map “T”s to 0.
# map "T" to "0" in PrecipitationIn
tidy_weather$PrecipitationIn <- str_replace(tidy_weather$PrecipitationIn,
"T", "0")
# convert to numeric again
tidy_weather <- mutate_each(tidy_weather,
funs = list(as.numeric),
CloudCover:WindDirDegrees)
glimpse(tidy_weather)
## Observations: 366
## Variables: 23
## $ date <date> 2014-12-01, 2014-12-10, 2014-12-11, 2…
## $ Events <chr> "Rain", "Rain", "Rain-Snow", "Snow", "…
## $ CloudCover <dbl> 6, 8, 8, 7, 5, 4, 2, 8, 8, 7, 4, 7, 6,…
## $ Max.Dew.PointF <dbl> 46, 45, 37, 28, 28, 29, 33, 42, 46, 34…
## $ Max.Gust.SpeedMPH <dbl> 29, 29, 28, 21, 23, 20, 21, 10, 26, 30…
## $ Max.Humidity <dbl> 74, 100, 92, 85, 75, 82, 89, 96, 100, …
## $ Max.Sea.Level.PressureIn <dbl> 30.45, 29.58, 29.81, 29.88, 29.86, 29.…
## $ Max.TemperatureF <dbl> 64, 48, 39, 39, 42, 45, 42, 44, 49, 44…
## $ Max.VisibilityMiles <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10…
## $ Max.Wind.SpeedMPH <dbl> 22, 23, 21, 16, 17, 15, 15, 8, 20, 23,…
## $ Mean.Humidity <dbl> 63, 95, 87, 75, 65, 68, 75, 85, 85, 73…
## $ Mean.Sea.Level.PressureIn <dbl> 30.13, 29.50, 29.61, 29.85, 29.82, 29.…
## $ Mean.TemperatureF <dbl> 52, 43, 36, 35, 37, 39, 37, 40, 45, 40…
## $ Mean.VisibilityMiles <dbl> 10, 3, 7, 10, 10, 10, 10, 9, 6, 10, 10…
## $ Mean.Wind.SpeedMPH <dbl> 13, 13, 13, 11, 12, 10, 6, 4, 11, 14, …
## $ MeanDew.PointF <dbl> 40, 39, 31, 27, 26, 27, 29, 36, 41, 30…
## $ Min.DewpointF <dbl> 26, 37, 27, 25, 24, 25, 27, 30, 32, 26…
## $ Min.Humidity <dbl> 52, 89, 82, 64, 55, 53, 60, 73, 70, 57…
## $ Min.Sea.Level.PressureIn <dbl> 30.01, 29.43, 29.44, 29.81, 29.78, 29.…
## $ Min.TemperatureF <dbl> 39, 38, 32, 31, 32, 33, 32, 35, 41, 36…
## $ Min.VisibilityMiles <dbl> 10, 1, 1, 7, 10, 10, 10, 5, 1, 10, 10,…
## $ PrecipitationIn <dbl> 0.01, 0.28, 0.02, 0.00, 0.00, 0.00, 0.…
## $ WindDirDegrees <dbl> 268, 357, 230, 286, 298, 306, 324, 79,…
Let’s dig deep to find missing values and errors by:
# count all the missing values in the dataset
sum(is.na(tidy_weather))
## [1] 6
summary(tidy_weather)
## date Events CloudCover Max.Dew.PointF
## Min. :2014-12-01 Length:366 Min. :0.000 Min. :-6.00
## 1st Qu.:2015-03-02 Class :character 1st Qu.:3.000 1st Qu.:32.00
## Median :2015-06-01 Mode :character Median :5.000 Median :47.50
## Mean :2015-06-01 Mean :4.708 Mean :45.48
## 3rd Qu.:2015-08-31 3rd Qu.:7.000 3rd Qu.:61.00
## Max. :2015-12-01 Max. :8.000 Max. :75.00
##
## Max.Gust.SpeedMPH Max.Humidity Max.Sea.Level.PressureIn
## Min. : 0.00 Min. : 39.00 Min. :29.58
## 1st Qu.:21.00 1st Qu.: 73.25 1st Qu.:30.00
## Median :25.50 Median : 86.00 Median :30.14
## Mean :26.99 Mean : 85.69 Mean :30.16
## 3rd Qu.:31.25 3rd Qu.: 93.00 3rd Qu.:30.31
## Max. :94.00 Max. :1000.00 Max. :30.88
## NA's :6
## Max.TemperatureF Max.VisibilityMiles Max.Wind.SpeedMPH Mean.Humidity
## Min. :18.00 Min. : 2.000 Min. : 8.00 Min. :28.00
## 1st Qu.:42.00 1st Qu.:10.000 1st Qu.:16.00 1st Qu.:56.00
## Median :60.00 Median :10.000 Median :20.00 Median :66.00
## Mean :58.93 Mean : 9.907 Mean :20.62 Mean :66.02
## 3rd Qu.:76.00 3rd Qu.:10.000 3rd Qu.:24.00 3rd Qu.:76.75
## Max. :96.00 Max. :10.000 Max. :38.00 Max. :98.00
##
## Mean.Sea.Level.PressureIn Mean.TemperatureF Mean.VisibilityMiles
## Min. :29.49 Min. : 8.00 Min. :-1.000
## 1st Qu.:29.87 1st Qu.:36.25 1st Qu.: 8.000
## Median :30.03 Median :53.50 Median :10.000
## Mean :30.04 Mean :51.40 Mean : 8.861
## 3rd Qu.:30.19 3rd Qu.:68.00 3rd Qu.:10.000
## Max. :30.77 Max. :84.00 Max. :10.000
##
## Mean.Wind.SpeedMPH MeanDew.PointF Min.DewpointF Min.Humidity
## Min. : 4.00 Min. :-11.00 Min. :-18.00 Min. :16.00
## 1st Qu.: 8.00 1st Qu.: 24.00 1st Qu.: 16.25 1st Qu.:35.00
## Median :10.00 Median : 41.00 Median : 35.00 Median :46.00
## Mean :10.68 Mean : 38.96 Mean : 32.25 Mean :48.31
## 3rd Qu.:13.00 3rd Qu.: 56.00 3rd Qu.: 51.00 3rd Qu.:60.00
## Max. :22.00 Max. : 71.00 Max. : 68.00 Max. :96.00
##
## Min.Sea.Level.PressureIn Min.TemperatureF Min.VisibilityMiles
## Min. :29.16 Min. :-3.00 Min. : 0.000
## 1st Qu.:29.76 1st Qu.:30.00 1st Qu.: 2.000
## Median :29.94 Median :46.00 Median :10.000
## Mean :29.93 Mean :43.33 Mean : 6.716
## 3rd Qu.:30.09 3rd Qu.:60.00 3rd Qu.:10.000
## Max. :30.64 Max. :74.00 Max. :10.000
##
## PrecipitationIn WindDirDegrees
## Min. :0.0000 Min. : 1.0
## 1st Qu.:0.0000 1st Qu.:113.0
## Median :0.0000 Median :222.0
## Mean :0.1016 Mean :200.1
## 3rd Qu.:0.0400 3rd Qu.:275.0
## Max. :2.9000 Max. :360.0
##
So, all the missing values in this dataset is associated with the column Max.Gust.SpeedMPH
.
# find indexes of NAs in Max.Gust.SpeedMPH
indexes <- which(is.na(tidy_weather$Max.Gust.SpeedMPH))
# observe each row with the mssing values
tidy_weather[indexes,]
In this case, we might have to discuss with the data provider or domain expert to clarify the issue.
After going through summary(tidy_weather)
, we can see errors from the columns Max.Humidity
and Mean.VisibilityMiles
. For max humidity, the error value is 1000. So, it is probable that an extra zero must have been entered. In the case of the negative one value in the mean visibility miles, it is also probable that the appropriate value must be 10.
# find rows with values greate than 100 for Max.Humidity
indexes <- which(tidy_weather$Max.Humidity > 100)
tidy_weather[indexes, ] %>%
select(date, Events, Max.Humidity, CloudCover:WindDirDegrees)
# fix the error
tidy_weather$Max.Humidity[indexes] <- 100
# find rows with values less than 0 for Mean.VisibilityMiles
indexes <- which(tidy_weather$Mean.VisibilityMiles < 0)
tidy_weather[indexes, ] %>%
select(date, Events, Mean.VisibilityMiles, CloudCover:WindDirDegrees)
# fix the error
tidy_weather$Mean.VisibilityMiles[indexes] <- 10
Event
columnSome of the values in the Event
column are empty for any day on which there was no significant weather events such as rain, fog, storm, etc. It is a good practice to be explicit about it and the empty strings should be replaced with “None”.
tidy_weather$Events[tidy_weather$Events == ""] <- "None"
head(tidy_weather)
Finally, the variable names should follow a standard convention. Some of the variables contain the period “.” for word separation and it has special meaning in certain situations. So, it is best to follow a recommend approach which is using camel case.
# clean_names from janitor helps naming variables in a convinient way
clean_weather <- clean_names(tidy_weather)
# and, we have our final clean data
glimpse(clean_weather)
## Observations: 366
## Variables: 23
## $ date <date> 2014-12-01, 2014-12-10, 2014-12-11, …
## $ events <chr> "Rain", "Rain", "Rain-Snow", "Snow", …
## $ cloud_cover <dbl> 6, 8, 8, 7, 5, 4, 2, 8, 8, 7, 4, 7, 6…
## $ max_dew_point_f <dbl> 46, 45, 37, 28, 28, 29, 33, 42, 46, 3…
## $ max_gust_speed_mph <dbl> 29, 29, 28, 21, 23, 20, 21, 10, 26, 3…
## $ max_humidity <dbl> 74, 100, 92, 85, 75, 82, 89, 96, 100,…
## $ max_sea_level_pressure_in <dbl> 30.45, 29.58, 29.81, 29.88, 29.86, 29…
## $ max_temperature_f <dbl> 64, 48, 39, 39, 42, 45, 42, 44, 49, 4…
## $ max_visibility_miles <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 1…
## $ max_wind_speed_mph <dbl> 22, 23, 21, 16, 17, 15, 15, 8, 20, 23…
## $ mean_humidity <dbl> 63, 95, 87, 75, 65, 68, 75, 85, 85, 7…
## $ mean_sea_level_pressure_in <dbl> 30.13, 29.50, 29.61, 29.85, 29.82, 29…
## $ mean_temperature_f <dbl> 52, 43, 36, 35, 37, 39, 37, 40, 45, 4…
## $ mean_visibility_miles <dbl> 10, 3, 7, 10, 10, 10, 10, 9, 6, 10, 1…
## $ mean_wind_speed_mph <dbl> 13, 13, 13, 11, 12, 10, 6, 4, 11, 14,…
## $ mean_dew_point_f <dbl> 40, 39, 31, 27, 26, 27, 29, 36, 41, 3…
## $ min_dewpoint_f <dbl> 26, 37, 27, 25, 24, 25, 27, 30, 32, 2…
## $ min_humidity <dbl> 52, 89, 82, 64, 55, 53, 60, 73, 70, 5…
## $ min_sea_level_pressure_in <dbl> 30.01, 29.43, 29.44, 29.81, 29.78, 29…
## $ min_temperature_f <dbl> 39, 38, 32, 31, 32, 33, 32, 35, 41, 3…
## $ min_visibility_miles <dbl> 10, 1, 1, 7, 10, 10, 10, 5, 1, 10, 10…
## $ precipitation_in <dbl> 0.01, 0.28, 0.02, 0.00, 0.00, 0.00, 0…
## $ wind_dir_degrees <dbl> 268, 357, 230, 286, 298, 306, 324, 79…
clean_weather
Overall, the cleaning process includes
Hadley Wickham n.d, Tidy data, The Comprehensive R Archive Network, viewed 30 March 2019, https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html
R-Ladies Sydney 2018, Clean It Up 4: Wide to Long to Wide to…, R-Ladies Sydney, 30 March 2019, https://rladiessydney.org/post/2018/11/28/clean-it-up-4/