Introduction

Most of the time in data analysis is spent cleaning and preparing data repetitively throughout the journey of analysis [Hadley Wickham n.d]. So, it is important that one should know how to properly utilize the tools and frameworks of data cleaning. In this vignette, we are going to clean the weather dataset from DataCamp by following the concept of tidy data by Hadley Wickham.

The source code for this vignette can be accessed via https://github.com/justin-htet/stds-vignette.

Explore the Dataset

Before cleaning the data, we must first understand the basic structure of the data by observing the dimensions, column names and the data type of each column.

# cascually explore the data
weather <- readRDS("weather.rds")
weather
# check the structure of the data frame
dplyr::glimpse(weather)
## Observations: 286
## Variables: 35
## $ X       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ year    <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 20…
## $ month   <int> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, …
## $ measure <chr> "Max.TemperatureF", "Mean.TemperatureF", "Min.Temperatur…
## $ X1      <chr> "64", "52", "39", "46", "40", "26", "74", "63", "52", "3…
## $ X2      <chr> "42", "38", "33", "40", "27", "17", "92", "72", "51", "3…
## $ X3      <chr> "51", "44", "37", "49", "42", "24", "100", "79", "57", "…
## $ X4      <chr> "43", "37", "30", "24", "21", "13", "69", "54", "39", "3…
## $ X5      <chr> "42", "34", "26", "37", "25", "12", "85", "66", "47", "3…
## $ X6      <chr> "45", "42", "38", "45", "40", "36", "100", "93", "85", "…
## $ X7      <chr> "38", "30", "21", "36", "20", "-3", "92", "61", "29", "3…
## $ X8      <chr> "29", "24", "18", "28", "16", "3", "92", "70", "47", "30…
## $ X9      <chr> "49", "39", "29", "49", "41", "28", "100", "93", "86", "…
## $ X10     <chr> "48", "43", "38", "45", "39", "37", "100", "95", "89", "…
## $ X11     <chr> "39", "36", "32", "37", "31", "27", "92", "87", "82", "2…
## $ X12     <chr> "39", "35", "31", "28", "27", "25", "85", "75", "64", "2…
## $ X13     <chr> "42", "37", "32", "28", "26", "24", "75", "65", "55", "2…
## $ X14     <chr> "45", "39", "33", "29", "27", "25", "82", "68", "53", "2…
## $ X15     <chr> "42", "37", "32", "33", "29", "27", "89", "75", "60", "3…
## $ X16     <chr> "44", "40", "35", "42", "36", "30", "96", "85", "73", "3…
## $ X17     <chr> "49", "45", "41", "46", "41", "32", "100", "85", "70", "…
## $ X18     <chr> "44", "40", "36", "34", "30", "26", "89", "73", "57", "2…
## $ X19     <chr> "37", "33", "29", "25", "22", "20", "69", "63", "56", "3…
## $ X20     <chr> "36", "32", "27", "30", "24", "20", "89", "79", "69", "3…
## $ X21     <chr> "36", "33", "30", "30", "27", "25", "85", "77", "69", "3…
## $ X22     <chr> "44", "39", "33", "39", "34", "25", "89", "79", "69", "3…
## $ X23     <chr> "47", "45", "42", "45", "42", "37", "100", "91", "82", "…
## $ X24     <chr> "46", "44", "41", "46", "44", "41", "100", "98", "96", "…
## $ X25     <chr> "59", "52", "44", "58", "43", "29", "100", "75", "49", "…
## $ X26     <chr> "50", "44", "37", "31", "29", "28", "70", "60", "49", "3…
## $ X27     <chr> "52", "45", "38", "34", "31", "29", "70", "60", "50", "3…
## $ X28     <chr> "52", "46", "40", "42", "35", "27", "76", "65", "53", "2…
## $ X29     <chr> "41", "36", "30", "26", "20", "10", "64", "51", "37", "3…
## $ X30     <chr> "30", "26", "22", "10", "4", "-6", "50", "38", "26", "30…
## $ X31     <chr> "30", "25", "20", "8", "5", "1", "57", "44", "31", "30.3…

So, the weather data frame consists of 286 rows with 35 variables, mostly character variables with three interger variables.

The columns X1 to X31 seems like days of the month with the values of various weather measures.

The first column X is just a duplicate of row numbers and we are going to remove it.

# remove the first column
weather <- weather[, -1]
head(weather)

Tidy with Gather and Spread

It looks like the values for the weather measurements (column 3) for each day of the month are stored in the columns X1 to X31. From tidy data perspective, the data set is messy because:

  • column names are values (X1 to X31)
  • variable names are represented as values (column 3 - measure)

In each case, we will be using gather and spread from tidyr package. The basic idea is shown by the animation below (made by Garrick Aden-Buie) [R-Ladies Sydney 2018].

Column Names are Values

The column names X1-X31 represent days of the month, which should be values of a new variable called day.

# gather X1-X31 variable names into the new "day" column 
# and the values of those 31 variables into the new "value" column
collected_days_weather2 <- weather %>% 
  gather(key = day, value = value, c("X1":"X31"), na.rm = TRUE)
head(collected_days_weather2)

Values are Column Names

The values in the column measure of the weather dataset should be variables.

# spread the "measure" variables across the dataset with values from the column "value"
tidy_weather <- collected_days_weather2 %>% 
  spread(key = measure, value = value)
head(tidy_weather)

Now, we can see that each observation represents the various weather measures for a day of the month.

Prepare for Data for Analysis

After tidying our data, the next step is to prepare for the data analysis by checking the data type, value, and range of each variable and transforming data types and values necessarily, and combining or separating columns.

Clean up Dates

# remove X's from day column
tidy_weather$day <- str_replace(tidy_weather$day, "X", "")

# unite the year, month, and day columns into a new column
tidy_weather <- tidy_weather %>% 
  unite(col = date, year, month, day, sep = "-")

head(tidy_weather)

The data type of the variable date is character. So, it is needed to convert into the date data type.

# using lubridate's ymd()
tidy_weather$date <- ymd(tidy_weather$date)
glimpse(tidy_weather)
## Observations: 366
## Variables: 23
## $ date                      <date> 2014-12-01, 2014-12-10, 2014-12-11, 2…
## $ CloudCover                <chr> "6", "8", "8", "7", "5", "4", "2", "8"…
## $ Events                    <chr> "Rain", "Rain", "Rain-Snow", "Snow", "…
## $ Max.Dew.PointF            <chr> "46", "45", "37", "28", "28", "29", "3…
## $ Max.Gust.SpeedMPH         <chr> "29", "29", "28", "21", "23", "20", "2…
## $ Max.Humidity              <chr> "74", "100", "92", "85", "75", "82", "…
## $ Max.Sea.Level.PressureIn  <chr> "30.45", "29.58", "29.81", "29.88", "2…
## $ Max.TemperatureF          <chr> "64", "48", "39", "39", "42", "45", "4…
## $ Max.VisibilityMiles       <chr> "10", "10", "10", "10", "10", "10", "1…
## $ Max.Wind.SpeedMPH         <chr> "22", "23", "21", "16", "17", "15", "1…
## $ Mean.Humidity             <chr> "63", "95", "87", "75", "65", "68", "7…
## $ Mean.Sea.Level.PressureIn <chr> "30.13", "29.5", "29.61", "29.85", "29…
## $ Mean.TemperatureF         <chr> "52", "43", "36", "35", "37", "39", "3…
## $ Mean.VisibilityMiles      <chr> "10", "3", "7", "10", "10", "10", "10"…
## $ Mean.Wind.SpeedMPH        <chr> "13", "13", "13", "11", "12", "10", "6…
## $ MeanDew.PointF            <chr> "40", "39", "31", "27", "26", "27", "2…
## $ Min.DewpointF             <chr> "26", "37", "27", "25", "24", "25", "2…
## $ Min.Humidity              <chr> "52", "89", "82", "64", "55", "53", "6…
## $ Min.Sea.Level.PressureIn  <chr> "30.01", "29.43", "29.44", "29.81", "2…
## $ Min.TemperatureF          <chr> "39", "38", "32", "31", "32", "33", "3…
## $ Min.VisibilityMiles       <chr> "10", "1", "1", "7", "10", "10", "10",…
## $ PrecipitationIn           <chr> "0.01", "0.28", "0.02", "T", "T", "0.0…
## $ WindDirDegrees            <chr> "268", "357", "230", "286", "298", "30…

Convert Measure Variables

After glimpsing at tidy_weather, it looks like all the weather measures except Events are numbers but in character data type. So, these measures must be converted to a numeric data type.

# first, move the Events variable to column 2
tidy_weather <- tidy_weather %>% 
  select(date, Events, CloudCover:WindDirDegrees)

# convert the other measure columns to numeric
convert_num_weather <- mutate_each(tidy_weather,
                              funs = list(as.numeric),
                              CloudCover:WindDirDegrees)
## Warning: NAs introduced by coercion

There is a warning about NAs being returned by numeric conversion. It is meant that some of the values are letters. Let’s see which column is causing the problem.

# check if all the values in each column are numeric with
# all.is.numeric from Hmisc package
tidy_weather %>% 
  select(CloudCover:WindDirDegrees) %>% 
  mutate_all(all.is.numeric) %>% 
  map(all)
## $CloudCover
## [1] TRUE
## 
## $Max.Dew.PointF
## [1] TRUE
## 
## $Max.Gust.SpeedMPH
## [1] FALSE
## 
## $Max.Humidity
## [1] TRUE
## 
## $Max.Sea.Level.PressureIn
## [1] TRUE
## 
## $Max.TemperatureF
## [1] TRUE
## 
## $Max.VisibilityMiles
## [1] TRUE
## 
## $Max.Wind.SpeedMPH
## [1] TRUE
## 
## $Mean.Humidity
## [1] TRUE
## 
## $Mean.Sea.Level.PressureIn
## [1] TRUE
## 
## $Mean.TemperatureF
## [1] TRUE
## 
## $Mean.VisibilityMiles
## [1] TRUE
## 
## $Mean.Wind.SpeedMPH
## [1] TRUE
## 
## $MeanDew.PointF
## [1] TRUE
## 
## $Min.DewpointF
## [1] TRUE
## 
## $Min.Humidity
## [1] TRUE
## 
## $Min.Sea.Level.PressureIn
## [1] TRUE
## 
## $Min.TemperatureF
## [1] TRUE
## 
## $Min.VisibilityMiles
## [1] TRUE
## 
## $PrecipitationIn
## [1] FALSE
## 
## $WindDirDegrees
## [1] TRUE
# check the columns that have FALSE value
tail(tidy_weather[, c("Max.Gust.SpeedMPH", "PrecipitationIn")], n = 25)

So, the column Max.Gust.SpeedMPH has NAs and the column PrecipitationIn has “T”s (Trace). “Traces” in precipitation are defined as less than 0.005 inch. So, in this case, we need to map “T”s to 0.

# map "T" to "0" in PrecipitationIn
tidy_weather$PrecipitationIn <- str_replace(tidy_weather$PrecipitationIn,
                                              "T", "0")

# convert to numeric again
tidy_weather <- mutate_each(tidy_weather,
                              funs = list(as.numeric),
                              CloudCover:WindDirDegrees)
glimpse(tidy_weather)
## Observations: 366
## Variables: 23
## $ date                      <date> 2014-12-01, 2014-12-10, 2014-12-11, 2…
## $ Events                    <chr> "Rain", "Rain", "Rain-Snow", "Snow", "…
## $ CloudCover                <dbl> 6, 8, 8, 7, 5, 4, 2, 8, 8, 7, 4, 7, 6,…
## $ Max.Dew.PointF            <dbl> 46, 45, 37, 28, 28, 29, 33, 42, 46, 34…
## $ Max.Gust.SpeedMPH         <dbl> 29, 29, 28, 21, 23, 20, 21, 10, 26, 30…
## $ Max.Humidity              <dbl> 74, 100, 92, 85, 75, 82, 89, 96, 100, …
## $ Max.Sea.Level.PressureIn  <dbl> 30.45, 29.58, 29.81, 29.88, 29.86, 29.…
## $ Max.TemperatureF          <dbl> 64, 48, 39, 39, 42, 45, 42, 44, 49, 44…
## $ Max.VisibilityMiles       <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10…
## $ Max.Wind.SpeedMPH         <dbl> 22, 23, 21, 16, 17, 15, 15, 8, 20, 23,…
## $ Mean.Humidity             <dbl> 63, 95, 87, 75, 65, 68, 75, 85, 85, 73…
## $ Mean.Sea.Level.PressureIn <dbl> 30.13, 29.50, 29.61, 29.85, 29.82, 29.…
## $ Mean.TemperatureF         <dbl> 52, 43, 36, 35, 37, 39, 37, 40, 45, 40…
## $ Mean.VisibilityMiles      <dbl> 10, 3, 7, 10, 10, 10, 10, 9, 6, 10, 10…
## $ Mean.Wind.SpeedMPH        <dbl> 13, 13, 13, 11, 12, 10, 6, 4, 11, 14, …
## $ MeanDew.PointF            <dbl> 40, 39, 31, 27, 26, 27, 29, 36, 41, 30…
## $ Min.DewpointF             <dbl> 26, 37, 27, 25, 24, 25, 27, 30, 32, 26…
## $ Min.Humidity              <dbl> 52, 89, 82, 64, 55, 53, 60, 73, 70, 57…
## $ Min.Sea.Level.PressureIn  <dbl> 30.01, 29.43, 29.44, 29.81, 29.78, 29.…
## $ Min.TemperatureF          <dbl> 39, 38, 32, 31, 32, 33, 32, 35, 41, 36…
## $ Min.VisibilityMiles       <dbl> 10, 1, 1, 7, 10, 10, 10, 5, 1, 10, 10,…
## $ PrecipitationIn           <dbl> 0.01, 0.28, 0.02, 0.00, 0.00, 0.00, 0.…
## $ WindDirDegrees            <dbl> 268, 357, 230, 286, 298, 306, 324, 79,…

Find Missing Values and Errors

Let’s dig deep to find missing values and errors by:

  • identifying observations with missing values and figure out why before dealing with it
  • looking out for plausible ranges of weather measures from summary
    • percentages (0 - 100)
    • temperatures (Fahrenheit)
    • wind speeds (miles per hour)
    • pressures (inches of mercury)
    • distances (miles)
    • eights (of cloud cover)
# count all the missing values in the dataset
sum(is.na(tidy_weather))
## [1] 6
summary(tidy_weather)
##       date               Events            CloudCover    Max.Dew.PointF 
##  Min.   :2014-12-01   Length:366         Min.   :0.000   Min.   :-6.00  
##  1st Qu.:2015-03-02   Class :character   1st Qu.:3.000   1st Qu.:32.00  
##  Median :2015-06-01   Mode  :character   Median :5.000   Median :47.50  
##  Mean   :2015-06-01                      Mean   :4.708   Mean   :45.48  
##  3rd Qu.:2015-08-31                      3rd Qu.:7.000   3rd Qu.:61.00  
##  Max.   :2015-12-01                      Max.   :8.000   Max.   :75.00  
##                                                                         
##  Max.Gust.SpeedMPH  Max.Humidity     Max.Sea.Level.PressureIn
##  Min.   : 0.00     Min.   :  39.00   Min.   :29.58           
##  1st Qu.:21.00     1st Qu.:  73.25   1st Qu.:30.00           
##  Median :25.50     Median :  86.00   Median :30.14           
##  Mean   :26.99     Mean   :  85.69   Mean   :30.16           
##  3rd Qu.:31.25     3rd Qu.:  93.00   3rd Qu.:30.31           
##  Max.   :94.00     Max.   :1000.00   Max.   :30.88           
##  NA's   :6                                                   
##  Max.TemperatureF Max.VisibilityMiles Max.Wind.SpeedMPH Mean.Humidity  
##  Min.   :18.00    Min.   : 2.000      Min.   : 8.00     Min.   :28.00  
##  1st Qu.:42.00    1st Qu.:10.000      1st Qu.:16.00     1st Qu.:56.00  
##  Median :60.00    Median :10.000      Median :20.00     Median :66.00  
##  Mean   :58.93    Mean   : 9.907      Mean   :20.62     Mean   :66.02  
##  3rd Qu.:76.00    3rd Qu.:10.000      3rd Qu.:24.00     3rd Qu.:76.75  
##  Max.   :96.00    Max.   :10.000      Max.   :38.00     Max.   :98.00  
##                                                                        
##  Mean.Sea.Level.PressureIn Mean.TemperatureF Mean.VisibilityMiles
##  Min.   :29.49             Min.   : 8.00     Min.   :-1.000      
##  1st Qu.:29.87             1st Qu.:36.25     1st Qu.: 8.000      
##  Median :30.03             Median :53.50     Median :10.000      
##  Mean   :30.04             Mean   :51.40     Mean   : 8.861      
##  3rd Qu.:30.19             3rd Qu.:68.00     3rd Qu.:10.000      
##  Max.   :30.77             Max.   :84.00     Max.   :10.000      
##                                                                  
##  Mean.Wind.SpeedMPH MeanDew.PointF   Min.DewpointF     Min.Humidity  
##  Min.   : 4.00      Min.   :-11.00   Min.   :-18.00   Min.   :16.00  
##  1st Qu.: 8.00      1st Qu.: 24.00   1st Qu.: 16.25   1st Qu.:35.00  
##  Median :10.00      Median : 41.00   Median : 35.00   Median :46.00  
##  Mean   :10.68      Mean   : 38.96   Mean   : 32.25   Mean   :48.31  
##  3rd Qu.:13.00      3rd Qu.: 56.00   3rd Qu.: 51.00   3rd Qu.:60.00  
##  Max.   :22.00      Max.   : 71.00   Max.   : 68.00   Max.   :96.00  
##                                                                      
##  Min.Sea.Level.PressureIn Min.TemperatureF Min.VisibilityMiles
##  Min.   :29.16            Min.   :-3.00    Min.   : 0.000     
##  1st Qu.:29.76            1st Qu.:30.00    1st Qu.: 2.000     
##  Median :29.94            Median :46.00    Median :10.000     
##  Mean   :29.93            Mean   :43.33    Mean   : 6.716     
##  3rd Qu.:30.09            3rd Qu.:60.00    3rd Qu.:10.000     
##  Max.   :30.64            Max.   :74.00    Max.   :10.000     
##                                                               
##  PrecipitationIn  WindDirDegrees 
##  Min.   :0.0000   Min.   :  1.0  
##  1st Qu.:0.0000   1st Qu.:113.0  
##  Median :0.0000   Median :222.0  
##  Mean   :0.1016   Mean   :200.1  
##  3rd Qu.:0.0400   3rd Qu.:275.0  
##  Max.   :2.9000   Max.   :360.0  
## 

So, all the missing values in this dataset is associated with the column Max.Gust.SpeedMPH.

# find indexes of NAs in Max.Gust.SpeedMPH
indexes <- which(is.na(tidy_weather$Max.Gust.SpeedMPH))

# observe each row with the mssing values
tidy_weather[indexes,]

In this case, we might have to discuss with the data provider or domain expert to clarify the issue.

After going through summary(tidy_weather), we can see errors from the columns Max.Humidity and Mean.VisibilityMiles. For max humidity, the error value is 1000. So, it is probable that an extra zero must have been entered. In the case of the negative one value in the mean visibility miles, it is also probable that the appropriate value must be 10.

# find rows with values greate than 100 for Max.Humidity
indexes <- which(tidy_weather$Max.Humidity > 100)

tidy_weather[indexes, ] %>% 
  select(date, Events, Max.Humidity, CloudCover:WindDirDegrees)
# fix the error
tidy_weather$Max.Humidity[indexes] <- 100

# find rows with values less than 0 for Mean.VisibilityMiles
indexes <- which(tidy_weather$Mean.VisibilityMiles < 0)

tidy_weather[indexes, ] %>% 
  select(date, Events, Mean.VisibilityMiles, CloudCover:WindDirDegrees)
# fix the error
tidy_weather$Mean.VisibilityMiles[indexes] <- 10

Empty values in Event column

Some of the values in the Event column are empty for any day on which there was no significant weather events such as rain, fog, storm, etc. It is a good practice to be explicit about it and the empty strings should be replaced with “None”.

tidy_weather$Events[tidy_weather$Events == ""] <- "None"

head(tidy_weather)

Naming Conventions

Finally, the variable names should follow a standard convention. Some of the variables contain the period “.” for word separation and it has special meaning in certain situations. So, it is best to follow a recommend approach which is using camel case.

# clean_names from janitor helps naming variables in a convinient way
clean_weather <- clean_names(tidy_weather)

# and, we have our final clean data
glimpse(clean_weather)
## Observations: 366
## Variables: 23
## $ date                       <date> 2014-12-01, 2014-12-10, 2014-12-11, …
## $ events                     <chr> "Rain", "Rain", "Rain-Snow", "Snow", …
## $ cloud_cover                <dbl> 6, 8, 8, 7, 5, 4, 2, 8, 8, 7, 4, 7, 6…
## $ max_dew_point_f            <dbl> 46, 45, 37, 28, 28, 29, 33, 42, 46, 3…
## $ max_gust_speed_mph         <dbl> 29, 29, 28, 21, 23, 20, 21, 10, 26, 3…
## $ max_humidity               <dbl> 74, 100, 92, 85, 75, 82, 89, 96, 100,…
## $ max_sea_level_pressure_in  <dbl> 30.45, 29.58, 29.81, 29.88, 29.86, 29…
## $ max_temperature_f          <dbl> 64, 48, 39, 39, 42, 45, 42, 44, 49, 4…
## $ max_visibility_miles       <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 1…
## $ max_wind_speed_mph         <dbl> 22, 23, 21, 16, 17, 15, 15, 8, 20, 23…
## $ mean_humidity              <dbl> 63, 95, 87, 75, 65, 68, 75, 85, 85, 7…
## $ mean_sea_level_pressure_in <dbl> 30.13, 29.50, 29.61, 29.85, 29.82, 29…
## $ mean_temperature_f         <dbl> 52, 43, 36, 35, 37, 39, 37, 40, 45, 4…
## $ mean_visibility_miles      <dbl> 10, 3, 7, 10, 10, 10, 10, 9, 6, 10, 1…
## $ mean_wind_speed_mph        <dbl> 13, 13, 13, 11, 12, 10, 6, 4, 11, 14,…
## $ mean_dew_point_f           <dbl> 40, 39, 31, 27, 26, 27, 29, 36, 41, 3…
## $ min_dewpoint_f             <dbl> 26, 37, 27, 25, 24, 25, 27, 30, 32, 2…
## $ min_humidity               <dbl> 52, 89, 82, 64, 55, 53, 60, 73, 70, 5…
## $ min_sea_level_pressure_in  <dbl> 30.01, 29.43, 29.44, 29.81, 29.78, 29…
## $ min_temperature_f          <dbl> 39, 38, 32, 31, 32, 33, 32, 35, 41, 3…
## $ min_visibility_miles       <dbl> 10, 1, 1, 7, 10, 10, 10, 5, 1, 10, 10…
## $ precipitation_in           <dbl> 0.01, 0.28, 0.02, 0.00, 0.00, 0.00, 0…
## $ wind_dir_degrees           <dbl> 268, 357, 230, 286, 298, 306, 324, 79…
clean_weather

Conclusion

Overall, the cleaning process includes

  • inspecting the data
  • tidying the data
  • improving date format
  • appropriate data type conversions
  • dealing with missing values, empty strings and errors
  • following standard naming conventions

Hadley Wickham n.d, Tidy data, The Comprehensive R Archive Network, viewed 30 March 2019, https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html

R-Ladies Sydney 2018, Clean It Up 4: Wide to Long to Wide to…, R-Ladies Sydney, 30 March 2019, https://rladiessydney.org/post/2018/11/28/clean-it-up-4/