Historical weather data from Boston, USA - 12 months beginning Dec 2014 - The data are dirty - Objective is to clean the data

###Explore the data Before diving into our data cleaning routine, we must first understand the basic structure of the data. This involves looking at things like the class() of the data object to make sure it’s what we expect (generally a data.frame) in addition to checking its dimensions with dim() and the column names with names().

weather = read.csv("weather.csv")
# Verify that weather is a data.frame
class(weather)
[1] "data.frame"
# Check the dimensions
dim(weather)
[1] 286  35
# View the column names
names(weather)
 [1] "X"       "year"    "month"   "measure" "X1"      "X2"     
 [7] "X3"      "X4"      "X5"      "X6"      "X7"      "X8"     
[13] "X9"      "X10"     "X11"     "X12"     "X13"     "X14"    
[19] "X15"     "X16"     "X17"     "X18"     "X19"     "X20"    
[25] "X21"     "X22"     "X23"     "X24"     "X25"     "X26"    
[31] "X27"     "X28"     "X29"     "X30"     "X31"    

Next up is to look at some summaries of the data. This is where functions like str(), glimpse() from dplyr, and summary() come in handy.

# View the structure of the data
str(weather)
'data.frame':   286 obs. of  35 variables:
 $ X      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ year   : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
 $ month  : int  12 12 12 12 12 12 12 12 12 12 ...
 $ measure: Factor w/ 22 levels "CloudCover","Events",..: 7 12 19 3 15 16 5 10 17 6 ...
 $ X1     : Factor w/ 129 levels "","-6","0","0.01",..: 101 89 76 83 78 31 111 100 89 66 ...
 $ X2     : Factor w/ 126 levels "","-5","0","0.01",..: 82 77 72 80 34 18 122 109 91 69 ...
 $ X3     : Factor w/ 124 levels "","-3","0","0.01",..: 85 78 70 82 76 27 11 109 90 59 ...
 $ X4     : Factor w/ 131 levels "","0","0.02",..: 82 75 49 27 23 14 108 94 77 68 ...
 $ X5     : Factor w/ 119 levels "","-4","-6","-7",..: 74 66 31 69 29 14 109 96 77 61 ...
 $ X6     : Factor w/ 119 levels "","-1","-5","-8",..: 75 73 69 75 72 67 11 116 111 59 ...
 $ X7     : Factor w/ 120 levels "","-13","-3",..: 70 44 22 68 20 3 116 92 33 61 ...
 $ X8     : Factor w/ 129 levels "","-1","-18",..: 35 27 19 33 17 43 123 106 82 65 ...
 $ X9     : Factor w/ 127 levels "","0","0.01",..: 81 73 34 81 76 33 10 121 116 63 ...
 $ X10    : Factor w/ 125 levels "","-2","0","0.05",..: 87 82 77 84 78 76 13 123 121 39 ...
 $ X11    : Factor w/ 126 levels "","-1","0","0.01",..: 74 72 68 73 67 32 120 115 111 45 ...
 $ X12    : Factor w/ 126 levels "","0","0.01",..: 77 74 68 33 32 29 117 110 100 50 ...
 $ X13    : Factor w/ 128 levels "","-10","-4",..: 85 80 74 32 29 26 113 105 95 47 ...
 $ X14    : Factor w/ 125 levels "","-4","-9","0",..: 85 81 75 36 32 29 115 104 92 51 ...
 $ X15    : Factor w/ 129 levels "","-13","-2",..: 83 78 69 74 37 34 120 112 101 60 ...
 $ X16    : Factor w/ 123 levels "","-11","-16",..: 83 80 74 82 75 53 120 114 108 61 ...
 $ X17    : Factor w/ 133 levels "","-3","-7","-9",..: 90 86 83 87 83 74 12 125 113 47 ...
 $ X18    : Factor w/ 122 levels "","-1","-6","0",..: 76 73 70 68 46 29 114 104 88 40 ...
 $ X19    : Factor w/ 126 levels "","-2","-4","-5",..: 73 71 36 29 26 21 101 95 90 58 ...
 $ X20    : Factor w/ 128 levels "","-10","-4",..: 79 74 32 54 28 23 121 117 110 70 ...
 $ X21    : Factor w/ 126 levels "","-5","0","0.07",..: 73 69 49 49 31 28 115 108 100 65 ...
 $ X22    : Factor w/ 128 levels "","-5","0","0.05",..: 83 78 70 78 72 27 120 115 107 66 ...
 $ X23    : Factor w/ 122 levels "","-14","-6",..: 87 86 84 86 84 79 11 119 115 66 ...
 $ X24    : Factor w/ 125 levels "","-15","-4",..: 81 80 78 81 80 78 10 120 119 51 ...
 $ X25    : Factor w/ 130 levels "","0","0.03",..: 99 93 85 98 84 34 9 115 89 49 ...
 $ X26    : Factor w/ 118 levels "","-2","0","0.02",..: 81 74 67 60 30 29 99 90 79 52 ...
 $ X27    : Factor w/ 132 levels "","-2","-5","0",..: 94 86 78 75 71 37 112 102 92 59 ...
 $ X28    : Factor w/ 127 levels "","-1","0","0.01",..: 91 86 81 83 75 33 112 104 92 50 ...
 $ X29    : Factor w/ 113 levels "","-2","0","0.04",..: 70 64 45 27 18 6 86 76 65 53 ...
 $ X30    : Factor w/ 124 levels "","-6","0","0.02",..: 50 29 24 7 73 2 81 71 29 60 ...
 $ X31    : Factor w/ 95 levels "","-8","0","0.05",..: 42 23 17 86 65 6 71 61 53 51 ...
# Load dplyr package
library(dplyr)

# Look at the structure using dplyr's glimpse()
glimpse(weather)
Observations: 286
Variables: 35
$ X       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
$ year    <int> 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 20...
$ month   <int> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12...
$ measure <fct> Max.TemperatureF, Mean.TemperatureF, Min.Temperatu...
$ X1      <fct> 64, 52, 39, 46, 40, 26, 74, 63, 52, 30.45, 30.13, ...
$ X2      <fct> 42, 38, 33, 40, 27, 17, 92, 72, 51, 30.71, 30.59, ...
$ X3      <fct> 51, 44, 37, 49, 42, 24, 100, 79, 57, 30.4, 30.07, ...
$ X4      <fct> 43, 37, 30, 24, 21, 13, 69, 54, 39, 30.56, 30.33, ...
$ X5      <fct> 42, 34, 26, 37, 25, 12, 85, 66, 47, 30.68, 30.59, ...
$ X6      <fct> 45, 42, 38, 45, 40, 36, 100, 93, 85, 30.42, 30.24,...
$ X7      <fct> 38, 30, 21, 36, 20, -3, 92, 61, 29, 30.69, 30.46, ...
$ X8      <fct> 29, 24, 18, 28, 16, 3, 92, 70, 47, 30.77, 30.67, 3...
$ X9      <fct> 49, 39, 29, 49, 41, 28, 100, 93, 86, 30.51, 30.04,...
$ X10     <fct> 48, 43, 38, 45, 39, 37, 100, 95, 89, 29.58, 29.5, ...
$ X11     <fct> 39, 36, 32, 37, 31, 27, 92, 87, 82, 29.81, 29.61, ...
$ X12     <fct> 39, 35, 31, 28, 27, 25, 85, 75, 64, 29.88, 29.85, ...
$ X13     <fct> 42, 37, 32, 28, 26, 24, 75, 65, 55, 29.86, 29.82, ...
$ X14     <fct> 45, 39, 33, 29, 27, 25, 82, 68, 53, 29.91, 29.83, ...
$ X15     <fct> 42, 37, 32, 33, 29, 27, 89, 75, 60, 30.15, 30.05, ...
$ X16     <fct> 44, 40, 35, 42, 36, 30, 96, 85, 73, 30.17, 30.09, ...
$ X17     <fct> 49, 45, 41, 46, 41, 32, 100, 85, 70, 29.91, 29.75,...
$ X18     <fct> 44, 40, 36, 34, 30, 26, 89, 73, 57, 29.87, 29.78, ...
$ X19     <fct> 37, 33, 29, 25, 22, 20, 69, 63, 56, 30.15, 29.98, ...
$ X20     <fct> 36, 32, 27, 30, 24, 20, 89, 79, 69, 30.31, 30.26, ...
$ X21     <fct> 36, 33, 30, 30, 27, 25, 85, 77, 69, 30.37, 30.32, ...
$ X22     <fct> 44, 39, 33, 39, 34, 25, 89, 79, 69, 30.4, 30.35, 3...
$ X23     <fct> 47, 45, 42, 45, 42, 37, 100, 91, 82, 30.31, 30.23,...
$ X24     <fct> 46, 44, 41, 46, 44, 41, 100, 98, 96, 30.13, 29.9, ...
$ X25     <fct> 59, 52, 44, 58, 43, 29, 100, 75, 49, 29.96, 29.63,...
$ X26     <fct> 50, 44, 37, 31, 29, 28, 70, 60, 49, 30.16, 30.11, ...
$ X27     <fct> 52, 45, 38, 34, 31, 29, 70, 60, 50, 30.22, 30.14, ...
$ X28     <fct> 52, 46, 40, 42, 35, 27, 76, 65, 53, 29.99, 29.87, ...
$ X29     <fct> 41, 36, 30, 26, 20, 10, 64, 51, 37, 30.22, 30.12, ...
$ X30     <fct> 30, 26, 22, 10, 4, -6, 50, 38, 26, 30.36, 30.32, 3...
$ X31     <fct> 30, 25, 20, 8, 5, 1, 57, 44, 31, 30.32, 30.25, 30....
# View a summary of the data
summary(weather)
       X               year          month       
 Min.   :  1.00   Min.   :2014   Min.   : 1.000  
 1st Qu.: 72.25   1st Qu.:2015   1st Qu.: 4.000  
 Median :143.50   Median :2015   Median : 7.000  
 Mean   :143.50   Mean   :2015   Mean   : 6.923  
 3rd Qu.:214.75   3rd Qu.:2015   3rd Qu.:10.000  
 Max.   :286.00   Max.   :2015   Max.   :12.000  
                                                 
                     measure          X1            X2     
 CloudCover              : 13   10     : 33   10     : 27  
 Events                  : 13   6      :  8   0      :  9  
 Max.Dew.PointF          : 13   22     :  7          :  7  
 Max.Gust.SpeedMPH       : 13   0      :  6   7      :  7  
 Max.Humidity            : 13   30     :  6   37     :  5  
 Max.Sea.Level.PressureIn: 13   (Other):225   (Other):209  
 (Other)                 :208   NA's   :  1   NA's   : 22  
       X3            X4            X5            X6            X7     
 10     : 23   10     : 28   10     : 29   10     : 32   10     : 30  
 1      :  7   0      : 10          :  8          :  9          :  7  
 0      :  6          :  7   0      :  8   0      :  9   0      :  7  
 7      :  6   49     :  5   22     :  6   9      :  5   21     :  6  
 9      :  6   6      :  5   5      :  5   15     :  4   29     :  5  
 (Other):214   (Other):209   (Other):208   (Other):205   (Other):209  
 NA's   : 24   NA's   : 22   NA's   : 22   NA's   : 22   NA's   : 22  
       X8            X9           X10           X11           X12     
 10     : 24   10     : 22   10     : 26   10     : 24   10     : 26  
        :  6   8      :  8   7      :  8   7      :  9   0      : 10  
 0      :  6   0      :  6   8      :  8   0      :  7   7      :  7  
 25     :  6   93     :  6   0      :  6          :  6          :  6  
 9      :  6          :  5   21     :  6   68     :  6   25     :  6  
 (Other):215   (Other):217   (Other):210   (Other):212   (Other):208  
 NA's   : 23   NA's   : 22   NA's   : 22   NA's   : 22   NA's   : 23  
      X13           X14           X15           X16           X17     
 10     : 31   10     : 34   10     : 28   10     : 32   10     : 29  
        :  9   0      :  8   0      : 10   0      :  9          :  8  
 0      :  9          :  7          :  6          :  8   0      :  7  
 59     :  5   39     :  5   6      :  5   3      :  6   8      :  6  
 6      :  5   51     :  5   67     :  5   8      :  5   1      :  5  
 (Other):205   (Other):205   (Other):210   (Other):204   (Other):209  
 NA's   : 22   NA's   : 22   NA's   : 22   NA's   : 22   NA's   : 22  
      X18           X19           X20           X21           X22     
 10     : 30   10     : 32   10     : 26   10     : 24   10     : 24  
 0      :  9   0      :  9          :  8   0      :  7   0      :  7  
 8      :  8          :  8   0      :  6   5      :  7   7      :  7  
        :  6   20     :  7   7      :  6   6      :  6   39     :  6  
 17     :  5   30     :  6   26     :  5          :  5   8      :  5  
 (Other):205   (Other):202   (Other):213   (Other):215   (Other):215  
 NA's   : 23   NA's   : 22   NA's   : 22   NA's   : 22   NA's   : 22  
      X23           X24           X25           X26           X27     
 10     : 33   10     : 27   10     : 29   10     : 29   10     : 28  
 0      : 12   0      :  8   0      :  8   0      :  8   0      :  9  
        :  8          :  7          :  6   8      :  8          :  7  
 18     :  5   16     :  7   4      :  5          :  6   52     :  5  
 42     :  5   20     :  7   55     :  5   16     :  6   7      :  5  
 (Other):201   (Other):208   (Other):211   (Other):207   (Other):210  
 NA's   : 22   NA's   : 22   NA's   : 22   NA's   : 22   NA's   : 22  
      X28           X29           X30           X31     
 10     : 25   10     : 28   10     : 28   10     : 17  
 0      :  7   0      :  9   0      :  8   0      :  7  
 8      :  7          :  8          :  7          :  5  
 Rain   :  6   7      :  8   26     :  7   31     :  5  
 4      :  5   20     :  5   6      :  5   1      :  3  
 (Other):214   (Other):184   (Other):187   (Other):117  
 NA's   : 22   NA's   : 44   NA's   : 44   NA's   :132  

After understanding the structure of the data and looking at some brief summaries, it often helps to preview the actual data. The functions head() and tail() allow you to view the top and bottom rows of the data, Recall you’ll be shown 6 rows by default, but you can alter this behavior with a second argument to the function.

# View first 6 rows
head(weather)

# View first 15 rows
head(weather, 15)

# View the last 6 rows
tail(weather)

# View the last 10 rows
tail(weather, 10)

Cleaning the data

The weather dataset suffers from one of the five most common symptoms of messy data: column names are values. In particular, the column names X1-X31 represent days of the month, which should really be values of a new variable called day.

The tidyr package provides the gather() function for exactly this scenario.

# Load the tidyr package
library(tidyr)

# Gather the columns
weather2 <- gather(weather, day, value, X1:X31, na.rm = TRUE)
attributes are not identical across measure variables;
they will be dropped
# View the head
head(weather2)

Our data suffer from a second common symptom of messy data: values are variable names. Specifically, values in the measure column should be variables (i.e. column names) in our dataset.

The spread() function from tidyr is designed to help with this.

# First remove column of row names
without_x <- weather2[, -1]

# Spread the data
weather3 <- spread(without_x, measure, value)

# View the head
head(weather3)

###Prepare the data for analysis Now that the weather dataset adheres to tidy data principles, the next step is to prepare it for analysis. We’ll start by combining the year, month, and day columns and recoding the resulting character column as a date. We can use a combination of base R, stringr, and lubridate to accomplish this task.

# Load the stringr and lubridate packages
library(stringr)
library(lubridate)


# Remove X's from day column
weather3$day <- str_replace(weather3$day, "X", "")

# Unite the year, month, and day columns
weather4 <- unite(weather3, date, year, month, day, sep = "-")

# Convert date column to proper date format using lubridates's ymd()
weather4$date <- ymd(weather4$date)

# Rearrange columns using dplyr's select()
weather5 <- select(weather4, date, Events, CloudCover:WindDirDegrees)

# View the head of weather5
head(weather5)

It’s important for analysis that variables are coded appropriately. This is not yet the case with our weather data. Recall that functions such as as.numeric() and as.character() can be used to coerce variables into different types.

It’s important to keep in mind that coercions are not always successful, particularly if there’s some data in a column that you don’t expect.

# View the structure of weather5
str(weather5)
'data.frame':   366 obs. of  23 variables:
 $ date                     : Date, format: "2014-12-01" ...
 $ Events                   : chr  "Rain" "Rain" "Rain-Snow" "Snow" ...
 $ CloudCover               : chr  "6" "8" "8" "7" ...
 $ Max.Dew.PointF           : chr  "46" "45" "37" "28" ...
 $ Max.Gust.SpeedMPH        : chr  "29" "29" "28" "21" ...
 $ Max.Humidity             : chr  "74" "100" "92" "85" ...
 $ Max.Sea.Level.PressureIn : chr  "30.45" "29.58" "29.81" "29.88" ...
 $ Max.TemperatureF         : chr  "64" "48" "39" "39" ...
 $ Max.VisibilityMiles      : chr  "10" "10" "10" "10" ...
 $ Max.Wind.SpeedMPH        : chr  "22" "23" "21" "16" ...
 $ Mean.Humidity            : chr  "63" "95" "87" "75" ...
 $ Mean.Sea.Level.PressureIn: chr  "30.13" "29.5" "29.61" "29.85" ...
 $ Mean.TemperatureF        : chr  "52" "43" "36" "35" ...
 $ Mean.VisibilityMiles     : chr  "10" "3" "7" "10" ...
 $ Mean.Wind.SpeedMPH       : chr  "13" "13" "13" "11" ...
 $ MeanDew.PointF           : chr  "40" "39" "31" "27" ...
 $ Min.DewpointF            : chr  "26" "37" "27" "25" ...
 $ Min.Humidity             : chr  "52" "89" "82" "64" ...
 $ Min.Sea.Level.PressureIn : chr  "30.01" "29.43" "29.44" "29.81" ...
 $ Min.TemperatureF         : chr  "39" "38" "32" "31" ...
 $ Min.VisibilityMiles      : chr  "10" "1" "1" "7" ...
 $ PrecipitationIn          : chr  "0.01" "0.28" "0.02" "T" ...
 $ WindDirDegrees           : chr  "268" "357" "230" "286" ...
# Examine the first 20 rows of weather5. Are most of the characters numeric?
head(weather5, 20)

# See what happens if we try to convert PrecipitationIn to numeric
as.numeric(weather5$PrecipitationIn)
NAs introducidos por coerci昼㸳n
  [1] 0.01 0.28 0.02   NA   NA 0.00 0.00   NA 0.43 0.01 0.00 0.10   NA
 [14]   NA 0.05 0.25 0.56 0.14 0.00 0.00 0.01 0.00 0.44 0.00 0.00 0.00
 [27] 0.11 1.09 0.13 0.03 2.90 0.00 0.00 0.00 0.20 0.00   NA 0.12 0.00
 [40] 0.00 0.15 0.00 0.00 0.00 0.00   NA 0.00 0.71 0.00 0.10 0.95 0.01
 [53]   NA 0.62 0.06 0.05 0.57 0.00 0.02   NA 0.00 0.01 0.00 0.05 0.01
 [66] 0.03 0.00 0.23 0.39 0.00 0.02 0.01 0.06 0.78 0.00 0.17 0.11 0.00
 [79]   NA 0.07 0.02 0.00 0.00 0.00 0.00 0.09   NA 0.07 0.37 0.88 0.17
 [92] 0.06 0.01 0.00 0.00 0.80 0.27 0.00 0.14 0.00 0.00 0.01 0.05 0.09
[105] 0.00 0.00 0.00 0.04 0.80 0.21 0.12 0.00 0.26   NA 0.00 0.02   NA
[118] 0.00 0.00   NA 0.00 0.00 0.09 0.00 0.00 0.00 0.01 0.00 0.00 0.06
[131] 0.00 0.00 0.00 0.61 0.54   NA 0.00   NA 0.00 0.00 0.10 0.07 0.00
[144] 0.03 0.00 0.39 0.00 0.00 0.03 0.26 0.09 0.00 0.00 0.00 0.02 0.00
[157] 0.00 0.00   NA 0.00 0.00 0.27 0.00 0.00 0.00   NA 0.00 0.00   NA
[170] 0.00 0.00   NA 0.00 0.00 0.00 0.91 0.00 0.02 0.00 0.00 0.00 0.00
[183] 0.38 0.00 0.00 0.00   NA 0.00 0.40   NA 0.00 0.00 0.00 0.74 0.04
[196] 1.72 0.00 0.01 0.00 0.00   NA 0.20 1.43   NA 0.00 0.00 0.00   NA
[209] 0.09 0.00   NA   NA 0.50 1.12 0.00 0.00 0.00 0.03   NA 0.00   NA
[222] 0.14   NA 0.00   NA   NA 0.00 0.00 0.01 0.00   NA 0.06 0.00 0.00
[235] 0.00 0.02 0.00   NA 0.00 0.00 0.02   NA 0.15   NA 0.00 0.83 0.00
[248] 0.00 0.00 0.08 0.00 0.00 0.14 0.00 0.00 0.00 0.63   NA 0.02   NA
[261] 0.00   NA 0.00 0.00 0.00 0.00 0.00 0.00 0.49 0.00 0.00 0.00 0.00
[274] 0.00 0.00 0.17 0.66 0.01 0.38 0.00 0.00 0.00 0.00 0.00 0.00 0.00
[287]   NA 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.04 0.01 2.46   NA
[300] 0.00 0.00 0.00 0.20 0.00   NA 0.00 0.00 0.00 0.12 0.00 0.00   NA
[313]   NA   NA 0.00 0.08   NA 0.07   NA 0.00 0.00 0.03 0.00 0.00 0.36
[326] 0.73 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.34   NA 0.07 0.54
[339] 0.04 0.01 0.00 0.00 0.00 0.00 0.00   NA 0.00 0.86 0.00 0.30 0.04
[352] 0.00 0.00 0.00 0.00 0.21 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
[365] 0.00 0.14

What values in PrecipitationIn would become NA if coerced to numbers? “T” was used to denote a trace amount (i.e. too small to be accurately measured) of precipitation in the PrecipitationIn column. In order to coerce this column to numeric, you’ll need to deal with this somehow. To keep things simple, we will just replace “T” with zero, as a string (“0”).

# Replace "T" with "0" (T = trace)
weather5$PrecipitationIn <- str_replace(weather5$PrecipitationIn, "T", "0")

# Convert characters to numerics
weather6 <- mutate_at(weather5, vars(CloudCover:WindDirDegrees), funs(as.numeric))
funs() is soft deprecated as of dplyr 0.8.0
Please use a list of either functions or lambdas: 

  # Simple named list: 
  list(mean = mean, median = median)

  # Auto named with `tibble::lst()`: 
  tibble::lst(mean, median)

  # Using lambdas
  list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
This warning is displayed once per session.
# Look at result
str(weather6)
'data.frame':   366 obs. of  23 variables:
 $ date                     : Date, format: "2014-12-01" ...
 $ Events                   : chr  "Rain" "Rain" "Rain-Snow" "Snow" ...
 $ CloudCover               : num  6 8 8 7 5 4 2 8 8 7 ...
 $ Max.Dew.PointF           : num  46 45 37 28 28 29 33 42 46 34 ...
 $ Max.Gust.SpeedMPH        : num  29 29 28 21 23 20 21 10 26 30 ...
 $ Max.Humidity             : num  74 100 92 85 75 82 89 96 100 89 ...
 $ Max.Sea.Level.PressureIn : num  30.4 29.6 29.8 29.9 29.9 ...
 $ Max.TemperatureF         : num  64 48 39 39 42 45 42 44 49 44 ...
 $ Max.VisibilityMiles      : num  10 10 10 10 10 10 10 10 10 10 ...
 $ Max.Wind.SpeedMPH        : num  22 23 21 16 17 15 15 8 20 23 ...
 $ Mean.Humidity            : num  63 95 87 75 65 68 75 85 85 73 ...
 $ Mean.Sea.Level.PressureIn: num  30.1 29.5 29.6 29.9 29.8 ...
 $ Mean.TemperatureF        : num  52 43 36 35 37 39 37 40 45 40 ...
 $ Mean.VisibilityMiles     : num  10 3 7 10 10 10 10 9 6 10 ...
 $ Mean.Wind.SpeedMPH       : num  13 13 13 11 12 10 6 4 11 14 ...
 $ MeanDew.PointF           : num  40 39 31 27 26 27 29 36 41 30 ...
 $ Min.DewpointF            : num  26 37 27 25 24 25 27 30 32 26 ...
 $ Min.Humidity             : num  52 89 82 64 55 53 60 73 70 57 ...
 $ Min.Sea.Level.PressureIn : num  30 29.4 29.4 29.8 29.8 ...
 $ Min.TemperatureF         : num  39 38 32 31 32 33 32 35 41 36 ...
 $ Min.VisibilityMiles      : num  10 1 1 7 10 10 10 5 1 10 ...
 $ PrecipitationIn          : num  0.01 0.28 0.02 0 0 0 0 0 0.43 0.01 ...
 $ WindDirDegrees           : num  268 357 230 286 298 306 324 79 311 281 ...

Missing, extreme, and unexpected values

Missing values

Before dealing with missing values in the data, it’s important to find them and figure out why they exist in the first place. If your dataset is too big to look at all at once, like it is here, remember you can use sum() and is.na() to quickly size up the situation by counting the number of NA values.

The summary() function may also come in handy for identifying which variables contain the missing values. Finally, the which() function is useful for locating the missing values within a particular column.

# Count missing values
sum(is.na(weather6))
[1] 6
# Find missing values
summary(weather6)
      date               Events            CloudCover   
 Min.   :2014-12-01   Length:366         Min.   :0.000  
 1st Qu.:2015-03-02   Class :character   1st Qu.:3.000  
 Median :2015-06-01   Mode  :character   Median :5.000  
 Mean   :2015-06-01                      Mean   :4.708  
 3rd Qu.:2015-08-31                      3rd Qu.:7.000  
 Max.   :2015-12-01                      Max.   :8.000  
                                                        
 Max.Dew.PointF  Max.Gust.SpeedMPH  Max.Humidity    
 Min.   :-6.00   Min.   : 0.00     Min.   :  39.00  
 1st Qu.:32.00   1st Qu.:21.00     1st Qu.:  73.25  
 Median :47.50   Median :25.50     Median :  86.00  
 Mean   :45.48   Mean   :26.99     Mean   :  85.69  
 3rd Qu.:61.00   3rd Qu.:31.25     3rd Qu.:  93.00  
 Max.   :75.00   Max.   :94.00     Max.   :1000.00  
                 NA's   :6                          
 Max.Sea.Level.PressureIn Max.TemperatureF Max.VisibilityMiles
 Min.   :29.58            Min.   :18.00    Min.   : 2.000     
 1st Qu.:30.00            1st Qu.:42.00    1st Qu.:10.000     
 Median :30.14            Median :60.00    Median :10.000     
 Mean   :30.16            Mean   :58.93    Mean   : 9.907     
 3rd Qu.:30.31            3rd Qu.:76.00    3rd Qu.:10.000     
 Max.   :30.88            Max.   :96.00    Max.   :10.000     
                                                              
 Max.Wind.SpeedMPH Mean.Humidity   Mean.Sea.Level.PressureIn
 Min.   : 8.00     Min.   :28.00   Min.   :29.49            
 1st Qu.:16.00     1st Qu.:56.00   1st Qu.:29.87            
 Median :20.00     Median :66.00   Median :30.03            
 Mean   :20.62     Mean   :66.02   Mean   :30.04            
 3rd Qu.:24.00     3rd Qu.:76.75   3rd Qu.:30.19            
 Max.   :38.00     Max.   :98.00   Max.   :30.77            
                                                            
 Mean.TemperatureF Mean.VisibilityMiles Mean.Wind.SpeedMPH
 Min.   : 8.00     Min.   :-1.000       Min.   : 4.00     
 1st Qu.:36.25     1st Qu.: 8.000       1st Qu.: 8.00     
 Median :53.50     Median :10.000       Median :10.00     
 Mean   :51.40     Mean   : 8.861       Mean   :10.68     
 3rd Qu.:68.00     3rd Qu.:10.000       3rd Qu.:13.00     
 Max.   :84.00     Max.   :10.000       Max.   :22.00     
                                                          
 MeanDew.PointF   Min.DewpointF     Min.Humidity  
 Min.   :-11.00   Min.   :-18.00   Min.   :16.00  
 1st Qu.: 24.00   1st Qu.: 16.25   1st Qu.:35.00  
 Median : 41.00   Median : 35.00   Median :46.00  
 Mean   : 38.96   Mean   : 32.25   Mean   :48.31  
 3rd Qu.: 56.00   3rd Qu.: 51.00   3rd Qu.:60.00  
 Max.   : 71.00   Max.   : 68.00   Max.   :96.00  
                                                  
 Min.Sea.Level.PressureIn Min.TemperatureF Min.VisibilityMiles
 Min.   :29.16            Min.   :-3.00    Min.   : 0.000     
 1st Qu.:29.76            1st Qu.:30.00    1st Qu.: 2.000     
 Median :29.94            Median :46.00    Median :10.000     
 Mean   :29.93            Mean   :43.33    Mean   : 6.716     
 3rd Qu.:30.09            3rd Qu.:60.00    3rd Qu.:10.000     
 Max.   :30.64            Max.   :74.00    Max.   :10.000     
                                                              
 PrecipitationIn  WindDirDegrees 
 Min.   :0.0000   Min.   :  1.0  
 1st Qu.:0.0000   1st Qu.:113.0  
 Median :0.0000   Median :222.0  
 Mean   :0.1016   Mean   :200.1  
 3rd Qu.:0.0400   3rd Qu.:275.0  
 Max.   :2.9000   Max.   :360.0  
                                 
# Find indices of NAs in Max.Gust.SpeedMPH
ind <- which(is.na(weather6$Max.Gust.SpeedMPH))

# Look at the full rows for records missing Max.Gust.SpeedMPH
weather6[ind, ]

In this situation it’s unclear why these values are missing and there doesn’t appear to be any obvious pattern to their missingness, so we’ll leave them alone for now.

extreme values

Besides missing values, we want to know if there are values in the data that are too extreme or bizarre to be plausible. A great way to start the search for these values is with summary().

Once implausible values are identified, they must be dealt with in an intelligent and informed way. Sometimes the best way forward is obvious and other times it may require some research and/or discussions with the original collectors of the data.

# Review distributions for all variables
summary(weather6)
      date               Events            CloudCover   
 Min.   :2014-12-01   Length:366         Min.   :0.000  
 1st Qu.:2015-03-02   Class :character   1st Qu.:3.000  
 Median :2015-06-01   Mode  :character   Median :5.000  
 Mean   :2015-06-01                      Mean   :4.708  
 3rd Qu.:2015-08-31                      3rd Qu.:7.000  
 Max.   :2015-12-01                      Max.   :8.000  
                                                        
 Max.Dew.PointF  Max.Gust.SpeedMPH  Max.Humidity    
 Min.   :-6.00   Min.   : 0.00     Min.   :  39.00  
 1st Qu.:32.00   1st Qu.:21.00     1st Qu.:  73.25  
 Median :47.50   Median :25.50     Median :  86.00  
 Mean   :45.48   Mean   :26.99     Mean   :  85.69  
 3rd Qu.:61.00   3rd Qu.:31.25     3rd Qu.:  93.00  
 Max.   :75.00   Max.   :94.00     Max.   :1000.00  
                 NA's   :6                          
 Max.Sea.Level.PressureIn Max.TemperatureF Max.VisibilityMiles
 Min.   :29.58            Min.   :18.00    Min.   : 2.000     
 1st Qu.:30.00            1st Qu.:42.00    1st Qu.:10.000     
 Median :30.14            Median :60.00    Median :10.000     
 Mean   :30.16            Mean   :58.93    Mean   : 9.907     
 3rd Qu.:30.31            3rd Qu.:76.00    3rd Qu.:10.000     
 Max.   :30.88            Max.   :96.00    Max.   :10.000     
                                                              
 Max.Wind.SpeedMPH Mean.Humidity   Mean.Sea.Level.PressureIn
 Min.   : 8.00     Min.   :28.00   Min.   :29.49            
 1st Qu.:16.00     1st Qu.:56.00   1st Qu.:29.87            
 Median :20.00     Median :66.00   Median :30.03            
 Mean   :20.62     Mean   :66.02   Mean   :30.04            
 3rd Qu.:24.00     3rd Qu.:76.75   3rd Qu.:30.19            
 Max.   :38.00     Max.   :98.00   Max.   :30.77            
                                                            
 Mean.TemperatureF Mean.VisibilityMiles Mean.Wind.SpeedMPH
 Min.   : 8.00     Min.   :-1.000       Min.   : 4.00     
 1st Qu.:36.25     1st Qu.: 8.000       1st Qu.: 8.00     
 Median :53.50     Median :10.000       Median :10.00     
 Mean   :51.40     Mean   : 8.861       Mean   :10.68     
 3rd Qu.:68.00     3rd Qu.:10.000       3rd Qu.:13.00     
 Max.   :84.00     Max.   :10.000       Max.   :22.00     
                                                          
 MeanDew.PointF   Min.DewpointF     Min.Humidity  
 Min.   :-11.00   Min.   :-18.00   Min.   :16.00  
 1st Qu.: 24.00   1st Qu.: 16.25   1st Qu.:35.00  
 Median : 41.00   Median : 35.00   Median :46.00  
 Mean   : 38.96   Mean   : 32.25   Mean   :48.31  
 3rd Qu.: 56.00   3rd Qu.: 51.00   3rd Qu.:60.00  
 Max.   : 71.00   Max.   : 68.00   Max.   :96.00  
                                                  
 Min.Sea.Level.PressureIn Min.TemperatureF Min.VisibilityMiles
 Min.   :29.16            Min.   :-3.00    Min.   : 0.000     
 1st Qu.:29.76            1st Qu.:30.00    1st Qu.: 2.000     
 Median :29.94            Median :46.00    Median :10.000     
 Mean   :29.93            Mean   :43.33    Mean   : 6.716     
 3rd Qu.:30.09            3rd Qu.:60.00    3rd Qu.:10.000     
 Max.   :30.64            Max.   :74.00    Max.   :10.000     
                                                              
 PrecipitationIn  WindDirDegrees 
 Min.   :0.0000   Min.   :  1.0  
 1st Qu.:0.0000   1st Qu.:113.0  
 Median :0.0000   Median :222.0  
 Mean   :0.1016   Mean   :200.1  
 3rd Qu.:0.0400   3rd Qu.:275.0  
 Max.   :2.9000   Max.   :360.0  
                                 
# Find row with Max.Humidity of 1000
ind <- which(weather6$Max.Humidity == 1000)

# Look at the data for that day
weather6[ind, ]

# Change 1000 to 100
weather6$Max.Humidity[ind] <- 100

You’ve discovered and repaired one obvious error in the data, but it appears that there’s another. Sometimes you get lucky and can infer the correct or intended value from the other data. For example, if you know the minimum and maximum values of a particular metric on a given day…

# Look at summary of Mean.VisibilityMiles
summary(weather6$Mean.VisibilityMiles)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -1.000   8.000  10.000   8.861  10.000  10.000 
# Get index of row with -1 value
ind <- which(weather6$Mean.VisibilityMiles == -1)

# Look at full row
weather6[ind, ]

# Set Mean.VisibilityMiles to the appropriate value
weather6$Mean.VisibilityMiles[ind] <- 10

In addition to dealing with obvious errors in the data, we want to see if there are other extreme values. In addition to the trusty summary() function, hist() is useful for quickly getting a feel for how different variables are distributed.

# Review summary of full data once more
summary(weather6)
      date               Events            CloudCover   
 Min.   :2014-12-01   Length:366         Min.   :0.000  
 1st Qu.:2015-03-02   Class :character   1st Qu.:3.000  
 Median :2015-06-01   Mode  :character   Median :5.000  
 Mean   :2015-06-01                      Mean   :4.708  
 3rd Qu.:2015-08-31                      3rd Qu.:7.000  
 Max.   :2015-12-01                      Max.   :8.000  
                                                        
 Max.Dew.PointF  Max.Gust.SpeedMPH  Max.Humidity   
 Min.   :-6.00   Min.   : 0.00     Min.   : 39.00  
 1st Qu.:32.00   1st Qu.:21.00     1st Qu.: 73.25  
 Median :47.50   Median :25.50     Median : 86.00  
 Mean   :45.48   Mean   :26.99     Mean   : 83.23  
 3rd Qu.:61.00   3rd Qu.:31.25     3rd Qu.: 93.00  
 Max.   :75.00   Max.   :94.00     Max.   :100.00  
                 NA's   :6                         
 Max.Sea.Level.PressureIn Max.TemperatureF Max.VisibilityMiles
 Min.   :29.58            Min.   :18.00    Min.   : 2.000     
 1st Qu.:30.00            1st Qu.:42.00    1st Qu.:10.000     
 Median :30.14            Median :60.00    Median :10.000     
 Mean   :30.16            Mean   :58.93    Mean   : 9.907     
 3rd Qu.:30.31            3rd Qu.:76.00    3rd Qu.:10.000     
 Max.   :30.88            Max.   :96.00    Max.   :10.000     
                                                              
 Max.Wind.SpeedMPH Mean.Humidity   Mean.Sea.Level.PressureIn
 Min.   : 8.00     Min.   :28.00   Min.   :29.49            
 1st Qu.:16.00     1st Qu.:56.00   1st Qu.:29.87            
 Median :20.00     Median :66.00   Median :30.03            
 Mean   :20.62     Mean   :66.02   Mean   :30.04            
 3rd Qu.:24.00     3rd Qu.:76.75   3rd Qu.:30.19            
 Max.   :38.00     Max.   :98.00   Max.   :30.77            
                                                            
 Mean.TemperatureF Mean.VisibilityMiles Mean.Wind.SpeedMPH
 Min.   : 8.00     Min.   : 1.000       Min.   : 4.00     
 1st Qu.:36.25     1st Qu.: 8.000       1st Qu.: 8.00     
 Median :53.50     Median :10.000       Median :10.00     
 Mean   :51.40     Mean   : 8.891       Mean   :10.68     
 3rd Qu.:68.00     3rd Qu.:10.000       3rd Qu.:13.00     
 Max.   :84.00     Max.   :10.000       Max.   :22.00     
                                                          
 MeanDew.PointF   Min.DewpointF     Min.Humidity  
 Min.   :-11.00   Min.   :-18.00   Min.   :16.00  
 1st Qu.: 24.00   1st Qu.: 16.25   1st Qu.:35.00  
 Median : 41.00   Median : 35.00   Median :46.00  
 Mean   : 38.96   Mean   : 32.25   Mean   :48.31  
 3rd Qu.: 56.00   3rd Qu.: 51.00   3rd Qu.:60.00  
 Max.   : 71.00   Max.   : 68.00   Max.   :96.00  
                                                  
 Min.Sea.Level.PressureIn Min.TemperatureF Min.VisibilityMiles
 Min.   :29.16            Min.   :-3.00    Min.   : 0.000     
 1st Qu.:29.76            1st Qu.:30.00    1st Qu.: 2.000     
 Median :29.94            Median :46.00    Median :10.000     
 Mean   :29.93            Mean   :43.33    Mean   : 6.716     
 3rd Qu.:30.09            3rd Qu.:60.00    3rd Qu.:10.000     
 Max.   :30.64            Max.   :74.00    Max.   :10.000     
                                                              
 PrecipitationIn  WindDirDegrees 
 Min.   :0.0000   Min.   :  1.0  
 1st Qu.:0.0000   1st Qu.:113.0  
 Median :0.0000   Median :222.0  
 Mean   :0.1016   Mean   :200.1  
 3rd Qu.:0.0400   3rd Qu.:275.0  
 Max.   :2.9000   Max.   :360.0  
                                 
# Look at histogram for MeanDew.PointF
hist(weather6$MeanDew.PointF)


# Look at histogram for Min.TemperatureF
hist(weather6$Min.TemperatureF)


# Compare to histogram for Mean.TemperatureF
hist(weather6$Mean.TemperatureF)

Before officially calling our weather data clean, we want to put a couple of finishing touches on the data. These are a bit more subjective and may not be necessary for analysis, but they will make the data easier for others to interpret, which is generally a good thing.

There are a number of stylistic conventions in the R language. Depending on who you ask, these conventions may vary. Because the period (.) has special meaning in certain situations, we generally recommend using underscores (_) to separate words in variable names. We also prefer all lowercase letters so that no one has to remember which letters are uppercase or lowercase.

Finally, the events column (renamed to be all lowercase in the first instruction) contains an empty string ("") for any day on which there was no significant weather event such as rain, fog, a thunderstorm, etc. However, if it’s the first time you’re seeing these data, it may not be obvious that this is the case, so it’s best for us to be explicit and replace the empty strings with something more meaningful.

new_colnames <- c("date", "events", "cloud_cover", "max_dew_point_f",
"max_gust_speed_mph", "max_humidity", "max_sea_level_pressure_in",
"max_temperature_f", "max_visibility_miles", "max_wind_speed_mph",
"mean_humidity", "mean_sea_level_pressure", "mean_temperature_f",
"mean_visibility_miles", "mean_wind_speed_mph", "mean_dew_point_f",
"min_dew_point_f", "min_humidity", "min_sea_level_pressure_in", 
"min_temperature_f", "min_visibility_miles", "precipitation_in", 
"wind_dir_degrees")
# Clean up column names
names(weather6) <- new_colnames

# Replace empty cells in events column
weather6$events[weather6$events == ""] <- "None"
    
# Print the first 6 rows of weather6
head(weather6)
LS0tDQp0aXRsZTogIkNsZWFuaW5nIGRhdGE6IHdlYXRoZXIgZXhhbXBsZSINCm91dHB1dDogaHRtbF9ub3RlYm9vaw0KLS0tDQpIaXN0b3JpY2FsIHdlYXRoZXIgZGF0YSBmcm9tIEJvc3RvbiwgVVNBDQogLSAxMiBtb250aHMgYmVnaW5uaW5nIERlYyAyMDE0DQogLSBUaGUgZGF0YSBhcmUgZGlydHkNCiAtIE9iamVjdGl2ZSBpcyB0byBjbGVhbiB0aGUgZGF0YQ0KIA0KIyMjRXhwbG9yZSB0aGUgZGF0YQ0KQmVmb3JlIGRpdmluZyBpbnRvIG91ciBkYXRhIGNsZWFuaW5nIHJvdXRpbmUsIHdlIG11c3QgZmlyc3QgdW5kZXJzdGFuZCB0aGUgYmFzaWMgc3RydWN0dXJlIG9mIHRoZSBkYXRhLiBUaGlzIGludm9sdmVzIGxvb2tpbmcgYXQgdGhpbmdzIGxpa2UgdGhlIGNsYXNzKCkgb2YgdGhlIGRhdGEgb2JqZWN0IHRvIG1ha2Ugc3VyZSBpdCdzIHdoYXQgd2UgZXhwZWN0IChnZW5lcmFsbHkgYSBkYXRhLmZyYW1lKSBpbiBhZGRpdGlvbiB0byBjaGVja2luZyBpdHMgZGltZW5zaW9ucyB3aXRoIGRpbSgpIGFuZCB0aGUgY29sdW1uIG5hbWVzIHdpdGggbmFtZXMoKS4NCmBgYHtyfQ0KI2xvYWQgdGhlIGRhdGENCndlYXRoZXIgPSByZWFkLmNzdigid2VhdGhlci5jc3YiKQ0KYGBgDQoNCg0KYGBge3J9DQojIFZlcmlmeSB0aGF0IHdlYXRoZXIgaXMgYSBkYXRhLmZyYW1lDQpjbGFzcyh3ZWF0aGVyKQ0KDQojIENoZWNrIHRoZSBkaW1lbnNpb25zDQpkaW0od2VhdGhlcikNCg0KIyBWaWV3IHRoZSBjb2x1bW4gbmFtZXMNCm5hbWVzKHdlYXRoZXIpDQpgYGANCk5leHQgdXAgaXMgdG8gbG9vayBhdCBzb21lIHN1bW1hcmllcyBvZiB0aGUgZGF0YS4gVGhpcyBpcyB3aGVyZSBmdW5jdGlvbnMgbGlrZSBzdHIoKSwgZ2xpbXBzZSgpIGZyb20gZHBseXIsIGFuZCBzdW1tYXJ5KCkgY29tZSBpbiBoYW5keS4NCmBgYHtyfQ0KIyBWaWV3IHRoZSBzdHJ1Y3R1cmUgb2YgdGhlIGRhdGENCnN0cih3ZWF0aGVyKQ0KDQojIExvYWQgZHBseXIgcGFja2FnZQ0KbGlicmFyeShkcGx5cikNCg0KIyBMb29rIGF0IHRoZSBzdHJ1Y3R1cmUgdXNpbmcgZHBseXIncyBnbGltcHNlKCkNCmdsaW1wc2Uod2VhdGhlcikNCg0KIyBWaWV3IGEgc3VtbWFyeSBvZiB0aGUgZGF0YQ0Kc3VtbWFyeSh3ZWF0aGVyKQ0KYGBgDQpBZnRlciB1bmRlcnN0YW5kaW5nIHRoZSBzdHJ1Y3R1cmUgb2YgdGhlIGRhdGEgYW5kIGxvb2tpbmcgYXQgc29tZSBicmllZiBzdW1tYXJpZXMsIGl0IG9mdGVuIGhlbHBzIHRvIHByZXZpZXcgdGhlIGFjdHVhbCBkYXRhLiBUaGUgZnVuY3Rpb25zIGhlYWQoKSBhbmQgdGFpbCgpIGFsbG93IHlvdSB0byB2aWV3IHRoZSB0b3AgYW5kIGJvdHRvbSByb3dzIG9mIHRoZSBkYXRhLCANClJlY2FsbCB5b3UnbGwgYmUgc2hvd24gNiByb3dzIGJ5IGRlZmF1bHQsIGJ1dCB5b3UgY2FuIGFsdGVyIHRoaXMgYmVoYXZpb3Igd2l0aCBhIHNlY29uZCBhcmd1bWVudCB0byB0aGUgZnVuY3Rpb24uDQpgYGB7cn0NCiMgVmlldyBmaXJzdCA2IHJvd3MNCmhlYWQod2VhdGhlcikNCg0KIyBWaWV3IGZpcnN0IDE1IHJvd3MNCmhlYWQod2VhdGhlciwgMTUpDQoNCiMgVmlldyB0aGUgbGFzdCA2IHJvd3MNCnRhaWwod2VhdGhlcikNCg0KIyBWaWV3IHRoZSBsYXN0IDEwIHJvd3MNCnRhaWwod2VhdGhlciwgMTApDQpgYGANCiMjIyBDbGVhbmluZyB0aGUgZGF0YQ0KVGhlIHdlYXRoZXIgZGF0YXNldCBzdWZmZXJzIGZyb20gb25lIG9mIHRoZSBmaXZlIG1vc3QgY29tbW9uIHN5bXB0b21zIG9mIG1lc3N5IGRhdGE6IGNvbHVtbiBuYW1lcyBhcmUgdmFsdWVzLiBJbiBwYXJ0aWN1bGFyLCB0aGUgY29sdW1uIG5hbWVzIFgxLVgzMSByZXByZXNlbnQgZGF5cyBvZiB0aGUgbW9udGgsIHdoaWNoIHNob3VsZCByZWFsbHkgYmUgdmFsdWVzIG9mIGEgbmV3IHZhcmlhYmxlIGNhbGxlZCBkYXkuDQoNClRoZSB0aWR5ciBwYWNrYWdlIHByb3ZpZGVzIHRoZSBnYXRoZXIoKSBmdW5jdGlvbiBmb3IgZXhhY3RseSB0aGlzIHNjZW5hcmlvLiANCmBgYHtyfQ0KIyBMb2FkIHRoZSB0aWR5ciBwYWNrYWdlDQpsaWJyYXJ5KHRpZHlyKQ0KDQojIEdhdGhlciB0aGUgY29sdW1ucw0Kd2VhdGhlcjIgPC0gZ2F0aGVyKHdlYXRoZXIsIGRheSwgdmFsdWUsIFgxOlgzMSwgbmEucm0gPSBUUlVFKQ0KDQojIFZpZXcgdGhlIGhlYWQNCmhlYWQod2VhdGhlcjIpDQpgYGANCk91ciBkYXRhIHN1ZmZlciBmcm9tIGEgc2Vjb25kIGNvbW1vbiBzeW1wdG9tIG9mIG1lc3N5IGRhdGE6IHZhbHVlcyBhcmUgdmFyaWFibGUgbmFtZXMuIFNwZWNpZmljYWxseSwgdmFsdWVzIGluIHRoZSBtZWFzdXJlIGNvbHVtbiBzaG91bGQgYmUgdmFyaWFibGVzIChpLmUuIGNvbHVtbiBuYW1lcykgaW4gb3VyIGRhdGFzZXQuDQoNClRoZSBzcHJlYWQoKSBmdW5jdGlvbiBmcm9tIHRpZHlyIGlzIGRlc2lnbmVkIHRvIGhlbHAgd2l0aCB0aGlzLg0KYGBge3J9DQojIEZpcnN0IHJlbW92ZSBjb2x1bW4gb2Ygcm93IG5hbWVzDQp3aXRob3V0X3ggPC0gd2VhdGhlcjJbLCAtMV0NCg0KIyBTcHJlYWQgdGhlIGRhdGENCndlYXRoZXIzIDwtIHNwcmVhZCh3aXRob3V0X3gsIG1lYXN1cmUsIHZhbHVlKQ0KDQojIFZpZXcgdGhlIGhlYWQNCmhlYWQod2VhdGhlcjMpDQpgYGANCiMjI1ByZXBhcmUgdGhlIGRhdGEgZm9yIGFuYWx5c2lzDQpOb3cgdGhhdCB0aGUgd2VhdGhlciBkYXRhc2V0IGFkaGVyZXMgdG8gdGlkeSBkYXRhIHByaW5jaXBsZXMsIHRoZSBuZXh0IHN0ZXAgaXMgdG8gcHJlcGFyZSBpdCBmb3IgYW5hbHlzaXMuIFdlJ2xsIHN0YXJ0IGJ5IGNvbWJpbmluZyB0aGUgeWVhciwgbW9udGgsIGFuZCBkYXkgY29sdW1ucyBhbmQgcmVjb2RpbmcgdGhlIHJlc3VsdGluZyBjaGFyYWN0ZXIgY29sdW1uIGFzIGEgZGF0ZS4gV2UgY2FuIHVzZSBhIGNvbWJpbmF0aW9uIG9mIGJhc2UgUiwgc3RyaW5nciwgYW5kIGx1YnJpZGF0ZSB0byBhY2NvbXBsaXNoIHRoaXMgdGFzay4NCmBgYHtyfQ0KIyBMb2FkIHRoZSBzdHJpbmdyIGFuZCBsdWJyaWRhdGUgcGFja2FnZXMNCmxpYnJhcnkoc3RyaW5ncikNCmxpYnJhcnkobHVicmlkYXRlKQ0KDQoNCiMgUmVtb3ZlIFgncyBmcm9tIGRheSBjb2x1bW4NCndlYXRoZXIzJGRheSA8LSBzdHJfcmVwbGFjZSh3ZWF0aGVyMyRkYXksICJYIiwgIiIpDQoNCiMgVW5pdGUgdGhlIHllYXIsIG1vbnRoLCBhbmQgZGF5IGNvbHVtbnMNCndlYXRoZXI0IDwtIHVuaXRlKHdlYXRoZXIzLCBkYXRlLCB5ZWFyLCBtb250aCwgZGF5LCBzZXAgPSAiLSIpDQoNCiMgQ29udmVydCBkYXRlIGNvbHVtbiB0byBwcm9wZXIgZGF0ZSBmb3JtYXQgdXNpbmcgbHVicmlkYXRlcydzIHltZCgpDQp3ZWF0aGVyNCRkYXRlIDwtIHltZCh3ZWF0aGVyNCRkYXRlKQ0KDQojIFJlYXJyYW5nZSBjb2x1bW5zIHVzaW5nIGRwbHlyJ3Mgc2VsZWN0KCkNCndlYXRoZXI1IDwtIHNlbGVjdCh3ZWF0aGVyNCwgZGF0ZSwgRXZlbnRzLCBDbG91ZENvdmVyOldpbmREaXJEZWdyZWVzKQ0KDQojIFZpZXcgdGhlIGhlYWQgb2Ygd2VhdGhlcjUNCmhlYWQod2VhdGhlcjUpDQpgYGANCkl0J3MgaW1wb3J0YW50IGZvciBhbmFseXNpcyB0aGF0IHZhcmlhYmxlcyBhcmUgY29kZWQgYXBwcm9wcmlhdGVseS4gVGhpcyBpcyBub3QgeWV0IHRoZSBjYXNlIHdpdGggb3VyIHdlYXRoZXIgZGF0YS4gUmVjYWxsIHRoYXQgZnVuY3Rpb25zIHN1Y2ggYXMgYXMubnVtZXJpYygpIGFuZCBhcy5jaGFyYWN0ZXIoKSBjYW4gYmUgdXNlZCB0byBjb2VyY2UgdmFyaWFibGVzIGludG8gZGlmZmVyZW50IHR5cGVzLg0KDQpJdCdzIGltcG9ydGFudCB0byBrZWVwIGluIG1pbmQgdGhhdCBjb2VyY2lvbnMgYXJlIG5vdCBhbHdheXMgc3VjY2Vzc2Z1bCwgcGFydGljdWxhcmx5IGlmIHRoZXJlJ3Mgc29tZSBkYXRhIGluIGEgY29sdW1uIHRoYXQgeW91IGRvbid0IGV4cGVjdC4NCmBgYHtyfQ0KIyBWaWV3IHRoZSBzdHJ1Y3R1cmUgb2Ygd2VhdGhlcjUNCnN0cih3ZWF0aGVyNSkNCg0KIyBFeGFtaW5lIHRoZSBmaXJzdCAyMCByb3dzIG9mIHdlYXRoZXI1LiBBcmUgbW9zdCBvZiB0aGUgY2hhcmFjdGVycyBudW1lcmljPw0KaGVhZCh3ZWF0aGVyNSwgMjApDQoNCiMgU2VlIHdoYXQgaGFwcGVucyBpZiB3ZSB0cnkgdG8gY29udmVydCBQcmVjaXBpdGF0aW9uSW4gdG8gbnVtZXJpYw0KYXMubnVtZXJpYyh3ZWF0aGVyNSRQcmVjaXBpdGF0aW9uSW4pDQpgYGANCiBXaGF0IHZhbHVlcyBpbiBQcmVjaXBpdGF0aW9uSW4gd291bGQgYmVjb21lIE5BIGlmIGNvZXJjZWQgdG8gbnVtYmVycz8NCiAiVCIgd2FzIHVzZWQgdG8gZGVub3RlIGEgdHJhY2UgYW1vdW50IChpLmUuIHRvbyBzbWFsbCB0byBiZSBhY2N1cmF0ZWx5IG1lYXN1cmVkKSBvZiBwcmVjaXBpdGF0aW9uIGluIHRoZSBQcmVjaXBpdGF0aW9uSW4gY29sdW1uLiBJbiBvcmRlciB0byBjb2VyY2UgdGhpcyBjb2x1bW4gdG8gbnVtZXJpYywgeW91J2xsIG5lZWQgdG8gZGVhbCB3aXRoIHRoaXMgc29tZWhvdy4gVG8ga2VlcCB0aGluZ3Mgc2ltcGxlLCB3ZSB3aWxsIGp1c3QgcmVwbGFjZSAiVCIgd2l0aCB6ZXJvLCBhcyBhIHN0cmluZyAoIjAiKS4NCmBgYHtyfQ0KIyBSZXBsYWNlICJUIiB3aXRoICIwIiAoVCA9IHRyYWNlKQ0Kd2VhdGhlcjUkUHJlY2lwaXRhdGlvbkluIDwtIHN0cl9yZXBsYWNlKHdlYXRoZXI1JFByZWNpcGl0YXRpb25JbiwgIlQiLCAiMCIpDQoNCiMgQ29udmVydCBjaGFyYWN0ZXJzIHRvIG51bWVyaWNzDQp3ZWF0aGVyNiA8LSBtdXRhdGVfYXQod2VhdGhlcjUsIHZhcnMoQ2xvdWRDb3ZlcjpXaW5kRGlyRGVncmVlcyksIGZ1bnMoYXMubnVtZXJpYykpDQoNCiMgTG9vayBhdCByZXN1bHQNCnN0cih3ZWF0aGVyNikNCmBgYA0KIyMjIE1pc3NpbmcsIGV4dHJlbWUsIGFuZCB1bmV4cGVjdGVkIHZhbHVlcw0KIyMjIyBNaXNzaW5nIHZhbHVlcw0KQmVmb3JlIGRlYWxpbmcgd2l0aCBtaXNzaW5nIHZhbHVlcyBpbiB0aGUgZGF0YSwgaXQncyBpbXBvcnRhbnQgdG8gZmluZCB0aGVtIGFuZCBmaWd1cmUgb3V0IHdoeSB0aGV5IGV4aXN0IGluIHRoZSBmaXJzdCBwbGFjZS4gSWYgeW91ciBkYXRhc2V0IGlzIHRvbyBiaWcgdG8gbG9vayBhdCBhbGwgYXQgb25jZSwgbGlrZSBpdCBpcyBoZXJlLCByZW1lbWJlciB5b3UgY2FuIHVzZSBzdW0oKSBhbmQgaXMubmEoKSB0byBxdWlja2x5IHNpemUgdXAgdGhlIHNpdHVhdGlvbiBieSBjb3VudGluZyB0aGUgbnVtYmVyIG9mIE5BIHZhbHVlcy4NCg0KVGhlIHN1bW1hcnkoKSBmdW5jdGlvbiBtYXkgYWxzbyBjb21lIGluIGhhbmR5IGZvciBpZGVudGlmeWluZyB3aGljaCB2YXJpYWJsZXMgY29udGFpbiB0aGUgbWlzc2luZyB2YWx1ZXMuIEZpbmFsbHksIHRoZSB3aGljaCgpIGZ1bmN0aW9uIGlzIHVzZWZ1bCBmb3IgbG9jYXRpbmcgdGhlIG1pc3NpbmcgdmFsdWVzIHdpdGhpbiBhIHBhcnRpY3VsYXIgY29sdW1uLg0KYGBge3J9DQojIENvdW50IG1pc3NpbmcgdmFsdWVzDQpzdW0oaXMubmEod2VhdGhlcjYpKQ0KDQojIEZpbmQgbWlzc2luZyB2YWx1ZXMNCnN1bW1hcnkod2VhdGhlcjYpDQoNCiMgRmluZCBpbmRpY2VzIG9mIE5BcyBpbiBNYXguR3VzdC5TcGVlZE1QSA0KaW5kIDwtIHdoaWNoKGlzLm5hKHdlYXRoZXI2JE1heC5HdXN0LlNwZWVkTVBIKSkNCg0KIyBMb29rIGF0IHRoZSBmdWxsIHJvd3MgZm9yIHJlY29yZHMgbWlzc2luZyBNYXguR3VzdC5TcGVlZE1QSA0Kd2VhdGhlcjZbaW5kLCBdDQpgYGANCkluIHRoaXMgc2l0dWF0aW9uIGl0J3MgdW5jbGVhciB3aHkgdGhlc2UgdmFsdWVzIGFyZSBtaXNzaW5nIGFuZCB0aGVyZSBkb2Vzbid0IGFwcGVhciB0byBiZSBhbnkgb2J2aW91cyBwYXR0ZXJuIHRvIHRoZWlyIG1pc3NpbmduZXNzLCBzbyB3ZSdsbCBsZWF2ZSB0aGVtIGFsb25lIGZvciBub3cuDQoNCiMjIyMgZXh0cmVtZSB2YWx1ZXMNCkJlc2lkZXMgbWlzc2luZyB2YWx1ZXMsIHdlIHdhbnQgdG8ga25vdyBpZiB0aGVyZSBhcmUgdmFsdWVzIGluIHRoZSBkYXRhIHRoYXQgYXJlIHRvbyBleHRyZW1lIG9yIGJpemFycmUgdG8gYmUgcGxhdXNpYmxlLiBBIGdyZWF0IHdheSB0byBzdGFydCB0aGUgc2VhcmNoIGZvciB0aGVzZSB2YWx1ZXMgaXMgd2l0aCBzdW1tYXJ5KCkuDQoNCk9uY2UgaW1wbGF1c2libGUgdmFsdWVzIGFyZSBpZGVudGlmaWVkLCB0aGV5IG11c3QgYmUgZGVhbHQgd2l0aCBpbiBhbiBpbnRlbGxpZ2VudCBhbmQgaW5mb3JtZWQgd2F5LiBTb21ldGltZXMgdGhlIGJlc3Qgd2F5IGZvcndhcmQgaXMgb2J2aW91cyBhbmQgb3RoZXIgdGltZXMgaXQgbWF5IHJlcXVpcmUgc29tZSByZXNlYXJjaCBhbmQvb3IgZGlzY3Vzc2lvbnMgd2l0aCB0aGUgb3JpZ2luYWwgY29sbGVjdG9ycyBvZiB0aGUgZGF0YS4NCmBgYHtyfQ0KIyBSZXZpZXcgZGlzdHJpYnV0aW9ucyBmb3IgYWxsIHZhcmlhYmxlcw0Kc3VtbWFyeSh3ZWF0aGVyNikNCg0KIyBGaW5kIHJvdyB3aXRoIE1heC5IdW1pZGl0eSBvZiAxMDAwDQppbmQgPC0gd2hpY2god2VhdGhlcjYkTWF4Lkh1bWlkaXR5ID09IDEwMDApDQoNCiMgTG9vayBhdCB0aGUgZGF0YSBmb3IgdGhhdCBkYXkNCndlYXRoZXI2W2luZCwgXQ0KDQojIENoYW5nZSAxMDAwIHRvIDEwMA0Kd2VhdGhlcjYkTWF4Lkh1bWlkaXR5W2luZF0gPC0gMTAwDQpgYGANCllvdSd2ZSBkaXNjb3ZlcmVkIGFuZCByZXBhaXJlZCBvbmUgb2J2aW91cyBlcnJvciBpbiB0aGUgZGF0YSwgYnV0IGl0IGFwcGVhcnMgdGhhdCB0aGVyZSdzIGFub3RoZXIuIFNvbWV0aW1lcyB5b3UgZ2V0IGx1Y2t5IGFuZCBjYW4gaW5mZXIgdGhlIGNvcnJlY3Qgb3IgaW50ZW5kZWQgdmFsdWUgZnJvbSB0aGUgb3RoZXIgZGF0YS4gRm9yIGV4YW1wbGUsIGlmIHlvdSBrbm93IHRoZSBtaW5pbXVtIGFuZCBtYXhpbXVtIHZhbHVlcyBvZiBhIHBhcnRpY3VsYXIgbWV0cmljIG9uIGEgZ2l2ZW4gZGF5Li4uDQpgYGB7cn0NCiMgTG9vayBhdCBzdW1tYXJ5IG9mIE1lYW4uVmlzaWJpbGl0eU1pbGVzDQpzdW1tYXJ5KHdlYXRoZXI2JE1lYW4uVmlzaWJpbGl0eU1pbGVzKQ0KDQojIEdldCBpbmRleCBvZiByb3cgd2l0aCAtMSB2YWx1ZQ0KaW5kIDwtIHdoaWNoKHdlYXRoZXI2JE1lYW4uVmlzaWJpbGl0eU1pbGVzID09IC0xKQ0KDQojIExvb2sgYXQgZnVsbCByb3cNCndlYXRoZXI2W2luZCwgXQ0KDQojIFNldCBNZWFuLlZpc2liaWxpdHlNaWxlcyB0byB0aGUgYXBwcm9wcmlhdGUgdmFsdWUNCndlYXRoZXI2JE1lYW4uVmlzaWJpbGl0eU1pbGVzW2luZF0gPC0gMTANCmBgYA0KSW4gYWRkaXRpb24gdG8gZGVhbGluZyB3aXRoIG9idmlvdXMgZXJyb3JzIGluIHRoZSBkYXRhLCB3ZSB3YW50IHRvIHNlZSBpZiB0aGVyZSBhcmUgb3RoZXIgZXh0cmVtZSB2YWx1ZXMuIEluIGFkZGl0aW9uIHRvIHRoZSB0cnVzdHkgc3VtbWFyeSgpIGZ1bmN0aW9uLCBoaXN0KCkgaXMgdXNlZnVsIGZvciBxdWlja2x5IGdldHRpbmcgYSBmZWVsIGZvciBob3cgZGlmZmVyZW50IHZhcmlhYmxlcyBhcmUgZGlzdHJpYnV0ZWQuDQpgYGB7cn0NCiMgUmV2aWV3IHN1bW1hcnkgb2YgZnVsbCBkYXRhIG9uY2UgbW9yZQ0Kc3VtbWFyeSh3ZWF0aGVyNikNCg0KIyBMb29rIGF0IGhpc3RvZ3JhbSBmb3IgTWVhbkRldy5Qb2ludEYNCmhpc3Qod2VhdGhlcjYkTWVhbkRldy5Qb2ludEYpDQoNCiMgTG9vayBhdCBoaXN0b2dyYW0gZm9yIE1pbi5UZW1wZXJhdHVyZUYNCmhpc3Qod2VhdGhlcjYkTWluLlRlbXBlcmF0dXJlRikNCg0KIyBDb21wYXJlIHRvIGhpc3RvZ3JhbSBmb3IgTWVhbi5UZW1wZXJhdHVyZUYNCmhpc3Qod2VhdGhlcjYkTWVhbi5UZW1wZXJhdHVyZUYpDQpgYGANCkJlZm9yZSBvZmZpY2lhbGx5IGNhbGxpbmcgb3VyIHdlYXRoZXIgZGF0YSBjbGVhbiwgd2Ugd2FudCB0byBwdXQgYSBjb3VwbGUgb2YgZmluaXNoaW5nIHRvdWNoZXMgb24gdGhlIGRhdGEuIFRoZXNlIGFyZSBhIGJpdCBtb3JlIHN1YmplY3RpdmUgYW5kIG1heSBub3QgYmUgbmVjZXNzYXJ5IGZvciBhbmFseXNpcywgYnV0IHRoZXkgd2lsbCBtYWtlIHRoZSBkYXRhIGVhc2llciBmb3Igb3RoZXJzIHRvIGludGVycHJldCwgd2hpY2ggaXMgZ2VuZXJhbGx5IGEgZ29vZCB0aGluZy4NCg0KVGhlcmUgYXJlIGEgbnVtYmVyIG9mIHN0eWxpc3RpYyBjb252ZW50aW9ucyBpbiB0aGUgUiBsYW5ndWFnZS4gRGVwZW5kaW5nIG9uIHdobyB5b3UgYXNrLCB0aGVzZSBjb252ZW50aW9ucyBtYXkgdmFyeS4gQmVjYXVzZSB0aGUgcGVyaW9kICguKSBoYXMgc3BlY2lhbCBtZWFuaW5nIGluIGNlcnRhaW4gc2l0dWF0aW9ucywgd2UgZ2VuZXJhbGx5IHJlY29tbWVuZCB1c2luZyB1bmRlcnNjb3JlcyAoXykgdG8gc2VwYXJhdGUgd29yZHMgaW4gdmFyaWFibGUgbmFtZXMuIFdlIGFsc28gcHJlZmVyIGFsbCBsb3dlcmNhc2UgbGV0dGVycyBzbyB0aGF0IG5vIG9uZSBoYXMgdG8gcmVtZW1iZXIgd2hpY2ggbGV0dGVycyBhcmUgdXBwZXJjYXNlIG9yIGxvd2VyY2FzZS4NCg0KRmluYWxseSwgdGhlIGV2ZW50cyBjb2x1bW4gKHJlbmFtZWQgdG8gYmUgYWxsIGxvd2VyY2FzZSBpbiB0aGUgZmlyc3QgaW5zdHJ1Y3Rpb24pIGNvbnRhaW5zIGFuIGVtcHR5IHN0cmluZyAoIiIpIGZvciBhbnkgZGF5IG9uIHdoaWNoIHRoZXJlIHdhcyBubyBzaWduaWZpY2FudCB3ZWF0aGVyIGV2ZW50IHN1Y2ggYXMgcmFpbiwgZm9nLCBhIHRodW5kZXJzdG9ybSwgZXRjLiBIb3dldmVyLCBpZiBpdCdzIHRoZSBmaXJzdCB0aW1lIHlvdSdyZSBzZWVpbmcgdGhlc2UgZGF0YSwgaXQgbWF5IG5vdCBiZSBvYnZpb3VzIHRoYXQgdGhpcyBpcyB0aGUgY2FzZSwgc28gaXQncyBiZXN0IGZvciB1cyB0byBiZSBleHBsaWNpdCBhbmQgcmVwbGFjZSB0aGUgZW1wdHkgc3RyaW5ncyB3aXRoIHNvbWV0aGluZyBtb3JlIG1lYW5pbmdmdWwuDQpgYGB7cn0NCm5ld19jb2xuYW1lcyA8LSBjKCJkYXRlIiwgImV2ZW50cyIsICJjbG91ZF9jb3ZlciIsICJtYXhfZGV3X3BvaW50X2YiLA0KIm1heF9ndXN0X3NwZWVkX21waCIsICJtYXhfaHVtaWRpdHkiLCAibWF4X3NlYV9sZXZlbF9wcmVzc3VyZV9pbiIsDQoibWF4X3RlbXBlcmF0dXJlX2YiLCAibWF4X3Zpc2liaWxpdHlfbWlsZXMiLCAibWF4X3dpbmRfc3BlZWRfbXBoIiwNCiJtZWFuX2h1bWlkaXR5IiwgIm1lYW5fc2VhX2xldmVsX3ByZXNzdXJlIiwgIm1lYW5fdGVtcGVyYXR1cmVfZiIsDQoibWVhbl92aXNpYmlsaXR5X21pbGVzIiwgIm1lYW5fd2luZF9zcGVlZF9tcGgiLCAibWVhbl9kZXdfcG9pbnRfZiIsDQoibWluX2Rld19wb2ludF9mIiwgIm1pbl9odW1pZGl0eSIsICJtaW5fc2VhX2xldmVsX3ByZXNzdXJlX2luIiwgDQoibWluX3RlbXBlcmF0dXJlX2YiLCAibWluX3Zpc2liaWxpdHlfbWlsZXMiLCAicHJlY2lwaXRhdGlvbl9pbiIsIA0KIndpbmRfZGlyX2RlZ3JlZXMiKQ0KIyBDbGVhbiB1cCBjb2x1bW4gbmFtZXMNCm5hbWVzKHdlYXRoZXI2KSA8LSBuZXdfY29sbmFtZXMNCg0KIyBSZXBsYWNlIGVtcHR5IGNlbGxzIGluIGV2ZW50cyBjb2x1bW4NCndlYXRoZXI2JGV2ZW50c1t3ZWF0aGVyNiRldmVudHMgPT0gIiJdIDwtICJOb25lIg0KICAgIA0KIyBQcmludCB0aGUgZmlyc3QgNiByb3dzIG9mIHdlYXRoZXI2DQpoZWFkKHdlYXRoZXI2KQ0KYGBgDQoNCg==