Historical weather data from Boston, USA - 12 months beginning Dec 2014 - The data are dirty - Objective is to clean the data
###Explore the data Before diving into our data cleaning routine, we must first understand the basic structure of the data. This involves looking at things like the class() of the data object to make sure it’s what we expect (generally a data.frame) in addition to checking its dimensions with dim() and the column names with names().
weather = read.csv("weather.csv")
# Verify that weather is a data.frame
class(weather)
[1] "data.frame"
# Check the dimensions
dim(weather)
[1] 286 35
# View the column names
names(weather)
[1] "X" "year" "month" "measure" "X1" "X2"
[7] "X3" "X4" "X5" "X6" "X7" "X8"
[13] "X9" "X10" "X11" "X12" "X13" "X14"
[19] "X15" "X16" "X17" "X18" "X19" "X20"
[25] "X21" "X22" "X23" "X24" "X25" "X26"
[31] "X27" "X28" "X29" "X30" "X31"
Next up is to look at some summaries of the data. This is where functions like str(), glimpse() from dplyr, and summary() come in handy.
# View the structure of the data
str(weather)
'data.frame': 286 obs. of 35 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ year : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
$ month : int 12 12 12 12 12 12 12 12 12 12 ...
$ measure: Factor w/ 22 levels "CloudCover","Events",..: 7 12 19 3 15 16 5 10 17 6 ...
$ X1 : Factor w/ 129 levels "","-6","0","0.01",..: 101 89 76 83 78 31 111 100 89 66 ...
$ X2 : Factor w/ 126 levels "","-5","0","0.01",..: 82 77 72 80 34 18 122 109 91 69 ...
$ X3 : Factor w/ 124 levels "","-3","0","0.01",..: 85 78 70 82 76 27 11 109 90 59 ...
$ X4 : Factor w/ 131 levels "","0","0.02",..: 82 75 49 27 23 14 108 94 77 68 ...
$ X5 : Factor w/ 119 levels "","-4","-6","-7",..: 74 66 31 69 29 14 109 96 77 61 ...
$ X6 : Factor w/ 119 levels "","-1","-5","-8",..: 75 73 69 75 72 67 11 116 111 59 ...
$ X7 : Factor w/ 120 levels "","-13","-3",..: 70 44 22 68 20 3 116 92 33 61 ...
$ X8 : Factor w/ 129 levels "","-1","-18",..: 35 27 19 33 17 43 123 106 82 65 ...
$ X9 : Factor w/ 127 levels "","0","0.01",..: 81 73 34 81 76 33 10 121 116 63 ...
$ X10 : Factor w/ 125 levels "","-2","0","0.05",..: 87 82 77 84 78 76 13 123 121 39 ...
$ X11 : Factor w/ 126 levels "","-1","0","0.01",..: 74 72 68 73 67 32 120 115 111 45 ...
$ X12 : Factor w/ 126 levels "","0","0.01",..: 77 74 68 33 32 29 117 110 100 50 ...
$ X13 : Factor w/ 128 levels "","-10","-4",..: 85 80 74 32 29 26 113 105 95 47 ...
$ X14 : Factor w/ 125 levels "","-4","-9","0",..: 85 81 75 36 32 29 115 104 92 51 ...
$ X15 : Factor w/ 129 levels "","-13","-2",..: 83 78 69 74 37 34 120 112 101 60 ...
$ X16 : Factor w/ 123 levels "","-11","-16",..: 83 80 74 82 75 53 120 114 108 61 ...
$ X17 : Factor w/ 133 levels "","-3","-7","-9",..: 90 86 83 87 83 74 12 125 113 47 ...
$ X18 : Factor w/ 122 levels "","-1","-6","0",..: 76 73 70 68 46 29 114 104 88 40 ...
$ X19 : Factor w/ 126 levels "","-2","-4","-5",..: 73 71 36 29 26 21 101 95 90 58 ...
$ X20 : Factor w/ 128 levels "","-10","-4",..: 79 74 32 54 28 23 121 117 110 70 ...
$ X21 : Factor w/ 126 levels "","-5","0","0.07",..: 73 69 49 49 31 28 115 108 100 65 ...
$ X22 : Factor w/ 128 levels "","-5","0","0.05",..: 83 78 70 78 72 27 120 115 107 66 ...
$ X23 : Factor w/ 122 levels "","-14","-6",..: 87 86 84 86 84 79 11 119 115 66 ...
$ X24 : Factor w/ 125 levels "","-15","-4",..: 81 80 78 81 80 78 10 120 119 51 ...
$ X25 : Factor w/ 130 levels "","0","0.03",..: 99 93 85 98 84 34 9 115 89 49 ...
$ X26 : Factor w/ 118 levels "","-2","0","0.02",..: 81 74 67 60 30 29 99 90 79 52 ...
$ X27 : Factor w/ 132 levels "","-2","-5","0",..: 94 86 78 75 71 37 112 102 92 59 ...
$ X28 : Factor w/ 127 levels "","-1","0","0.01",..: 91 86 81 83 75 33 112 104 92 50 ...
$ X29 : Factor w/ 113 levels "","-2","0","0.04",..: 70 64 45 27 18 6 86 76 65 53 ...
$ X30 : Factor w/ 124 levels "","-6","0","0.02",..: 50 29 24 7 73 2 81 71 29 60 ...
$ X31 : Factor w/ 95 levels "","-8","0","0.05",..: 42 23 17 86 65 6 71 61 53 51 ...
# Load dplyr package
library(dplyr)
# Look at the structure using dplyr's glimpse()
glimpse(weather)
Observations: 286
Variables: 35
$ X [3m[38;5;246m<int>[39m[23m 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
$ year [3m[38;5;246m<int>[39m[23m 2014, 2014, 2014, 2014, 2014, 2014, 2014, 2014, 20...
$ month [3m[38;5;246m<int>[39m[23m 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12...
$ measure [3m[38;5;246m<fct>[39m[23m Max.TemperatureF, Mean.TemperatureF, Min.Temperatu...
$ X1 [3m[38;5;246m<fct>[39m[23m 64, 52, 39, 46, 40, 26, 74, 63, 52, 30.45, 30.13, ...
$ X2 [3m[38;5;246m<fct>[39m[23m 42, 38, 33, 40, 27, 17, 92, 72, 51, 30.71, 30.59, ...
$ X3 [3m[38;5;246m<fct>[39m[23m 51, 44, 37, 49, 42, 24, 100, 79, 57, 30.4, 30.07, ...
$ X4 [3m[38;5;246m<fct>[39m[23m 43, 37, 30, 24, 21, 13, 69, 54, 39, 30.56, 30.33, ...
$ X5 [3m[38;5;246m<fct>[39m[23m 42, 34, 26, 37, 25, 12, 85, 66, 47, 30.68, 30.59, ...
$ X6 [3m[38;5;246m<fct>[39m[23m 45, 42, 38, 45, 40, 36, 100, 93, 85, 30.42, 30.24,...
$ X7 [3m[38;5;246m<fct>[39m[23m 38, 30, 21, 36, 20, -3, 92, 61, 29, 30.69, 30.46, ...
$ X8 [3m[38;5;246m<fct>[39m[23m 29, 24, 18, 28, 16, 3, 92, 70, 47, 30.77, 30.67, 3...
$ X9 [3m[38;5;246m<fct>[39m[23m 49, 39, 29, 49, 41, 28, 100, 93, 86, 30.51, 30.04,...
$ X10 [3m[38;5;246m<fct>[39m[23m 48, 43, 38, 45, 39, 37, 100, 95, 89, 29.58, 29.5, ...
$ X11 [3m[38;5;246m<fct>[39m[23m 39, 36, 32, 37, 31, 27, 92, 87, 82, 29.81, 29.61, ...
$ X12 [3m[38;5;246m<fct>[39m[23m 39, 35, 31, 28, 27, 25, 85, 75, 64, 29.88, 29.85, ...
$ X13 [3m[38;5;246m<fct>[39m[23m 42, 37, 32, 28, 26, 24, 75, 65, 55, 29.86, 29.82, ...
$ X14 [3m[38;5;246m<fct>[39m[23m 45, 39, 33, 29, 27, 25, 82, 68, 53, 29.91, 29.83, ...
$ X15 [3m[38;5;246m<fct>[39m[23m 42, 37, 32, 33, 29, 27, 89, 75, 60, 30.15, 30.05, ...
$ X16 [3m[38;5;246m<fct>[39m[23m 44, 40, 35, 42, 36, 30, 96, 85, 73, 30.17, 30.09, ...
$ X17 [3m[38;5;246m<fct>[39m[23m 49, 45, 41, 46, 41, 32, 100, 85, 70, 29.91, 29.75,...
$ X18 [3m[38;5;246m<fct>[39m[23m 44, 40, 36, 34, 30, 26, 89, 73, 57, 29.87, 29.78, ...
$ X19 [3m[38;5;246m<fct>[39m[23m 37, 33, 29, 25, 22, 20, 69, 63, 56, 30.15, 29.98, ...
$ X20 [3m[38;5;246m<fct>[39m[23m 36, 32, 27, 30, 24, 20, 89, 79, 69, 30.31, 30.26, ...
$ X21 [3m[38;5;246m<fct>[39m[23m 36, 33, 30, 30, 27, 25, 85, 77, 69, 30.37, 30.32, ...
$ X22 [3m[38;5;246m<fct>[39m[23m 44, 39, 33, 39, 34, 25, 89, 79, 69, 30.4, 30.35, 3...
$ X23 [3m[38;5;246m<fct>[39m[23m 47, 45, 42, 45, 42, 37, 100, 91, 82, 30.31, 30.23,...
$ X24 [3m[38;5;246m<fct>[39m[23m 46, 44, 41, 46, 44, 41, 100, 98, 96, 30.13, 29.9, ...
$ X25 [3m[38;5;246m<fct>[39m[23m 59, 52, 44, 58, 43, 29, 100, 75, 49, 29.96, 29.63,...
$ X26 [3m[38;5;246m<fct>[39m[23m 50, 44, 37, 31, 29, 28, 70, 60, 49, 30.16, 30.11, ...
$ X27 [3m[38;5;246m<fct>[39m[23m 52, 45, 38, 34, 31, 29, 70, 60, 50, 30.22, 30.14, ...
$ X28 [3m[38;5;246m<fct>[39m[23m 52, 46, 40, 42, 35, 27, 76, 65, 53, 29.99, 29.87, ...
$ X29 [3m[38;5;246m<fct>[39m[23m 41, 36, 30, 26, 20, 10, 64, 51, 37, 30.22, 30.12, ...
$ X30 [3m[38;5;246m<fct>[39m[23m 30, 26, 22, 10, 4, -6, 50, 38, 26, 30.36, 30.32, 3...
$ X31 [3m[38;5;246m<fct>[39m[23m 30, 25, 20, 8, 5, 1, 57, 44, 31, 30.32, 30.25, 30....
# View a summary of the data
summary(weather)
X year month
Min. : 1.00 Min. :2014 Min. : 1.000
1st Qu.: 72.25 1st Qu.:2015 1st Qu.: 4.000
Median :143.50 Median :2015 Median : 7.000
Mean :143.50 Mean :2015 Mean : 6.923
3rd Qu.:214.75 3rd Qu.:2015 3rd Qu.:10.000
Max. :286.00 Max. :2015 Max. :12.000
measure X1 X2
CloudCover : 13 10 : 33 10 : 27
Events : 13 6 : 8 0 : 9
Max.Dew.PointF : 13 22 : 7 : 7
Max.Gust.SpeedMPH : 13 0 : 6 7 : 7
Max.Humidity : 13 30 : 6 37 : 5
Max.Sea.Level.PressureIn: 13 (Other):225 (Other):209
(Other) :208 NA's : 1 NA's : 22
X3 X4 X5 X6 X7
10 : 23 10 : 28 10 : 29 10 : 32 10 : 30
1 : 7 0 : 10 : 8 : 9 : 7
0 : 6 : 7 0 : 8 0 : 9 0 : 7
7 : 6 49 : 5 22 : 6 9 : 5 21 : 6
9 : 6 6 : 5 5 : 5 15 : 4 29 : 5
(Other):214 (Other):209 (Other):208 (Other):205 (Other):209
NA's : 24 NA's : 22 NA's : 22 NA's : 22 NA's : 22
X8 X9 X10 X11 X12
10 : 24 10 : 22 10 : 26 10 : 24 10 : 26
: 6 8 : 8 7 : 8 7 : 9 0 : 10
0 : 6 0 : 6 8 : 8 0 : 7 7 : 7
25 : 6 93 : 6 0 : 6 : 6 : 6
9 : 6 : 5 21 : 6 68 : 6 25 : 6
(Other):215 (Other):217 (Other):210 (Other):212 (Other):208
NA's : 23 NA's : 22 NA's : 22 NA's : 22 NA's : 23
X13 X14 X15 X16 X17
10 : 31 10 : 34 10 : 28 10 : 32 10 : 29
: 9 0 : 8 0 : 10 0 : 9 : 8
0 : 9 : 7 : 6 : 8 0 : 7
59 : 5 39 : 5 6 : 5 3 : 6 8 : 6
6 : 5 51 : 5 67 : 5 8 : 5 1 : 5
(Other):205 (Other):205 (Other):210 (Other):204 (Other):209
NA's : 22 NA's : 22 NA's : 22 NA's : 22 NA's : 22
X18 X19 X20 X21 X22
10 : 30 10 : 32 10 : 26 10 : 24 10 : 24
0 : 9 0 : 9 : 8 0 : 7 0 : 7
8 : 8 : 8 0 : 6 5 : 7 7 : 7
: 6 20 : 7 7 : 6 6 : 6 39 : 6
17 : 5 30 : 6 26 : 5 : 5 8 : 5
(Other):205 (Other):202 (Other):213 (Other):215 (Other):215
NA's : 23 NA's : 22 NA's : 22 NA's : 22 NA's : 22
X23 X24 X25 X26 X27
10 : 33 10 : 27 10 : 29 10 : 29 10 : 28
0 : 12 0 : 8 0 : 8 0 : 8 0 : 9
: 8 : 7 : 6 8 : 8 : 7
18 : 5 16 : 7 4 : 5 : 6 52 : 5
42 : 5 20 : 7 55 : 5 16 : 6 7 : 5
(Other):201 (Other):208 (Other):211 (Other):207 (Other):210
NA's : 22 NA's : 22 NA's : 22 NA's : 22 NA's : 22
X28 X29 X30 X31
10 : 25 10 : 28 10 : 28 10 : 17
0 : 7 0 : 9 0 : 8 0 : 7
8 : 7 : 8 : 7 : 5
Rain : 6 7 : 8 26 : 7 31 : 5
4 : 5 20 : 5 6 : 5 1 : 3
(Other):214 (Other):184 (Other):187 (Other):117
NA's : 22 NA's : 44 NA's : 44 NA's :132
After understanding the structure of the data and looking at some brief summaries, it often helps to preview the actual data. The functions head() and tail() allow you to view the top and bottom rows of the data, Recall you’ll be shown 6 rows by default, but you can alter this behavior with a second argument to the function.
# View first 6 rows
head(weather)
# View first 15 rows
head(weather, 15)
# View the last 6 rows
tail(weather)
# View the last 10 rows
tail(weather, 10)
The weather dataset suffers from one of the five most common symptoms of messy data: column names are values. In particular, the column names X1-X31 represent days of the month, which should really be values of a new variable called day.
The tidyr package provides the gather() function for exactly this scenario.
# Load the tidyr package
library(tidyr)
# Gather the columns
weather2 <- gather(weather, day, value, X1:X31, na.rm = TRUE)
attributes are not identical across measure variables;
they will be dropped
# View the head
head(weather2)
Our data suffer from a second common symptom of messy data: values are variable names. Specifically, values in the measure column should be variables (i.e. column names) in our dataset.
The spread() function from tidyr is designed to help with this.
# First remove column of row names
without_x <- weather2[, -1]
# Spread the data
weather3 <- spread(without_x, measure, value)
# View the head
head(weather3)
###Prepare the data for analysis Now that the weather dataset adheres to tidy data principles, the next step is to prepare it for analysis. We’ll start by combining the year, month, and day columns and recoding the resulting character column as a date. We can use a combination of base R, stringr, and lubridate to accomplish this task.
# Load the stringr and lubridate packages
library(stringr)
library(lubridate)
# Remove X's from day column
weather3$day <- str_replace(weather3$day, "X", "")
# Unite the year, month, and day columns
weather4 <- unite(weather3, date, year, month, day, sep = "-")
# Convert date column to proper date format using lubridates's ymd()
weather4$date <- ymd(weather4$date)
# Rearrange columns using dplyr's select()
weather5 <- select(weather4, date, Events, CloudCover:WindDirDegrees)
# View the head of weather5
head(weather5)
It’s important for analysis that variables are coded appropriately. This is not yet the case with our weather data. Recall that functions such as as.numeric() and as.character() can be used to coerce variables into different types.
It’s important to keep in mind that coercions are not always successful, particularly if there’s some data in a column that you don’t expect.
# View the structure of weather5
str(weather5)
'data.frame': 366 obs. of 23 variables:
$ date : Date, format: "2014-12-01" ...
$ Events : chr "Rain" "Rain" "Rain-Snow" "Snow" ...
$ CloudCover : chr "6" "8" "8" "7" ...
$ Max.Dew.PointF : chr "46" "45" "37" "28" ...
$ Max.Gust.SpeedMPH : chr "29" "29" "28" "21" ...
$ Max.Humidity : chr "74" "100" "92" "85" ...
$ Max.Sea.Level.PressureIn : chr "30.45" "29.58" "29.81" "29.88" ...
$ Max.TemperatureF : chr "64" "48" "39" "39" ...
$ Max.VisibilityMiles : chr "10" "10" "10" "10" ...
$ Max.Wind.SpeedMPH : chr "22" "23" "21" "16" ...
$ Mean.Humidity : chr "63" "95" "87" "75" ...
$ Mean.Sea.Level.PressureIn: chr "30.13" "29.5" "29.61" "29.85" ...
$ Mean.TemperatureF : chr "52" "43" "36" "35" ...
$ Mean.VisibilityMiles : chr "10" "3" "7" "10" ...
$ Mean.Wind.SpeedMPH : chr "13" "13" "13" "11" ...
$ MeanDew.PointF : chr "40" "39" "31" "27" ...
$ Min.DewpointF : chr "26" "37" "27" "25" ...
$ Min.Humidity : chr "52" "89" "82" "64" ...
$ Min.Sea.Level.PressureIn : chr "30.01" "29.43" "29.44" "29.81" ...
$ Min.TemperatureF : chr "39" "38" "32" "31" ...
$ Min.VisibilityMiles : chr "10" "1" "1" "7" ...
$ PrecipitationIn : chr "0.01" "0.28" "0.02" "T" ...
$ WindDirDegrees : chr "268" "357" "230" "286" ...
# Examine the first 20 rows of weather5. Are most of the characters numeric?
head(weather5, 20)
# See what happens if we try to convert PrecipitationIn to numeric
as.numeric(weather5$PrecipitationIn)
NAs introducidos por coerci昼㸳n
[1] 0.01 0.28 0.02 NA NA 0.00 0.00 NA 0.43 0.01 0.00 0.10 NA
[14] NA 0.05 0.25 0.56 0.14 0.00 0.00 0.01 0.00 0.44 0.00 0.00 0.00
[27] 0.11 1.09 0.13 0.03 2.90 0.00 0.00 0.00 0.20 0.00 NA 0.12 0.00
[40] 0.00 0.15 0.00 0.00 0.00 0.00 NA 0.00 0.71 0.00 0.10 0.95 0.01
[53] NA 0.62 0.06 0.05 0.57 0.00 0.02 NA 0.00 0.01 0.00 0.05 0.01
[66] 0.03 0.00 0.23 0.39 0.00 0.02 0.01 0.06 0.78 0.00 0.17 0.11 0.00
[79] NA 0.07 0.02 0.00 0.00 0.00 0.00 0.09 NA 0.07 0.37 0.88 0.17
[92] 0.06 0.01 0.00 0.00 0.80 0.27 0.00 0.14 0.00 0.00 0.01 0.05 0.09
[105] 0.00 0.00 0.00 0.04 0.80 0.21 0.12 0.00 0.26 NA 0.00 0.02 NA
[118] 0.00 0.00 NA 0.00 0.00 0.09 0.00 0.00 0.00 0.01 0.00 0.00 0.06
[131] 0.00 0.00 0.00 0.61 0.54 NA 0.00 NA 0.00 0.00 0.10 0.07 0.00
[144] 0.03 0.00 0.39 0.00 0.00 0.03 0.26 0.09 0.00 0.00 0.00 0.02 0.00
[157] 0.00 0.00 NA 0.00 0.00 0.27 0.00 0.00 0.00 NA 0.00 0.00 NA
[170] 0.00 0.00 NA 0.00 0.00 0.00 0.91 0.00 0.02 0.00 0.00 0.00 0.00
[183] 0.38 0.00 0.00 0.00 NA 0.00 0.40 NA 0.00 0.00 0.00 0.74 0.04
[196] 1.72 0.00 0.01 0.00 0.00 NA 0.20 1.43 NA 0.00 0.00 0.00 NA
[209] 0.09 0.00 NA NA 0.50 1.12 0.00 0.00 0.00 0.03 NA 0.00 NA
[222] 0.14 NA 0.00 NA NA 0.00 0.00 0.01 0.00 NA 0.06 0.00 0.00
[235] 0.00 0.02 0.00 NA 0.00 0.00 0.02 NA 0.15 NA 0.00 0.83 0.00
[248] 0.00 0.00 0.08 0.00 0.00 0.14 0.00 0.00 0.00 0.63 NA 0.02 NA
[261] 0.00 NA 0.00 0.00 0.00 0.00 0.00 0.00 0.49 0.00 0.00 0.00 0.00
[274] 0.00 0.00 0.17 0.66 0.01 0.38 0.00 0.00 0.00 0.00 0.00 0.00 0.00
[287] NA 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.04 0.01 2.46 NA
[300] 0.00 0.00 0.00 0.20 0.00 NA 0.00 0.00 0.00 0.12 0.00 0.00 NA
[313] NA NA 0.00 0.08 NA 0.07 NA 0.00 0.00 0.03 0.00 0.00 0.36
[326] 0.73 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.34 NA 0.07 0.54
[339] 0.04 0.01 0.00 0.00 0.00 0.00 0.00 NA 0.00 0.86 0.00 0.30 0.04
[352] 0.00 0.00 0.00 0.00 0.21 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
[365] 0.00 0.14
What values in PrecipitationIn would become NA if coerced to numbers? “T” was used to denote a trace amount (i.e. too small to be accurately measured) of precipitation in the PrecipitationIn column. In order to coerce this column to numeric, you’ll need to deal with this somehow. To keep things simple, we will just replace “T” with zero, as a string (“0”).
# Replace "T" with "0" (T = trace)
weather5$PrecipitationIn <- str_replace(weather5$PrecipitationIn, "T", "0")
# Convert characters to numerics
weather6 <- mutate_at(weather5, vars(CloudCover:WindDirDegrees), funs(as.numeric))
funs() is soft deprecated as of dplyr 0.8.0
Please use a list of either functions or lambdas:
# Simple named list:
list(mean = mean, median = median)
# Auto named with `tibble::lst()`:
tibble::lst(mean, median)
# Using lambdas
list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
[90mThis warning is displayed once per session.[39m
# Look at result
str(weather6)
'data.frame': 366 obs. of 23 variables:
$ date : Date, format: "2014-12-01" ...
$ Events : chr "Rain" "Rain" "Rain-Snow" "Snow" ...
$ CloudCover : num 6 8 8 7 5 4 2 8 8 7 ...
$ Max.Dew.PointF : num 46 45 37 28 28 29 33 42 46 34 ...
$ Max.Gust.SpeedMPH : num 29 29 28 21 23 20 21 10 26 30 ...
$ Max.Humidity : num 74 100 92 85 75 82 89 96 100 89 ...
$ Max.Sea.Level.PressureIn : num 30.4 29.6 29.8 29.9 29.9 ...
$ Max.TemperatureF : num 64 48 39 39 42 45 42 44 49 44 ...
$ Max.VisibilityMiles : num 10 10 10 10 10 10 10 10 10 10 ...
$ Max.Wind.SpeedMPH : num 22 23 21 16 17 15 15 8 20 23 ...
$ Mean.Humidity : num 63 95 87 75 65 68 75 85 85 73 ...
$ Mean.Sea.Level.PressureIn: num 30.1 29.5 29.6 29.9 29.8 ...
$ Mean.TemperatureF : num 52 43 36 35 37 39 37 40 45 40 ...
$ Mean.VisibilityMiles : num 10 3 7 10 10 10 10 9 6 10 ...
$ Mean.Wind.SpeedMPH : num 13 13 13 11 12 10 6 4 11 14 ...
$ MeanDew.PointF : num 40 39 31 27 26 27 29 36 41 30 ...
$ Min.DewpointF : num 26 37 27 25 24 25 27 30 32 26 ...
$ Min.Humidity : num 52 89 82 64 55 53 60 73 70 57 ...
$ Min.Sea.Level.PressureIn : num 30 29.4 29.4 29.8 29.8 ...
$ Min.TemperatureF : num 39 38 32 31 32 33 32 35 41 36 ...
$ Min.VisibilityMiles : num 10 1 1 7 10 10 10 5 1 10 ...
$ PrecipitationIn : num 0.01 0.28 0.02 0 0 0 0 0 0.43 0.01 ...
$ WindDirDegrees : num 268 357 230 286 298 306 324 79 311 281 ...
Before dealing with missing values in the data, it’s important to find them and figure out why they exist in the first place. If your dataset is too big to look at all at once, like it is here, remember you can use sum() and is.na() to quickly size up the situation by counting the number of NA values.
The summary() function may also come in handy for identifying which variables contain the missing values. Finally, the which() function is useful for locating the missing values within a particular column.
# Count missing values
sum(is.na(weather6))
[1] 6
# Find missing values
summary(weather6)
date Events CloudCover
Min. :2014-12-01 Length:366 Min. :0.000
1st Qu.:2015-03-02 Class :character 1st Qu.:3.000
Median :2015-06-01 Mode :character Median :5.000
Mean :2015-06-01 Mean :4.708
3rd Qu.:2015-08-31 3rd Qu.:7.000
Max. :2015-12-01 Max. :8.000
Max.Dew.PointF Max.Gust.SpeedMPH Max.Humidity
Min. :-6.00 Min. : 0.00 Min. : 39.00
1st Qu.:32.00 1st Qu.:21.00 1st Qu.: 73.25
Median :47.50 Median :25.50 Median : 86.00
Mean :45.48 Mean :26.99 Mean : 85.69
3rd Qu.:61.00 3rd Qu.:31.25 3rd Qu.: 93.00
Max. :75.00 Max. :94.00 Max. :1000.00
NA's :6
Max.Sea.Level.PressureIn Max.TemperatureF Max.VisibilityMiles
Min. :29.58 Min. :18.00 Min. : 2.000
1st Qu.:30.00 1st Qu.:42.00 1st Qu.:10.000
Median :30.14 Median :60.00 Median :10.000
Mean :30.16 Mean :58.93 Mean : 9.907
3rd Qu.:30.31 3rd Qu.:76.00 3rd Qu.:10.000
Max. :30.88 Max. :96.00 Max. :10.000
Max.Wind.SpeedMPH Mean.Humidity Mean.Sea.Level.PressureIn
Min. : 8.00 Min. :28.00 Min. :29.49
1st Qu.:16.00 1st Qu.:56.00 1st Qu.:29.87
Median :20.00 Median :66.00 Median :30.03
Mean :20.62 Mean :66.02 Mean :30.04
3rd Qu.:24.00 3rd Qu.:76.75 3rd Qu.:30.19
Max. :38.00 Max. :98.00 Max. :30.77
Mean.TemperatureF Mean.VisibilityMiles Mean.Wind.SpeedMPH
Min. : 8.00 Min. :-1.000 Min. : 4.00
1st Qu.:36.25 1st Qu.: 8.000 1st Qu.: 8.00
Median :53.50 Median :10.000 Median :10.00
Mean :51.40 Mean : 8.861 Mean :10.68
3rd Qu.:68.00 3rd Qu.:10.000 3rd Qu.:13.00
Max. :84.00 Max. :10.000 Max. :22.00
MeanDew.PointF Min.DewpointF Min.Humidity
Min. :-11.00 Min. :-18.00 Min. :16.00
1st Qu.: 24.00 1st Qu.: 16.25 1st Qu.:35.00
Median : 41.00 Median : 35.00 Median :46.00
Mean : 38.96 Mean : 32.25 Mean :48.31
3rd Qu.: 56.00 3rd Qu.: 51.00 3rd Qu.:60.00
Max. : 71.00 Max. : 68.00 Max. :96.00
Min.Sea.Level.PressureIn Min.TemperatureF Min.VisibilityMiles
Min. :29.16 Min. :-3.00 Min. : 0.000
1st Qu.:29.76 1st Qu.:30.00 1st Qu.: 2.000
Median :29.94 Median :46.00 Median :10.000
Mean :29.93 Mean :43.33 Mean : 6.716
3rd Qu.:30.09 3rd Qu.:60.00 3rd Qu.:10.000
Max. :30.64 Max. :74.00 Max. :10.000
PrecipitationIn WindDirDegrees
Min. :0.0000 Min. : 1.0
1st Qu.:0.0000 1st Qu.:113.0
Median :0.0000 Median :222.0
Mean :0.1016 Mean :200.1
3rd Qu.:0.0400 3rd Qu.:275.0
Max. :2.9000 Max. :360.0
# Find indices of NAs in Max.Gust.SpeedMPH
ind <- which(is.na(weather6$Max.Gust.SpeedMPH))
# Look at the full rows for records missing Max.Gust.SpeedMPH
weather6[ind, ]
In this situation it’s unclear why these values are missing and there doesn’t appear to be any obvious pattern to their missingness, so we’ll leave them alone for now.
Besides missing values, we want to know if there are values in the data that are too extreme or bizarre to be plausible. A great way to start the search for these values is with summary().
Once implausible values are identified, they must be dealt with in an intelligent and informed way. Sometimes the best way forward is obvious and other times it may require some research and/or discussions with the original collectors of the data.
# Review distributions for all variables
summary(weather6)
date Events CloudCover
Min. :2014-12-01 Length:366 Min. :0.000
1st Qu.:2015-03-02 Class :character 1st Qu.:3.000
Median :2015-06-01 Mode :character Median :5.000
Mean :2015-06-01 Mean :4.708
3rd Qu.:2015-08-31 3rd Qu.:7.000
Max. :2015-12-01 Max. :8.000
Max.Dew.PointF Max.Gust.SpeedMPH Max.Humidity
Min. :-6.00 Min. : 0.00 Min. : 39.00
1st Qu.:32.00 1st Qu.:21.00 1st Qu.: 73.25
Median :47.50 Median :25.50 Median : 86.00
Mean :45.48 Mean :26.99 Mean : 85.69
3rd Qu.:61.00 3rd Qu.:31.25 3rd Qu.: 93.00
Max. :75.00 Max. :94.00 Max. :1000.00
NA's :6
Max.Sea.Level.PressureIn Max.TemperatureF Max.VisibilityMiles
Min. :29.58 Min. :18.00 Min. : 2.000
1st Qu.:30.00 1st Qu.:42.00 1st Qu.:10.000
Median :30.14 Median :60.00 Median :10.000
Mean :30.16 Mean :58.93 Mean : 9.907
3rd Qu.:30.31 3rd Qu.:76.00 3rd Qu.:10.000
Max. :30.88 Max. :96.00 Max. :10.000
Max.Wind.SpeedMPH Mean.Humidity Mean.Sea.Level.PressureIn
Min. : 8.00 Min. :28.00 Min. :29.49
1st Qu.:16.00 1st Qu.:56.00 1st Qu.:29.87
Median :20.00 Median :66.00 Median :30.03
Mean :20.62 Mean :66.02 Mean :30.04
3rd Qu.:24.00 3rd Qu.:76.75 3rd Qu.:30.19
Max. :38.00 Max. :98.00 Max. :30.77
Mean.TemperatureF Mean.VisibilityMiles Mean.Wind.SpeedMPH
Min. : 8.00 Min. :-1.000 Min. : 4.00
1st Qu.:36.25 1st Qu.: 8.000 1st Qu.: 8.00
Median :53.50 Median :10.000 Median :10.00
Mean :51.40 Mean : 8.861 Mean :10.68
3rd Qu.:68.00 3rd Qu.:10.000 3rd Qu.:13.00
Max. :84.00 Max. :10.000 Max. :22.00
MeanDew.PointF Min.DewpointF Min.Humidity
Min. :-11.00 Min. :-18.00 Min. :16.00
1st Qu.: 24.00 1st Qu.: 16.25 1st Qu.:35.00
Median : 41.00 Median : 35.00 Median :46.00
Mean : 38.96 Mean : 32.25 Mean :48.31
3rd Qu.: 56.00 3rd Qu.: 51.00 3rd Qu.:60.00
Max. : 71.00 Max. : 68.00 Max. :96.00
Min.Sea.Level.PressureIn Min.TemperatureF Min.VisibilityMiles
Min. :29.16 Min. :-3.00 Min. : 0.000
1st Qu.:29.76 1st Qu.:30.00 1st Qu.: 2.000
Median :29.94 Median :46.00 Median :10.000
Mean :29.93 Mean :43.33 Mean : 6.716
3rd Qu.:30.09 3rd Qu.:60.00 3rd Qu.:10.000
Max. :30.64 Max. :74.00 Max. :10.000
PrecipitationIn WindDirDegrees
Min. :0.0000 Min. : 1.0
1st Qu.:0.0000 1st Qu.:113.0
Median :0.0000 Median :222.0
Mean :0.1016 Mean :200.1
3rd Qu.:0.0400 3rd Qu.:275.0
Max. :2.9000 Max. :360.0
# Find row with Max.Humidity of 1000
ind <- which(weather6$Max.Humidity == 1000)
# Look at the data for that day
weather6[ind, ]
# Change 1000 to 100
weather6$Max.Humidity[ind] <- 100
You’ve discovered and repaired one obvious error in the data, but it appears that there’s another. Sometimes you get lucky and can infer the correct or intended value from the other data. For example, if you know the minimum and maximum values of a particular metric on a given day…
# Look at summary of Mean.VisibilityMiles
summary(weather6$Mean.VisibilityMiles)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.000 8.000 10.000 8.861 10.000 10.000
# Get index of row with -1 value
ind <- which(weather6$Mean.VisibilityMiles == -1)
# Look at full row
weather6[ind, ]
# Set Mean.VisibilityMiles to the appropriate value
weather6$Mean.VisibilityMiles[ind] <- 10
In addition to dealing with obvious errors in the data, we want to see if there are other extreme values. In addition to the trusty summary() function, hist() is useful for quickly getting a feel for how different variables are distributed.
# Review summary of full data once more
summary(weather6)
date Events CloudCover
Min. :2014-12-01 Length:366 Min. :0.000
1st Qu.:2015-03-02 Class :character 1st Qu.:3.000
Median :2015-06-01 Mode :character Median :5.000
Mean :2015-06-01 Mean :4.708
3rd Qu.:2015-08-31 3rd Qu.:7.000
Max. :2015-12-01 Max. :8.000
Max.Dew.PointF Max.Gust.SpeedMPH Max.Humidity
Min. :-6.00 Min. : 0.00 Min. : 39.00
1st Qu.:32.00 1st Qu.:21.00 1st Qu.: 73.25
Median :47.50 Median :25.50 Median : 86.00
Mean :45.48 Mean :26.99 Mean : 83.23
3rd Qu.:61.00 3rd Qu.:31.25 3rd Qu.: 93.00
Max. :75.00 Max. :94.00 Max. :100.00
NA's :6
Max.Sea.Level.PressureIn Max.TemperatureF Max.VisibilityMiles
Min. :29.58 Min. :18.00 Min. : 2.000
1st Qu.:30.00 1st Qu.:42.00 1st Qu.:10.000
Median :30.14 Median :60.00 Median :10.000
Mean :30.16 Mean :58.93 Mean : 9.907
3rd Qu.:30.31 3rd Qu.:76.00 3rd Qu.:10.000
Max. :30.88 Max. :96.00 Max. :10.000
Max.Wind.SpeedMPH Mean.Humidity Mean.Sea.Level.PressureIn
Min. : 8.00 Min. :28.00 Min. :29.49
1st Qu.:16.00 1st Qu.:56.00 1st Qu.:29.87
Median :20.00 Median :66.00 Median :30.03
Mean :20.62 Mean :66.02 Mean :30.04
3rd Qu.:24.00 3rd Qu.:76.75 3rd Qu.:30.19
Max. :38.00 Max. :98.00 Max. :30.77
Mean.TemperatureF Mean.VisibilityMiles Mean.Wind.SpeedMPH
Min. : 8.00 Min. : 1.000 Min. : 4.00
1st Qu.:36.25 1st Qu.: 8.000 1st Qu.: 8.00
Median :53.50 Median :10.000 Median :10.00
Mean :51.40 Mean : 8.891 Mean :10.68
3rd Qu.:68.00 3rd Qu.:10.000 3rd Qu.:13.00
Max. :84.00 Max. :10.000 Max. :22.00
MeanDew.PointF Min.DewpointF Min.Humidity
Min. :-11.00 Min. :-18.00 Min. :16.00
1st Qu.: 24.00 1st Qu.: 16.25 1st Qu.:35.00
Median : 41.00 Median : 35.00 Median :46.00
Mean : 38.96 Mean : 32.25 Mean :48.31
3rd Qu.: 56.00 3rd Qu.: 51.00 3rd Qu.:60.00
Max. : 71.00 Max. : 68.00 Max. :96.00
Min.Sea.Level.PressureIn Min.TemperatureF Min.VisibilityMiles
Min. :29.16 Min. :-3.00 Min. : 0.000
1st Qu.:29.76 1st Qu.:30.00 1st Qu.: 2.000
Median :29.94 Median :46.00 Median :10.000
Mean :29.93 Mean :43.33 Mean : 6.716
3rd Qu.:30.09 3rd Qu.:60.00 3rd Qu.:10.000
Max. :30.64 Max. :74.00 Max. :10.000
PrecipitationIn WindDirDegrees
Min. :0.0000 Min. : 1.0
1st Qu.:0.0000 1st Qu.:113.0
Median :0.0000 Median :222.0
Mean :0.1016 Mean :200.1
3rd Qu.:0.0400 3rd Qu.:275.0
Max. :2.9000 Max. :360.0
# Look at histogram for MeanDew.PointF
hist(weather6$MeanDew.PointF)
# Look at histogram for Min.TemperatureF
hist(weather6$Min.TemperatureF)
# Compare to histogram for Mean.TemperatureF
hist(weather6$Mean.TemperatureF)
Before officially calling our weather data clean, we want to put a couple of finishing touches on the data. These are a bit more subjective and may not be necessary for analysis, but they will make the data easier for others to interpret, which is generally a good thing.
There are a number of stylistic conventions in the R language. Depending on who you ask, these conventions may vary. Because the period (.) has special meaning in certain situations, we generally recommend using underscores (_) to separate words in variable names. We also prefer all lowercase letters so that no one has to remember which letters are uppercase or lowercase.
Finally, the events column (renamed to be all lowercase in the first instruction) contains an empty string ("") for any day on which there was no significant weather event such as rain, fog, a thunderstorm, etc. However, if it’s the first time you’re seeing these data, it may not be obvious that this is the case, so it’s best for us to be explicit and replace the empty strings with something more meaningful.
new_colnames <- c("date", "events", "cloud_cover", "max_dew_point_f",
"max_gust_speed_mph", "max_humidity", "max_sea_level_pressure_in",
"max_temperature_f", "max_visibility_miles", "max_wind_speed_mph",
"mean_humidity", "mean_sea_level_pressure", "mean_temperature_f",
"mean_visibility_miles", "mean_wind_speed_mph", "mean_dew_point_f",
"min_dew_point_f", "min_humidity", "min_sea_level_pressure_in",
"min_temperature_f", "min_visibility_miles", "precipitation_in",
"wind_dir_degrees")
# Clean up column names
names(weather6) <- new_colnames
# Replace empty cells in events column
weather6$events[weather6$events == ""] <- "None"
# Print the first 6 rows of weather6
head(weather6)