#reading in raw data1
df1 <- read.csv('Data1.csv')
head(df1)
## DateTime Sensor.1 Sensor2 Sensor.3
## 1 9/29/2015 23:00 16.46646 NA NA
## 2 9/29/2015 23:01 16.52575 NA NA
## 3 9/29/2015 23:02 16.55345 NA NA
## 4 9/29/2015 23:03 16.56724 16.56724 NA
## 5 9/29/2015 23:04 16.56958 NA NA
## 6 9/29/2015 23:05 16.56660 NA NA
#checking original data sets
glimpse(df1)
## Rows: 1,440
## Columns: 4
## $ DateTime <chr> "9/29/2015 23:00", "9/29/2015 23:01", "9/29/2015 23:02", "9/2~
## $ Sensor.1 <dbl> 16.46646, 16.52575, 16.55345, 16.56724, 16.56958, 16.56660, 1~
## $ Sensor2 <dbl> NA, NA, NA, 16.56724, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ Sensor.3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 16.48639, 16.4556~
skim(df1)
Name | df1 |
Number of rows | 1440 |
Number of columns | 4 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
DateTime | 0 | 1 | 0 | 15 | 1340 | 99 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Sensor.1 | 1361 | 0.05 | 16.53 | 0.10 | 16.30 | 16.48 | 16.57 | 16.62 | 16.64 | ▂▂▂▇▆ |
Sensor2 | 1439 | 0.00 | 16.57 | NA | 16.57 | 16.57 | 16.57 | 16.57 | 16.57 | ▁▁▇▁▁ |
Sensor.3 | 1355 | 0.06 | 16.41 | 0.14 | 16.08 | 16.31 | 16.43 | 16.52 | 16.59 | ▂▁▂▇▅ |
nrow(df1)
## [1] 1440
head(df1$DateTime)
## [1] "9/29/2015 23:00" "9/29/2015 23:01" "9/29/2015 23:02" "9/29/2015 23:03"
## [5] "9/29/2015 23:04" "9/29/2015 23:05"
There were plenty of rows filled with ” “. Therefore, removed”empty” rows using subsetting
Data set is smaller (only 100 rows of relevant data points) now b/c empty redundant rows were removed
Because Sensor2 has only 1 data point, I am electing to toss out this data point from further analysis
df1_short <- df1[c(1:100), ]
skim(df1_short)
Name | df1_short |
Number of rows | 100 |
Number of columns | 4 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
DateTime | 0 | 1 | 14 | 15 | 0 | 98 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Sensor.1 | 21 | 0.79 | 16.53 | 0.10 | 16.30 | 16.48 | 16.57 | 16.62 | 16.64 | ▂▂▂▇▆ |
Sensor2 | 99 | 0.01 | 16.57 | NA | 16.57 | 16.57 | 16.57 | 16.57 | 16.57 | ▁▁▇▁▁ |
Sensor.3 | 15 | 0.85 | 16.41 | 0.14 | 16.08 | 16.31 | 16.43 | 16.52 | 16.59 | ▂▁▂▇▅ |
df1_short1 <- df1_short %>% mutate(Time=mdy_hm(DateTime))
df1_short1 <- df1_short1 %>% dplyr::select(Time, Sensor.1,Sensor.3)
#add index column to data frame in order to melt the dataframe later
df1_short1$index <- 1:nrow(df1_short1)
df1_short1 <- df1_short1 %>% dplyr::select(index,Time, Sensor.1,Sensor.3)
head(df1_short1)
## index Time Sensor.1 Sensor.3
## 1 1 2015-09-29 23:00:00 16.46646 NA
## 2 2 2015-09-29 23:01:00 16.52575 NA
## 3 3 2015-09-29 23:02:00 16.55345 NA
## 4 4 2015-09-29 23:03:00 16.56724 NA
## 5 5 2015-09-29 23:04:00 16.56958 NA
## 6 6 2015-09-29 23:05:00 16.56660 NA
class(df1_short1$Time)
## [1] "POSIXct" "POSIXt"
note: See below for visuals of proportions of missing data w.r.t each sensor
# Make dataset from wide to long format)
mdata1 <- melt(df1_short1, id=c("index","Time")) %>% dplyr::rename(sensor=variable)
madata2 <- mdata1%>% dplyr::select(Time, sensor, value)
head(madata2)
## Time sensor value
## 1 2015-09-29 23:00:00 Sensor.1 16.46646
## 2 2015-09-29 23:01:00 Sensor.1 16.52575
## 3 2015-09-29 23:02:00 Sensor.1 16.55345
## 4 2015-09-29 23:03:00 Sensor.1 16.56724
## 5 2015-09-29 23:04:00 Sensor.1 16.56958
## 6 2015-09-29 23:05:00 Sensor.1 16.56660
df1_shortTemp <- df1_short %>% mutate(Time=mdy_hm(DateTime))
res<-summary(aggr(df1_shortTemp[,2:4], sortVar=TRUE))$combinations
##
## Variables sorted by number of missings:
## Variable Count
## Sensor2 0.99
## Sensor.1 0.21
## Sensor.3 0.15
res
## Combinations Count Percent
## 1 0:0:1 1 1
## 2 0:1:0 65 65
## 3 0:1:1 13 13
## 4 1:1:0 20 20
## 5 1:1:1 1 1
madata_0 <- madata2
madata_0[is.na(madata_0)] <- max(madata_0$value,na.rm = TRUE)
tail(madata_0)
## Time sensor value
## 195 2015-09-30 00:34:00 Sensor.3 16.16788
## 196 2015-09-30 00:35:00 Sensor.3 16.16810
## 197 2015-09-30 00:36:00 Sensor.3 16.63897
## 198 2015-09-30 00:37:00 Sensor.3 16.63897
## 199 2015-09-30 00:38:00 Sensor.3 16.63897
## 200 2015-09-30 00:39:00 Sensor.3 16.63897
sum(is.na(madata_0$value))
## [1] 0
madata3 <- madata2
madata3$value[which(is.na(madata3$value))] <- mean(madata3$value,na.rm = TRUE)
tail(madata3)
## Time sensor value
## 195 2015-09-30 00:34:00 Sensor.3 16.16788
## 196 2015-09-30 00:35:00 Sensor.3 16.16810
## 197 2015-09-30 00:36:00 Sensor.3 16.46872
## 198 2015-09-30 00:37:00 Sensor.3 16.46872
## 199 2015-09-30 00:38:00 Sensor.3 16.46872
## 200 2015-09-30 00:39:00 Sensor.3 16.46872
madata4 <- madata2
madata4<-na.locf(na.locf(madata4),fromLast=TRUE)
tail(madata4)
## Time sensor value
## 195 2015-09-30 00:34:00 Sensor.3 16.16788
## 196 2015-09-30 00:35:00 Sensor.3 16.16810
## 197 2015-09-30 00:36:00 Sensor.3 16.16810
## 198 2015-09-30 00:37:00 Sensor.3 16.16810
## 199 2015-09-30 00:38:00 Sensor.3 16.16810
## 200 2015-09-30 00:39:00 Sensor.3 16.16810
# Create the histogram.
hist(df1_short1$Sensor.1,main= "Sensor 1",xlab = "",col = "yellow",border = "blue")
hist(df1_short1$Sensor.3,main ="Sensor 3",xlab = "",col = "yellow",border = "blue")
From the skim function and histograms above, sensors 1 and 3 have long left tails
Because sensor 3 has slightly higher variance, it might increased accuracy a bit if we were to impute by median rather than the mean. The reason is averages are more susceptible to outliers and since sensor 3 exhibited a higher standard deviation, using median might prevent skewing
Unlike the mean, the median value doesn’t depend on all the values in the data set. Consequently, when some of the values are more extreme, the effect on the median is smaller. When you have a skewed distribution, the median is a better measure of central tendency than the mean
madata5 <- madata2
madata5$value[which(is.na(madata5$value))] <- median(madata5$value,na.rm = TRUE)
tail(madata5)
## Time sensor value
## 195 2015-09-30 00:34:00 Sensor.3 16.16788
## 196 2015-09-30 00:35:00 Sensor.3 16.16810
## 197 2015-09-30 00:36:00 Sensor.3 16.48767
## 198 2015-09-30 00:37:00 Sensor.3 16.48767
## 199 2015-09-30 00:38:00 Sensor.3 16.48767
## 200 2015-09-30 00:39:00 Sensor.3 16.48767
series <- madata_0 %>% dplyr::select(Time,value)
#using 70/30 split of method 1 dataset
train_0 =head(series,nrow(series)*0.7)
test_0= tail(series,60 )
#create time series
#df_ts <- as.xts(x = series[, -1], order.by = series$Time)
df_ts <- ts(train_0[, 2])
#display time series plot
plot(df_ts, main="Method 1: Imputing using Max",col="darkblue", ylab="values")
#Checking for Autocorrelation of the train data
#plot autocorrelation function
acf(df_ts)
arima_model <- auto.arima(df_ts, stepwise=F, approximation = FALSE)
summary(arima_model)
## Series: df_ts
## ARIMA(0,1,0)
##
## sigma^2 estimated as 0.001334: log likelihood=262.91
## AIC=-523.83 AICc=-523.8 BIC=-520.89
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.0002539104 0.03639771 0.01342297 0.001299605 0.08121197 1.001634
## ACF1
## Training set 0.02514878
fcast_0 <- forecast(arima_model,h=60)# to match test data sets of 60 data
plot(fcast_0)
#Accuracy Checks & Residuals Checks
df_ts_test <- ts(test_0[, 2])
df_arima_0<-as.data.frame(fcast_0)
fcast_0_ts<- ts(df_arima_0[,1])
accuracy(fcast_0_ts,df_ts_test)
## ME RMSE MAE MPE MAPE ACF1
## Test set -0.08440702 0.1924907 0.1498097 -0.5258868 0.9203862 0.9118253
## Theil's U
## Test set 2.963144
tsdisplay(residuals(arima_model))
series_3 <- madata3 %>% dplyr::select(Time,value)
#using 70/30 split of method 1 dataset
train_3 =head(series_3,nrow(series)*0.7)
test_3= tail(series_3,60 )
#create time series
df_ts3 <- ts(train_3[, 2])
#display time series plot
plot(df_ts3, main="Method 2: Imputing using Means",col="darkgreen", ylab="values")
#Plot ACF function
acf(df_ts3)
arima_model3 <- auto.arima(df_ts3, stepwise=F, approximation = FALSE)
summary(arima_model3)
## Series: df_ts3
## ARIMA(2,1,3)
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3
## -1.2578 -0.6787 1.0216 0.3901 -0.3351
## s.e. 0.2063 0.1723 0.1968 0.1708 0.0830
##
## sigma^2 estimated as 0.001253: log likelihood=269.38
## AIC=-526.77 AICc=-526.13 BIC=-509.16
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.00026761 0.03462545 0.02029339 0.001339458 0.1229694 1.288332
## ACF1
## Training set 0.004704262
fcast_3 <- forecast(arima_model3,h=60)# to match test data sets of 60 data
plot(fcast_3)
df_ts_test3 <- ts(test_3[, 2])
df_arima_3<-as.data.frame(fcast_3)
fcast_3_ts<- ts(df_arima_3[,1])
accuracy(fcast_3_ts,df_ts_test3)
## ME RMSE MAE MPE MAPE ACF1
## Test set -0.09563678 0.188365 0.1407021 -0.5934364 0.8657147 0.9565365
## Theil's U
## Test set 4.190648
tsdisplay(residuals(arima_model3))
series_4 <- madata4 %>% dplyr::select(Time,value)
#using 70/30 split of method 1 dataset
train_4 =head(series_4,nrow(series)*0.7)
test_4= tail(series_4,60 )
#create time series
df_ts4 <- ts(train_4[, 2])
#display time series plot
plot(df_ts4, main="Method 3: Imputing using Previous Values",col="red", ylab="values")
#Plot ACF function
acf(df_ts4)
arima_model4 <- auto.arima(df_ts4, stepwise=F, approximation = FALSE)
summary(arima_model4)
## Series: df_ts4
## ARIMA(1,1,0)
##
## Coefficients:
## ar1
## 0.1348
## s.e. 0.0856
##
## sigma^2 estimated as 0.0005937: log likelihood=319.81
## AIC=-635.62 AICc=-635.53 BIC=-629.75
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.0002298213 0.02419203 0.009898567 0.00130477 0.06001109 1.00306
## ACF1
## Training set 0.002259289
fcast_4 <- forecast(arima_model4,h=60)# to match test data sets of 60 data
plot(fcast_4)
df_ts_test4 <- ts(test_4[, 2])
df_arima_4<-as.data.frame(fcast_4)
fcast_4_ts<- ts(df_arima_4[,1])
accuracy(fcast_4_ts,df_ts_test4)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set -0.1155016 0.205263 0.1606845 -0.7164398 0.9894267 0.9753037 9.073716
tsdisplay(residuals(arima_model4))
series_5 <- madata5 %>% dplyr::select(Time,value)
#using 70/30 split of method 1 dataset
train_5 =head(series_5,nrow(series)*0.7)
test_5= tail(series_5,60 )
#create time series
df_ts5 <- ts(train_5[, 2])
#display time series plot
plot(df_ts5, main="Method 4: Imputing using Median",col="purple", ylab="values")
#Plot ACF function
acf(df_ts5)
arima_model5 <- auto.arima(df_ts5, stepwise=F, approximation = FALSE)
summary(arima_model5)
## Series: df_ts5
## ARIMA(3,1,2)
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2
## -1.4679 -1.0853 -0.2937 1.2690 0.8401
## s.e. 0.1498 0.1541 0.0931 0.1399 0.1037
##
## sigma^2 estimated as 0.001103: log likelihood=278.4
## AIC=-544.81 AICc=-544.17 BIC=-527.2
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.0002796718 0.03249271 0.01885023 0.001466746 0.1142239 1.260717
## ACF1
## Training set 0.005857079
fcast_5 <- forecast(arima_model5,h=60)# to match test data sets of 60 data
plot(fcast_5)
df_ts_test5 <- ts(test_5[, 2])
df_arima_5<-as.data.frame(fcast_5)
fcast_5_ts<- ts(df_arima_5[,1])
accuracy(fcast_5_ts,df_ts_test5)
## ME RMSE MAE MPE MAPE ACF1
## Test set -0.09567507 0.1889631 0.1398158 -0.5937077 0.8603979 0.9524104
## Theil's U
## Test set 4.013735
tsdisplay(residuals(arima_model5))
At first glance, raw data appeared to have 1440 rows but after further inspection, a majority of those rows were ‘empty’; filled with “”
Post wrangling, data had only 100 rows per sensor readings
Each data sets was then split into 70/30 ratios of training and testing
Sensor2 had only 1 datapoint and as a result, its datapoint of 16.567243 was excluded from further analysis. Sensor2 could have malfunctioned or covered with debris
DataTime col was converted to datetime stamp and wide data was converted to long formatted data.
4 imputation methods were implemented: Max, Means, Previous values and Median
Each method was first plotted to get “feel” of any underlying patterns. However, these plots of the training datasets did produced any descernible patterns such as trends or seasonality etc..
Because of that, ACF plots of each training dataset were then plotted to ascertain which model to implement. From the ACF plots, it was clear that the underlying data has strong autocorrelation at several lags levels
Using RSME as a prediction metrics, the following was found: Method 1: Max, RSME=0.1924907 Method 2: Mean, RSME= 0.188365 Method 3: Previous Value, RSME= 0.205263 Method 4: Median, RSME= 0.1889631
Based on RSME, both the Mean and Median were the most accurate, while using the Max came in 3rd and imputing using Previous Values came in last