Test Question 1:

#reading in raw data1
df1 <- read.csv('Data1.csv')
head(df1)
##          DateTime Sensor.1  Sensor2 Sensor.3
## 1 9/29/2015 23:00 16.46646       NA       NA
## 2 9/29/2015 23:01 16.52575       NA       NA
## 3 9/29/2015 23:02 16.55345       NA       NA
## 4 9/29/2015 23:03 16.56724 16.56724       NA
## 5 9/29/2015 23:04 16.56958       NA       NA
## 6 9/29/2015 23:05 16.56660       NA       NA

#checking original data sets

glimpse(df1)
## Rows: 1,440
## Columns: 4
## $ DateTime <chr> "9/29/2015 23:00", "9/29/2015 23:01", "9/29/2015 23:02", "9/2~
## $ Sensor.1 <dbl> 16.46646, 16.52575, 16.55345, 16.56724, 16.56958, 16.56660, 1~
## $ Sensor2  <dbl> NA, NA, NA, 16.56724, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ Sensor.3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 16.48639, 16.4556~
skim(df1)
Data summary
Name df1
Number of rows 1440
Number of columns 4
_______________________
Column type frequency:
character 1
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
DateTime 0 1 0 15 1340 99 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Sensor.1 1361 0.05 16.53 0.10 16.30 16.48 16.57 16.62 16.64 ▂▂▂▇▆
Sensor2 1439 0.00 16.57 NA 16.57 16.57 16.57 16.57 16.57 ▁▁▇▁▁
Sensor.3 1355 0.06 16.41 0.14 16.08 16.31 16.43 16.52 16.59 ▂▁▂▇▅

checking nrows and datatype of DateTime column

nrow(df1)
## [1] 1440
head(df1$DateTime)   
## [1] "9/29/2015 23:00" "9/29/2015 23:01" "9/29/2015 23:02" "9/29/2015 23:03"
## [5] "9/29/2015 23:04" "9/29/2015 23:05"

Observation 1

  • There were plenty of rows filled with ” “. Therefore, removed”empty” rows using subsetting

  • Data set is smaller (only 100 rows of relevant data points) now b/c empty redundant rows were removed

  • Because Sensor2 has only 1 data point, I am electing to toss out this data point from further analysis

df1_short <- df1[c(1:100), ]

skim(df1_short)
Data summary
Name df1_short
Number of rows 100
Number of columns 4
_______________________
Column type frequency:
character 1
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
DateTime 0 1 14 15 0 98 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Sensor.1 21 0.79 16.53 0.10 16.30 16.48 16.57 16.62 16.64 ▂▂▂▇▆
Sensor2 99 0.01 16.57 NA 16.57 16.57 16.57 16.57 16.57 ▁▁▇▁▁
Sensor.3 15 0.85 16.41 0.14 16.08 16.31 16.43 16.52 16.59 ▂▁▂▇▅

Converting Datetime col to time series object

df1_short1 <- df1_short %>%  mutate(Time=mdy_hm(DateTime))

df1_short1 <- df1_short1 %>%  dplyr::select(Time, Sensor.1,Sensor.3)

#add index column to data frame in order to melt the dataframe later
df1_short1$index <- 1:nrow(df1_short1)
df1_short1 <- df1_short1 %>%  dplyr::select(index,Time, Sensor.1,Sensor.3)
head(df1_short1)
##   index                Time Sensor.1 Sensor.3
## 1     1 2015-09-29 23:00:00 16.46646       NA
## 2     2 2015-09-29 23:01:00 16.52575       NA
## 3     3 2015-09-29 23:02:00 16.55345       NA
## 4     4 2015-09-29 23:03:00 16.56724       NA
## 5     5 2015-09-29 23:04:00 16.56958       NA
## 6     6 2015-09-29 23:05:00 16.56660       NA

Verify that Time col is indeed changed

class(df1_short1$Time)
## [1] "POSIXct" "POSIXt"

The reason I am making this into “long” format is because all the 3 sensors have similar statistical attributes:

  1. The mean of the data points is around 16.45 between the 2 leftover sensors
  2. Sensor2 has only 1 data point with the rest being N/As, see visuals below (maybe a faulty sensor? or covered by external materials)
  3. The quin tiles of the sensors are very similar. Lastly, sensor 1 has slightly lower SD than sensor 3

note: See below for visuals of proportions of missing data w.r.t each sensor

# Make dataset from wide to long format)
mdata1 <- melt(df1_short1, id=c("index","Time")) %>% dplyr::rename(sensor=variable)
madata2 <- mdata1%>% dplyr::select(Time, sensor, value) 
head(madata2)
##                  Time   sensor    value
## 1 2015-09-29 23:00:00 Sensor.1 16.46646
## 2 2015-09-29 23:01:00 Sensor.1 16.52575
## 3 2015-09-29 23:02:00 Sensor.1 16.55345
## 4 2015-09-29 23:03:00 Sensor.1 16.56724
## 5 2015-09-29 23:04:00 Sensor.1 16.56958
## 6 2015-09-29 23:05:00 Sensor.1 16.56660

Visuals confirmed what we already knew from above:

df1_shortTemp <- df1_short %>%  mutate(Time=mdy_hm(DateTime))
res<-summary(aggr(df1_shortTemp[,2:4], sortVar=TRUE))$combinations

## 
##  Variables sorted by number of missings: 
##  Variable Count
##   Sensor2  0.99
##  Sensor.1  0.21
##  Sensor.3  0.15
res
##   Combinations Count Percent
## 1        0:0:1     1       1
## 2        0:1:0    65      65
## 3        0:1:1    13      13
## 4        1:1:0    20      20
## 5        1:1:1     1       1

Method 1: Fill in missing data points with Max

madata_0 <- madata2
madata_0[is.na(madata_0)] <- max(madata_0$value,na.rm = TRUE)
tail(madata_0)
##                    Time   sensor    value
## 195 2015-09-30 00:34:00 Sensor.3 16.16788
## 196 2015-09-30 00:35:00 Sensor.3 16.16810
## 197 2015-09-30 00:36:00 Sensor.3 16.63897
## 198 2015-09-30 00:37:00 Sensor.3 16.63897
## 199 2015-09-30 00:38:00 Sensor.3 16.63897
## 200 2015-09-30 00:39:00 Sensor.3 16.63897

Checked if there’s still missing N/As

sum(is.na(madata_0$value))
## [1] 0

Method 2: Fill in missing data points with means

madata3 <- madata2

madata3$value[which(is.na(madata3$value))] <- mean(madata3$value,na.rm = TRUE)
tail(madata3)
##                    Time   sensor    value
## 195 2015-09-30 00:34:00 Sensor.3 16.16788
## 196 2015-09-30 00:35:00 Sensor.3 16.16810
## 197 2015-09-30 00:36:00 Sensor.3 16.46872
## 198 2015-09-30 00:37:00 Sensor.3 16.46872
## 199 2015-09-30 00:38:00 Sensor.3 16.46872
## 200 2015-09-30 00:39:00 Sensor.3 16.46872

Method 3: Fill with previous values

madata4 <- madata2

madata4<-na.locf(na.locf(madata4),fromLast=TRUE)
tail(madata4)
##                    Time   sensor    value
## 195 2015-09-30 00:34:00 Sensor.3 16.16788
## 196 2015-09-30 00:35:00 Sensor.3 16.16810
## 197 2015-09-30 00:36:00 Sensor.3 16.16810
## 198 2015-09-30 00:37:00 Sensor.3 16.16810
## 199 2015-09-30 00:38:00 Sensor.3 16.16810
## 200 2015-09-30 00:39:00 Sensor.3 16.16810

Checking for spread and shape of sensors 1 & 3

# Create the histogram.
hist(df1_short1$Sensor.1,main= "Sensor 1",xlab = "",col = "yellow",border = "blue")

hist(df1_short1$Sensor.3,main ="Sensor 3",xlab = "",col = "yellow",border = "blue")

Method 4: Fill with median:

Observation2

  • From the skim function and histograms above, sensors 1 and 3 have long left tails

  • Because sensor 3 has slightly higher variance, it might increased accuracy a bit if we were to impute by median rather than the mean. The reason is averages are more susceptible to outliers and since sensor 3 exhibited a higher standard deviation, using median might prevent skewing

  • Unlike the mean, the median value doesn’t depend on all the values in the data set. Consequently, when some of the values are more extreme, the effect on the median is smaller. When you have a skewed distribution, the median is a better measure of central tendency than the mean

madata5 <- madata2
madata5$value[which(is.na(madata5$value))] <- median(madata5$value,na.rm = TRUE)
tail(madata5)
##                    Time   sensor    value
## 195 2015-09-30 00:34:00 Sensor.3 16.16788
## 196 2015-09-30 00:35:00 Sensor.3 16.16810
## 197 2015-09-30 00:36:00 Sensor.3 16.48767
## 198 2015-09-30 00:37:00 Sensor.3 16.48767
## 199 2015-09-30 00:38:00 Sensor.3 16.48767
## 200 2015-09-30 00:39:00 Sensor.3 16.48767

Method 1: Imputing using Max

series <- madata_0 %>% dplyr::select(Time,value)
#using 70/30 split of method 1 dataset
train_0 =head(series,nrow(series)*0.7)
test_0= tail(series,60 )
#create time series
#df_ts <- as.xts(x = series[, -1], order.by = series$Time)
df_ts <- ts(train_0[, 2])

#display time series plot
plot(df_ts, main="Method 1: Imputing using Max",col="darkblue", ylab="values")

#Checking for Autocorrelation of the train data

#plot autocorrelation function
acf(df_ts)

Arima Model

arima_model <- auto.arima(df_ts, stepwise=F, approximation = FALSE)
summary(arima_model)
## Series: df_ts 
## ARIMA(0,1,0) 
## 
## sigma^2 estimated as 0.001334:  log likelihood=262.91
## AIC=-523.83   AICc=-523.8   BIC=-520.89
## 
## Training set error measures:
##                        ME       RMSE        MAE         MPE       MAPE     MASE
## Training set 0.0002539104 0.03639771 0.01342297 0.001299605 0.08121197 1.001634
##                    ACF1
## Training set 0.02514878
fcast_0 <- forecast(arima_model,h=60)# to match test data sets of 60 data
plot(fcast_0)

#Accuracy Checks & Residuals Checks

df_ts_test <- ts(test_0[, 2])
df_arima_0<-as.data.frame(fcast_0)
fcast_0_ts<- ts(df_arima_0[,1])
accuracy(fcast_0_ts,df_ts_test)
##                   ME      RMSE       MAE        MPE      MAPE      ACF1
## Test set -0.08440702 0.1924907 0.1498097 -0.5258868 0.9203862 0.9118253
##          Theil's U
## Test set  2.963144
tsdisplay(residuals(arima_model))

Method 2: Imputing using means

series_3 <- madata3 %>% dplyr::select(Time,value)
#using 70/30 split of method 1 dataset
train_3 =head(series_3,nrow(series)*0.7)
test_3= tail(series_3,60 )
#create time series

df_ts3 <- ts(train_3[, 2])

#display time series plot
plot(df_ts3, main="Method 2: Imputing using Means",col="darkgreen", ylab="values")

Repeating the same procedure as before

#Plot ACF function
acf(df_ts3)

arima_model3 <- auto.arima(df_ts3, stepwise=F, approximation = FALSE)
summary(arima_model3)
## Series: df_ts3 
## ARIMA(2,1,3) 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2      ma3
##       -1.2578  -0.6787  1.0216  0.3901  -0.3351
## s.e.   0.2063   0.1723  0.1968  0.1708   0.0830
## 
## sigma^2 estimated as 0.001253:  log likelihood=269.38
## AIC=-526.77   AICc=-526.13   BIC=-509.16
## 
## Training set error measures:
##                      ME       RMSE        MAE         MPE      MAPE     MASE
## Training set 0.00026761 0.03462545 0.02029339 0.001339458 0.1229694 1.288332
##                     ACF1
## Training set 0.004704262
fcast_3 <- forecast(arima_model3,h=60)# to match test data sets of 60 data
plot(fcast_3)

df_ts_test3 <- ts(test_3[, 2])
df_arima_3<-as.data.frame(fcast_3)
fcast_3_ts<- ts(df_arima_3[,1])
accuracy(fcast_3_ts,df_ts_test3)
##                   ME     RMSE       MAE        MPE      MAPE      ACF1
## Test set -0.09563678 0.188365 0.1407021 -0.5934364 0.8657147 0.9565365
##          Theil's U
## Test set  4.190648
tsdisplay(residuals(arima_model3))

Method 3: Imputing using Previous Values

series_4 <- madata4 %>% dplyr::select(Time,value)
#using 70/30 split of method 1 dataset
train_4 =head(series_4,nrow(series)*0.7)
test_4= tail(series_4,60 )
#create time series

df_ts4 <- ts(train_4[, 2])

#display time series plot
plot(df_ts4, main="Method 3: Imputing using Previous Values",col="red", ylab="values")

#Plot ACF function
acf(df_ts4)

arima_model4 <- auto.arima(df_ts4, stepwise=F, approximation = FALSE)
summary(arima_model4)
## Series: df_ts4 
## ARIMA(1,1,0) 
## 
## Coefficients:
##          ar1
##       0.1348
## s.e.  0.0856
## 
## sigma^2 estimated as 0.0005937:  log likelihood=319.81
## AIC=-635.62   AICc=-635.53   BIC=-629.75
## 
## Training set error measures:
##                        ME       RMSE         MAE        MPE       MAPE    MASE
## Training set 0.0002298213 0.02419203 0.009898567 0.00130477 0.06001109 1.00306
##                     ACF1
## Training set 0.002259289
fcast_4 <- forecast(arima_model4,h=60)# to match test data sets of 60 data
plot(fcast_4)

df_ts_test4 <- ts(test_4[, 2])
df_arima_4<-as.data.frame(fcast_4)
fcast_4_ts<- ts(df_arima_4[,1])
accuracy(fcast_4_ts,df_ts_test4)
##                  ME     RMSE       MAE        MPE      MAPE      ACF1 Theil's U
## Test set -0.1155016 0.205263 0.1606845 -0.7164398 0.9894267 0.9753037  9.073716
tsdisplay(residuals(arima_model4))

Method 4: Imputing using Median

series_5 <- madata5 %>% dplyr::select(Time,value)
#using 70/30 split of method 1 dataset
train_5 =head(series_5,nrow(series)*0.7)
test_5= tail(series_5,60 )
#create time series

df_ts5 <- ts(train_5[, 2])

#display time series plot
plot(df_ts5, main="Method 4: Imputing using Median",col="purple", ylab="values")

#Plot ACF function
acf(df_ts5)

arima_model5 <- auto.arima(df_ts5, stepwise=F, approximation = FALSE)
summary(arima_model5)
## Series: df_ts5 
## ARIMA(3,1,2) 
## 
## Coefficients:
##           ar1      ar2      ar3     ma1     ma2
##       -1.4679  -1.0853  -0.2937  1.2690  0.8401
## s.e.   0.1498   0.1541   0.0931  0.1399  0.1037
## 
## sigma^2 estimated as 0.001103:  log likelihood=278.4
## AIC=-544.81   AICc=-544.17   BIC=-527.2
## 
## Training set error measures:
##                        ME       RMSE        MAE         MPE      MAPE     MASE
## Training set 0.0002796718 0.03249271 0.01885023 0.001466746 0.1142239 1.260717
##                     ACF1
## Training set 0.005857079
fcast_5 <- forecast(arima_model5,h=60)# to match test data sets of 60 data
plot(fcast_5)

df_ts_test5 <- ts(test_5[, 2])
df_arima_5<-as.data.frame(fcast_5)
fcast_5_ts<- ts(df_arima_5[,1])
accuracy(fcast_5_ts,df_ts_test5)
##                   ME      RMSE       MAE        MPE      MAPE      ACF1
## Test set -0.09567507 0.1889631 0.1398158 -0.5937077 0.8603979 0.9524104
##          Theil's U
## Test set  4.013735
tsdisplay(residuals(arima_model5))

Summary and Conclusions

  • At first glance, raw data appeared to have 1440 rows but after further inspection, a majority of those rows were ‘empty’; filled with “”

  • Post wrangling, data had only 100 rows per sensor readings

  • Each data sets was then split into 70/30 ratios of training and testing

  • Sensor2 had only 1 datapoint and as a result, its datapoint of 16.567243 was excluded from further analysis. Sensor2 could have malfunctioned or covered with debris

  • DataTime col was converted to datetime stamp and wide data was converted to long formatted data.

  • 4 imputation methods were implemented: Max, Means, Previous values and Median

  • Each method was first plotted to get “feel” of any underlying patterns. However, these plots of the training datasets did produced any descernible patterns such as trends or seasonality etc..

  • Because of that, ACF plots of each training dataset were then plotted to ascertain which model to implement. From the ACF plots, it was clear that the underlying data has strong autocorrelation at several lags levels

  • Using RSME as a prediction metrics, the following was found: Method 1: Max, RSME=0.1924907 Method 2: Mean, RSME= 0.188365 Method 3: Previous Value, RSME= 0.205263 Method 4: Median, RSME= 0.1889631

  • Based on RSME, both the Mean and Median were the most accurate, while using the Max came in 3rd and imputing using Previous Values came in last