Test Question 1:

Fill in NAs with 4 different methods
Plot results : Use Accuracy measures to evaluate methods

#reading in raw data1
df1 <- read.csv('Data1.csv')
head(df1)

##          DateTime Sensor.1  Sensor2 Sensor.3
## 1 9/29/2015 23:00 16.46646       NA       NA
## 2 9/29/2015 23:01 16.52575       NA       NA
## 3 9/29/2015 23:02 16.55345       NA       NA
## 4 9/29/2015 23:03 16.56724 16.56724       NA
## 5 9/29/2015 23:04 16.56958       NA       NA
## 6 9/29/2015 23:05 16.56660       NA       NA

#checking original data sets

glimpse(df1)

## Rows: 1,440
## Columns: 4
## $ DateTime <chr> "9/29/2015 23:00", "9/29/2015 23:01", "9/29/2015 23:02", "9/2~
## $ Sensor.1 <dbl> 16.46646, 16.52575, 16.55345, 16.56724, 16.56958, 16.56660, 1~
## $ Sensor2  <dbl> NA, NA, NA, 16.56724, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ Sensor.3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 16.48639, 16.4556~

skim(df1)

Data summary
Name	df1
Number of rows	1440
Number of columns	4
_______________________
Column type frequency:
character	1
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
DateTime	0	1	0	15	1340	99	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Sensor.1	1361	0.05	16.53	0.10	16.30	16.48	16.57	16.62	16.64	▂▂▂▇▆
Sensor2	1439	0.00	16.57	NA	16.57	16.57	16.57	16.57	16.57	▁▁▇▁▁
Sensor.3	1355	0.06	16.41	0.14	16.08	16.31	16.43	16.52	16.59	▂▁▂▇▅

checking nrows and datatype of DateTime column

nrow(df1)

## [1] 1440

head(df1$DateTime)

## [1] "9/29/2015 23:00" "9/29/2015 23:01" "9/29/2015 23:02" "9/29/2015 23:03"
## [5] "9/29/2015 23:04" "9/29/2015 23:05"

Observation 1

There were plenty of rows filled with ” “. Therefore, removed”empty” rows using subsetting
Data set is smaller (only 100 rows of relevant data points) now b/c empty redundant rows were removed
Because Sensor2 has only 1 data point, I am electing to toss out this data point from further analysis

df1_short <- df1[c(1:100), ]

skim(df1_short)

Data summary
Name	df1_short
Number of rows	100
Number of columns	4
_______________________
Column type frequency:
character	1
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
DateTime	0	1	14	15	0	98	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Sensor.1	21	0.79	16.53	0.10	16.30	16.48	16.57	16.62	16.64	▂▂▂▇▆
Sensor2	99	0.01	16.57	NA	16.57	16.57	16.57	16.57	16.57	▁▁▇▁▁
Sensor.3	15	0.85	16.41	0.14	16.08	16.31	16.43	16.52	16.59	▂▁▂▇▅

Converting Datetime col to time series object

Reminder: Sensor2 data point of 16.567243 was excluded from further analysis

df1_short1 <- df1_short %>%  mutate(Time=mdy_hm(DateTime))

df1_short1 <- df1_short1 %>%  dplyr::select(Time, Sensor.1,Sensor.3)

#add index column to data frame in order to melt the dataframe later
df1_short1$index <- 1:nrow(df1_short1)
df1_short1 <- df1_short1 %>%  dplyr::select(index,Time, Sensor.1,Sensor.3)
head(df1_short1)

##   index                Time Sensor.1 Sensor.3
## 1     1 2015-09-29 23:00:00 16.46646       NA
## 2     2 2015-09-29 23:01:00 16.52575       NA
## 3     3 2015-09-29 23:02:00 16.55345       NA
## 4     4 2015-09-29 23:03:00 16.56724       NA
## 5     5 2015-09-29 23:04:00 16.56958       NA
## 6     6 2015-09-29 23:05:00 16.56660       NA

Verify that Time col is indeed changed

class(df1_short1$Time)

## [1] "POSIXct" "POSIXt"

The reason I am making this into “long” format is because all the 3 sensors have similar statistical attributes:

The mean of the data points is around 16.45 between the 2 leftover sensors
Sensor2 has only 1 data point with the rest being N/As, see visuals below (maybe a faulty sensor? or covered by external materials)
The quin tiles of the sensors are very similar. Lastly, sensor 1 has slightly lower SD than sensor 3

note: See below for visuals of proportions of missing data w.r.t each sensor

# Make dataset from wide to long format)
mdata1 <- melt(df1_short1, id=c("index","Time")) %>% dplyr::rename(sensor=variable)
madata2 <- mdata1%>% dplyr::select(Time, sensor, value) 
head(madata2)

##                  Time   sensor    value
## 1 2015-09-29 23:00:00 Sensor.1 16.46646
## 2 2015-09-29 23:01:00 Sensor.1 16.52575
## 3 2015-09-29 23:02:00 Sensor.1 16.55345
## 4 2015-09-29 23:03:00 Sensor.1 16.56724
## 5 2015-09-29 23:04:00 Sensor.1 16.56958
## 6 2015-09-29 23:05:00 Sensor.1 16.56660

Visuals confirmed what we already knew from above:

Sensor2 has the most missing values
Alluded earlier, Sensor2 may have malfunction or covered by extraneous materials which could prevented it from collecting more data

df1_shortTemp <- df1_short %>%  mutate(Time=mdy_hm(DateTime))
res<-summary(aggr(df1_shortTemp[,2:4], sortVar=TRUE))$combinations

## 
##  Variables sorted by number of missings: 
##  Variable Count
##   Sensor2  0.99
##  Sensor.1  0.21
##  Sensor.3  0.15

res

##   Combinations Count Percent
## 1        0:0:1     1       1
## 2        0:1:0    65      65
## 3        0:1:1    13      13
## 4        1:1:0    20      20
## 5        1:1:1     1       1

Method 1: Fill in missing data points with Max

madata_0 <- madata2
madata_0[is.na(madata_0)] <- max(madata_0$value,na.rm = TRUE)
tail(madata_0)

##                    Time   sensor    value
## 195 2015-09-30 00:34:00 Sensor.3 16.16788
## 196 2015-09-30 00:35:00 Sensor.3 16.16810
## 197 2015-09-30 00:36:00 Sensor.3 16.63897
## 198 2015-09-30 00:37:00 Sensor.3 16.63897
## 199 2015-09-30 00:38:00 Sensor.3 16.63897
## 200 2015-09-30 00:39:00 Sensor.3 16.63897

Checked if there’s still missing N/As

sum(is.na(madata_0$value))

## [1] 0

Method 2: Fill in missing data points with means

Since the means of sensor 1 is fairly close to sensor 3

madata3 <- madata2

madata3$value[which(is.na(madata3$value))] <- mean(madata3$value,na.rm = TRUE)
tail(madata3)

##                    Time   sensor    value
## 195 2015-09-30 00:34:00 Sensor.3 16.16788
## 196 2015-09-30 00:35:00 Sensor.3 16.16810
## 197 2015-09-30 00:36:00 Sensor.3 16.46872
## 198 2015-09-30 00:37:00 Sensor.3 16.46872
## 199 2015-09-30 00:38:00 Sensor.3 16.46872
## 200 2015-09-30 00:39:00 Sensor.3 16.46872

Method 3: Fill with previous values

madata4 <- madata2

madata4<-na.locf(na.locf(madata4),fromLast=TRUE)
tail(madata4)

##                    Time   sensor    value
## 195 2015-09-30 00:34:00 Sensor.3 16.16788
## 196 2015-09-30 00:35:00 Sensor.3 16.16810
## 197 2015-09-30 00:36:00 Sensor.3 16.16810
## 198 2015-09-30 00:37:00 Sensor.3 16.16810
## 199 2015-09-30 00:38:00 Sensor.3 16.16810
## 200 2015-09-30 00:39:00 Sensor.3 16.16810

Checking for spread and shape of sensors 1 & 3

# Create the histogram.
hist(df1_short1$Sensor.1,main= "Sensor 1",xlab = "",col = "yellow",border = "blue")

hist(df1_short1$Sensor.3,main ="Sensor 3",xlab = "",col = "yellow",border = "blue")

Method 4: Fill with median:

Observation2

From the skim function and histograms above, sensors 1 and 3 have long left tails
Because sensor 3 has slightly higher variance, it might increased accuracy a bit if we were to impute by median rather than the mean. The reason is averages are more susceptible to outliers and since sensor 3 exhibited a higher standard deviation, using median might prevent skewing
Unlike the mean, the median value doesn’t depend on all the values in the data set. Consequently, when some of the values are more extreme, the effect on the median is smaller. When you have a skewed distribution, the median is a better measure of central tendency than the mean

madata5 <- madata2
madata5$value[which(is.na(madata5$value))] <- median(madata5$value,na.rm = TRUE)
tail(madata5)

##                    Time   sensor    value
## 195 2015-09-30 00:34:00 Sensor.3 16.16788
## 196 2015-09-30 00:35:00 Sensor.3 16.16810
## 197 2015-09-30 00:36:00 Sensor.3 16.48767
## 198 2015-09-30 00:37:00 Sensor.3 16.48767
## 199 2015-09-30 00:38:00 Sensor.3 16.48767
## 200 2015-09-30 00:39:00 Sensor.3 16.48767

Method 1: Imputing using Max

Plotting and checking for accuracy

series <- madata_0 %>% dplyr::select(Time,value)
#using 70/30 split of method 1 dataset
train_0 =head(series,nrow(series)*0.7)
test_0= tail(series,60 )
#create time series
#df_ts <- as.xts(x = series[, -1], order.by = series$Time)
df_ts <- ts(train_0[, 2])

#display time series plot
plot(df_ts, main="Method 1: Imputing using Max",col="darkblue", ylab="values")

#Checking for Autocorrelation of the train data

#plot autocorrelation function
acf(df_ts)

Arima Model

Because of autocorrelation from ACF plots, ARIMA modeling is suitable

arima_model <- auto.arima(df_ts, stepwise=F, approximation = FALSE)
summary(arima_model)

## Series: df_ts 
## ARIMA(0,1,0) 
## 
## sigma^2 estimated as 0.001334:  log likelihood=262.91
## AIC=-523.83   AICc=-523.8   BIC=-520.89
## 
## Training set error measures:
##                        ME       RMSE        MAE         MPE       MAPE     MASE
## Training set 0.0002539104 0.03639771 0.01342297 0.001299605 0.08121197 1.001634
##                    ACF1
## Training set 0.02514878

fcast_0 <- forecast(arima_model,h=60)# to match test data sets of 60 data
plot(fcast_0)

#Accuracy Checks & Residuals Checks

df_ts_test <- ts(test_0[, 2])
df_arima_0<-as.data.frame(fcast_0)
fcast_0_ts<- ts(df_arima_0[,1])
accuracy(fcast_0_ts,df_ts_test)

##                   ME      RMSE       MAE        MPE      MAPE      ACF1
## Test set -0.08440702 0.1924907 0.1498097 -0.5258868 0.9203862 0.9118253
##          Theil's U
## Test set  2.963144

tsdisplay(residuals(arima_model))

Method 2: Imputing using means

Plots and checking for accuracy

series_3 <- madata3 %>% dplyr::select(Time,value)
#using 70/30 split of method 1 dataset
train_3 =head(series_3,nrow(series)*0.7)
test_3= tail(series_3,60 )
#create time series

df_ts3 <- ts(train_3[, 2])

#display time series plot
plot(df_ts3, main="Method 2: Imputing using Means",col="darkgreen", ylab="values")

Repeating the same procedure as before

#Plot ACF function
acf(df_ts3)

arima_model3 <- auto.arima(df_ts3, stepwise=F, approximation = FALSE)
summary(arima_model3)

## Series: df_ts3 
## ARIMA(2,1,3) 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2      ma3
##       -1.2578  -0.6787  1.0216  0.3901  -0.3351
## s.e.   0.2063   0.1723  0.1968  0.1708   0.0830
## 
## sigma^2 estimated as 0.001253:  log likelihood=269.38
## AIC=-526.77   AICc=-526.13   BIC=-509.16
## 
## Training set error measures:
##                      ME       RMSE        MAE         MPE      MAPE     MASE
## Training set 0.00026761 0.03462545 0.02029339 0.001339458 0.1229694 1.288332
##                     ACF1
## Training set 0.004704262

fcast_3 <- forecast(arima_model3,h=60)# to match test data sets of 60 data
plot(fcast_3)

df_ts_test3 <- ts(test_3[, 2])
df_arima_3<-as.data.frame(fcast_3)
fcast_3_ts<- ts(df_arima_3[,1])
accuracy(fcast_3_ts,df_ts_test3)

##                   ME     RMSE       MAE        MPE      MAPE      ACF1
## Test set -0.09563678 0.188365 0.1407021 -0.5934364 0.8657147 0.9565365
##          Theil's U
## Test set  4.190648

tsdisplay(residuals(arima_model3))

Method 3: Imputing using Previous Values

Plots and checking for accuracy

series_4 <- madata4 %>% dplyr::select(Time,value)
#using 70/30 split of method 1 dataset
train_4 =head(series_4,nrow(series)*0.7)
test_4= tail(series_4,60 )
#create time series

df_ts4 <- ts(train_4[, 2])

#display time series plot
plot(df_ts4, main="Method 3: Imputing using Previous Values",col="red", ylab="values")

#Plot ACF function
acf(df_ts4)

arima_model4 <- auto.arima(df_ts4, stepwise=F, approximation = FALSE)
summary(arima_model4)

## Series: df_ts4 
## ARIMA(1,1,0) 
## 
## Coefficients:
##          ar1
##       0.1348
## s.e.  0.0856
## 
## sigma^2 estimated as 0.0005937:  log likelihood=319.81
## AIC=-635.62   AICc=-635.53   BIC=-629.75
## 
## Training set error measures:
##                        ME       RMSE         MAE        MPE       MAPE    MASE
## Training set 0.0002298213 0.02419203 0.009898567 0.00130477 0.06001109 1.00306
##                     ACF1
## Training set 0.002259289

fcast_4 <- forecast(arima_model4,h=60)# to match test data sets of 60 data
plot(fcast_4)

df_ts_test4 <- ts(test_4[, 2])
df_arima_4<-as.data.frame(fcast_4)
fcast_4_ts<- ts(df_arima_4[,1])
accuracy(fcast_4_ts,df_ts_test4)

##                  ME     RMSE       MAE        MPE      MAPE      ACF1 Theil's U
## Test set -0.1155016 0.205263 0.1606845 -0.7164398 0.9894267 0.9753037  9.073716

tsdisplay(residuals(arima_model4))

Method 4: Imputing using Median

Plotting and checking for accuracy

series_5 <- madata5 %>% dplyr::select(Time,value)
#using 70/30 split of method 1 dataset
train_5 =head(series_5,nrow(series)*0.7)
test_5= tail(series_5,60 )
#create time series

df_ts5 <- ts(train_5[, 2])

#display time series plot
plot(df_ts5, main="Method 4: Imputing using Median",col="purple", ylab="values")

#Plot ACF function
acf(df_ts5)

arima_model5 <- auto.arima(df_ts5, stepwise=F, approximation = FALSE)
summary(arima_model5)

## Series: df_ts5 
## ARIMA(3,1,2) 
## 
## Coefficients:
##           ar1      ar2      ar3     ma1     ma2
##       -1.4679  -1.0853  -0.2937  1.2690  0.8401
## s.e.   0.1498   0.1541   0.0931  0.1399  0.1037
## 
## sigma^2 estimated as 0.001103:  log likelihood=278.4
## AIC=-544.81   AICc=-544.17   BIC=-527.2
## 
## Training set error measures:
##                        ME       RMSE        MAE         MPE      MAPE     MASE
## Training set 0.0002796718 0.03249271 0.01885023 0.001466746 0.1142239 1.260717
##                     ACF1
## Training set 0.005857079

fcast_5 <- forecast(arima_model5,h=60)# to match test data sets of 60 data
plot(fcast_5)

df_ts_test5 <- ts(test_5[, 2])
df_arima_5<-as.data.frame(fcast_5)
fcast_5_ts<- ts(df_arima_5[,1])
accuracy(fcast_5_ts,df_ts_test5)

##                   ME      RMSE       MAE        MPE      MAPE      ACF1
## Test set -0.09567507 0.1889631 0.1398158 -0.5937077 0.8603979 0.9524104
##          Theil's U
## Test set  4.013735

tsdisplay(residuals(arima_model5))

Summary and Conclusions

At first glance, raw data appeared to have 1440 rows but after further inspection, a majority of those rows were ‘empty’; filled with “”
Post wrangling, data had only 100 rows per sensor readings
Each data sets was then split into 70/30 ratios of training and testing
Sensor2 had only 1 datapoint and as a result, its datapoint of 16.567243 was excluded from further analysis. Sensor2 could have malfunctioned or covered with debris
DataTime col was converted to datetime stamp and wide data was converted to long formatted data.
4 imputation methods were implemented: Max, Means, Previous values and Median
Each method was first plotted to get “feel” of any underlying patterns. However, these plots of the training datasets did produced any descernible patterns such as trends or seasonality etc..
Because of that, ACF plots of each training dataset were then plotted to ascertain which model to implement. From the ACF plots, it was clear that the underlying data has strong autocorrelation at several lags levels
Using RSME as a prediction metrics, the following was found: Method 1: Max, RSME=0.1924907 Method 2: Mean, RSME= 0.188365 Method 3: Previous Value, RSME= 0.205263 Method 4: Median, RSME= 0.1889631
Based on RSME, both the Mean and Median were the most accurate, while using the Max came in 3rd and imputing using Previous Values came in last

Coding Exercise 1

sufian

01/05/2022