#Introduction People in middle-upper income economies spend 80-90% of their time indoors. It should broadly be the goal of buildings to keep people healthy while they occupy them. While there is plenty of research in this area, data is often collected through lab studies or simulations. The data from this project exists to improve access to IEQ data for architecture researchers.

The data we are evaluating is spatiotemporal (ST) indoor environmental quality (IEQ) data. Every row, contains a datetime, x-y position, and either temperature (C) or illumiance (lx) data. Sometimes lux data is BA, this is hardware limitation, as the physical lux sensor could not read measurements as fast as the temperature sensor. (Lux is a measurement of the amount of incidental light that strikes a surface and is widely used to measure brightness.)

The full dataset comprises approximately 101,000 rows collected over five days from a single room at the University of Oregon in Winter 2021. The data was collected with a small robotic sensor that crawls on the floor and is able to capture a continuous winding path of ST data.

i. Packages

options(repos="https://cran.rstudio.com")
library(rjson)
library(parsedate)
library(gridExtra)
library(grid)
library(ggplot2)
library(lattice)
library(rgl)
## Warning in rgl.init(initValue, onlyNULL): RGL: unable to open X11 display
## Warning: 'rgl.init' failed, running with 'rgl.useNULL = TRUE'.
library(corrplot)
## corrplot 0.89 loaded
set.seed(1)

ii. Data

Parsing the original JSON (a nightmare.) This code does not execute.

 files = c('w.0.json', 'w.1.json', 'w.2.json', 'w.3.json', 'w.4.json', 
          'r.0.json', 'r.1.json', 'r.2.json', 'r.3.json', 'r.4.json', 
          'r.5.json', 
          'f.0.json', 'f.1.json', 'f.2.json', 'f.3.json', 'f.4.json', 
          'f.5.json', 'f.6.json', 
          's.0.json', 's.1.json', 's.2.json', 's.3.json', 's.4.json', 
          's.5.json', 
          'u.0.json', 'u.1.json', 'u.2.json', 'u.3.json', 'u.4.json', 
          'u.5.json', 'u.6.json', 'u.7.json')
json_data <- lapply(files, 
                    function(x) fromJSON(file=paste('/content/',x,sep='')))

#I give up with lapply...

nrows <- 0
for (i in 1:length(files)) {
  nrows <- nrows + length(json_data[[i]])
}

df <- data.frame(matrix(ncol = 5, nrow = nrows))
names(df) <- c("time", "x", "y", "temp", "lux")

idx = 0
for (i in 1:length(files)) {
  d = json_data[[i]]
  for (j in 1:length(d)) {
    idx <- idx + 1
    df$time[idx] <- parse_iso_8601(json_data[[i]][j][[1]]$created)
    df$x[idx] <- json_data[[i]][j][[1]]$data$x
    df$y[idx] <- json_data[[i]][j][[1]]$data$y
    df$temp[idx] <- json_data[[i]][j][[1]]$data$t
    df$lux[idx] <- ifelse("lx" %in% colnames(data.frame(json_data[[i]][j][[1]]$data)), 
                          json_data[[i]][j][[1]]$data$lx, NA_real_)
  }
}

summary(df)

iii. A Shortcut

# write.csv(df, "/Users/josephmcl/Downloads/df.csv")
df <- read.csv("/Users/josephmcl/Downloads/df.csv")

Data Cleaning

Our first step with this data is to remove any invalid or impossible data. This is the raw data from the sensor system and contains some data that does not make sense to this problem. So we remove (i) negative positions, ((0,0) represents a corner of the room), (ii) negative temperatures and (iii) negative lux readings.

df <- df[df$x >= 0 & df$y >= 0 & df$temp > 0 & df$lux >= 0, ]
df <- df[!is.na(df$time), ]
df$time <- df$time - min(df$time)
# df <- df[,-match(c("X"),names(df))]

We also split apart the day and the time-of-day.

df$timeofday <- (df$time + 15000) %% 86400
df$day <- as.factor(sapply(df$time + 40000, function(t) 
                    ifelse(t %/% 86400 == 0, 0, 
                    ifelse(t %/% 86400 == 1, 1, 
                    ifelse(t %/% 86400 == 2, 2, 
                    ifelse(t %/% 86400 == 3, 3, 
                    ifelse(t %/% 86400 == 4, 4, 5)))))))
summary(df)
##        X               time              x               y        
##  Min.   :    20   Min.   :     0   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 27202   1st Qu.:103398   1st Qu.:1.750   1st Qu.:1.690  
##  Median : 53923   Median :190942   Median :3.540   Median :3.090  
##  Mean   : 54139   Mean   :206073   Mean   :3.562   Mean   :3.329  
##  3rd Qu.: 81692   3rd Qu.:335353   3rd Qu.:5.030   3rd Qu.:4.890  
##  Max.   :108476   Max.   :372103   Max.   :8.740   Max.   :7.630  
##       temp            lux             timeofday     day      
##  Min.   :18.60   Min.   :   0.000   Min.   : 3156   0: 5653  
##  1st Qu.:19.78   1st Qu.:   6.845   1st Qu.:11161   1: 8947  
##  Median :20.23   Median :  21.588   Median :18643   2:11404  
##  Mean   :20.28   Mean   :  92.907   Mean   :21296   3:10632  
##  3rd Qu.:20.71   3rd Qu.:  70.390   3rd Qu.:32049   4:12478  
##  Max.   :22.62   Max.   :2463.246   Max.   :41503
head(df)
##     X  time    x    y     temp      lux timeofday day
## 20 20 24685 5.85 0.03 19.15410 5.460672     39685   0
## 22 22 24686 5.83 0.09 19.15898 5.845824     39686   0
## 24 24 24686 5.82 0.15 19.14902 6.364800     39686   0
## 26 26 24687 5.75 0.19 19.15410 6.913152     39687   0
## 28 28 24687 5.63 0.27 19.16406 7.187328     39687   0
## 30 30 24688 5.62 0.40 19.16914 3.978816     39688   0

Data Visualization

Scatterplot pairs

Before we check the correlation matrix, let’s expand some of these plots

# pairs(~x+y+temp+lux+timeofday+day, data=df)

It’s fairly difficult to visualize ST data, but we begin with the spatial dimension. Lux scales logarithmically, so to illustrate it we often convert it to log-space.

Spatial lux plot

ggplot() + geom_point(data=subset(df, df$lux > 1), aes(x, y, color=log10(
         subset(df, df$lux > 1)$lux))) + labs(color='log(lx)')  + coord_fixed()

Spatial temperature plot

ggplot() + geom_point(data=df, aes(x, y, color=df$temp)) + labs(color='temp')  + coord_fixed()
## Warning: Use of `df$temp` is discouraged. Use `temp` instead.

The entire spatial dimension is collapsed here, making it kind of hard make sense of these plots. Let’s also view these as a time-series.

Lux time-series plots

ggplot() + geom_point(data=df, aes(time, lux))

Temperature time-series plots

ggplot() + geom_point(data=df, aes(time, temp))

let’s also plot the lux by the time of day, overlaying measurements taken at the same time of day on top of each other.

Lux time-of-day plot

plot(lux~timeofday, df)

Temperature time-of-day plot

plot(temp~timeofday, df)

It’s interesting how the attributes compare. Temperature seems more tightly irregular while lux has significant variance, but still maintains a pattern. Remember the values are st so they can’t simply be viewed as a time-series.

Let’s get a closer look at some of the individual clusters. We’ll do the first (8:00 am) and the third (12:00 pm.)

8:00 am lux time-series plots

plot(lux~timeofday, subset(df, df$timeofday < 7000))

12:00 pm lux time-series plots

plot(lux~timeofday, subset(df, df$timeofday > 17000 & df$timeofday < 19000), xlim1= 100000)
## Warning in plot.window(...): "xlim1" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "xlim1" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "xlim1" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "xlim1" is not a
## graphical parameter
## Warning in box(...): "xlim1" is not a graphical parameter
## Warning in title(...): "xlim1" is not a graphical parameter

8:00 am temperature time-series plots

plot(temp~timeofday, subset(df, df$timeofday < 7000))

12:00 pm temperature time-series plots

plot(temp~timeofday, subset(df, df$timeofday > 17000 & df$timeofday < 19000))

8:00 am lux spatial plots

ggplot() + geom_point(data=subset(df, df$lux > 1 & df$timeofday < 7000), 
           aes(x, y, color=log10(subset(df, df$lux > 1 & df$timeofday < 7000
           )$lux))) + labs(color='log(lx)') + coord_fixed()

12:00 am lux spatial plots

ggplot() + geom_point(data=subset(df, df$lux > 1 & df$timeofday > 17000 & df$timeofday < 19000), aes(x, y, color=log10(subset(df, df$lux > 1 
           & df$timeofday > 17000 & df$timeofday < 19000)$lux))) +
           labs(color='log(lx)') + coord_fixed()

8:00 am temperature spatial plots

ggplot() + geom_point(data=subset(df,  df$timeofday > 17000 & df$timeofday < 19000), aes(x, y, color=subset(df, df$timeofday > 17000 & df$timeofday < 19000)$temp)) + labs(color='temp') + coord_fixed()

12:00 am temperature spatial plots

ggplot() + geom_point(data=subset(df, df$timeofday < 7000), 
           aes(x, y, color=subset(df, df$timeofday < 7000
           )$temp)) + labs(color='temp') + coord_fixed()

One more set for the sake of it. Let’s just look at individual clumps on one day.

ggplot() + geom_point(data=subset(df, df$timeofday > 17000 & df$timeofday < 19000 & df$day == 1), aes(x, y, color=subset(df, df$timeofday > 17000 & df$timeofday < 19000 & df$day == 1)$temp)) + labs(color='temp') + coord_fixed()

ggplot() + geom_point(data=subset(df, df$timeofday < 7000 & df$day == 1), 
           aes(x, y, color=subset(df, df$timeofday < 7000 & df$day == 1
           )$temp)) + labs(color='temp') + coord_fixed()

Correlation matricies

After playing around visualization, we should look at how the different columns actually correlate. Here, we can see that of all properties we have, temperature is correlated loosely with mostly everything. This could suggest several things: with respect to time, that the time of day and the day influence indoor temperature. This is expected, because the weather changes from day to day along with the outdoor temperature, affecting the indoor temperature. There is correlation with x and y, though more strongly x. This may suggest that the layout of the room maintains a temperature “lumpiness.” Lux very weakly correlates with x and y, and most correlates with temperature. This may be because as temperature increases mostly during the day.

corrplot(cor(df[, sapply(df, is.numeric)]))

Correlation matrix at 8:00 am

corrplot(cor(subset(df[, sapply(df, is.numeric)], df$timeofday < 7000)))

Correlation matrix at 12:00 pm

corrplot(cor(subset(df[, sapply(df, is.numeric)], df$timeofday > 17000 & df$timeofday < 19000)))

Now we examine the same correlations over a limited time-of-day. These plots shows the correlation between different features varying dramatically based on the time-of-day. This does not hold for constraining the day, suggesting there may be daily cyclic effects in the space. If were the case time of day should be a good predictor.

Modeling

Goal: given some arbitrary location in space and time, can we predict a value for the lux or temperature fields?

Why: if a device similar to the device that collected this data could survey a space on distanct intervals, but still make predictions about the space in between these intervals, we can make performance predictions about the space.

This first model just evaluates the fit with a typical randomized test set, using lux as the response and time of day, day, x, and y as predictors.

idx <- sample(seq(1, 3), size = nrow(df), replace = TRUE, prob = c(.8, .2, .2))
train <- df[idx == 1,]
test  <- df[idx == 2,]
val   <- df[idx == 3,]

fit.lx<-lm(lux~timeofday*day*x*y, train)
summary(fit.lx)
## 
## Call:
## lm(formula = lux ~ timeofday * day * x * y, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -511.53  -79.52  -19.00   12.38 2191.87 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         7.320e+01  3.841e+01   1.906 0.056724 .  
## timeofday          -1.488e-03  1.517e-03  -0.981 0.326639    
## day1                3.547e+02  4.179e+01   8.487  < 2e-16 ***
## day2                3.053e+02  4.374e+01   6.981 3.00e-12 ***
## day3                7.569e+01  4.094e+01   1.849 0.064475 .  
## day4                3.734e+02  4.328e+01   8.627  < 2e-16 ***
## x                  -9.020e+00  7.820e+00  -1.153 0.248727    
## y                  -8.901e+00  1.008e+01  -0.883 0.377383    
## timeofday:day1     -2.178e-03  1.675e-03  -1.300 0.193633    
## timeofday:day2     -8.329e-03  1.685e-03  -4.943 7.73e-07 ***
## timeofday:day3      7.682e-03  1.671e-03   4.597 4.29e-06 ***
## timeofday:day4      3.673e-03  1.810e-03   2.030 0.042412 *  
## timeofday:x         1.842e-04  3.424e-04   0.538 0.590656    
## day1:x             -4.146e+01  8.377e+00  -4.949 7.48e-07 ***
## day2:x             -4.453e+01  8.880e+00  -5.015 5.34e-07 ***
## day3:x              7.325e+00  8.490e+00   0.863 0.388284    
## day4:x             -5.002e+01  8.626e+00  -5.798 6.76e-09 ***
## timeofday:y         2.627e-04  4.049e-04   0.649 0.516453    
## day1:y             -4.556e+01  1.119e+01  -4.072 4.67e-05 ***
## day2:y             -5.482e+01  1.157e+01  -4.738 2.17e-06 ***
## day3:y              4.684e+00  1.130e+01   0.415 0.678418    
## day4:y             -6.312e+01  1.140e+01  -5.539 3.06e-08 ***
## x:y                 1.112e+00  2.357e+00   0.472 0.637217    
## timeofday:day1:x    3.468e-04  3.719e-04   0.933 0.350996    
## timeofday:day2:x    1.322e-03  3.738e-04   3.537 0.000405 ***
## timeofday:day3:x   -1.474e-03  3.708e-04  -3.975 7.06e-05 ***
## timeofday:day4:x   -5.332e-04  3.864e-04  -1.380 0.167683    
## timeofday:day1:y    2.038e-04  4.538e-04   0.449 0.653365    
## timeofday:day2:y    1.688e-03  4.629e-04   3.647 0.000266 ***
## timeofday:day3:y   -1.503e-03  4.645e-04  -3.235 0.001218 ** 
## timeofday:day4:y   -1.986e-04  4.769e-04  -0.417 0.677010    
## timeofday:x:y      -3.244e-05  9.513e-05  -0.341 0.733077    
## day1:x:y            5.368e+00  2.615e+00   2.053 0.040121 *  
## day2:x:y            8.019e+00  2.666e+00   3.008 0.002630 ** 
## day3:x:y           -1.882e+00  2.679e+00  -0.702 0.482433    
## day4:x:y            1.202e+01  2.643e+00   4.549 5.40e-06 ***
## timeofday:day1:x:y -4.029e-05  1.076e-04  -0.375 0.707953    
## timeofday:day2:x:y -2.647e-04  1.087e-04  -2.435 0.014901 *  
## timeofday:day3:x:y  2.375e-04  1.087e-04   2.185 0.028861 *  
## timeofday:day4:x:y -8.672e-05  1.092e-04  -0.794 0.427174    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 173.6 on 32668 degrees of freedom
## Multiple R-squared:  0.1754, Adjusted R-squared:  0.1744 
## F-statistic: 178.1 on 39 and 32668 DF,  p-value: < 2.2e-16

The p-value is good, providing evidence against the null hypothesis and the standard error is low for several predictors, suggesting a decent fit. We expect this since we had previously established correlations with lux.

pre <- predict(fit.lx, val)
r2 <- 1 - (sum((pre - val$lux)^2) / sum((val$lux - mean(val$lux))^2))
r2
## [1] 0.1865492

However the model has a low r-squared value. This is expected because lux has a lot of variance.

The same model is repeated but with temperature as the response.

fit.t<-lm(temp~timeofday*day*x*y, train)
summary(fit.t)
## 
## Call:
## lm(formula = temp ~ timeofday * day * x * y, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.67991 -0.22647 -0.04954  0.14743  1.82053 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.034e+01  9.230e-02 220.372  < 2e-16 ***
## timeofday          -1.514e-05  3.645e-06  -4.153 3.29e-05 ***
## day1               -2.995e-01  1.004e-01  -2.982 0.002863 ** 
## day2               -5.645e-01  1.051e-01  -5.371 7.88e-08 ***
## day3               -8.599e-01  9.837e-02  -8.742  < 2e-16 ***
## day4                1.104e+00  1.040e-01  10.617  < 2e-16 ***
## x                  -1.511e-01  1.879e-02  -8.040 9.31e-16 ***
## y                  -3.601e-02  2.423e-02  -1.486 0.137197    
## timeofday:day1      3.307e-05  4.026e-06   8.213 2.23e-16 ***
## timeofday:day2      4.732e-05  4.049e-06  11.687  < 2e-16 ***
## timeofday:day3      6.008e-05  4.015e-06  14.963  < 2e-16 ***
## timeofday:day4      3.624e-05  4.348e-06   8.334  < 2e-16 ***
## timeofday:x         1.315e-06  8.227e-07   1.598 0.109981    
## day1:x              1.213e-02  2.013e-02   0.603 0.546679    
## day2:x              1.240e-01  2.134e-02   5.814 6.17e-09 ***
## day3:x              1.378e-01  2.040e-02   6.756 1.44e-11 ***
## day4:x             -9.907e-02  2.073e-02  -4.780 1.76e-06 ***
## timeofday:y         2.006e-07  9.728e-07   0.206 0.836638    
## day1:y             -4.217e-02  2.689e-02  -1.569 0.116726    
## day2:y              2.603e-02  2.780e-02   0.936 0.349183    
## day3:y              8.306e-02  2.715e-02   3.060 0.002216 ** 
## day4:y             -1.315e-01  2.738e-02  -4.804 1.56e-06 ***
## x:y                 1.955e-02  5.664e-03   3.452 0.000557 ***
## timeofday:day1:x    1.245e-06  8.935e-07   1.394 0.163458    
## timeofday:day2:x   -4.784e-06  8.982e-07  -5.327 1.01e-07 ***
## timeofday:day3:x   -5.534e-06  8.910e-07  -6.211 5.33e-10 ***
## timeofday:day4:x   -1.430e-06  9.286e-07  -1.540 0.123509    
## timeofday:day1:y    1.121e-06  1.090e-06   1.028 0.303996    
## timeofday:day2:y    9.778e-07  1.112e-06   0.879 0.379346    
## timeofday:day3:y   -4.618e-06  1.116e-06  -4.138 3.52e-05 ***
## timeofday:day4:y    7.524e-07  1.146e-06   0.657 0.511393    
## timeofday:x:y      -3.014e-08  2.286e-07  -0.132 0.895106    
## day1:x:y           -4.955e-03  6.284e-03  -0.789 0.430402    
## day2:x:y           -6.264e-03  6.405e-03  -0.978 0.328086    
## day3:x:y           -1.008e-02  6.438e-03  -1.566 0.117384    
## day4:x:y            2.757e-02  6.351e-03   4.340 1.43e-05 ***
## timeofday:day1:x:y -3.154e-07  2.584e-07  -1.221 0.222254    
## timeofday:day2:x:y -8.190e-09  2.613e-07  -0.031 0.974992    
## timeofday:day3:x:y  4.892e-07  2.611e-07   1.873 0.061049 .  
## timeofday:day4:x:y -4.078e-07  2.624e-07  -1.554 0.120178    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4171 on 32668 degrees of freedom
## Multiple R-squared:  0.6335, Adjusted R-squared:  0.6331 
## F-statistic:  1448 on 39 and 32668 DF,  p-value: < 2.2e-16
pre <- predict(fit.lx, val)
r2 <- 1 - (sum((pre - val$lux)^2) / sum((val$lux - mean(val$lux))^2))
r2
## [1] 0.1865492

Temperature as a response has a marginally better p-value, which we would again expect because it has lower variance.

This approach is naive and does not consider what is actually being modeled here and if that model helps us reach our goal. We only have ST values along the along ST “path” to test against, but we want to predict points across space and time–values that stray from the path. The current approach is naive because it requires our data be distributed evenly over space and time to result in a model that can determine a value at an arbitrary ST location. The workaround is to hold out a continuous chunk of ST values from of the data and to sample the test set from this hold out. This way, we increase the likelihood that our test data is different enough from the training data, and that it may actually simulate sampling from points that are not along the path.

To start, we consider a chunk to be an entire day of data. We will arbitrarily leave out one day and will not use the day as a predictor.

loday = sample(df$day,1)
train<-subset(df, df$day != loday & !is.nan(df$lux))
holdout<-subset(df, df$day == loday & !is.nan(df$lux))

fit.lx<-lm(lux~timeofday*x*y, train)
summary(fit.lx)
## 
## Call:
## lm(formula = lux ~ timeofday * x * y, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -325.19  -74.24  -34.75   -2.10 2401.28 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.865e+02  7.473e+00  51.729   <2e-16 ***
## timeofday     -6.439e-03  3.132e-04 -20.555   <2e-16 ***
## x             -4.661e+01  1.542e+00 -30.218   <2e-16 ***
## y             -5.465e+01  2.240e+00 -24.399   <2e-16 ***
## timeofday:x    8.625e-04  6.356e-05  13.569   <2e-16 ***
## timeofday:y    1.194e-03  9.574e-05  12.476   <2e-16 ***
## x:y            7.135e+00  5.207e-01  13.703   <2e-16 ***
## timeofday:x:y -1.844e-04  2.171e-05  -8.495   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 173.5 on 40159 degrees of freedom
## Multiple R-squared:  0.09857,    Adjusted R-squared:  0.09842 
## F-statistic: 627.4 on 7 and 40159 DF,  p-value: < 2.2e-16

Though the p-value is low, there is little correlation with timeofday, which may mean that this model won’t perform the way that we expect.

test<-sample(0:4,1)
bestlx.r2<- -100
bestlx.fit<- 0
histlx.r2<-c()
for (loday in 0:4) {
  if (loday != test) {
    train<-subset(df, df$day != loday & df$day != test & !is.nan(df$lux))
    val<-subset(df, df$day == loday & df$day != test & !is.nan(df$lux))
    fit.lx<-lm(lux~timeofday*x*y, train)
    pre <- predict(fit.lx, val)
    r2 <- 1 - (sum((pre - val$lux)^2) / sum((val$lux - mean(val$lux))^2))
    if (r2 > bestlx.r2) {
      bestlx.r2<-r2
      bestlx.fit<-fit.lx
    }
    histlx.r2<-c(histlx.r2, r2)
  }
}

test<-subset(df, df$day == test & !is.nan(df$lux))
pre <- predict(bestlx.fit, test)
r2 <- 1 - (sum((pre - test$lux)^2) / sum((test$lux - mean(test$lux))^2))
print(bestlx.r2)
## [1] 0.0922309
print(r2)
## [1] -20.98411
print(histlx.r2)
## [1]  0.092230902 -1.594136156 -0.009152468  0.069089281

The r-squared is very low, even in the best of cases, often even negative, which means it performs worse than random. This is likely due to the day to day differences between lux readings, but could be due to several other factors. When the sensor explores the room it’s not guaranteed to reach a high lux location at a specific time, even if high lux readings are more common at a specific time. This result suggests there is a more complex relationship involving the day and the time of day.

Here, we hold out one day as the test set and another day as the validation set. Cross validation is not typical for linear models, because they are usually simple enough that more training data helps. However, since we have very few days to train against, this is helpful in illustrating how much variations between different days.

test<-sample(0:4,1)
bestt.r2<- -100
bestt.fit<-0
histt.r2<-c()
for (loday in 0:4) {
  if (loday != test) {
    train<-subset(df, df$day != loday & df$day != test & !is.nan(df$lux))
    val<-subset(df, df$day == loday & df$day != test & !is.nan(df$lux))
    fit.t<-lm(temp~timeofday*x*y, train)
    pre <- predict(fit.t, val)
    r2 <- 1 - (sum((pre - val$temp)^2) / sum((val$temp - mean(val$temp))^2))
    if (r2 > bestt.r2) {
      bestt.r2<-r2
      bestt.fit<-fit.t
    }
    histt.r2<-c(histt.r2, r2)
  }
}

test<-subset(df, df$day == test & !is.nan(df$lux))
pre <- predict(bestt.fit, test)
r2 <- 1 - (sum((pre - test$temp)^2) / sum((test$temp - mean(test$temp))^2))
print(bestt.r2)
## [1] 0.3244161
print(r2)
## [1] -1.269119
print(histt.r2<-c(histt.r2, r2))
## [1] -7.66844559  0.32441607 -0.07507013  0.15674634 -1.26911935

Again, temperature performs similar to lux. This is actually quite surprising, because temperature appeared to correlate with several other features, and generally had less variability than lux. However, there are so few ways to split the day by day, the results may not be representative of a generalized model and more related to the specific noise of that day.

We saw from the visualizations that the measured variables do not behave linearly with respect to time of day. It is clearly brighter and warmer at midday. Let’s introduce a second order polynomial term to time of day to see if this helps model that relationship.

test<-sample(0:4,1)

bestlx.r2<- -100
bestlx.fit<- 0
bestt.r2<- -100
bestt.fit<- 0
histr2.lx<-c()
histr2.t<-c()
for (loday in 0:4) {
  if (loday != test) {
    train<-subset(df, df$day != loday & !is.nan(df$lux))
    val<-subset(df, df$day == loday & !is.nan(df$lux))
    fit.lx<-lm(lux~poly(timeofday, 2)*x*y, train)
    pre <- predict(fit.lx, val)
    r2 <- 1 - (sum((pre - val$lux)^2) / sum((val$lux - mean(val$lux))^2))
    if (r2 > bestlx.r2) {
      bestlx.r2<-r2
      bestlx.fit<-fit.lx
    }
    histr2.lx<-c(histr2.lx, r2)
    
    fit.t<-lm(temp~timeofday*x*y, train)
    pre <- predict(fit.t, val)
    r2 <- 1 - (sum((pre - val$temp)^2) / sum((val$temp - mean(val$temp))^2))
    if (r2 > bestt.r2) {
      bestt.r2<-r2
      bestt.fit<-fit.t
    }
    histr2.t<-c(histr2.t, r2)
  }
}

test<-subset(df, df$day == test & !is.nan(df$lux))
pre <- predict(bestlx.fit, test)
r2 <- 1 - (sum((pre - test$lux)^2) / sum((test$lux - mean(test$lux))^2))
print(bestlx.r2)
## [1] 0.2978333
print(r2)
## [1] -32.02225
pre <- predict(bestt.fit, test)
r2 <- 1 - (sum((pre - test$temp)^2) / sum((test$temp - mean(test$temp))^2))
print(bestt.r2)
## [1] 0.5354715
print(r2)
## [1] -7.990196
print(histr2.lx)
## [1]  0.2978333 -1.4798465  0.2069770  0.2242037
print(histr2.t)
## [1]  0.02566935  0.53547153 -0.28970682 -1.32945212

We see generally see poor r-squared values on the test set, though lux performs well in this instance. What is also clear here is that there is a lot of variance between days. Chunking the data by day will carry these uncertainties since we are not learning global ST values but the average time of day. Stated plainly, it doesn’t accomplish our goal of predicting over space and time, only space and time of day.

We can extend this to chunking by collection period. The device collected data over regular intervals as seen here.

ggplot(df, aes(timeofday, time)) + geom_point()

Here, we chunk the data into these sections and use them to split the dataset into train and test sets. With this, we also regain the ability to predict global ST values by reintroducing day.

df$section<-((df$timeofday %/% 7200) + (as.numeric(df$day) * 10))
start<-range(df$section)
stop<-start[2]
start<-start[1]
samplt<-sample(start:stop, 5, 4)
test<-subset(df, df$section %in% samplt  & !is.nan(df$lux))
nontest<-c()
for (i in start:stop) {
  if (!(i %in% samplt)) {
    nontest<-c(nontest, i)
  }
}

bestlx.fit <- 0
bestt.fit  <- 0
bestlx.r2  <- -100
bestt.r2   <- -100
r2.lx      <- c()
r2.t       <- c()

for (i in 1:50) {
  samplv<-sample(nontest, 5, 4)
  train<-subset(df, !(df$section %in% samplv) & !is.nan(df$lux))
  val<-subset(df, df$section %in% samplv  & !is.nan(df$lux))
  fit.lx<-lm(lux~poly(timeofday, 2)*x*y*day, train)
  fit.t<-lm(temp~poly(timeofday, 2)*x*y*day, train)

  pre<-predict(fit.lx, val)
  r2<-1 - (sum((pre - val$lux)^2) / sum((val$lux - mean(val$lux))^2))
  if (r2 > bestlx.r2) {
    bestlx.fit<-fit.lx
    bestlx.r2<-r2
  }
  r2.lx<-c(r2.lx, r2)

  pre<-predict(fit.t, val)
  r2<-1 - (sum((pre - val$temp)^2) / sum((val$temp - mean(val$temp))^2))
  if (r2 > bestt.r2) {
    bestt.fit<-fit.t
    bestt.r2<-r2
  }
  r2.t<-c(r2.t, r2)
}
print(bestlx.r2)
## [1] 0.2901521
print(bestt.r2)
## [1] 0.2994899
pre <- predict(bestlx.fit, test)
r2 <- 1 - (sum((pre - test$lux)^2) / sum((test$lux - mean(test$lux))^2))
print(r2)
## [1] 0.2388383
pre <- predict(bestt.fit, test)
r2 <- 1 - (sum((pre - test$temp)^2) / sum((test$temp - mean(test$temp))^2))
print(r2)
## [1] 0.7148606
print(summary(r2.lx))
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -8404.403   -42.570    -5.481  -559.392    -0.226     0.290
print(summary(r2.t))
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -994.4773  -19.3865   -3.5214  -34.3553   -0.6496    0.2995

Here we accomplish a better r-squared value on the validation set than the prior model. In the case of temperature, the test set r-squared is notably high. This is an encouraging result and suggests that there is some predictive capacity here. With the variance in lux data it’s unlikely a simular r-squared to could be achieved because there is a significant degree of variance between held out days. The validation data highly impacts the predictions, suggesting that this new held out chunking method still has too much variability to be modeled effectively with the data that we have.

Splitting the data set further for the purpose of making test and hold sets could be done in two ways. We could split the data further in the time dimension, or we could split it spatially. The issue with these is that they would model something more similar to the first model. Where the first two chunking methods tested and validated on ST values that were both spatially and temporally distance, these methods begin to diminish this relationship, with some values in the model being spatially near or temporally near. To proceed this way we should reduce the goal of the model.

In this model, we sample holdout and test data from one day, simulating the following scenario. “The model has been trained for different days, but only has incomplete data for a certain day. Can the model recover the lost data on this data?” With this we can include data from several days and reduce the size of the held out data, it just requires that we modify our intentions from global prediction to single day prediction.

df$tsect<-((df$X %/% 1000))

start<-range(subset(df, df$day==sample(0:4,1))$tsect)
stop<-start[2]
start<-start[1]
samplt<-sample(start:stop, (stop-start -1) %/% 4)
test<-subset(df, df$tsect %in% samplt & !is.nan(df$lux))
train<-subset(df, !(df$tsect %in% samplt) & !is.nan(df$lux))

fit.lx<-lm(lux~poly(timeofday, 2)*x*y*day, train)
fit.t<-lm(temp~poly(timeofday, 2)*x*y*day, train)

pre <- predict(bestlx.fit, test)
r2 <- 1 - (sum((pre - test$lux)^2) / sum((test$lux - mean(test$lux))^2))
print(r2)
## [1] 0.2419603
pre <- predict(bestt.fit, test)
r2 <- 1 - (sum((pre - test$temp)^2) / sum((test$temp - mean(test$temp))^2))
print(r2)
## [1] 0.4656608

In general, our fit with this model is better than the first two models, and is more consistent that the third model. However we had to reduce the scope of our problem to accomplish this.

Discussion

These four organizational regimes suggest a few things about the data. The first conveys that a linear model can loosely model some relationship between the ST data at large. This was expected by our correlation matrix. However this has limited capacity, and we do not know if it would perform, better if we had sampled test data that wasn’t already spatially and temporally close to other values.

In the second model we experiment with this understanding, by testing on entire days However we have so few days that the variance in each day is difficult of the linear model to reconcile Further, this limits us to only predicting the average time of day, since we cannot use the day feature in the model. We cannot use the day feature because using both time and day, with every value will make our model akin to the first model, just with extremely biased test sampling.

The third model expands the temporal dimensions further, recovering the day feature, and not learning any temporally near ST values. This model on average has a better-squared than the prior model. The improved fit could be be attributed to a more varied distribution in the test set sampling. Still, in this model and in the prior model we saw too much variance between the chunks, suggesting that we don’t have enough chucks to make confident ST predictions. This may mean our model will improve if we have a larger historic catalog, though it may also mean that we have not chosen the right model.

To address the lack of chunks, we further split the data temporally. However, we took a different approach that narrowed our goal We modified our goal as followed: recovery of ST values on just a single day. With this framing we can increase the number of chunks significantly. In practice we ended up with 109 temporally divided chunks. This model has a better r-squared than the first model which took in all of the historic data and constructed a random test set without chunking first.

One limitation here that should be noted briefly is that we are repeatedly resampling from this data, and because each method comprises a different test set, the results are not directly comparable and they validity of the results is reduced.

The success here is in the framing of the goal. The original goal “to predict the lux value of the field in ST value” is not broadly feasible with our data and our chosen model. The dramatic variance we demonstrate in the chunked validations sets in the second and third regimes demonstrates this. The improved r-squared in the final model may suggest that while the ST data over several days is useful, that each individual day succumbs to its own affects that are more useful in predicting values on that day. From this it could be suggested that we should prioritize longer collection sessions of frequent and short sessions. The intuition here is that several long sessions lead to more ST overlap. The improved r-squared occurs when the test data is trained on historic data that is temporally distant and in-situ data that is spatially and temporally near. Longer collection sessions may improve both of these qualities in the data. Longer sessions will also yield data that varies in its ST distance from other points. From our plot of time over time of day, we can see that the ST values are fairly evenly spaced temporally. This likely has some impact to our model, and a greater variation here would likely improve it.

Impact

If we were able to predict ST values at any position with a model we could analyze indoor spaces for performance inefficiencies (e.g. poor insulation, circulation.) We could also model whether the room is devised in such a way that is comfortable to its occupants. From here we could establish inference models that could help explain what features of indoor spaces are efficient in practice and which are not. Since we cannot yet determine this with much accuracy–though we did demonstrate encouraging results–it may suggest that the problem is more complex than we have modeled it so far.