On May 30nd Cassie Kozyrkov asked in this article: https://towardsdatascience.com/when-not-to-use-machine-learning-or-ai-8185650f6a29
if we can predict the value for day 61. Do we need machine learning or can we predict the value by analysis?

This are the data for day 1 to 60

The input from the article

(1,28) (2,17) (3,92) (4,41) (5,9) (6,87) (7,54) (8,3) (9,78) (10,67) (11,1) (12,67) (13,78) (14,3) (15,55) (16,86) (17,8) (18,42) (19,92) (20,17) (21,29) (22,94) (23,28) (24,18) (25,93) (26,40) (27,9) (28,87) (29,53) (30,3) (31,79) (32,66) (33,1) (34,68) (35,77) (36,3) (37,56) (38,86) (39,8) (40,43) (41,92) (42,16) (43,30) (44,94) (45,27) (46,19) (47,93) (48,39) (49,10) (50,88) (51,53) (52,4) (53,80) (54,65) (55,1) (56,69) (57,77) (58,3) (59,57) (60,86)

Building a tibble

x <- seq(1,60)
y <- c(28,17,92,41,9,87,54,3,78,67,1,67,78,3,55,86,8,42,92,17,29,94,28,18,93,40,9,87,53,3,79,66,1,68,77,3,56,86,8,43,92,16,30,94,27,19,93,39,10,88,53,4,80,65,1,69,77,3,57,86)

dy61 <- tibble(x,y)

head(dy61)

First step

The first step to get a feeling for your data: plot it.

ggplot(dy61,aes(x,y))+ geom_point() + theme_bw()

Obviously Cassie used three sinus functions to create the data. The data are drawn alternating from three functions. To enhance the plot I added series (abc) to the data and plotted again using colors for each sinus function.

family <- rep(c("a","b","c"),20) 
dy61 <- dy61 %>% cbind(family)
ggplot(dy61,aes(x,y))+ geom_point(aes(color=as.factor(family))) + theme_bw()

Last lines of the data

kable(dy61[55:60,])  %>% kable_styling(fixed_thead = T)
x y family
55 55 1 a
56 56 69 b
57 57 77 c
58 58 3 a
59 59 57 b
60 60 86 c

From the data and the plots its obvious that the next value (61) must be from series “a”

First solution:

Visual extrapolation, refine the plot

ggplot(dy61,aes(x,y))+ geom_point(aes(color=as.factor(family)))+ theme(panel.grid.minor = element_line(colour="white", size=0.5)) +
    scale_y_continuous(minor_breaks = seq(0 , 100, 5), breaks = seq(0, 100, 10)) + theme_bw()

a closer look reveals in all three series that the second value after the turning point is a little bit below the second value befor the turning point. According to this I interpolate the value for 61 with 9

Second solution

Construct the sinus function from data and the plotted curves and calculate the missing value f(x)=a⋅sin(b⋅(x−c))+d

From the data table we get: max(a) is x= 22 and y = 94 min(a) is x=55 and y = 1

f(x)=a⋅sin(b⋅(x−c))+d

d= y_max+y_min /2 47.5
a= y_max-y-min /2 46.5

c is estimated from the plot and the data with ~ 5.4

building a function to calculate the values

y_new <- function(wert){
        a <- (2*pi)/66
        a <- sin(a * (wert-5.4))
        y <- a* 46.5
        y <- y + 47.5
        return(y)
}

Some tests

Test for x = 1 gives 28.5867461 datset: 28
Test for x = 16 gives 86.8558433 datset: 86
Test for x = 37 gives 53.6791736 datset: 56
Test for x = 55 gives 1.0021071 datset: 1

Except for x = 37 results are precise. Maybe its due to estimating b optical

Calculated results:

Calculated results for point 61 is 8.6228101

Conclusion

Plotting the data gives basic insight and possible solution without machine learning

Author:

Peter Hahn
https://www.linkedin.com/in/kphahn/

Hopefully, that I got it :-)