On May 30nd Cassie Kozyrkov asked in this article: https://towardsdatascience.com/when-not-to-use-machine-learning-or-ai-8185650f6a29
if we can predict the value for day 61. Do we need machine learning or can we predict the value by analysis?
This are the data for day 1 to 60
(1,28) (2,17) (3,92) (4,41) (5,9) (6,87) (7,54) (8,3) (9,78) (10,67) (11,1) (12,67) (13,78) (14,3) (15,55) (16,86) (17,8) (18,42) (19,92) (20,17) (21,29) (22,94) (23,28) (24,18) (25,93) (26,40) (27,9) (28,87) (29,53) (30,3) (31,79) (32,66) (33,1) (34,68) (35,77) (36,3) (37,56) (38,86) (39,8) (40,43) (41,92) (42,16) (43,30) (44,94) (45,27) (46,19) (47,93) (48,39) (49,10) (50,88) (51,53) (52,4) (53,80) (54,65) (55,1) (56,69) (57,77) (58,3) (59,57) (60,86)
x <- seq(1,60)
y <- c(28,17,92,41,9,87,54,3,78,67,1,67,78,3,55,86,8,42,92,17,29,94,28,18,93,40,9,87,53,3,79,66,1,68,77,3,56,86,8,43,92,16,30,94,27,19,93,39,10,88,53,4,80,65,1,69,77,3,57,86)
dy61 <- tibble(x,y)
head(dy61)
The first step to get a feeling for your data: plot it.
ggplot(dy61,aes(x,y))+ geom_point() + theme_bw()
Obviously Cassie used three sinus functions to create the data. The data are drawn alternating from three functions. To enhance the plot I added series (abc) to the data and plotted again using colors for each sinus function.
family <- rep(c("a","b","c"),20)
dy61 <- dy61 %>% cbind(family)
ggplot(dy61,aes(x,y))+ geom_point(aes(color=as.factor(family))) + theme_bw()
kable(dy61[55:60,]) %>% kable_styling(fixed_thead = T)
| x | y | family | |
|---|---|---|---|
| 55 | 55 | 1 | a |
| 56 | 56 | 69 | b |
| 57 | 57 | 77 | c |
| 58 | 58 | 3 | a |
| 59 | 59 | 57 | b |
| 60 | 60 | 86 | c |
From the data and the plots its obvious that the next value (61) must be from series “a”
Visual extrapolation, refine the plot
ggplot(dy61,aes(x,y))+ geom_point(aes(color=as.factor(family)))+ theme(panel.grid.minor = element_line(colour="white", size=0.5)) +
scale_y_continuous(minor_breaks = seq(0 , 100, 5), breaks = seq(0, 100, 10)) + theme_bw()
a closer look reveals in all three series that the second value after the turning point is a little bit below the second value befor the turning point. According to this I interpolate the value for 61 with 9
Construct the sinus function from data and the plotted curves and calculate the missing value f(x)=a⋅sin(b⋅(x−c))+d
From the data table we get: max(a) is x= 22 and y = 94 min(a) is x=55 and y = 1
f(x)=a⋅sin(b⋅(x−c))+d
d= y_max+y_min /2 47.5
a= y_max-y-min /2 46.5
c is estimated from the plot and the data with ~ 5.4
building a function to calculate the values
y_new <- function(wert){
a <- (2*pi)/66
a <- sin(a * (wert-5.4))
y <- a* 46.5
y <- y + 47.5
return(y)
}
Test for x = 1 gives 28.5867461 datset: 28
Test for x = 16 gives 86.8558433 datset: 86
Test for x = 37 gives 53.6791736 datset: 56
Test for x = 55 gives 1.0021071 datset: 1
Except for x = 37 results are precise. Maybe its due to estimating b optical
Calculated results for point 61 is 8.6228101
Plotting the data gives basic insight and possible solution without machine learning