This project is from DQLab Data Science Course. We want to know the relation between food selling and people go to dentist using Linear Regression. The dataset can be download here dentist and food selling

libraries

library(dplyr) # data manipulation
library(Hmisc) # data analysis lib

Data Import

dt_dental <- read.csv("E:/2025/DQLAB/R/6 Regression/kunjungan_dokter_gigi_kota_x_dqlab.tsv", header = TRUE, sep = "\t")
dt_food_sell <- read.csv("E:/2025/DQLAB/R/6 Regression/tingkat_penjualan_kota_x_dqlab.tsv", header = TRUE, sep = "\t")
head(dt_dental)
##   Bulan Tahun tingkat.kunjungan.ke.dokter.gigi
## 1     1  1996                               37
## 2     2  1996                               32
## 3     3  1996                               56
## 4     4  1996                               43
## 5     5  1996                               52
## 6     6  1996                               70
head(dt_food_sell)
##   No Bulan Tahun penjualan.permen penjualan.sereal penjualan.buah.pisang
## 1  1     1  1996           200661            10324                112146
## 2  2     2  1996           297141            13150                175417
## 3  3     3  1996           235586             5398                104645
## 4  4     4  1996           299498             7609                112283
## 5  5     5  1996           395824             5190                137083
## 6  6     6  1996           247212            14571                198130
# combine both data
# by.x and by.y is for unique key in both data frame 
dt <- merge(dt_dental, dt_food_sell, by.x = c("Bulan", "Tahun"), by.y = c("Bulan", "Tahun"), sort = FALSE)
str(dt)
## 'data.frame':    24 obs. of  7 variables:
##  $ Bulan                           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Tahun                           : int  1996 1996 1996 1996 1996 1996 1996 1996 1996 1996 ...
##  $ tingkat.kunjungan.ke.dokter.gigi: int  37 32 56 43 52 70 64 88 74 98 ...
##  $ No                              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ penjualan.permen                : int  200661 297141 235586 299498 395824 247212 273899 184405 141211 292225 ...
##  $ penjualan.sereal                : int  10324 13150 5398 7609 5190 14571 13407 8914 3663 14756 ...
##  $ penjualan.buah.pisang           : int  112146 175417 104645 112283 137083 198130 194685 100091 194972 133166 ...

Data Wrangling and Visualization

summary(dt)
##      Bulan           Tahun      tingkat.kunjungan.ke.dokter.gigi
##  Min.   : 1.00   Min.   :1996   Min.   : 32.00                  
##  1st Qu.: 3.75   1st Qu.:1996   1st Qu.: 55.00                  
##  Median : 6.50   Median :1996   Median : 75.00                  
##  Mean   : 6.50   Mean   :1996   Mean   : 75.17                  
##  3rd Qu.: 9.25   3rd Qu.:1997   3rd Qu.: 90.50                  
##  Max.   :12.00   Max.   :1997   Max.   :132.00                  
##        No        penjualan.permen penjualan.sereal penjualan.buah.pisang
##  Min.   : 1.00   Min.   :141211   Min.   : 2372    Min.   :100091       
##  1st Qu.: 6.75   1st Qu.:256775   1st Qu.: 5143    1st Qu.:112249       
##  Median :12.50   Median :298320   Median : 9366    Median :139996       
##  Mean   :12.50   Mean   :310964   Mean   : 9282    Mean   :140682       
##  3rd Qu.:18.25   3rd Qu.:393837   3rd Qu.:13711    3rd Qu.:157232       
##  Max.   :24.00   Max.   :467572   Max.   :16209    Max.   :198130

People who consumed candies and bananas higher than who consumed cereal. The maximum doctor appointment is quite high, there are 132 times people going to dentist.

Not everyone will get toothache after consuming candies, bananas, or cereal. Then we should search for the delay effect, the time gap between we eating food and get toothache.

delayed <- data.frame(month = dt$Bulan, year = dt$Tahun,
                      dentist_appointment = dt$tingkat.kunjungan.ke.dokter.gigi,
                      
                      # candy
                      candy_sell = dt$penjualan.permen,
                      # lag, n for monthly
                      candy_sell1 = lag(dt$penjualan.permen),
                      candy_sell2 = lag(dt$penjualan.permen,2),
                      candy_sell3 = lag(dt$penjualan.permen,3),
                      candy_sell4 = lag(dt$penjualan.permen,4),
                      candy_sell5 = lag(dt$penjualan.permen,5),
                      candy_sell6 = lag(dt$penjualan.permen,6),
                      
                      # bananas
                      banana_sell = dt$penjualan.buah.pisang,
                      banana_sell1 = lag(dt$penjualan.buah.pisang),
                      banana_sell2 = lag(dt$penjualan.buah.pisang,2),
                      banana_sell3 = lag(dt$penjualan.buah.pisang,3),
                      banana_sell4 = lag(dt$penjualan.buah.pisang,4),
                      banana_sell5 = lag(dt$penjualan.buah.pisang,5),
                      banana_sell6 = lag(dt$penjualan.buah.pisang,6),
                      
                      # cereal
                      cereal_sell = dt$penjualan.sereal,
                      cereal_sell1 = lag(dt$penjualan.sereal),
                      cereal_sell2 = lag(dt$penjualan.sereal,2),
                      cereal_sell3 = lag(dt$penjualan.sereal,3),
                      cereal_sell4 = lag(dt$penjualan.sereal,4),
                      cereal_sell5 = lag(dt$penjualan.sereal,5),
                      cereal_sell6 = lag(dt$penjualan.sereal,6))
delayed
##    month year dentist_appointment candy_sell candy_sell1 candy_sell2
## 1      1 1996                  37     200661          NA          NA
## 2      2 1996                  32     297141      200661          NA
## 3      3 1996                  56     235586      297141      200661
## 4      4 1996                  43     299498      235586      297141
## 5      5 1996                  52     395824      299498      235586
## 6      6 1996                  70     247212      395824      299498
## 7      7 1996                  64     273899      247212      395824
## 8      8 1996                  88     184405      273899      247212
## 9      9 1996                  74     141211      184405      273899
## 10    10 1996                  98     292225      141211      184405
## 11    11 1996                  80     227864      292225      141211
## 12    12 1996                  46     259963      227864      292225
## 13     1 1997                  66     433783      259963      227864
## 14     2 1997                  78     343034      433783      259963
## 15     3 1997                  46     419255      343034      433783
## 16     4 1997                  62     393175      419255      343034
## 17     5 1997                 124     304954      393175      419255
## 18     6 1997                 106     439380      304954      393175
## 19     7 1997                 110     270706      439380      304954
## 20     8 1997                  98     319472      270706      439380
## 21     9 1997                  76     262688      319472      270706
## 22    10 1997                 132     322140      262688      319472
## 23    11 1997                  82     467572      322140      262688
## 24    12 1997                  84     431484      467572      322140
##    candy_sell3 candy_sell4 candy_sell5 candy_sell6 banana_sell banana_sell1
## 1           NA          NA          NA          NA      112146           NA
## 2           NA          NA          NA          NA      175417       112146
## 3           NA          NA          NA          NA      104645       175417
## 4       200661          NA          NA          NA      112283       104645
## 5       297141      200661          NA          NA      137083       112283
## 6       235586      297141      200661          NA      198130       137083
## 7       299498      235586      297141      200661      194685       198130
## 8       395824      299498      235586      297141      100091       194685
## 9       247212      395824      299498      235586      194972       100091
## 10      273899      247212      395824      299498      133166       194972
## 11      184405      273899      247212      395824      155598       133166
## 12      141211      184405      273899      247212      104563       155598
## 13      292225      141211      184405      273899      104216       104563
## 14      227864      292225      141211      184405      142908       104216
## 15      259963      227864      292225      141211      135994       142908
## 16      433783      259963      227864      292225      154659       135994
## 17      343034      433783      259963      227864      122983       154659
## 18      419255      343034      433783      259963      104657       122983
## 19      393175      419255      343034      433783      163008       104657
## 20      304954      393175      419255      343034      149116       163008
## 21      439380      304954      393175      419255      144727       149116
## 22      270706      439380      304954      393175      148500       144727
## 23      319472      270706      439380      304954      120676       148500
## 24      262688      319472      270706      439380      162135       120676
##    banana_sell2 banana_sell3 banana_sell4 banana_sell5 banana_sell6 cereal_sell
## 1            NA           NA           NA           NA           NA       10324
## 2            NA           NA           NA           NA           NA       13150
## 3        112146           NA           NA           NA           NA        5398
## 4        175417       112146           NA           NA           NA        7609
## 5        104645       175417       112146           NA           NA        5190
## 6        112283       104645       175417       112146           NA       14571
## 7        137083       112283       104645       175417       112146       13407
## 8        198130       137083       112283       104645       175417        8914
## 9        194685       198130       137083       112283       104645        3663
## 10       100091       194685       198130       137083       112283       14756
## 11       194972       100091       194685       198130       137083        3883
## 12       133166       194972       100091       194685       198130        5720
## 13       155598       133166       194972       100091       194685       16209
## 14       104563       155598       133166       194972       100091        5955
## 15       104216       104563       155598       133166       194972       14789
## 16       142908       104216       104563       155598       133166       11150
## 17       135994       142908       104216       104563       155598        4930
## 18       154659       135994       142908       104216       104563       15041
## 19       122983       154659       135994       142908       104216       14210
## 20       104657       122983       154659       135994       142908        5002
## 21       163008       104657       122983       154659       135994        3172
## 22       149116       163008       104657       122983       154659        9818
## 23       144727       149116       163008       104657       122983       13545
## 24       148500       144727       149116       163008       104657        2372
##    cereal_sell1 cereal_sell2 cereal_sell3 cereal_sell4 cereal_sell5
## 1            NA           NA           NA           NA           NA
## 2         10324           NA           NA           NA           NA
## 3         13150        10324           NA           NA           NA
## 4          5398        13150        10324           NA           NA
## 5          7609         5398        13150        10324           NA
## 6          5190         7609         5398        13150        10324
## 7         14571         5190         7609         5398        13150
## 8         13407        14571         5190         7609         5398
## 9          8914        13407        14571         5190         7609
## 10         3663         8914        13407        14571         5190
## 11        14756         3663         8914        13407        14571
## 12         3883        14756         3663         8914        13407
## 13         5720         3883        14756         3663         8914
## 14        16209         5720         3883        14756         3663
## 15         5955        16209         5720         3883        14756
## 16        14789         5955        16209         5720         3883
## 17        11150        14789         5955        16209         5720
## 18         4930        11150        14789         5955        16209
## 19        15041         4930        11150        14789         5955
## 20        14210        15041         4930        11150        14789
## 21         5002        14210        15041         4930        11150
## 22         3172         5002        14210        15041         4930
## 23         9818         3172         5002        14210        15041
## 24        13545         9818         3172         5002        14210
##    cereal_sell6
## 1            NA
## 2            NA
## 3            NA
## 4            NA
## 5            NA
## 6            NA
## 7         10324
## 8         13150
## 9          5398
## 10         7609
## 11         5190
## 12        14571
## 13        13407
## 14         8914
## 15         3663
## 16        14756
## 17         3883
## 18         5720
## 19        16209
## 20         5955
## 21        14789
## 22        11150
## 23         4930
## 24        15041
# scatter for relation candy variable
plot(delayed$dentist_appointment, delayed$candy_sell1, 
     pch = 20, 
     xlab = "Dentist Appoinment", 
     ylab = "Candies selling 1st month", 
     col = "red")

plot(delayed$dentist_appointment, delayed$candy_sell2, 
     pch = 20, 
     xlab = "Dentist Appoinment", 
     ylab = "Candies selling 2nd month", 
     col = "orange")

plot(delayed$dentist_appointment, delayed$candy_sell3, 
     pch = 20, 
     xlab = "Dentist Appoinment", 
     ylab = "Candies selling 3rd month", 
     col = "yellowgreen")

plot(delayed$dentist_appointment, delayed$candy_sell4, 
     pch = 20, 
     xlab = "Dentist Appoinment", 
     ylab = "Candies selling 4th month", 
     col = "green")

plot(delayed$dentist_appointment, delayed$candy_sell5, 
     pch = 20, 
     xlab = "Dentist Appoinment", 
     ylab = "Candies selling 5th month", 
     col = "blue")

plot(delayed$dentist_appointment, delayed$candy_sell6, 
     pch = 20, 
     xlab = "Dentist Appoinment", 
     ylab = "Candies selling 6th month", 
     col = "purple")

We can see from the graph, the 4th month starting to show the effect selling candy with dentist appointment. We’ll see for the other variable on the 4th month.

plot(delayed$dentist_appointment, delayed$cereal_sell6, 
     pch = 20, 
     xlab = "Dentist Appoinment", 
     ylab = "Candies selling 6th month", 
     col = "gray")

plot(delayed$dentist_appointment, delayed$banana_sell6, 
     pch = 20, 
     xlab = "Dentist Appoinment", 
     ylab = "Candies selling 6th month", 
     col = "steelblue")

The other variable also starting to show the relation on the 4th month, next analysis will start from the 4th month.

Regression Model

# Regression using basic R
month_4th <- data.frame(month = delayed$month,
                  year = delayed$year,
                  dentist_appointment = delayed$dentist_appointment,
                  candies_selling = delayed$candy_sell4,
                  bananas_selling = delayed$banana_sell4,
                  cereal_selling = delayed$cereal_sell4) %>% na.omit()
month_4th
##    month year dentist_appointment candies_selling bananas_selling
## 5      5 1996                  52          200661          112146
## 6      6 1996                  70          297141          175417
## 7      7 1996                  64          235586          104645
## 8      8 1996                  88          299498          112283
## 9      9 1996                  74          395824          137083
## 10    10 1996                  98          247212          198130
## 11    11 1996                  80          273899          194685
## 12    12 1996                  46          184405          100091
## 13     1 1997                  66          141211          194972
## 14     2 1997                  78          292225          133166
## 15     3 1997                  46          227864          155598
## 16     4 1997                  62          259963          104563
## 17     5 1997                 124          433783          104216
## 18     6 1997                 106          343034          142908
## 19     7 1997                 110          419255          135994
## 20     8 1997                  98          393175          154659
## 21     9 1997                  76          304954          122983
## 22    10 1997                 132          439380          104657
## 23    11 1997                  82          270706          163008
## 24    12 1997                  84          319472          149116
##    cereal_selling
## 5           10324
## 6           13150
## 7            5398
## 8            7609
## 9            5190
## 10          14571
## 11          13407
## 12           8914
## 13           3663
## 14          14756
## 15           3883
## 16           5720
## 17          16209
## 18           5955
## 19          14789
## 20          11150
## 21           4930
## 22          15041
## 23          14210
## 24           5002
# Regression with linear model
# dentist appointment is our target (dependent variable) and candies_selling is independent variable
mdl_lm <- lm(dentist_appointment ~ candies_selling + bananas_selling + cereal_selling, data = month_4th)
summary(mdl_lm)
## 
## Call:
## lm(formula = dentist_appointment ~ candies_selling + bananas_selling + 
##     cereal_selling, data = month_4th)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.188  -7.019  -2.460   9.383  20.098 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -7.004e+00  2.001e+01  -0.350  0.73086    
## candies_selling  2.112e-04  4.220e-05   5.005  0.00013 ***
## bananas_selling  9.421e-05  9.987e-05   0.943  0.35952    
## cereal_selling   1.286e-03  7.718e-04   1.667  0.11500    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.47 on 16 degrees of freedom
## Multiple R-squared:  0.734,  Adjusted R-squared:  0.6842 
## F-statistic: 14.72 on 3 and 16 DF,  p-value: 7.304e-05

Interpreting the Model

From the model above, the most impactful for the dentist appointment is candies selling (shown by ***), meanwhile the other predictors (banana sell and cereal) have pvalue > 0.05 are not significant to our model.

Formulas we get from the model result for the candy selling is:

y_hat = -7.004 + 0.0002112 . x

Intercept (b0) is a constant, slope (b1) is coefficient of candies_selling. The intercept value is negative, which means it is not realistic for dentist appointments number. It may indicate that people visit dentist for another reasone beside sugar causes or other missing factor like general health habits or insurances.

For example, the candies selling is 100,000, then

y_hat = -7.004 + 0.0002112 . 100000

y_hat = -7.004 + 21.12

y_hat = 14.116

This means for every candy sales increased by 100k will affect the dentist appointment. The dentist appointment will increase 14 people for the next 4 months.

The Residual Standard Error is 13.47, means that the average of an error in prediction is 13 appointments. The R-squared is about 73% of dentist appointment explained by the predictors.

predict_data <- data.frame(
     month = c(1,2,3,4,5),
     year = c(1998,1998,1998,1998,1998),
     candies_selling = c(345646,454344,346987,209854,254634),
     bananas_selling = c(102331,234123,902313,222135,213563),
     cereal_selling = c(4231234,213452,242556,890982,234155)
     )

predict(mdl_lm, predict_data)
##         1         2         3         4         5 
## 5518.6449  385.5959  463.3135 1204.3948  368.1111

This is prediction dentist appointment for 5 month (Jan to May) using linear regression. Many people are going to dentist in January and April due to high selling of candy, banana, and cereal.