This project is from DQLab Data Science Course. We want to know the relation between food selling and people go to dentist using Linear Regression. The dataset can be download here dentist and food selling
Data Import
dt_dental <- read.csv("E:/2025/DQLAB/R/6 Regression/kunjungan_dokter_gigi_kota_x_dqlab.tsv", header = TRUE, sep = "\t")
dt_food_sell <- read.csv("E:/2025/DQLAB/R/6 Regression/tingkat_penjualan_kota_x_dqlab.tsv", header = TRUE, sep = "\t")## Bulan Tahun tingkat.kunjungan.ke.dokter.gigi
## 1 1 1996 37
## 2 2 1996 32
## 3 3 1996 56
## 4 4 1996 43
## 5 5 1996 52
## 6 6 1996 70
## No Bulan Tahun penjualan.permen penjualan.sereal penjualan.buah.pisang
## 1 1 1 1996 200661 10324 112146
## 2 2 2 1996 297141 13150 175417
## 3 3 3 1996 235586 5398 104645
## 4 4 4 1996 299498 7609 112283
## 5 5 5 1996 395824 5190 137083
## 6 6 6 1996 247212 14571 198130
# combine both data
# by.x and by.y is for unique key in both data frame
dt <- merge(dt_dental, dt_food_sell, by.x = c("Bulan", "Tahun"), by.y = c("Bulan", "Tahun"), sort = FALSE)
str(dt)## 'data.frame': 24 obs. of 7 variables:
## $ Bulan : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Tahun : int 1996 1996 1996 1996 1996 1996 1996 1996 1996 1996 ...
## $ tingkat.kunjungan.ke.dokter.gigi: int 37 32 56 43 52 70 64 88 74 98 ...
## $ No : int 1 2 3 4 5 6 7 8 9 10 ...
## $ penjualan.permen : int 200661 297141 235586 299498 395824 247212 273899 184405 141211 292225 ...
## $ penjualan.sereal : int 10324 13150 5398 7609 5190 14571 13407 8914 3663 14756 ...
## $ penjualan.buah.pisang : int 112146 175417 104645 112283 137083 198130 194685 100091 194972 133166 ...
Data Wrangling and Visualization
## Bulan Tahun tingkat.kunjungan.ke.dokter.gigi
## Min. : 1.00 Min. :1996 Min. : 32.00
## 1st Qu.: 3.75 1st Qu.:1996 1st Qu.: 55.00
## Median : 6.50 Median :1996 Median : 75.00
## Mean : 6.50 Mean :1996 Mean : 75.17
## 3rd Qu.: 9.25 3rd Qu.:1997 3rd Qu.: 90.50
## Max. :12.00 Max. :1997 Max. :132.00
## No penjualan.permen penjualan.sereal penjualan.buah.pisang
## Min. : 1.00 Min. :141211 Min. : 2372 Min. :100091
## 1st Qu.: 6.75 1st Qu.:256775 1st Qu.: 5143 1st Qu.:112249
## Median :12.50 Median :298320 Median : 9366 Median :139996
## Mean :12.50 Mean :310964 Mean : 9282 Mean :140682
## 3rd Qu.:18.25 3rd Qu.:393837 3rd Qu.:13711 3rd Qu.:157232
## Max. :24.00 Max. :467572 Max. :16209 Max. :198130
People who consumed candies and bananas higher than who consumed cereal. The maximum doctor appointment is quite high, there are 132 times people going to dentist.
Not everyone will get toothache after consuming candies, bananas, or cereal. Then we should search for the delay effect, the time gap between we eating food and get toothache.
delayed <- data.frame(month = dt$Bulan, year = dt$Tahun,
dentist_appointment = dt$tingkat.kunjungan.ke.dokter.gigi,
# candy
candy_sell = dt$penjualan.permen,
# lag, n for monthly
candy_sell1 = lag(dt$penjualan.permen),
candy_sell2 = lag(dt$penjualan.permen,2),
candy_sell3 = lag(dt$penjualan.permen,3),
candy_sell4 = lag(dt$penjualan.permen,4),
candy_sell5 = lag(dt$penjualan.permen,5),
candy_sell6 = lag(dt$penjualan.permen,6),
# bananas
banana_sell = dt$penjualan.buah.pisang,
banana_sell1 = lag(dt$penjualan.buah.pisang),
banana_sell2 = lag(dt$penjualan.buah.pisang,2),
banana_sell3 = lag(dt$penjualan.buah.pisang,3),
banana_sell4 = lag(dt$penjualan.buah.pisang,4),
banana_sell5 = lag(dt$penjualan.buah.pisang,5),
banana_sell6 = lag(dt$penjualan.buah.pisang,6),
# cereal
cereal_sell = dt$penjualan.sereal,
cereal_sell1 = lag(dt$penjualan.sereal),
cereal_sell2 = lag(dt$penjualan.sereal,2),
cereal_sell3 = lag(dt$penjualan.sereal,3),
cereal_sell4 = lag(dt$penjualan.sereal,4),
cereal_sell5 = lag(dt$penjualan.sereal,5),
cereal_sell6 = lag(dt$penjualan.sereal,6))
delayed## month year dentist_appointment candy_sell candy_sell1 candy_sell2
## 1 1 1996 37 200661 NA NA
## 2 2 1996 32 297141 200661 NA
## 3 3 1996 56 235586 297141 200661
## 4 4 1996 43 299498 235586 297141
## 5 5 1996 52 395824 299498 235586
## 6 6 1996 70 247212 395824 299498
## 7 7 1996 64 273899 247212 395824
## 8 8 1996 88 184405 273899 247212
## 9 9 1996 74 141211 184405 273899
## 10 10 1996 98 292225 141211 184405
## 11 11 1996 80 227864 292225 141211
## 12 12 1996 46 259963 227864 292225
## 13 1 1997 66 433783 259963 227864
## 14 2 1997 78 343034 433783 259963
## 15 3 1997 46 419255 343034 433783
## 16 4 1997 62 393175 419255 343034
## 17 5 1997 124 304954 393175 419255
## 18 6 1997 106 439380 304954 393175
## 19 7 1997 110 270706 439380 304954
## 20 8 1997 98 319472 270706 439380
## 21 9 1997 76 262688 319472 270706
## 22 10 1997 132 322140 262688 319472
## 23 11 1997 82 467572 322140 262688
## 24 12 1997 84 431484 467572 322140
## candy_sell3 candy_sell4 candy_sell5 candy_sell6 banana_sell banana_sell1
## 1 NA NA NA NA 112146 NA
## 2 NA NA NA NA 175417 112146
## 3 NA NA NA NA 104645 175417
## 4 200661 NA NA NA 112283 104645
## 5 297141 200661 NA NA 137083 112283
## 6 235586 297141 200661 NA 198130 137083
## 7 299498 235586 297141 200661 194685 198130
## 8 395824 299498 235586 297141 100091 194685
## 9 247212 395824 299498 235586 194972 100091
## 10 273899 247212 395824 299498 133166 194972
## 11 184405 273899 247212 395824 155598 133166
## 12 141211 184405 273899 247212 104563 155598
## 13 292225 141211 184405 273899 104216 104563
## 14 227864 292225 141211 184405 142908 104216
## 15 259963 227864 292225 141211 135994 142908
## 16 433783 259963 227864 292225 154659 135994
## 17 343034 433783 259963 227864 122983 154659
## 18 419255 343034 433783 259963 104657 122983
## 19 393175 419255 343034 433783 163008 104657
## 20 304954 393175 419255 343034 149116 163008
## 21 439380 304954 393175 419255 144727 149116
## 22 270706 439380 304954 393175 148500 144727
## 23 319472 270706 439380 304954 120676 148500
## 24 262688 319472 270706 439380 162135 120676
## banana_sell2 banana_sell3 banana_sell4 banana_sell5 banana_sell6 cereal_sell
## 1 NA NA NA NA NA 10324
## 2 NA NA NA NA NA 13150
## 3 112146 NA NA NA NA 5398
## 4 175417 112146 NA NA NA 7609
## 5 104645 175417 112146 NA NA 5190
## 6 112283 104645 175417 112146 NA 14571
## 7 137083 112283 104645 175417 112146 13407
## 8 198130 137083 112283 104645 175417 8914
## 9 194685 198130 137083 112283 104645 3663
## 10 100091 194685 198130 137083 112283 14756
## 11 194972 100091 194685 198130 137083 3883
## 12 133166 194972 100091 194685 198130 5720
## 13 155598 133166 194972 100091 194685 16209
## 14 104563 155598 133166 194972 100091 5955
## 15 104216 104563 155598 133166 194972 14789
## 16 142908 104216 104563 155598 133166 11150
## 17 135994 142908 104216 104563 155598 4930
## 18 154659 135994 142908 104216 104563 15041
## 19 122983 154659 135994 142908 104216 14210
## 20 104657 122983 154659 135994 142908 5002
## 21 163008 104657 122983 154659 135994 3172
## 22 149116 163008 104657 122983 154659 9818
## 23 144727 149116 163008 104657 122983 13545
## 24 148500 144727 149116 163008 104657 2372
## cereal_sell1 cereal_sell2 cereal_sell3 cereal_sell4 cereal_sell5
## 1 NA NA NA NA NA
## 2 10324 NA NA NA NA
## 3 13150 10324 NA NA NA
## 4 5398 13150 10324 NA NA
## 5 7609 5398 13150 10324 NA
## 6 5190 7609 5398 13150 10324
## 7 14571 5190 7609 5398 13150
## 8 13407 14571 5190 7609 5398
## 9 8914 13407 14571 5190 7609
## 10 3663 8914 13407 14571 5190
## 11 14756 3663 8914 13407 14571
## 12 3883 14756 3663 8914 13407
## 13 5720 3883 14756 3663 8914
## 14 16209 5720 3883 14756 3663
## 15 5955 16209 5720 3883 14756
## 16 14789 5955 16209 5720 3883
## 17 11150 14789 5955 16209 5720
## 18 4930 11150 14789 5955 16209
## 19 15041 4930 11150 14789 5955
## 20 14210 15041 4930 11150 14789
## 21 5002 14210 15041 4930 11150
## 22 3172 5002 14210 15041 4930
## 23 9818 3172 5002 14210 15041
## 24 13545 9818 3172 5002 14210
## cereal_sell6
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
## 7 10324
## 8 13150
## 9 5398
## 10 7609
## 11 5190
## 12 14571
## 13 13407
## 14 8914
## 15 3663
## 16 14756
## 17 3883
## 18 5720
## 19 16209
## 20 5955
## 21 14789
## 22 11150
## 23 4930
## 24 15041
# scatter for relation candy variable
plot(delayed$dentist_appointment, delayed$candy_sell1,
pch = 20,
xlab = "Dentist Appoinment",
ylab = "Candies selling 1st month",
col = "red")plot(delayed$dentist_appointment, delayed$candy_sell2,
pch = 20,
xlab = "Dentist Appoinment",
ylab = "Candies selling 2nd month",
col = "orange")plot(delayed$dentist_appointment, delayed$candy_sell3,
pch = 20,
xlab = "Dentist Appoinment",
ylab = "Candies selling 3rd month",
col = "yellowgreen")plot(delayed$dentist_appointment, delayed$candy_sell4,
pch = 20,
xlab = "Dentist Appoinment",
ylab = "Candies selling 4th month",
col = "green")plot(delayed$dentist_appointment, delayed$candy_sell5,
pch = 20,
xlab = "Dentist Appoinment",
ylab = "Candies selling 5th month",
col = "blue")plot(delayed$dentist_appointment, delayed$candy_sell6,
pch = 20,
xlab = "Dentist Appoinment",
ylab = "Candies selling 6th month",
col = "purple")We can see from the graph, the 4th month starting to show the effect selling candy with dentist appointment. We’ll see for the other variable on the 4th month.
plot(delayed$dentist_appointment, delayed$cereal_sell6,
pch = 20,
xlab = "Dentist Appoinment",
ylab = "Candies selling 6th month",
col = "gray")plot(delayed$dentist_appointment, delayed$banana_sell6,
pch = 20,
xlab = "Dentist Appoinment",
ylab = "Candies selling 6th month",
col = "steelblue")The other variable also starting to show the relation on the 4th month, next analysis will start from the 4th month.
Regression Model
# Regression using basic R
month_4th <- data.frame(month = delayed$month,
year = delayed$year,
dentist_appointment = delayed$dentist_appointment,
candies_selling = delayed$candy_sell4,
bananas_selling = delayed$banana_sell4,
cereal_selling = delayed$cereal_sell4) %>% na.omit()
month_4th## month year dentist_appointment candies_selling bananas_selling
## 5 5 1996 52 200661 112146
## 6 6 1996 70 297141 175417
## 7 7 1996 64 235586 104645
## 8 8 1996 88 299498 112283
## 9 9 1996 74 395824 137083
## 10 10 1996 98 247212 198130
## 11 11 1996 80 273899 194685
## 12 12 1996 46 184405 100091
## 13 1 1997 66 141211 194972
## 14 2 1997 78 292225 133166
## 15 3 1997 46 227864 155598
## 16 4 1997 62 259963 104563
## 17 5 1997 124 433783 104216
## 18 6 1997 106 343034 142908
## 19 7 1997 110 419255 135994
## 20 8 1997 98 393175 154659
## 21 9 1997 76 304954 122983
## 22 10 1997 132 439380 104657
## 23 11 1997 82 270706 163008
## 24 12 1997 84 319472 149116
## cereal_selling
## 5 10324
## 6 13150
## 7 5398
## 8 7609
## 9 5190
## 10 14571
## 11 13407
## 12 8914
## 13 3663
## 14 14756
## 15 3883
## 16 5720
## 17 16209
## 18 5955
## 19 14789
## 20 11150
## 21 4930
## 22 15041
## 23 14210
## 24 5002
# Regression with linear model
# dentist appointment is our target (dependent variable) and candies_selling is independent variable
mdl_lm <- lm(dentist_appointment ~ candies_selling + bananas_selling + cereal_selling, data = month_4th)
summary(mdl_lm)##
## Call:
## lm(formula = dentist_appointment ~ candies_selling + bananas_selling +
## cereal_selling, data = month_4th)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.188 -7.019 -2.460 9.383 20.098
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.004e+00 2.001e+01 -0.350 0.73086
## candies_selling 2.112e-04 4.220e-05 5.005 0.00013 ***
## bananas_selling 9.421e-05 9.987e-05 0.943 0.35952
## cereal_selling 1.286e-03 7.718e-04 1.667 0.11500
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.47 on 16 degrees of freedom
## Multiple R-squared: 0.734, Adjusted R-squared: 0.6842
## F-statistic: 14.72 on 3 and 16 DF, p-value: 7.304e-05
Interpreting the Model
From the model above, the most impactful for the dentist appointment is candies selling (shown by ***), meanwhile the other predictors (banana sell and cereal) have pvalue > 0.05 are not significant to our model.
Formulas we get from the model result for the candy selling is:
y_hat = -7.004 + 0.0002112 . x
Intercept (b0) is a constant, slope (b1) is coefficient of candies_selling. The intercept value is negative, which means it is not realistic for dentist appointments number. It may indicate that people visit dentist for another reasone beside sugar causes or other missing factor like general health habits or insurances.
For example, the candies selling is 100,000, then
y_hat = -7.004 + 0.0002112 . 100000
y_hat = -7.004 + 21.12
y_hat = 14.116
This means for every candy sales increased by 100k will affect the dentist appointment. The dentist appointment will increase 14 people for the next 4 months.
The Residual Standard Error is 13.47, means that the average of an error in prediction is 13 appointments. The R-squared is about 73% of dentist appointment explained by the predictors.
predict_data <- data.frame(
month = c(1,2,3,4,5),
year = c(1998,1998,1998,1998,1998),
candies_selling = c(345646,454344,346987,209854,254634),
bananas_selling = c(102331,234123,902313,222135,213563),
cereal_selling = c(4231234,213452,242556,890982,234155)
)
predict(mdl_lm, predict_data)## 1 2 3 4 5
## 5518.6449 385.5959 463.3135 1204.3948 368.1111
This is prediction dentist appointment for 5 month (Jan to May) using linear regression. Many people are going to dentist in January and April due to high selling of candy, banana, and cereal.