Introduction

Background

Divorce rates reflect important demographic and social dynamics.
This report investigates divorce trends in Kazakhstan using a dataset containing:

  • year — year of observation
  • value — number of divorces
  • area_code — regional identifier
  • area_type — region type (urban or rural)

Objectives

  • Visualize the data
  • Identify missing values
  • Impute missing data using MICE
  • Fit linear models
  • Check for nonlinearity and heterogeneity
  • Build improved models

The dataset used in this analysis can be accessed here:
Divorce and Marriage in Kazakhstan — Kaggle dataset

I began by loading the dataset and reading it.

df <- read_csv("Divorce.csv")
## Rows: 614 Columns: 76
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (50): termNames/0, termNames/1, periods/0/name, periods/0/date, periods/...
## dbl (26): terms/0, terms/1, periods/0/value, periods/1/value, periods/2/valu...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df)
## # A tibble: 6 × 76
##   `terms/0` `terms/1` `termNames/0`          `termNames/1`      `periods/0/name`
##       <dbl>     <dbl> <chr>                  <chr>              <chr>           
## 1    741880    533590 РЕСПУБЛИКА КАЗАХСТАН   сельская местность 2019 год        
## 2    741880    741917 РЕСПУБЛИКА КАЗАХСТАН   Всего              2019 год        
## 3    258742    741917 КОСТАНАЙСКАЯ ОБЛАСТЬ   Всего              2019 год        
## 4    256619    741917 КАРАГАНДИНСКАЯ ОБЛАСТЬ Всего              2019 год        
## 5    247783    741917 АКМОЛИНСКАЯ ОБЛАСТЬ    Всего              2019 год        
## 6    256619    533589 КАРАГАНДИНСКАЯ ОБЛАСТЬ городская местнос… 2019 год        
## # ℹ 71 more variables: `periods/0/date` <chr>, `periods/0/value` <dbl>,
## #   `periods/1/name` <chr>, `periods/1/date` <chr>, `periods/1/value` <dbl>,
## #   `periods/2/name` <chr>, `periods/2/date` <chr>, `periods/2/value` <dbl>,
## #   `periods/3/name` <chr>, `periods/3/date` <chr>, `periods/3/value` <dbl>,
## #   `periods/4/name` <chr>, `periods/4/date` <chr>, `periods/4/value` <dbl>,
## #   `periods/5/name` <chr>, `periods/5/date` <chr>, `periods/5/value` <dbl>,
## #   `periods/6/name` <chr>, `periods/6/date` <chr>, `periods/6/value` <dbl>, …

Next i changed it into a long format suitable for analysis, as data was raw:

library(dplyr)
df_long <- df %>%
  pivot_longer(
    cols = starts_with("periods"),
    names_to = c("period", ".value"),
    names_pattern = "periods/(\\d+)/(.*)"
  ) %>%
  rename(
    region = `termNames/0`,
    region_type = `termNames/1`
  ) %>%
  mutate(
    period = as.numeric(period),
    value  = as.numeric(value),
    date   = dmy(date)
  )

Visualization

I created visual plot using gglpot to see the relationship between time and divorce values

ggplot(df_long, 
       aes(x = date, y = value)) +
  geom_point() +
  theme_minimal()
## Warning: Removed 6278 rows containing missing values or values outside the scale range
## (`geom_point()`).

Its visible nonlinearity.

Next i decided to check whether the pattern differs across Urban and Rural regions

ggplot(df_long, 
       aes(x = date, y = value, color = region_type)) +
  geom_smooth(se = FALSE, linewidth = 1.5) +
  scale_color_brewer(palette = "Dark2") +
  theme_minimal(base_size = 15) +
  labs(
    title = "Urban vs Rural — Trend Over Time",
    subtitle = "Smooth curves with color contrast",
    x = "Date",
    y = "Value",
    color = "Region Type"
  ) +
  theme(
    plot.title = element_text(face = "bold"),
    legend.position = "top"
  )
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 6278 rows containing non-finite outside the scale range
## (`stat_smooth()`).

We can see that urban is higher that rural areas.

Exploring missing data

To check if any data is missinf in dataset, used summary of if NA is in dataset.

sum(is.na(df_long))
## [1] 18834

Imputing missing data with MICE

we see that data is missing in dataset, so i used MICE as it discussed to be good method in order not to lose the big amount of data

df_mice <- df_long %>%
  mutate(
    date_num = as.numeric(date)
  ) %>%
  select(-date)

R was giving errors, so i changed date to numeric, to run the MICE accurately.Method was used Predictive Mean Matching.

str(df_long)
## tibble [14,736 × 8] (S3: tbl_df/tbl/data.frame)
##  $ terms/0    : num [1:14736] 741880 741880 741880 741880 741880 ...
##  $ terms/1    : num [1:14736] 533590 533590 533590 533590 533590 ...
##  $ region     : chr [1:14736] "РЕСПУБЛИКА КАЗАХСТАН" "РЕСПУБЛИКА КАЗАХСТАН" "РЕСПУБЛИКА КАЗАХСТАН" "РЕСПУБЛИКА КАЗАХСТАН" ...
##  $ region_type: chr [1:14736] "сельская местность" "сельская местность" "сельская местность" "сельская местность" ...
##  $ period     : num [1:14736] 0 1 2 3 4 5 6 7 8 9 ...
##  $ name       : chr [1:14736] "2019 год" "2018 год" "2009 год" "2011 год" ...
##  $ date       : Date[1:14736], format: "2019-12-31" "2018-12-31" ...
##  $ value      : num [1:14736] 15715 13675 9303 11070 4638 ...
df_mice <- df_long %>%
  mutate(across(where(is.character), as.factor),
    date_num = as.numeric(date)
  ) %>%
  select(-date)

str(df_mice)
## tibble [14,736 × 8] (S3: tbl_df/tbl/data.frame)
##  $ terms/0    : num [1:14736] 741880 741880 741880 741880 741880 ...
##  $ terms/1    : num [1:14736] 533590 533590 533590 533590 533590 ...
##  $ region     : Factor w/ 239 levels "АБАЙСКИЙ РАЙОН",..: 178 178 178 178 178 178 178 178 178 178 ...
##  $ region_type: Factor w/ 3 levels "Всего","городская местность",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ period     : num [1:14736] 0 1 2 3 4 5 6 7 8 9 ...
##  $ name       : Factor w/ 24 levels "2000 год","2001 год",..: 20 19 10 12 1 6 3 2 8 4 ...
##  $ value      : num [1:14736] 15715 13675 9303 11070 4638 ...
##  $ date_num   : num [1:14736] 18261 17896 14609 15339 11322 ...
df_mice$name <- NULL
pred <- quickpred(df_mice, minpuc = 0.2)
mice_result <- mice(df_mice, m = 5, method = "pmm", pred = pred)
## 
##  iter imp variable
##   1   1  value  date_num
##   1   2  value  date_num
##   1   3  value  date_num
##   1   4  value  date_num
##   1   5  value  date_num
##   2   1  value  date_num
##   2   2  value  date_num
##   2   3  value  date_num
##   2   4  value  date_num
##   2   5  value  date_num
##   3   1  value  date_num
##   3   2  value  date_num
##   3   3  value  date_num
##   3   4  value  date_num
##   3   5  value  date_num
##   4   1  value  date_num
##   4   2  value  date_num
##   4   3  value  date_num
##   4   4  value  date_num
##   4   5  value  date_num
##   5   1  value  date_num
##   5   2  value  date_num
##   5   3  value  date_num
##   5   4  value  date_num
##   5   5  value  date_num

Model Fit

I used the lm() function to fit a linear regression model to one of the completed MICE datasets. I then used autoplot() to automatically generate diagnostic plots, which allow me to visually inspect model assumptions such as linearity, homoscedasticity, normality of residuals.

completed <- complete(mice_result, 1)
lm1 <- lm(value ~ period, data = completed)
autoplot(lm1)
## Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Urban vs Rural Wanted to see model for Urban vs Rural,also used autolot to see the relation.

lm2 <- lm(value ~ period + region_type, data = completed)
summary(lm2)
## 
## Call:
## lm(formula = value ~ period + region_type, data = completed)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -19922  -8113  -3223   2384  63656 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -3859.80     238.95 -16.153   <2e-16 ***
## period                          1040.82      15.24  68.297   <2e-16 ***
## region_typeгородская местность   308.28     279.55   1.103    0.270    
## region_typeсельская местность    154.71     238.79   0.648    0.517    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12810 on 14732 degrees of freedom
## Multiple R-squared:  0.2405, Adjusted R-squared:  0.2404 
## F-statistic:  1555 on 3 and 14732 DF,  p-value: < 2.2e-16
autoplot(lm2)

The plots from both models show that the simple linear regression is not appropriate for this dataset.Taken together, these patterns demonstrate that linearity, normality are violated. So i applied a log transformation and using polynomial terms to improve model.

Transforming the Model

Method by Log

completed$log_value <- log(completed$value + 1)

lm_log <- lm(log_value ~ period + region_type, data = completed)
autoplot(lm_log)

summary(lm_log)
## 
## Call:
## lm(formula = log_value ~ period + region_type, data = completed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8563 -1.2529  0.0284  1.2096  7.1363 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.862384   0.034613 111.587   <2e-16 ***
## period                          0.253107   0.002208 114.655   <2e-16 ***
## region_typeгородская местность  0.396281   0.040494   9.786   <2e-16 ***
## region_typeсельская местность  -0.421344   0.034591 -12.181   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.855 on 14732 degrees of freedom
## Multiple R-squared:  0.479,  Adjusted R-squared:  0.4789 
## F-statistic:  4516 on 3 and 14732 DF,  p-value: < 2.2e-16

The log transformation improved linearity, making the model more appropriate than the original linear fit.

Method by Polynomials

lm_poly <- lm(value ~ poly(period, 2) + region_type, data = completed)
autoplot(lm_poly)

The polynomial model helps address some of the nonlinearity detected in the earlier models, but it does not fully correct issues.

Conclusion

Overall, there appears to be a clear relationship between divorce rates and time, but the pattern is not perfectly linear. In some regions, divorce rates increase or decrease steadily, while in others the trend levels off or shifts direction over time. This suggests that divorce patterns are influenced by more than just the passing of years, but also other factors play roles. We also see that region is an important factor. Some regions consistently have higher divorce rates, while others remain lower or more stable. This means the relationship between time and divorce is not the same everywhere. To be more confident in these findings, we would need a bigger and more detailed dataset that covers more years and more areas.