Background
Divorce rates reflect important demographic and social dynamics.
This report investigates divorce trends in Kazakhstan using a dataset containing:
- year — year of observation
- value — number of divorces
- area_code — regional identifier
- area_type — region type (urban or rural)
Objectives
- Visualize the data
- Identify missing values
- Impute missing data using MICE
- Fit linear models
- Check for nonlinearity and heterogeneity
- Build improved models
The dataset used in this analysis can be accessed here:
Divorce
and Marriage in Kazakhstan — Kaggle dataset
I began by loading the dataset and reading it.
df <- read_csv("Divorce.csv")
## Rows: 614 Columns: 76
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (50): termNames/0, termNames/1, periods/0/name, periods/0/date, periods/...
## dbl (26): terms/0, terms/1, periods/0/value, periods/1/value, periods/2/valu...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df)
## # A tibble: 6 × 76
## `terms/0` `terms/1` `termNames/0` `termNames/1` `periods/0/name`
## <dbl> <dbl> <chr> <chr> <chr>
## 1 741880 533590 РЕСПУБЛИКА КАЗАХСТАН сельская местность 2019 год
## 2 741880 741917 РЕСПУБЛИКА КАЗАХСТАН Всего 2019 год
## 3 258742 741917 КОСТАНАЙСКАЯ ОБЛАСТЬ Всего 2019 год
## 4 256619 741917 КАРАГАНДИНСКАЯ ОБЛАСТЬ Всего 2019 год
## 5 247783 741917 АКМОЛИНСКАЯ ОБЛАСТЬ Всего 2019 год
## 6 256619 533589 КАРАГАНДИНСКАЯ ОБЛАСТЬ городская местнос… 2019 год
## # ℹ 71 more variables: `periods/0/date` <chr>, `periods/0/value` <dbl>,
## # `periods/1/name` <chr>, `periods/1/date` <chr>, `periods/1/value` <dbl>,
## # `periods/2/name` <chr>, `periods/2/date` <chr>, `periods/2/value` <dbl>,
## # `periods/3/name` <chr>, `periods/3/date` <chr>, `periods/3/value` <dbl>,
## # `periods/4/name` <chr>, `periods/4/date` <chr>, `periods/4/value` <dbl>,
## # `periods/5/name` <chr>, `periods/5/date` <chr>, `periods/5/value` <dbl>,
## # `periods/6/name` <chr>, `periods/6/date` <chr>, `periods/6/value` <dbl>, …
Next i changed it into a long format suitable for analysis, as data was raw:
library(dplyr)
df_long <- df %>%
pivot_longer(
cols = starts_with("periods"),
names_to = c("period", ".value"),
names_pattern = "periods/(\\d+)/(.*)"
) %>%
rename(
region = `termNames/0`,
region_type = `termNames/1`
) %>%
mutate(
period = as.numeric(period),
value = as.numeric(value),
date = dmy(date)
)
I created visual plot using gglpot to see the relationship between time and divorce values
ggplot(df_long,
aes(x = date, y = value)) +
geom_point() +
theme_minimal()
## Warning: Removed 6278 rows containing missing values or values outside the scale range
## (`geom_point()`).
Its visible nonlinearity.
Next i decided to check whether the pattern differs across Urban and Rural regions
ggplot(df_long,
aes(x = date, y = value, color = region_type)) +
geom_smooth(se = FALSE, linewidth = 1.5) +
scale_color_brewer(palette = "Dark2") +
theme_minimal(base_size = 15) +
labs(
title = "Urban vs Rural — Trend Over Time",
subtitle = "Smooth curves with color contrast",
x = "Date",
y = "Value",
color = "Region Type"
) +
theme(
plot.title = element_text(face = "bold"),
legend.position = "top"
)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 6278 rows containing non-finite outside the scale range
## (`stat_smooth()`).
We can see that urban is higher that rural areas.
To check if any data is missinf in dataset, used summary of if NA is in dataset.
sum(is.na(df_long))
## [1] 18834
we see that data is missing in dataset, so i used MICE as it discussed to be good method in order not to lose the big amount of data
df_mice <- df_long %>%
mutate(
date_num = as.numeric(date)
) %>%
select(-date)
R was giving errors, so i changed date to numeric, to run the MICE accurately.Method was used Predictive Mean Matching.
str(df_long)
## tibble [14,736 × 8] (S3: tbl_df/tbl/data.frame)
## $ terms/0 : num [1:14736] 741880 741880 741880 741880 741880 ...
## $ terms/1 : num [1:14736] 533590 533590 533590 533590 533590 ...
## $ region : chr [1:14736] "РЕСПУБЛИКА КАЗАХСТАН" "РЕСПУБЛИКА КАЗАХСТАН" "РЕСПУБЛИКА КАЗАХСТАН" "РЕСПУБЛИКА КАЗАХСТАН" ...
## $ region_type: chr [1:14736] "сельская местность" "сельская местность" "сельская местность" "сельская местность" ...
## $ period : num [1:14736] 0 1 2 3 4 5 6 7 8 9 ...
## $ name : chr [1:14736] "2019 год" "2018 год" "2009 год" "2011 год" ...
## $ date : Date[1:14736], format: "2019-12-31" "2018-12-31" ...
## $ value : num [1:14736] 15715 13675 9303 11070 4638 ...
df_mice <- df_long %>%
mutate(across(where(is.character), as.factor),
date_num = as.numeric(date)
) %>%
select(-date)
str(df_mice)
## tibble [14,736 × 8] (S3: tbl_df/tbl/data.frame)
## $ terms/0 : num [1:14736] 741880 741880 741880 741880 741880 ...
## $ terms/1 : num [1:14736] 533590 533590 533590 533590 533590 ...
## $ region : Factor w/ 239 levels "АБАЙСКИЙ РАЙОН",..: 178 178 178 178 178 178 178 178 178 178 ...
## $ region_type: Factor w/ 3 levels "Всего","городская местность",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ period : num [1:14736] 0 1 2 3 4 5 6 7 8 9 ...
## $ name : Factor w/ 24 levels "2000 год","2001 год",..: 20 19 10 12 1 6 3 2 8 4 ...
## $ value : num [1:14736] 15715 13675 9303 11070 4638 ...
## $ date_num : num [1:14736] 18261 17896 14609 15339 11322 ...
df_mice$name <- NULL
pred <- quickpred(df_mice, minpuc = 0.2)
mice_result <- mice(df_mice, m = 5, method = "pmm", pred = pred)
##
## iter imp variable
## 1 1 value date_num
## 1 2 value date_num
## 1 3 value date_num
## 1 4 value date_num
## 1 5 value date_num
## 2 1 value date_num
## 2 2 value date_num
## 2 3 value date_num
## 2 4 value date_num
## 2 5 value date_num
## 3 1 value date_num
## 3 2 value date_num
## 3 3 value date_num
## 3 4 value date_num
## 3 5 value date_num
## 4 1 value date_num
## 4 2 value date_num
## 4 3 value date_num
## 4 4 value date_num
## 4 5 value date_num
## 5 1 value date_num
## 5 2 value date_num
## 5 3 value date_num
## 5 4 value date_num
## 5 5 value date_num
I used the lm() function to fit a linear regression model to one of the completed MICE datasets. I then used autoplot() to automatically generate diagnostic plots, which allow me to visually inspect model assumptions such as linearity, homoscedasticity, normality of residuals.
completed <- complete(mice_result, 1)
lm1 <- lm(value ~ period, data = completed)
autoplot(lm1)
## Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
## Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggfortify package.
## Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
## Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Urban vs Rural Wanted to see model for Urban vs Rural,also used autolot to see the relation.
lm2 <- lm(value ~ period + region_type, data = completed)
summary(lm2)
##
## Call:
## lm(formula = value ~ period + region_type, data = completed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19922 -8113 -3223 2384 63656
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3859.80 238.95 -16.153 <2e-16 ***
## period 1040.82 15.24 68.297 <2e-16 ***
## region_typeгородская местность 308.28 279.55 1.103 0.270
## region_typeсельская местность 154.71 238.79 0.648 0.517
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12810 on 14732 degrees of freedom
## Multiple R-squared: 0.2405, Adjusted R-squared: 0.2404
## F-statistic: 1555 on 3 and 14732 DF, p-value: < 2.2e-16
autoplot(lm2)
The plots from both models show that the simple linear regression is not appropriate for this dataset.Taken together, these patterns demonstrate that linearity, normality are violated. So i applied a log transformation and using polynomial terms to improve model.
Method by Log
completed$log_value <- log(completed$value + 1)
lm_log <- lm(log_value ~ period + region_type, data = completed)
autoplot(lm_log)
summary(lm_log)
##
## Call:
## lm(formula = log_value ~ period + region_type, data = completed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8563 -1.2529 0.0284 1.2096 7.1363
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.862384 0.034613 111.587 <2e-16 ***
## period 0.253107 0.002208 114.655 <2e-16 ***
## region_typeгородская местность 0.396281 0.040494 9.786 <2e-16 ***
## region_typeсельская местность -0.421344 0.034591 -12.181 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.855 on 14732 degrees of freedom
## Multiple R-squared: 0.479, Adjusted R-squared: 0.4789
## F-statistic: 4516 on 3 and 14732 DF, p-value: < 2.2e-16
The log transformation improved linearity, making the model more appropriate than the original linear fit.
lm_poly <- lm(value ~ poly(period, 2) + region_type, data = completed)
autoplot(lm_poly)
The polynomial model helps address some of the nonlinearity detected in the earlier models, but it does not fully correct issues.
Overall, there appears to be a clear relationship between divorce rates and time, but the pattern is not perfectly linear. In some regions, divorce rates increase or decrease steadily, while in others the trend levels off or shifts direction over time. This suggests that divorce patterns are influenced by more than just the passing of years, but also other factors play roles. We also see that region is an important factor. Some regions consistently have higher divorce rates, while others remain lower or more stable. This means the relationship between time and divorce is not the same everywhere. To be more confident in these findings, we would need a bigger and more detailed dataset that covers more years and more areas.