Prashant kumar(S3790132),Varshil Jain(S3803751)
Last updated: 27 October, 2019
-Divorce cases are rapidly increasing in modern world, it is hard to determine the cause of these increasing divorce rates.
What we can actually visualize is who files for a divorce first in a marriage.
-In this report we will use open source data from kaggle. The link from where this data was scraped is given below, https://www.kaggle.com/osbornep/uk-marriage-and-divorce-figures
-We will use this data to determine the divorce filled rates by husband and wife first in UK.
-We will see whether there is a difference between who filled for a divorce first in marriage.
-Traditionally it is a trend in UK that wife will file for a divorce first in any marriage, on complete investigation we will be able to determine if there is a statistical difference between divorce filing rate or not.
-Who files for divorce first in a marriage in UK? Are wife more likely to submit for a divorce first in UK?
-For this investigation we will use descriptive statistics to visualize the data first using bar plot and box plots.
-After initial investigation we will use hypothesis testing to test the null and alternate hypothesis to form a conclusion of this test.
-I collected this open source data from kaggle.The link from where this data was scraped is given below, https://www.kaggle.com/osbornep/uk-marriage-and-divorce-figures
- This data was collected online from kaggle,but looking at this data it looks like cluster sampling method was used for data collection.
- This data originally had 116 objects and 39 variables, but we need only 4 variables to carry out this investigation.
-We have used filter function to filter the required variables for this investigation.
#Importing data to R.
UKDivorceAndMarriage <- read_csv("E:/RMIT SEMESTER1/INTRO TO STATS/assigment 3/UKDivorceAndMarriage.csv")
UK_Divorce<-filter(UKDivorceAndMarriage[,1:4])
#Converting data to numeric datatype.
UK_Divorce$`Husband Submitted Divorce`<-as.numeric(UK_Divorce$`Husband Submitted Divorce`)
UK_Divorce$`Wife Submitted Divorce`<-as.numeric(UK_Divorce$`Wife Submitted Divorce`)
str(UK_Divorce)## Classes 'tbl_df', 'tbl' and 'data.frame': 116 obs. of 4 variables:
## $ Year : num 2016 2015 2014 2013 2012 ...
## $ Total Number of Divorces : num 106959 101055 111169 114720 118140 ...
## $ Husband Submitted Divorce: num 41669 38490 41364 40635 41601 ...
## $ Wife Submitted Divorce : num 65290 62565 69803 74076 76490 ...
-Four variable used for this investigation were
1.Year-The year of data collection,it is numeric data.
2.total number of divorces-total divorce filed on the perticular year,it is numeric data.
3.Husband submitted divorce-Husband filled for divorce first,it is numeric data.
4.Wife submitted divorce-Wife filled for divorce first,it is numeric data.
- There are no factors in these variables.
-Numerical variable for year varies from 1906-2016.
-Husband submitted divorce first data varies from 273-47580.
-Wife submitted divorce first data varies from 194-118401.
-For pre processing of these data we have used filter function to filter first 4 columns of the complete dataset.
-Due to presence of seperators in original dataset R was reading numeric values as charecters,we have used as.numeric function to convert data types to numeric.
-We had 4 missing values in our observations, we have replaced all these values with the mean of the column.
-After replacing the missing values sum(is.na) function is used to check for any missing values.
#replacing NA values by mean
Husband[is.na(Husband)]=mean(Husband,na.rm = TRUE)#replacing missing value by mean.
Wife[is.na(Wife)]=mean(Wife,na.rm = TRUE)#replacing missing value by mean.
#checking for missing values
sum(is.na(Husband))## [1] 0
## [1] 0
-We have used summary function to visualize the data.
-We see that the mean of Husband submitting divorce first is 20759 and that of wife submitting divorce first is 42923
-Between the year 1906-2016 a minimum of 283 husband in a year filled divorce first as compared to 193 filled by wife, similarly a maximum of 47580 husband filed divorce first as compared to 118401 to wife.
-Median for both these data were 20759(Husband submitted first) and 18691(wife submitted first).
-These data shows a significant increase in wife filing for divorce first in recent years.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 283 1678 14936 20759 41610 47580
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 194 2147 18691 42923 89527 118401
boxplot(Husband, Wife, ylab = "Number of Divorces filed",col=c("Sky blue","Pink"))
axis(1, at = 1:2, labels = c("Husband", "Wife"),
title (main = "Boxplot Comparing Divorce filed by husband and wife"))
- The important variable in our plot are (Husband submitted divorce first) and (wife submitted divorce first)
-The boxplot for Husband shows us that the maximum no of divorce filed by husband first is approx.47000 as compared to wife first which 118401 in a single year is.
-The minimum no of divorce filed in a single year by husband is 283 and wife is 194.
-The IQR for husband boxplot is Q3 (41610)-Q1(1678)=39932.
-The IQR for Wife boxplot is Q3 (89527)-Q1(2147)=87380.
-both the boxplot shows that the data is not normally distributed.
-By this boxplots we see that there is greater variability in these data as wife filing for divorce first is significantly higher compared to Husband filing first.
-Since the median lies at the lower half of the box plot it suggest that in both the case there has been a sudden increase in divorce filing in recent years.(This may be due to the factors like increase in population and world war1 and world war2 happening between these periods.)
Year<-UK_Divorce$Year
Husband1<-data.frame(Husband,Year)
Wife1<-data.frame(Wife,Year)
barplot(Husband1$Husband,names.arg = Husband1$Year,ylim=c(0,120000),col = "Sky blue",xlab = "Years",ylab = "Number of divorces filed by husbands")
-Bar plot for the husband shows us that there was a significant increase in husband filing for divorce first in a marriage after the year 1972 and the average divorce filing rate by husband first increased from approx.20000 to 40000.
-Bar plot for the Wife shows us that there was a significant increase in wife filing for divorce first in a marriage after the year 1972 and the average divorce filing rate by wife first increased from approx.40000 to over 100000 by year 1994 and then lowering to aprox.60000 by 2016.
Divorce_difference<-Wife-Husband
UK_Divorce$Year%>%qqplot(Divorce_difference,dist="norm",col="red",ylab = "Difference in number of divorce filed",xlab = "Years")
-The qq plot shows that the mean of the data is not zero, these points follow a strong non-linear pattern, suggesting the data are not distributes ad the standard normal data.
-Between year 1900-1960 the difference in divorce filed first by husband and wife were comparatively low which suggested that both filed for divorce first on equal rates in UK, and we couldn’t say that one was filing more divorce as compared to other.
-After 1960 we see a increase in this difference which suggest us that out of husband and wife one of them were filing for divorce first by a significant number. By our boxplot and bar plot visualization we can see that wife was filling for divorce first as compared to husbands after 1960.
-Levene’s test for homogeneity of variance shows us that for our degree of freedom of 110 our F-value is 8.7269e+61 and our significance level is 2.2e-16 which is very less than the alpha significant level of 0.05, thus the variance should not be assumed to be equal for husband and wife data.
- For this investigation a TWO sample T-test was performed
- The assumtion for this test is that the missing 4 data in column is assumed to be the mean of the column.
- Simple linear regression model for my data can be made by using this code,
##
## Call:
## lm(formula = Husband ~ Wife, data = UK_Divorce)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6219 -2827 -1114 2839 12192
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.143e+03 4.990e+02 6.298 5.82e-09 ***
## Wife 4.104e-01 8.138e-03 50.432 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3838 on 114 degrees of freedom
## Multiple R-squared: 0.9571, Adjusted R-squared: 0.9567
## F-statistic: 2543 on 1 and 114 DF, p-value: < 2.2e-16
##
## Paired t-test
##
## data: Husband and Wife
## t = -9.1079, df = 115, p-value = 3.106e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -26985.23 -17344.34
## sample estimates:
## mean of the differences
## -22164.79
## [1] 3.106032e-15
-The two sample T-test gives t=-9.1079
-Our 95% confidence level lies in between -26985.23 and -17344.34
-Our mean value(-22164.79) lies outside our 95%CI level. This shows that the interval does not contain the true mean. Thus we have evidence to reject the null hypothesis H0.
-Our P-value is 3.106032e-15 which is very less than the significant level alpha=0.05, this also indicates that we should reject our null hypothesis H0.
-Based on the evidence collected from the above test we can conclude that the null hypothesis does not hold true and we can reject the null hypothesis, this means that there is statistical significant difference between divorce filing rates in husband and wife in UK.
-It means that in UK it is highly likely that wife will submit for a divorce first in a marriage.
-On initial visualization of these data by boxplot, barplot and qq plot we could see that there is a statistically significant difference between the rates at which divorce were submitted first in a marriage, we could visualize that wife submitted divorce first in a marriage by a significantly higher number each year as compared to their husbands. On hypothesis testing we confirmed that our null hypothesis does not stands and we rejected the null hypothesis.
- The strength of my investigation lies on the data, This data is from 1906-2016 which is 110 years of data which is sufficient enough for this investigation. -During the span of these 110 years there were also many factors like population growth, World war etc that effected overall data and these investigation.
- For future investigation we may have a data where the reason for divorce is specified and we may have 3-4 general reasons like Domestic violence, financial problems, extramarital affairs etc listed for a more in depth investigation.
-Finally I can say that,
-If you see a couple from UK who are not doing well in their marriage you can state that it is most likely that wife is going to file for divorce.
- https://www.kaggle.com/osbornep/uk-marriage-and-divorce-figures.
-https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot
-https://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm
-https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/descriptive-statistics/box-plot/
-https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51 —