Assignment 2

Assignment 2: Swiss Fertility and Socioeconomic Indicators (1888) Data

Display & Explain Data:

data(swiss)
str(swiss)

## 'data.frame':    47 obs. of  6 variables:
##  $ Fertility       : num  80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
##  $ Agriculture     : num  17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
##  $ Examination     : int  15 6 5 12 17 9 16 14 12 16 ...
##  $ Education       : int  12 9 5 7 15 7 7 8 7 13 ...
##  $ Catholic        : num  9.96 84.84 93.4 33.77 5.16 ...
##  $ Infant.Mortality: num  22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...

head(swiss)

##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6

Data Explanation: The Swiss dataset contains 47 observations (provinces) with 6 variables:

Fertility: Common standardized fertility measure
Agriculture: Percentage of males involved in agriculture as occupation
Examination: Percentage draftees receiving highest mark on army examination
Education: Percentage education beyond primary school for draftees
Catholic: Percentage Catholic (as opposed to Protestant)
Infant.Mortality: Live births who live less than 1 year

The unit of observation is the province level, and the sample size is 47 provinces.

Data Source The Swiss dataset is built into R and was originally taken from Swiss demographic data from 1888.

Data Check & Descriptive Stats

#check for missing Variables
print(colSums(is.na(swiss)))

##        Fertility      Agriculture      Examination        Education 
##                0                0                0                0 
##         Catholic Infant.Mortality 
##                0                0

# Descriptive statistics
summary(swiss)

##    Fertility      Agriculture     Examination      Education    
##  Min.   :35.00   Min.   : 1.20   Min.   : 3.00   Min.   : 1.00  
##  1st Qu.:64.70   1st Qu.:35.90   1st Qu.:12.00   1st Qu.: 6.00  
##  Median :70.40   Median :54.10   Median :16.00   Median : 8.00  
##  Mean   :70.14   Mean   :50.66   Mean   :16.49   Mean   :10.98  
##  3rd Qu.:78.45   3rd Qu.:67.65   3rd Qu.:22.00   3rd Qu.:12.00  
##  Max.   :92.50   Max.   :89.70   Max.   :37.00   Max.   :53.00  
##     Catholic       Infant.Mortality
##  Min.   :  2.150   Min.   :10.80   
##  1st Qu.:  5.195   1st Qu.:18.15   
##  Median : 15.140   Median :20.00   
##  Mean   : 41.144   Mean   :19.94   
##  3rd Qu.: 93.125   3rd Qu.:21.70   
##  Max.   :100.000   Max.   :26.60

# Calculate additional statistics
library(psych)
describe(swiss)

##                  vars  n  mean    sd median trimmed   mad   min   max range
## Fertility           1 47 70.14 12.49  70.40   70.66 10.23 35.00  92.5 57.50
## Agriculture         2 47 50.66 22.71  54.10   51.16 23.87  1.20  89.7 88.50
## Examination         3 47 16.49  7.98  16.00   16.08  7.41  3.00  37.0 34.00
## Education           4 47 10.98  9.62   8.00    9.38  5.93  1.00  53.0 52.00
## Catholic            5 47 41.14 41.70  15.14   39.12 18.65  2.15 100.0 97.85
## Infant.Mortality    6 47 19.94  2.91  20.00   19.98  2.82 10.80  26.6 15.80
##                   skew kurtosis   se
## Fertility        -0.46     0.26 1.82
## Agriculture      -0.32    -0.89 3.31
## Examination       0.45    -0.14 1.16
## Education         2.27     6.14 1.40
## Catholic          0.48    -1.67 6.08
## Infant.Mortality -0.33     0.78 0.42

# Visualization
par(mfrow=c(2,3), bg = "ivory1")
colour <- c("lightblue", "forestgreen", "yellow", "palevioletred", "plum1", "orange", "khaki")
for(i in 1:6) {
  hist(swiss[,i], 
       main=names(swiss)[i], 
       xlab=names(swiss)[i],
       col = colour)
}

Statistical Parameters:

No Missing Values for Any Variables.

Fertility:
- Range: 35.00 to 92.50
- Median: 70.40 - Mean: 70.14
- Most values fall between 64.70 (1st quartile) and 78.45 (3rd quartile) (normally distributed)
Agriculture:
- Range: 1.20% to 89.70%
- Median: 54.10%
- Mean: 50.66%
- Most values fall between 35.90% and 67.65% (evenly distributed)
Examination:
- Range: 3.00% to 37.00%
- Median: 16.00%
- Mean: 16.49%
- Most values fall between 12.00% and 22.00%
Education:
- Range: 1.00% to 53.00%
- Median: 8.00%
- Mean: 10.98%
- Most values fall between 6.00% and 12.00%
- Large maximum (53.00%) compared to the median (8.00%) suggests outliers
Catholic:
- Range: 2.15% to 100.00%
- Median: 15.14%
- Mean: 41.14%
Infant.Mortality:
- Range: 10.80 to 26.60
- Median: 20.00
- Mean: 19.94
- Most values fall between 18.15 and 21.70

Research Question & Hypothesis Testing

RQ: Is there a significant difference in fertility rates between provinces that are predominantly Catholic, versus those that are predominantly Protestant?

H0: There is no significant difference in fertility rates between provinces that are predominantly Catholic and those that are predominantly Protestant. µCatholic = μProtestant

HA: There is a significant difference in fertility rates between provinces that are predominantly Catholic and those that are predominantly Protestant. µCatholic ≠ μProtestant

# Create groups based on Catholic percentage
swiss$religious_group <- ifelse(swiss$Catholic > 50, "Predominantly Catholic", "Predominantly Protestant")
swiss$religious_group <- factor(swiss$religious_group)

# Explore the groups
table(swiss$religious_group)

## 
##   Predominantly Catholic Predominantly Protestant 
##                       18                       29

by(swiss$Fertility, swiss$religious_group, summary)

## swiss$religious_group: Predominantly Catholic
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   42.80   71.75   79.35   76.46   83.62   92.50 
## ------------------------------------------------------------ 
## swiss$religious_group: Predominantly Protestant
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   35.00   61.70   65.70   66.22   72.00   85.80

# Visualize
par(bg="ivory1")
boxplot(Fertility ~ religious_group, data = swiss, 
        main = "Fertility by Religious Majority",
        ylab = "Fertility", 
        xlab = "Religious Group",
        col = colour)

# Check assumptions for T-test
# 1. Normality within groups
shapiro.test(swiss$Fertility[swiss$religious_group == "Predominantly Catholic"])

## 
##  Shapiro-Wilk normality test
## 
## data:  swiss$Fertility[swiss$religious_group == "Predominantly Catholic"]
## W = 0.8576, p-value = 0.01118

shapiro.test(swiss$Fertility[swiss$religious_group == "Predominantly Protestant"])

## 
##  Shapiro-Wilk normality test
## 
## data:  swiss$Fertility[swiss$religious_group == "Predominantly Protestant"]
## W = 0.94021, p-value = 0.1015

# 2. Homogeneity of variances
var.test(Fertility ~ religious_group, data = swiss)

## 
##  F test to compare two variances
## 
## data:  Fertility by religious_group
## F = 2.18, num df = 17, denom df = 28, p-value = 0.06538
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.9509789 5.4908530
## sample estimates:
## ratio of variances 
##           2.179993

# T-test
t_test_result <- t.test(Fertility ~ religious_group, data = swiss)
print(t_test_result)

## 
##  Welch Two Sample t-test
## 
## data:  Fertility by religious_group
## t = 2.7004, df = 26.742, p-value = 0.01186
## alternative hypothesis: true difference in means between group Predominantly Catholic and group Predominantly Protestant is not equal to 0
## 95 percent confidence interval:
##   2.455904 18.024939
## sample estimates:
##   mean in group Predominantly Catholic mean in group Predominantly Protestant 
##                               76.46111                               66.22069

# Non-parametric test (Wilcoxon rank-sum test / Mann-Whitney U test)
wilcox_test_result <- wilcox.test(Fertility ~ religious_group, data = swiss)

## Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot
## compute exact p-value with ties

print(wilcox_test_result)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Fertility by religious_group
## W = 409.5, p-value = 0.0012
## alternative hypothesis: true location shift is not equal to 0

Interpretation and Conclusion

Assumption Testing

Variance Homogeneity (F-test):

F = 2.18, p-value = 0.06538
The p-value is just above the conventional significance level of 0.05
- Suggests that the variances between the two groups are marginally homogeneous
The confidence interval (0.95-5.49) contains values substantially above 1, indicating potential heterogeneity

Normality:

Based on the boxplot, both groups have outliers, which could indicate non-normal distributions.
- The predominantly Catholic group has one notable low outlier around 45
- The predominantly Protestant group has one near 35.

Parametric Test Results

t = 2.7004, df = 26.742, p-value = 0.01186
The p-value is less than 0.05, indicating statistical significance
Mean fertility in predominantly Catholic provinces (76.46) is higher than in predominantly Protestant provinces (66.22)

Non-parametric Test (Wilcoxon rank sum test)

W = 409.5, p-value = 0.0012
The p-value is considerably smaller than the t-test and well below 0.05
This strongly supports the alternative hypothesis that there is a location shift between the two groups

Conclusion

Based on the statistical analysis, the research question: “Is there a significant difference in fertility rates between predominantly Catholic and predominantly Protestant provinces in Switzerland?” can be answered as:

Yes, there is a statistically significant difference in fertility rates between predominantly Catholic and predominantly Protestant provinces in Switzerland. Predominantly Catholic provinces show substantially higher fertility rates (approximately 10.24 units higher) compared to predominantly Protestant provinces.