In this assignment, we’ll use the “swiss” dataset available in R. This dataset gives us information about different aspects of Swiss provinces in the late 1800s. We’ll look at how factors like farming, education, and religion relate to fertility rates. Using a method called multiple regression.
To begin, we’ll load the dataset. After loading the dataset, we’ll explore its structure and then check summary statistics.
# Load the dataset
data(swiss)
# Explore the structure of the dataset
str(swiss)
## 'data.frame': 47 obs. of 6 variables:
## $ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
## $ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
## $ Examination : int 15 6 5 12 17 9 16 14 12 16 ...
## $ Education : int 12 9 5 7 15 7 7 8 7 13 ...
## $ Catholic : num 9.96 84.84 93.4 33.77 5.16 ...
## $ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
# Check summary statistics
summary(swiss)
## Fertility Agriculture Examination Education
## Min. :35.00 Min. : 1.20 Min. : 3.00 Min. : 1.00
## 1st Qu.:64.70 1st Qu.:35.90 1st Qu.:12.00 1st Qu.: 6.00
## Median :70.40 Median :54.10 Median :16.00 Median : 8.00
## Mean :70.14 Mean :50.66 Mean :16.49 Mean :10.98
## 3rd Qu.:78.45 3rd Qu.:67.65 3rd Qu.:22.00 3rd Qu.:12.00
## Max. :92.50 Max. :89.70 Max. :37.00 Max. :53.00
## Catholic Infant.Mortality
## Min. : 2.150 Min. :10.80
## 1st Qu.: 5.195 1st Qu.:18.15
## Median : 15.140 Median :20.00
## Mean : 41.144 Mean :19.94
## 3rd Qu.: 93.125 3rd Qu.:21.70
## Max. :100.000 Max. :26.60
Next, we will fit a multiple regression model to the “swiss” dataset, aiming to predict fertility rates using agricultural, educational, and religious variables, along with an interaction term between agriculture and Catholicism. Then, we’ll print a summary of the model’s findings.
# Fit the multiple regression model
model <- lm(Fertility ~ Agriculture + Education + I(Education^2) + Catholic + Agriculture:Catholic, data = swiss)
# Print the summary of the model
summary(model)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + I(Education^2) +
## Catholic + Agriculture:Catholic, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.136 -6.214 1.376 5.712 14.437
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.1518268 6.5838484 12.782 6.88e-16 ***
## Agriculture -0.1832766 0.0934643 -1.961 0.0567 .
## Education -0.9174241 0.3961951 -2.316 0.0257 *
## I(Education^2) -0.0033018 0.0076312 -0.433 0.6675
## Catholic 0.1684727 0.0986089 1.708 0.0951 .
## Agriculture:Catholic -0.0003612 0.0015433 -0.234 0.8161
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.892 on 41 degrees of freedom
## Multiple R-squared: 0.6442, Adjusted R-squared: 0.6008
## F-statistic: 14.85 on 5 and 41 DF, p-value: 2.581e-08
Based on the findings from the multiple regression analysis of the “swiss” dataset, several variables significantly influence fertility rates. Specifically, education expenditure exhibits a negative relationship with fertility rates, suggesting that increased investment in education is associated with lower fertility. However, this effect diminishes as education expenditure increases further, as indicated by the non-significant quadratic term. Additionally, provinces with a higher proportion of Catholic residents tend to have slightly higher fertility rates, although this relationship is marginally significant. However, the interaction between agriculture and Catholicism does not significantly influence fertility rates. Overall, the model explains a substantial portion of the variability in fertility rates, as indicated by the high R-squared value.
Now, we will conduct a residual analysis to assess the validity of
our multiple regression model. The plot() function will
generate four diagnostic plots: (1) residuals vs. fitted values, (2) a
Q-Q plot of residuals, (3) a scale-location plot, and (4) a plot of
Cook’s distances. These plots will help us evaluate the assumptions of
the regression model and identify any patterns or outliers in the
residuals.
# Residual analysis
par(mfrow=c(2,2))
plot(model)