Discussion 12

Noori Selina

In this assignment, we’ll use the “swiss” dataset available in R. This dataset gives us information about different aspects of Swiss provinces in the late 1800s. We’ll look at how factors like farming, education, and religion relate to fertility rates. Using a method called multiple regression.

To begin, we’ll load the dataset. After loading the dataset, we’ll explore its structure and then check summary statistics.

# Load the dataset
data(swiss)

# Explore the structure of the dataset
str(swiss)
## 'data.frame':    47 obs. of  6 variables:
##  $ Fertility       : num  80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
##  $ Agriculture     : num  17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
##  $ Examination     : int  15 6 5 12 17 9 16 14 12 16 ...
##  $ Education       : int  12 9 5 7 15 7 7 8 7 13 ...
##  $ Catholic        : num  9.96 84.84 93.4 33.77 5.16 ...
##  $ Infant.Mortality: num  22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
# Check summary statistics
summary(swiss)
##    Fertility      Agriculture     Examination      Education    
##  Min.   :35.00   Min.   : 1.20   Min.   : 3.00   Min.   : 1.00  
##  1st Qu.:64.70   1st Qu.:35.90   1st Qu.:12.00   1st Qu.: 6.00  
##  Median :70.40   Median :54.10   Median :16.00   Median : 8.00  
##  Mean   :70.14   Mean   :50.66   Mean   :16.49   Mean   :10.98  
##  3rd Qu.:78.45   3rd Qu.:67.65   3rd Qu.:22.00   3rd Qu.:12.00  
##  Max.   :92.50   Max.   :89.70   Max.   :37.00   Max.   :53.00  
##     Catholic       Infant.Mortality
##  Min.   :  2.150   Min.   :10.80   
##  1st Qu.:  5.195   1st Qu.:18.15   
##  Median : 15.140   Median :20.00   
##  Mean   : 41.144   Mean   :19.94   
##  3rd Qu.: 93.125   3rd Qu.:21.70   
##  Max.   :100.000   Max.   :26.60

Next, we will fit a multiple regression model to the “swiss” dataset, aiming to predict fertility rates using agricultural, educational, and religious variables, along with an interaction term between agriculture and Catholicism. Then, we’ll print a summary of the model’s findings.

# Fit the multiple regression model
model <- lm(Fertility ~ Agriculture + Education + I(Education^2) + Catholic + Agriculture:Catholic, data = swiss)

# Print the summary of the model
summary(model)
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + I(Education^2) + 
##     Catholic + Agriculture:Catholic, data = swiss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.136  -6.214   1.376   5.712  14.437 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          84.1518268  6.5838484  12.782 6.88e-16 ***
## Agriculture          -0.1832766  0.0934643  -1.961   0.0567 .  
## Education            -0.9174241  0.3961951  -2.316   0.0257 *  
## I(Education^2)       -0.0033018  0.0076312  -0.433   0.6675    
## Catholic              0.1684727  0.0986089   1.708   0.0951 .  
## Agriculture:Catholic -0.0003612  0.0015433  -0.234   0.8161    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.892 on 41 degrees of freedom
## Multiple R-squared:  0.6442, Adjusted R-squared:  0.6008 
## F-statistic: 14.85 on 5 and 41 DF,  p-value: 2.581e-08

Based on the findings from the multiple regression analysis of the “swiss” dataset, several variables significantly influence fertility rates. Specifically, education expenditure exhibits a negative relationship with fertility rates, suggesting that increased investment in education is associated with lower fertility. However, this effect diminishes as education expenditure increases further, as indicated by the non-significant quadratic term. Additionally, provinces with a higher proportion of Catholic residents tend to have slightly higher fertility rates, although this relationship is marginally significant. However, the interaction between agriculture and Catholicism does not significantly influence fertility rates. Overall, the model explains a substantial portion of the variability in fertility rates, as indicated by the high R-squared value.

Now, we will conduct a residual analysis to assess the validity of our multiple regression model. The plot() function will generate four diagnostic plots: (1) residuals vs. fitted values, (2) a Q-Q plot of residuals, (3) a scale-location plot, and (4) a plot of Cook’s distances. These plots will help us evaluate the assumptions of the regression model and identify any patterns or outliers in the residuals.

# Residual analysis
par(mfrow=c(2,2))
plot(model)