week7_discussion

# Load iris dataset 
data(iris)

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

# Fit the multiple regression model
model <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species, data = iris)

# Summarize the model
summary(model)

## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + 
##     Species, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.79424 -0.21874  0.00899  0.20255  0.73103 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.17127    0.27979   7.760 1.43e-12 ***
## Sepal.Width        0.49589    0.08607   5.761 4.87e-08 ***
## Petal.Length       0.82924    0.06853  12.101  < 2e-16 ***
## Petal.Width       -0.31516    0.15120  -2.084  0.03889 *  
## Speciesversicolor -0.72356    0.24017  -3.013  0.00306 ** 
## Speciesvirginica  -1.02350    0.33373  -3.067  0.00258 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3068 on 144 degrees of freedom
## Multiple R-squared:  0.8673, Adjusted R-squared:  0.8627 
## F-statistic: 188.3 on 5 and 144 DF,  p-value: < 2.2e-16

# Evaluate residuals
residuals <- resid(model)

# Plot residuals for diagnostic checks
par(mfrow = c(2, 2))
plot(model)

Interpretation of Coefficients:

Intercept: The intercept is 2.17, which is the estimated Sepal.Length when all other predictors (Sepal.Width, Petal.Length, Petal.Width, and Species) are zero.

Coefficients for Predictors:

Sepal.Width: For each unit increase in Sepal.Width, Sepal.Length is expected to decrease by 0.49589 units, although this effect is not statistically significant.

Petal.Length: For each unit increase in Petal.Length, Sepal.Length is expected to increase by 0.82924 units.

Petal.Width: For each unit increase in Petal.Width, Sepal.Length is expected to decrease by 0.31516 units (significant at p = 0.03889).

Species (versicolor and virginica): The coefficients represent the difference in Sepal.Length compared to the baseline species (setosa). For example, compared to setosa, versicolor species is associated with an increase of 0.72356 units in Sepal.Length, and virginica species is associated with an increase of 0.90086 units (both significant).

Residuals: These are the differences between the observed values of Sepal.Length and the values predicted by the model.

Residual standard error: This is an estimate of the standard deviation of the residuals, which indicates the average amount that the observed Sepal.Length values deviate from the predicted values.

Multiple R-squared: This measures the proportion of the variance in Sepal.Length that is predictable from the independent variables (Sepal.Width, Petal.Length, Petal.Width, and Species). An R-squared of 0.8673 indicates that approximately 86.73% of the variance in Sepal.Length is explained by the predictors in the model.

So overall this multiple regression model indicates that Sepal.Width, Petal.Length, Petal.Width, and Species (versicolor and virginica compared to setosa) are significant predictors of Sepal.Length in the iris dataset. The model provides insights into how each predictor influences Sepal.Length, and the residuals help assess the model’s goodness of fit by examining the difference between observed and predicted values.

# Load the 'iris' dataset
data(iris)

# Convert 'sepal.length' to a categorical variable
iris$Sepal.Length.category <- cut(iris$Sepal.Length, breaks = c(0, 5, 6, Inf), labels = c("short", "medium", "long"))

# Build a multinomial logistic regression model
model <- glm(Sepal.Length.category ~ Species + Sepal.Width + Petal.Length + Petal.Width, data = iris, family = "binomial")

# Summary of the model
summary(model)

## 
## Call:
## glm(formula = Sepal.Length.category ~ Species + Sepal.Width + 
##     Petal.Length + Petal.Width, family = "binomial", data = iris)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -40.94866   10.86174  -3.770 0.000163 ***
## Speciesversicolor   0.09454    4.69795   0.020 0.983945    
## Speciesvirginica   -7.59422    7.39397  -1.027 0.304381    
## Sepal.Width         9.45599    2.75797   3.429 0.000607 ***
## Petal.Length        5.37351    2.01581   2.666 0.007683 ** 
## Petal.Width         0.64334    3.45795   0.186 0.852408    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 155.502  on 149  degrees of freedom
## Residual deviance:  42.673  on 144  degrees of freedom
## AIC: 54.673
## 
## Number of Fisher Scoring iterations: 9

Coefficients: This section displays the estimated coefficients for each variable in the model. The coefficients represent the log-odds of moving from the reference category to the corresponding category for each predictor variable. For example, the coefficients for the Species variable show the estimated log-odds of moving from the reference category (setosa) to the other species (versicolor and virginica). Similarly, the coefficients for the other predictor variables (Sepal.Width, Petal.Length, and Petal.Width) represent the estimated log-odds of moving from one category to another within those variables.

Standard Errors, z-value, and p-value: These columns provide information about the statistical significance of the estimated coefficients. The standard errors indicate the variability of the estimated coefficients. The z-value is calculated by dividing the estimated coefficient by its standard error and can be used to determine the significance of the coefficient. The p-value represents the probability of observing a coefficient as extreme as the estimated coefficient if the null hypothesis is true. Generally, lower p-values indicate a stronger evidence against the null hypothesis.

The significance codes are provided to indicate the level of significance of the coefficients. The number of stars reflects the p-value of the coefficient: three stars () indicate p < 0.001, two stars () indicate p < 0.01, one star () indicates p < 0.05, and no stars indicate p > 0.05. These codes help identify which coefficients are statistically significant.

The null deviance and the deviance residuals are measures of the goodness of fit of the model. The null deviance represents the deviance when only the intercept is included in the model. The deviance residuals measure the discrepancy between the observed and predicted responses. The smaller the deviance residuals, the better the model fits the data.

AIC and BIC: The AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) are measures of model fit that take into account the number of parameters in the model. Lower AIC and BIC values indicate a better fit, all else being equal. Number of Fisher Scoring iterations: This section provides information about the number of iterations performed by the estimation algorithm to obtain the maximum likelihood estimates.

By examining the coefficients, their standard errors, z-values, and p-values, you can determine the significance and direction of the relationships between the predictor variables (Species, Sepal.Width, Petal.Length, Petal.Width) and the response variable (Sepal.Length.category). The significance codes and the AIC/BIC values can help assess the overall goodness of fit of the model.

week7_discussion

Reuben

2024-06-25