Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
#Loading data from CSV file
data <- read.csv("C:/Users/aleja/Desktop/forestfires.csv")
#View of the first few rows of dataset
head(data)## X Y month day FFMC DMC DC ISI temp RH wind rain area
## 1 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 0
## 2 7 4 oct tue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 0
## 3 7 4 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0 0
## 4 8 6 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.2 0
## 5 8 6 mar sun 89.3 51.3 102.2 9.6 11.4 99 1.8 0.0 0
## 6 8 6 aug sun 92.3 85.3 488.0 14.7 22.2 29 5.4 0.0 0
## 'data.frame': 517 obs. of 13 variables:
## $ X : int 7 7 7 8 8 8 8 8 8 7 ...
## $ Y : int 5 4 4 6 6 6 6 6 6 5 ...
## $ month: chr "mar" "oct" "oct" "mar" ...
## $ day : chr "fri" "tue" "sat" "fri" ...
## $ FFMC : num 86.2 90.6 90.6 91.7 89.3 92.3 92.3 91.5 91 92.5 ...
## $ DMC : num 26.2 35.4 43.7 33.3 51.3 ...
## $ DC : num 94.3 669.1 686.9 77.5 102.2 ...
## $ ISI : num 5.1 6.7 6.7 9 9.6 14.7 8.5 10.7 7 7.1 ...
## $ temp : num 8.2 18 14.6 8.3 11.4 22.2 24.1 8 13.1 22.8 ...
## $ RH : int 51 33 33 97 99 29 27 86 63 40 ...
## $ wind : num 6.7 0.9 1.3 4 1.8 5.4 3.1 2.2 5.4 4 ...
## $ rain : num 0 0 0 0.2 0 0 0 0 0 0 ...
## $ area : num 0 0 0 0 0 0 0 0 0 0 ...
#Fitting the multiple regression model with log-transformed response variable
model <- lm(log(area + 1) ~ FFMC + DMC + DC + ISI + temp + RH + wind + rain, data = data)
#Summary of the model
summary(model)##
## Call:
## lm(formula = log(area + 1) ~ FFMC + DMC + DC + ISI + temp + RH +
## wind + rain, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5203 -1.1129 -0.6158 0.8787 5.7121
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2224140 1.3604350 0.163 0.870
## FFMC 0.0077082 0.0144884 0.532 0.595
## DMC 0.0011915 0.0014642 0.814 0.416
## DC 0.0002737 0.0003570 0.767 0.444
## ISI -0.0239494 0.0169248 -1.415 0.158
## temp 0.0024618 0.0172593 0.143 0.887
## RH -0.0051729 0.0051889 -0.997 0.319
## wind 0.0757669 0.0366155 2.069 0.039 *
## rain 0.0965122 0.2121461 0.455 0.649
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.395 on 508 degrees of freedom
## Multiple R-squared: 0.01988, Adjusted R-squared: 0.004446
## F-statistic: 1.288 on 8 and 508 DF, p-value: 0.2472
Residual analysis
Conclusion:
The linear regression model attempted with the forest fires dataset showed that only the wind variable had a statistically significant effect on the log-transformed area burned, with all other predictors not being statistically significant. However, the residual analysis revealed violations of key assumptions of linear regression, including heteroscedasticity and deviation from normality in the residuals.