#a) Plot a pairwise scatter plot of "crim", "zn", "indus", "age" and "medv". State your observations
library(MASS)
## Warning: package 'MASS' was built under R version 4.5.3
data1 <- Boston[,c("crim","zn","indus","age","medv")]
pairs(data1, main="Pairwise Scatter Plot of Boston Data")

#medv vs age: There is a negative correlation. As the age increases, the medv tends to decrease.This suggests older neighborhoods generally have lower property values in this dataset. medv vs. crim: There is a "heavy-tailed" distribution. Most high-value homes (medv > 30) are located in areas with near-zero crime rates. As crim increases, medv drops sharply and stays low. medv vs. indus: There is a noticeable negative trend. Areas with more industrial non-retail business acres (indus) correspond to lower median home values.age vs. indus: There is a strong positive correlation. Industrial areas tend to consist of much older buildings. The points cluster upward, showing that as one increases, the other typically does too. crim vs. age: Interestingly, crime rates appear to be higher in areas with a higher proportion of older homes (age near 100). You can see the dots for crim only start to "spread out" and increase once age is high.crim and zn: Both variables are heavily skewed. Most data points for crim are clustered near zero, with a few extreme outliers on the far right. Similarly, zn (land zoned for large lots) has many zeros, indicating that most areas in this dataset do not have large residential lots.medv Cap: There is a horizontal line of dots at the very top (the 50 mark). This indicates that the data was likely censored" or capped at $50,000.

#b) Fit a multiple linear regression by using "medv" as the response variable. Interpret the model

model <- lm(medv~crim+zn+indus+age, data=data1)
summary(model)
## 
## Call:
## lm(formula = medv ~ crim + zn + indus + age, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.089  -4.739  -1.652   2.677  32.510 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.752147   1.248655  22.226  < 2e-16 ***
## crim        -0.246045   0.044424  -5.539 4.93e-08 ***
## zn           0.055864   0.018727   2.983  0.00299 ** 
## indus       -0.403492   0.070717  -5.706 1.98e-08 ***
## age         -0.006875   0.017308  -0.397  0.69136    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.759 on 501 degrees of freedom
## Multiple R-squared:  0.2939, Adjusted R-squared:  0.2883 
## F-statistic: 52.14 on 4 and 501 DF,  p-value: < 2.2e-16
#The multiple linear regression model indicates that crime, zoning, and industrial land use significantly influence house prices, explaining about 28.8% of the variation in property values. Overall, the model is statistically valid (p < 0.05), meaning these factors are reliable predictors. The results show that as crime rates and industrial activity increase, house values drop significantly, while neighborhoods with larger residential lots tend to have higher prices. Notably, once these factors are considered, the age of the house has almost no independent impact on its value.

#c) Determine the significant predictors using confidence interval and p-value.
#The significant predictors in this model are crim, zn, and indus.The variabe age is not significant and can be considered for removal to simplify the model.

#d) Fit the model with the significant predictors only and compare with the model in part (b)
model2 <- lm(medv~crim + zn+age, data=data1)
anova(model2,model)
## Analysis of Variance Table
## 
## Model 1: medv ~ crim + zn + age
## Model 2: medv ~ crim + zn + indus + age
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    502 32120                                  
## 2    501 30160  1    1959.8 32.555 1.985e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#We compared the reduced modelwith the full model using an ANOVA test. The resulting p-value (1.98e-08) is significantly less than 0.05, indicating that the full model provides a significantly better fit to the data.Therefore, despite our goal of parsimony, the variable indus should be retained in the model as its removal significantly reduces the model's explanatory power.

#e) Perform residual analysis on the reduced model and interpret the result
par(mfrow=c(2,2))
plot(model2)

#Residual analysis shows that the model's assumptions are generally met, though not perfectly. The Residuals vs Fitted plot indicates a largely linear relationship. The Normal Q-Q plot reveals some deviation from normality in the tails, suggesting the presence of extreme values or outliers. However, the Scale-Location plot confirms that the variance of errors is relatively constant, and the Residuals vs Leverage plot shows that no individual outliers are exerting excessive influence on the model’s coefficients. Overall, the model is a reliable fit for the data.