library(ISLR) library(rpart) library(rpart.plot) library(randomForest)

1.What is the estimated average casual effect of having a female politician on the number of new (or repaired) drinking water facilities?

a. Fit a linear model

setwd("~/Desktop/DSS") 
india = read.csv("~/Desktop/DSS/Datasets/india.csv") 
fit <- lm(water ~ female, data = india)

b. What is the estimated slope coefficient, β?

summary(fit)$coefficients["female", "Estimate"]

## [1] 9.252423

The estimated slope coefficient, β, is approximately 9.25

c. What is the estimated average treatment effect?

ate <- summary(fit)$coefficients["female", "Estimate"]

The estimated average treatment effect of having a female politician on the number of new or repaired drinking water facilities is approximately 9.25. In this analysis, the treatment is having a female politician, and the outcome is the number of new or repaired drinking water facilities. A simple linear regression of water on the female indicator provides an estimated slope coefficient of 9.25, which represents the average difference in outcomes between villages with female politicians and those without. This estimate relies on the assumption that the treatment is as-good-as-randomly assigned, meaning that any difference in outcomes can be attributed to the presence of a female politician. If this assumption holds, we can interpret the result as a causal effect: villages with female politicians, on average, had 9.25 more drinking water facilities than those with male politicians.

2. Is the effect statistically significant at the 5% level?

a. Let’s start by specifying the null and alternative hypotheses.

Null Hypothesis (H0): There is no effect of being female on water access H0: beta_1 = 0 Interpretation: The slope (beta_1) is zero, meaning no relationship between female and water. Alternative Hypothesis (H1): There is an effect of being female on water access H1: beta_1 != 0 Interpretation: The slope (beta_1) is not zero, meaning there is a significant relationship between female and water.

To determine whether the effect of being female on water access is statistically significant at the 5% level, we first specify the null and alternative hypotheses. The null hypothesis (H₀) is that there is no effect of being female on water access, which can be written as H₀: β₁ = 0. This means that the slope of the regression line is zero, indicating no relationship between the predictor variable (female) and the response variable (water). The alternative hypothesis (H₁) is that there is an effect, which can be written as H₁: β₁ ≠ 0, meaning the slope is not zero, and there is a significant relationship between the two variables. To test these hypotheses, we examine the p-value from the regression analysis. If the p-value is less than 0.05, we reject the null hypothesis and conclude that the effect is statistically significant at the 5% level. Conversely, if the p-value is greater than 0.05, we fail to reject the null hypothesis and conclude that the effect is not statistically significant.

b. What is the value of the observed test statistic, z_obs?

# Fit the linear model
fit <- lm(water ~ female, data = india)

# Extract the coefficient and standard error for 'female'
beta_hat <- summary(fit)$coefficients["female", "Estimate"]
std_error <- summary(fit)$coefficients["female", "Std. Error"]

# Calculate the observed test statistic (z_obs)
z_obs <- beta_hat / std_error

# Display the result
z_obs

## [1] 2.343723

The observed test statistic zobs, is 2.343723, which indicates how many standard errors the estimated coefficient for the variable female is away from zero.

c. What is the associated p-value? Is the effect significant at the 5% level?

# Extract the p-value for the 'female' variable from the linear model summary
p_value <- summary(fit)$coefficients["female", "Pr(>|t|)"]

# Display the p-value
p_value

## [1] 0.01970398

# Check if the p-value is less than 0.05 and print the significance result
if (p_value < 0.05) {
  print("The effect is statistically significant at the 5% level.")
} else {
  print("The effect is not statistically significant at the 5% level.")
}

## [1] "The effect is statistically significant at the 5% level."

The p-value for the effect of being female on water access is 0.0197. Since this p-value is less than the standard 5% significance level (0.05), we reject the null hypothesis and conclude that the effect is statistically significant. This indicates that gender (specifically being female) has a significant impact on water access in this dataset.

3. Use the Hitters data in the ISLR package to fit a regression tree that predicts ‘salary’.

library(ISLR)
library(rpart)
library(rpart.plot)
data(Hitters)
Hitters <- na.omit(Hitters)
# Fit the regression tree
tree_model <- rpart(Salary ~ ., data = Hitters, method = "anova")
# Plot the regression tree
rpart.plot(tree_model, type = 3, fallen.leaves = TRUE, cex = 0.7)

The regression tree predicts player salary based on performance metrics, starting with Career Hits (CHits) as the most influential variable. Players with fewer than 450 CHits are further split by Walks, CRBI, and AtBat, resulting in lower average salaries. Players with 450 or more CHits are divided by Walks, AtBat, RBI, PutOuts, and Career Runs, generally leading to higher salary predictions. Each path in the tree ends in a leaf node representing the average salary and percentage of players in that group. The tree highlights how a combination of high performance stats—particularly CHits, Walks, and RBIs—correlates with significantly higher salaries.

4. Use cross-validation to determine the number of variables you need for a good fit.

library(rpart)
tree_model <- rpart(Salary ~ ., data = Hitters, method = "anova", cp = 0.001)

# View cross-validation results
printcp(tree_model)

## 
## Regression tree:
## rpart(formula = Salary ~ ., data = Hitters, method = "anova", 
##     cp = 0.001)
## 
## Variables actually used in tree construction:
##  [1] AtBat   CAtBat  CHits   CHmRun  CRBI    CRuns   CWalks  Hits    PutOuts
## [10] RBI     Walks   Years  
## 
## Root node error: 53319113/263 = 202734
## 
## n= 263 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.3751526      0   1.00000 1.00371 0.138389
## 2  0.1202660      1   0.62485 0.66005 0.114887
## 3  0.0447760      2   0.50458 0.59845 0.099874
## 4  0.0395069      3   0.45981 0.57851 0.104502
## 5  0.0189058      4   0.42030 0.56694 0.106595
## 6  0.0156460      5   0.40139 0.58323 0.109591
## 7  0.0141210      7   0.37010 0.59194 0.109634
## 8  0.0140507      8   0.35598 0.59335 0.109662
## 9  0.0090608      9   0.34193 0.59587 0.111624
## 10 0.0087486     10   0.33287 0.60659 0.111611
## 11 0.0070271     11   0.32412 0.60827 0.111588
## 12 0.0061551     12   0.31709 0.61120 0.110778
## 13 0.0045893     13   0.31094 0.60333 0.107047
## 14 0.0034490     14   0.30635 0.60319 0.107115
## 15 0.0029108     15   0.30290 0.60540 0.107515
## 16 0.0028538     16   0.29999 0.60422 0.107503
## 17 0.0017889     17   0.29713 0.60352 0.107560
## 18 0.0010000     18   0.29535 0.59791 0.102855

# Plot the cross-validation results
plotcp(tree_model)

library(rpart.plot)
# Get the CP value with the lowest cross-validation error
best_cp <- tree_model$cptable[which.min(tree_model$cptable[,"xerror"]), "CP"]

# Prune the tree using the best CP
pruned_tree <- prune(tree_model, cp = best_cp)

# Plot the pruned tree
rpart.plot(pruned_tree, type = 3, fallen.leaves = TRUE, cex = 0.7)

It’s a good fit in terms of interpret ability and cross-validation

5.Fit a random forest with 500 trees and 3 variables in each tree. Show variable importance.

library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

# Load dataset
dataset <- read.csv("~/Desktop/DSS/Datasets/india.csv")

# --- Classification Model ---
# Create classification target from 'water'
dataset$target <- as.factor(ifelse(dataset$water > mean(dataset$water, na.rm = TRUE), "High", "Low"))

# Fit classification random forest
rf_model_classification <- randomForest(target ~ . -water, data = dataset, ntree = 500, mtry = 3, importance = TRUE)

cat("Classification Model Summary:\n")

## Classification Model Summary:

print(rf_model_classification)

## 
## Call:
##  randomForest(formula = target ~ . - water, data = dataset, ntree = 500,      mtry = 3, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 30.75%
## Confusion matrix:
##      High Low class.error
## High   40  50   0.5555556
## Low    49 183   0.2112069

cat("\nClassification Model Variable Importance:\n")

## 
## Classification Model Variable Importance:

classification_importance <- importance(rf_model_classification)
print(classification_importance)

##                 High       Low MeanDecreaseAccuracy MeanDecreaseGini
## village    14.933130 13.095361            17.137471         97.27998
## female      7.208754 -1.645557             2.584108          6.05522
## irrigation  4.994437  7.115440             8.238371         25.83182

varImpPlot(rf_model_classification, main = "Classification Variable Importance")

# --- Regression Model ---
dataset$water <- as.numeric(dataset$water)

# Fit regression random forest
rf_model_regression <- randomForest(water ~ . -target, data = dataset, ntree = 500, mtry = 3, importance = TRUE)

cat("\nRegression Model Summary:\n")

## 
## Regression Model Summary:

print(rf_model_regression)

## 
## Call:
##  randomForest(formula = water ~ . - target, data = dataset, ntree = 500,      mtry = 3, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 774.4321
##                     % Var explained: 31.51

cat("\nRegression Model Variable Importance:\n")

## 
## Regression Model Variable Importance:

regression_importance <- importance(rf_model_regression)
print(regression_importance)

##              %IncMSE IncNodePurity
## village    23.703881    202229.182
## female      5.271946      8445.525
## irrigation 27.750795    132405.650

varImpPlot(rf_model_regression, main = "Regression Variable Importance")

A random forest model was trained with 500 trees and 3 variables considered at each split for both classification and regression tasks. For classification, the most important variable was village, followed by irrigation and female, as shown by the highest values in both MeanDecreaseAccuracy and MeanDecreaseGini. Similarly, in the regression model predicting water levels, village and irrigation were the top predictors based on %IncMSE and IncNodePurity. Overall, village appears to be the most influential feature across both models, indicating its strong predictive power in relation to the target variables.

6.Draw the partial plot for one of the variables. Can salary be predicted linearly with this variable?

library(ISLR)
library(randomForest)

# Load the dataset and remove rows with missing values
data(Hitters)
Hitters <- na.omit(Hitters)

# Fit the random forest model
rf_model <- randomForest(Salary ~ ., data = Hitters, ntree = 500, mtry = 3, importance = TRUE)

# Print model summary
print(rf_model)

## 
## Call:
##  randomForest(formula = Salary ~ ., data = Hitters, ntree = 500,      mtry = 3, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 78677.97
##                     % Var explained: 61.19

# Draw partial plot for "Years"
partialPlot(rf_model, pred.data = Hitters, x.var = "Years")

The Hitters dataset to predict player salaries using 500 trees and 3 variables at each split. The model summary shows how well it fits the data, usually with metrics like the average error and how much of the salary variation it explains. The partial dependence plot for “Years” shows how a player’s experience affects their salary, with all other factors held constant. From this plot, you can see how salary changes with experience, which helps understand the relationship between years played and earnings.

Final Project

Raveen Brar

2025-04-14