library(ISLR) library(rpart) library(rpart.plot) library(randomForest)
setwd("~/Desktop/DSS")
india = read.csv("~/Desktop/DSS/Datasets/india.csv")
fit <- lm(water ~ female, data = india)
summary(fit)$coefficients["female", "Estimate"]
## [1] 9.252423
The estimated slope coefficient, β, is approximately 9.25
ate <- summary(fit)$coefficients["female", "Estimate"]
The estimated average treatment effect of having a female politician on the number of new or repaired drinking water facilities is approximately 9.25. In this analysis, the treatment is having a female politician, and the outcome is the number of new or repaired drinking water facilities. A simple linear regression of water on the female indicator provides an estimated slope coefficient of 9.25, which represents the average difference in outcomes between villages with female politicians and those without. This estimate relies on the assumption that the treatment is as-good-as-randomly assigned, meaning that any difference in outcomes can be attributed to the presence of a female politician. If this assumption holds, we can interpret the result as a causal effect: villages with female politicians, on average, had 9.25 more drinking water facilities than those with male politicians.
Null Hypothesis (H0): There is no effect of being female on water access H0: beta_1 = 0 Interpretation: The slope (beta_1) is zero, meaning no relationship between female and water. Alternative Hypothesis (H1): There is an effect of being female on water access H1: beta_1 != 0 Interpretation: The slope (beta_1) is not zero, meaning there is a significant relationship between female and water.
To determine whether the effect of being female on water access is statistically significant at the 5% level, we first specify the null and alternative hypotheses. The null hypothesis (H₀) is that there is no effect of being female on water access, which can be written as H₀: β₁ = 0. This means that the slope of the regression line is zero, indicating no relationship between the predictor variable (female) and the response variable (water). The alternative hypothesis (H₁) is that there is an effect, which can be written as H₁: β₁ ≠ 0, meaning the slope is not zero, and there is a significant relationship between the two variables. To test these hypotheses, we examine the p-value from the regression analysis. If the p-value is less than 0.05, we reject the null hypothesis and conclude that the effect is statistically significant at the 5% level. Conversely, if the p-value is greater than 0.05, we fail to reject the null hypothesis and conclude that the effect is not statistically significant.
# Fit the linear model
fit <- lm(water ~ female, data = india)
# Extract the coefficient and standard error for 'female'
beta_hat <- summary(fit)$coefficients["female", "Estimate"]
std_error <- summary(fit)$coefficients["female", "Std. Error"]
# Calculate the observed test statistic (z_obs)
z_obs <- beta_hat / std_error
# Display the result
z_obs
## [1] 2.343723
The observed test statistic zobs, is 2.343723, which indicates how many standard errors the estimated coefficient for the variable female is away from zero.
# Extract the p-value for the 'female' variable from the linear model summary
p_value <- summary(fit)$coefficients["female", "Pr(>|t|)"]
# Display the p-value
p_value
## [1] 0.01970398
# Check if the p-value is less than 0.05 and print the significance result
if (p_value < 0.05) {
print("The effect is statistically significant at the 5% level.")
} else {
print("The effect is not statistically significant at the 5% level.")
}
## [1] "The effect is statistically significant at the 5% level."
The p-value for the effect of being female on water access is 0.0197. Since this p-value is less than the standard 5% significance level (0.05), we reject the null hypothesis and conclude that the effect is statistically significant. This indicates that gender (specifically being female) has a significant impact on water access in this dataset.
library(ISLR)
library(rpart)
library(rpart.plot)
data(Hitters)
Hitters <- na.omit(Hitters)
# Fit the regression tree
tree_model <- rpart(Salary ~ ., data = Hitters, method = "anova")
# Plot the regression tree
rpart.plot(tree_model, type = 3, fallen.leaves = TRUE, cex = 0.7)
The regression tree predicts player salary based on performance metrics, starting with Career Hits (CHits) as the most influential variable. Players with fewer than 450 CHits are further split by Walks, CRBI, and AtBat, resulting in lower average salaries. Players with 450 or more CHits are divided by Walks, AtBat, RBI, PutOuts, and Career Runs, generally leading to higher salary predictions. Each path in the tree ends in a leaf node representing the average salary and percentage of players in that group. The tree highlights how a combination of high performance stats—particularly CHits, Walks, and RBIs—correlates with significantly higher salaries.
library(rpart)
tree_model <- rpart(Salary ~ ., data = Hitters, method = "anova", cp = 0.001)
# View cross-validation results
printcp(tree_model)
##
## Regression tree:
## rpart(formula = Salary ~ ., data = Hitters, method = "anova",
## cp = 0.001)
##
## Variables actually used in tree construction:
## [1] AtBat CAtBat CHits CHmRun CRBI CRuns CWalks Hits PutOuts
## [10] RBI Walks Years
##
## Root node error: 53319113/263 = 202734
##
## n= 263
##
## CP nsplit rel error xerror xstd
## 1 0.3751526 0 1.00000 1.00371 0.138389
## 2 0.1202660 1 0.62485 0.66005 0.114887
## 3 0.0447760 2 0.50458 0.59845 0.099874
## 4 0.0395069 3 0.45981 0.57851 0.104502
## 5 0.0189058 4 0.42030 0.56694 0.106595
## 6 0.0156460 5 0.40139 0.58323 0.109591
## 7 0.0141210 7 0.37010 0.59194 0.109634
## 8 0.0140507 8 0.35598 0.59335 0.109662
## 9 0.0090608 9 0.34193 0.59587 0.111624
## 10 0.0087486 10 0.33287 0.60659 0.111611
## 11 0.0070271 11 0.32412 0.60827 0.111588
## 12 0.0061551 12 0.31709 0.61120 0.110778
## 13 0.0045893 13 0.31094 0.60333 0.107047
## 14 0.0034490 14 0.30635 0.60319 0.107115
## 15 0.0029108 15 0.30290 0.60540 0.107515
## 16 0.0028538 16 0.29999 0.60422 0.107503
## 17 0.0017889 17 0.29713 0.60352 0.107560
## 18 0.0010000 18 0.29535 0.59791 0.102855
# Plot the cross-validation results
plotcp(tree_model)
library(rpart.plot)
# Get the CP value with the lowest cross-validation error
best_cp <- tree_model$cptable[which.min(tree_model$cptable[,"xerror"]), "CP"]
# Prune the tree using the best CP
pruned_tree <- prune(tree_model, cp = best_cp)
# Plot the pruned tree
rpart.plot(pruned_tree, type = 3, fallen.leaves = TRUE, cex = 0.7)
It’s a good fit in terms of interpret ability and cross-validation
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
# Load dataset
dataset <- read.csv("~/Desktop/DSS/Datasets/india.csv")
# --- Classification Model ---
# Create classification target from 'water'
dataset$target <- as.factor(ifelse(dataset$water > mean(dataset$water, na.rm = TRUE), "High", "Low"))
# Fit classification random forest
rf_model_classification <- randomForest(target ~ . -water, data = dataset, ntree = 500, mtry = 3, importance = TRUE)
cat("Classification Model Summary:\n")
## Classification Model Summary:
print(rf_model_classification)
##
## Call:
## randomForest(formula = target ~ . - water, data = dataset, ntree = 500, mtry = 3, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 30.75%
## Confusion matrix:
## High Low class.error
## High 40 50 0.5555556
## Low 49 183 0.2112069
cat("\nClassification Model Variable Importance:\n")
##
## Classification Model Variable Importance:
classification_importance <- importance(rf_model_classification)
print(classification_importance)
## High Low MeanDecreaseAccuracy MeanDecreaseGini
## village 14.933130 13.095361 17.137471 97.27998
## female 7.208754 -1.645557 2.584108 6.05522
## irrigation 4.994437 7.115440 8.238371 25.83182
varImpPlot(rf_model_classification, main = "Classification Variable Importance")
# --- Regression Model ---
dataset$water <- as.numeric(dataset$water)
# Fit regression random forest
rf_model_regression <- randomForest(water ~ . -target, data = dataset, ntree = 500, mtry = 3, importance = TRUE)
cat("\nRegression Model Summary:\n")
##
## Regression Model Summary:
print(rf_model_regression)
##
## Call:
## randomForest(formula = water ~ . - target, data = dataset, ntree = 500, mtry = 3, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 774.4321
## % Var explained: 31.51
cat("\nRegression Model Variable Importance:\n")
##
## Regression Model Variable Importance:
regression_importance <- importance(rf_model_regression)
print(regression_importance)
## %IncMSE IncNodePurity
## village 23.703881 202229.182
## female 5.271946 8445.525
## irrigation 27.750795 132405.650
varImpPlot(rf_model_regression, main = "Regression Variable Importance")
A random forest model was trained with 500 trees and 3 variables considered at each split for both classification and regression tasks. For classification, the most important variable was village, followed by irrigation and female, as shown by the highest values in both MeanDecreaseAccuracy and MeanDecreaseGini. Similarly, in the regression model predicting water levels, village and irrigation were the top predictors based on %IncMSE and IncNodePurity. Overall, village appears to be the most influential feature across both models, indicating its strong predictive power in relation to the target variables.
library(ISLR)
library(randomForest)
# Load the dataset and remove rows with missing values
data(Hitters)
Hitters <- na.omit(Hitters)
# Fit the random forest model
rf_model <- randomForest(Salary ~ ., data = Hitters, ntree = 500, mtry = 3, importance = TRUE)
# Print model summary
print(rf_model)
##
## Call:
## randomForest(formula = Salary ~ ., data = Hitters, ntree = 500, mtry = 3, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 78677.97
## % Var explained: 61.19
# Draw partial plot for "Years"
partialPlot(rf_model, pred.data = Hitters, x.var = "Years")
The Hitters dataset to predict player salaries using 500 trees and 3 variables at each split. The model summary shows how well it fits the data, usually with metrics like the average error and how much of the salary variation it explains. The partial dependence plot for “Years” shows how a player’s experience affects their salary, with all other factors held constant. From this plot, you can see how salary changes with experience, which helps understand the relationship between years played and earnings.