Your Name: Anjum Mehnaz Akumalla
UTA ID#: 1002170913

Problem Statement

# Installing packages
# install.packages("dplyr")
# install.packages("ggplot2")
# install.packages("corrplot")
# Loading required libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(datasets)
# Loading the data into new dataframe after removing unnecessary columns
df <- read.csv("C:/Users/aameh/OneDrive - UT Arlington/My Files/IPS/Assignment #4/boston.csv", header = TRUE)
df <- df %>% select(-ZN, -CHAS, -B, -INDUS, -LSTAT, -RM, -CRIM)
write.csv(df, file = "boston_new.csv", row.names = FALSE)
df1 <- read.csv("C:/Users/aameh/OneDrive - UT Arlington/My Files/IPS/Assignment #4/boston_new.csv", header = TRUE)

#*1. NOX: nitric oxides concentration - parts per 10 million - parts/10M*
#*2. AGE: proportion of owner-occupied units built prior to 1940*
#*3. DIS: weighted distances to five Boston employment centres*
#*4. RAD: index of accessibility to radial highways*
#*5. TAX: full-value property-tax rate per $10,000 - $/10k*
#*6. PTRATIO: pupil-teacher ratio by town*
#*7. MEDV: Median value of owner-occupied homes in $1000's - k$*
# Initial Data inspection
head(df1)

##     NOX  AGE    DIS RAD TAX PTRATIO MEDV
## 1 0.538 65.2 4.0900   1 296    15.3 24.0
## 2 0.469 78.9 4.9671   2 242    17.8 21.6
## 3 0.469 61.1 4.9671   2 242    17.8 34.7
## 4 0.458 45.8 6.0622   3 222    18.7 33.4
## 5 0.458 54.2 6.0622   3 222    18.7 36.2
## 6 0.458 58.7 6.0622   3 222    18.7 28.7

tail(df1)

##       NOX  AGE    DIS RAD TAX PTRATIO MEDV
## 501 0.585 79.7 2.4982   6 391    19.2 16.8
## 502 0.573 69.1 2.4786   1 273    21.0 22.4
## 503 0.573 76.7 2.2875   1 273    21.0 20.6
## 504 0.573 91.0 2.1675   1 273    21.0 23.9
## 505 0.573 89.3 2.3889   1 273    21.0 22.0
## 506 0.573 80.8 2.5050   1 273    21.0 11.9

str(df1)

## 'data.frame':    506 obs. of  7 variables:
##  $ NOX    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ AGE    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ DIS    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ RAD    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ TAX    : int  296 242 242 222 222 222 311 311 311 311 ...
##  $ PTRATIO: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ MEDV   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

summary(df1)

##       NOX              AGE              DIS              RAD        
##  Min.   :0.3850   Min.   :  2.90   Min.   : 1.130   Min.   : 1.000  
##  1st Qu.:0.4490   1st Qu.: 45.02   1st Qu.: 2.100   1st Qu.: 4.000  
##  Median :0.5380   Median : 77.50   Median : 3.207   Median : 5.000  
##  Mean   :0.5547   Mean   : 68.57   Mean   : 3.795   Mean   : 9.549  
##  3rd Qu.:0.6240   3rd Qu.: 94.08   3rd Qu.: 5.188   3rd Qu.:24.000  
##  Max.   :0.8710   Max.   :100.00   Max.   :12.127   Max.   :24.000  
##       TAX           PTRATIO           MEDV      
##  Min.   :187.0   Min.   :12.60   Min.   : 5.00  
##  1st Qu.:279.0   1st Qu.:17.40   1st Qu.:17.02  
##  Median :330.0   Median :19.05   Median :21.20  
##  Mean   :408.2   Mean   :18.46   Mean   :22.53  
##  3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:25.00  
##  Max.   :711.0   Max.   :22.00   Max.   :50.00

# Data Cleaning and Pre-processing
# Missing values
any(is.na(df1))

## [1] FALSE

# Duplicates
duplicates <- df1[duplicated(df1)]
any(is.na(duplicates))

## [1] FALSE

# Removing duplicates
df1 <- df1[!duplicated(df1), ]

3. Identifying the shape of data

# Data Visualization
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.2

library(gridExtra)

## Warning: package 'gridExtra' was built under R version 4.3.2

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(GGally)

## Warning: package 'GGally' was built under R version 4.3.2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(car)

## Warning: package 'car' was built under R version 4.3.2

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.3.2

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

library(MASS)

## Warning: package 'MASS' was built under R version 4.3.2

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

boxplot(df1[["NOX"]], main="Nitric Oxides concentration", col="lightblue")

boxplot(df1[["AGE"]], main="Nitric Oxides concentration", col="cyan")

boxplot(df1[["DIS"]], main="Nitric Oxides concentration", col="green")

boxplot(df1[["RAD"]], main="Nitric Oxides concentration", col="violet")

boxplot(df1[["TAX"]], main="Nitric Oxides concentration", col="yellow")

boxplot(df1[["PTRATIO"]], main="Nitric Oxides concentration", col="orange")

boxplot(df1[["MEDV"]], main="Nitric Oxides concentration", col="pink")

# Function to remove outliers based on IQR
remove_outliers <- function(data_frame, threshold = 1.5) {
  # Identify outliers for each column
  outliers <- sapply(data_frame, function(col) {
    q <- quantile(col, c(0.25, 0.75), na.rm = TRUE)
    iqr <- IQR(col, na.rm = TRUE)
    lower_bound <- q[1] - threshold * iqr
    upper_bound <- q[2] + threshold * iqr
    col < lower_bound | col > upper_bound
  })

  # Remove rows with outliers
  data_frame[!apply(outliers, 1, any), , drop = FALSE]
}

# Remove outliers from your DataFrame
df <- remove_outliers(df1)
df <- as.data.frame(df1)

Pair Plot

# Create a pairplot using GGally
ggpairs(df, title="Pair plot of the data")

SCATTER PLOT

scatter1 <- ggplot(df, aes(x=NOX, y=AGE)) + geom_point()
scatter1

scatter2 <- ggplot(df, aes(x=DIS, y=MEDV)) + geom_point()
scatter2

scatter3 <- ggplot(df, aes(x=NOX, y=PTRATIO)) + geom_point()
scatter3

Multiple Regression Analysis

model <- lm(TAX ~ NOX + AGE + DIS + RAD + PTRATIO + MEDV, data = df)

#Summary
summary(model)

## 
## Call:
## lm(formula = TAX ~ NOX + AGE + DIS + RAD + PTRATIO + MEDV, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -129.78  -34.09  -11.27   28.55  364.59 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 173.96108   56.66840   3.070  0.00226 ** 
## NOX         199.57152   49.23844   4.053 5.86e-05 ***
## AGE          -0.02902    0.16683  -0.174  0.86198    
## DIS          -1.74242    2.42353  -0.719  0.47250    
## RAD          14.87723    0.46350  32.098  < 2e-16 ***
## PTRATIO       1.73782    1.73469   1.002  0.31692    
## MEDV         -1.86234    0.40468  -4.602 5.31e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63.81 on 499 degrees of freedom
## Multiple R-squared:  0.8584, Adjusted R-squared:  0.8567 
## F-statistic:   504 on 6 and 499 DF,  p-value: < 2.2e-16

# Assuming 'model' is your multiple regression model

# Extract coefficients
coeff <- coef(model)

# Print coefficients
print(coeff)

##  (Intercept)          NOX          AGE          DIS          RAD      PTRATIO 
## 173.96108055 199.57152100  -0.02902002  -1.74241739  14.87723430   1.73782113 
##         MEDV 
##  -1.86234472

cat("\nEquation of the line:\n TAX = ", coeff[1], "+", coeff[2], "* NOX + ", coeff[3], "* AGE + ", coeff[4], "* DIS + ", coeff[5], "* RAD + ", coeff[6], "* PTRATIO + ", coeff[7], "* MEDV")

## 
## Equation of the line:
##  TAX =  173.9611 + 199.5715 * NOX +  -0.02902002 * AGE +  -1.742417 * DIS +  14.87723 * RAD +  1.737821 * PTRATIO +  -1.862345 * MEDV

# Extract more detailed information about coefficients
coeff_summary <- summary(model)$coeff
print(coeff_summary)

##                 Estimate Std. Error    t value      Pr(>|t|)
## (Intercept) 173.96108055 56.6684033  3.0698073  2.258631e-03
## NOX         199.57152100 49.2384411  4.0531649  5.859990e-05
## AGE          -0.02902002  0.1668300 -0.1739497  8.619755e-01
## DIS          -1.74241739  2.4235327 -0.7189577  4.725035e-01
## RAD          14.87723430  0.4634964 32.0978391 1.928481e-123
## PTRATIO       1.73782113  1.7346905  1.0018047  3.169235e-01
## MEDV         -1.86234472  0.4046834 -4.6019800  5.314599e-06

How do you interpret the intercept? (watch out for the units that it was measured) The estimated intercept is 173.96.

Significance:

The t-value for the intercept is 3.070, and the corresponding p-value is 0.00226, denoted by the double asterisks (**).
The small p-value suggests that the intercept is statistically significant.

Interpretation:

If all the predictor variables (NOX, AGE, DIS, RAD, PTRATIO, MEDV) are zero, the estimated value for TAX is 173.96 per $10,000 (in dollars).
In other words, when there are no nitric oxides (NOX), the proportion of owner-occupied units built prior to 1940 (AGE) is zero, the weighted distances to five Boston employment centers (DIS) are zero, the index of accessibility to radial highways (RAD) is zero, the pupil-teacher ratio by town (PTRATIO) is zero, and the median value of owner-occupied homes (MEDV) is zero, the estimated full-value property-tax rate (TAX) is 173.96 per $10,000.

How do you interpret the slope? (watch out for the units that it was measured)

The slopes (coefficients) for each predictor variable in a multiple regression model indicate the estimated change in the dependent variable for a one-unit change in the corresponding predictor variable, holding other variables constant. Let's interpret the slopes for each variable in the provided model:

1. NOX (Nitric Oxides Concentration):

- The estimated slope for NOX is 199.57.
- Interpretation: Holding other variables constant, a one-unit increase in nitric oxides concentration (parts per 10 million) is associated with an estimated increase of 199.57 in the full-value property-tax rate per $10,000.

2. AGE (Proportion of Owner-Occupied Units Built Prior to 1940):

- The estimated slope for AGE is -0.02902.
- Interpretation: Holding other variables constant, a one-unit increase in the proportion of owner-occupied units built prior to 1940 is associated with an estimated decrease of 0.02902 in the full-value property-tax rate per $10,000.

3. DIS (Weighted Distances to Five Boston Employment Centres):

- The estimated slope for DIS is -1.74242.
- Interpretation: Holding other variables constant, a one-unit increase in the weighted distances to five Boston employment centers is associated with an estimated decrease of 1.74242 in the full-value property-tax rate per $10,000.

4. RAD (Index of Accessibility to Radial Highways):

- The estimated slope for RAD is 14.87723.
- Interpretation: Holding other variables constant, a one-unit increase in the index of accessibility to radial highways is associated with an estimated increase of 14.87723 in the full-value property-tax rate per $10,000.

5. PTRATIO (Pupil-Teacher Ratio by Town):

- The estimated slope for PTRATIO is 1.73782.
- Interpretation: Holding other variables constant, a one-unit increase in the pupil-teacher ratio by town is associated with an estimated increase of 1.73782 in the full-value property-tax rate per $10,000.

6. MEDV (Median Value of Owner-Occupied Homes):

- The estimated slope for MEDV is -1.86234.
- Interpretation: Holding other variables constant, a one-unit increase in the median value of owner-occupied homes (in $1000's) is associated with an estimated decrease of 1.86234 in the full-value property-tax rate per $10,000.

These interpretations provide insights into the direction and magnitude of the relationship between each predictor variable and the response variable (TAX), while accounting for the other variables in the model.

6. Are the coefficients statistically significant?

The asterisks in the “Pr(>|t|)” column indicate the level of statistical significance:

***: p-value < 0.001
**: 0.001 ≤ p-value < 0.01
*: 0.01 ≤ p-value < 0.05

Here are the conclusions for each coefficient:

1. Intercept: The intercept is statistically significant (p-value = 0.00226), denoted by two asterisks (**).

2. NOX: The coefficient for NOX is statistically significant (p-value = 5.86e-05), denoted by three asterisks (***).

3. AGE, DIS, PTRATIO: These coefficients are not statistically significant (p-values > 0.05), indicated by no asterisks.

4. RAD, MEDV: The coefficients for RAD and MEDV are highly statistically significant (p-values < 0.001), denoted by three asterisks (***).

In summary, the intercept, NOX, RAD, and MEDV coefficients are statistically significant based on their p-values. The other coefficients (AGE, DIS, PTRATIO) are not statistically significant at the commonly used significance level of 0.05.

(a) What is the null and alternative hypothesis that you are testing?

In multiple regression analysis, each hypothesis test is associated with a null hypothesis (Ho) and an alternative hypothesis (Ha). These hypotheses are formulated to assess the significance of each individual coefficient. The general form of these hypotheses is as follows:

Null Hypothesis (Ho):
- Ho for a specific coefficient asserts that the true population coefficient for that predictor variable is equal to zero.
Alternative Hypothesis (Ha):
- HA asserts that the true population coefficient for that predictor variable is not equal to zero. In the context of your multiple regression model, where you are analyzing the impact of various predictor variables on the dependent variable (TAX), the null and alternative hypotheses for each coefficient are as follows:

1. Intercept: - Ho: βIntercept =0 - Ha: βIntercept =0

2. NOX (Nitric Oxides Concentration): - Ho: βIntercept =0 - Ha: βIntercept =0

3. AGE (Proportion of Owner-Occupied Units Built Prior to 1940): - Ho: βIntercept =0 - Ha: βIntercept =0

4. DIS (Weighted Distances to Five Boston Employment Centres): - Ho: βIntercept =0 - Ha: βIntercept =0

5. RAD (Index of Accessibility to Radial Highways): - Ho: βIntercept =0 - Ha: βIntercept =0

6. PTRATIO (Pupil-Teacher Ratio by Town): - Ho: βIntercept =0 - Ha: βIntercept =0

7. MEDV (Median Value of Owner-Occupied Homes): - Ho: βIntercept =0 - Ha: βIntercept =0

Each of these hypotheses is testing whether the corresponding predictor variable has a statistically significant effect on the dependent variable (TAX). The alternative hypothesis (Ha) suggests that the variable contributes to explaining the variation in TAX, while the null hypothesis (Ho) posits that the variable does not have a significant effect.

(b) what are your conclusions and why ?

Based on the statistical significance of the coefficients in the multiple regression model, we can draw the following conclusions:

1. Intercept:

- **Conclusion:** The intercept is statistically significant (p-value = 0.00226), indicating that the estimated value of the dependent variable (TAX) is significantly different from zero when all predictor variables are zero.
- **Reason:** The intercept represents the estimated value of TAX when all predictor variables are zero. The statistical significance suggests that this estimate is different from zero.

2. NOX (Nitric Oxides Concentration):

- **Conclusion:** The coefficient for NOX is statistically significant (p-value = 5.86e-05), suggesting that there is strong evidence that nitric oxides concentration has a significant effect on the full-value property-tax rate (TAX).
- **Reason:** A low p-value indicates that the estimated effect of NOX on TAX is unlikely to be zero. The coefficient is positive, suggesting that an increase in nitric oxides concentration is associated with an increase in the tax rate.

3. AGE (Proportion of Owner-Occupied Units Built Prior to 1940), DIS (Weighted Distances to Five Boston Employment Centres), PTRATIO (Pupil-Teacher Ratio by Town):

- **Conclusion:** The coefficients for AGE, DIS, and PTRATIO are not statistically significant (p-values > 0.05), suggesting that there is not enough evidence to conclude that these variables have a significant effect on TAX.
- **Reason:** The higher p-values indicate that the estimated effects of these variables on TAX could be consistent with a zero effect. The evidence for their impact is not strong.

4. RAD (Index of Accessibility to Radial Highways), MEDV (Median Value of Owner-Occupied Homes):

- **Conclusion:** The coefficients for RAD and MEDV are highly statistically significant (p-values < 0.001), indicating a strong evidence of a significant effect on TAX.
- **Reason:** The low p-values suggest that the estimated effects of RAD and MEDV on TAX are unlikely to be zero. The positive coefficient for RAD suggests that increased accessibility to radial highways is associated with a higher tax rate, while the negative coefficient for MEDV indicates that higher median home values are associated with a lower tax rate.

In summary, the statistical significance of the coefficients provides evidence for the presence or absence of significant relationships between each predictor variable and the dependent variable (TAX). These conclusions help in understanding the factors that influence the tax rate in the given context.

7. What is the variance of the model ?

The variance of the model, often referred to as the residual variance or the residual standard error, is a measure of how well the model fits the observed data. It represents the average deviation of the observed values from the predicted values. In the context of your multiple regression model, you can find the residual standard error in the model summary output.

The "Residual standard error" is the estimate of the standard deviation of the residuals (the differences between observed and predicted values). In this case, the residual standard error is approximately 63.81.

It's important to note that the residual standard error is expressed in the same units as the dependent variable (TAX). Therefore, in the context of your model, the residual standard error of 63.81 would be in the units of the full-value property-tax rate per $10,000.

Variance of the model=(Residual standard error)^2 =(63.81)^2 ≈ 4,071.7161

8. Plot the regression fitted line on the scatterplot.

ggplot(df, aes(x = TAX, y = NOX)) +
  geom_point(color = "blue") +          # Scatterplot points
  geom_smooth(method = "lm", col = "red") +  # Regression line
  labs(title = "Scatterplot with Regression Line",
       x = "TAX", y = "NOX")

## `geom_smooth()` using formula = 'y ~ x'

ggplot(df, aes(x = TAX, y = RAD)) +
  geom_point(color = "blue") +          # Scatterplot points
  geom_smooth(method = "lm", col = "red") +  # Regression line
  labs(title = "Scatterplot with Regression Line",
       x = "TAX", y = "RAD")

## `geom_smooth()` using formula = 'y ~ x'

9. Model assumptions & 10. What are the model assumptions? & 11. How do you test for them? Do they hold?

# Data Visualization
library(ggplot2)
library(gridExtra)

# Scatter plot of NOX vs. TAX
ggplot1 <- ggplot(df, aes(x = TAX, y = NOX)) +
  geom_point(color = "blue") +          # Scatterplot points
  geom_smooth(method = "lm", col = "red") +  # Regression line
  labs(title = "Scatterplot with Regression Line",
       x = "TAX", y = "NOX")
# Scatter plot of AGE vs. TAX
ggplot2 <- ggplot(df, aes(x = TAX, y = AGE)) +
  geom_point(color = "blue") +          # Scatterplot points
  geom_smooth(method = "lm", col = "red") +  # Regression line
  labs(title = "Scatterplot with Regression Line",
       x = "TAX", y = "AGE")
# Scatter plot of DIS vs. TAX
ggplot3 <- ggplot(df, aes(x = TAX, y = DIS)) +
  geom_point(color = "blue") +          # Scatterplot points
  geom_smooth(method = "lm", col = "red") +  # Regression line
  labs(title = "Scatterplot with Regression Line",
       x = "TAX", y = "DIS")
# Scatter plot of RAD vs. TAX
ggplot4 <- ggplot(df, aes(x = TAX, y = RAD)) +
  geom_point(color = "blue") +          # Scatterplot points
  geom_smooth(method = "lm", col = "red") +  # Regression line
  labs(title = "Scatterplot with Regression Line",
       x = "TAX", y = "RAD")
# Scatter plot of PTRATIO vs. TAX
ggplot5 <- ggplot(df, aes(x = TAX, y = PTRATIO)) +
  geom_point(color = "blue") +          # Scatterplot points
  geom_smooth(method = "lm", col = "red") +  # Regression line
  labs(title = "Scatterplot with Regression Line",
       x = "TAX", y = "PTRATIO")
# Scatter plot of MEDV vs. TAX
ggplot6 <- ggplot(df, aes(x = TAX, y = MEDV)) +
  geom_point(color = "blue") +          # Scatterplot points
  geom_smooth(method = "lm", col = "red") +  # Regression line
  labs(title = "Scatterplot with Regression Line",
       x = "TAX", y = "MEDV")
ggplot1

## `geom_smooth()` using formula = 'y ~ x'

ggplot2

## `geom_smooth()` using formula = 'y ~ x'

ggplot3

## `geom_smooth()` using formula = 'y ~ x'

ggplot4

## `geom_smooth()` using formula = 'y ~ x'

ggplot5

## `geom_smooth()` using formula = 'y ~ x'

ggplot6

## `geom_smooth()` using formula = 'y ~ x'

Based on the output you provided from the linear regression model, here are the model assumptions and diagnostic checks:

1. Linearity:

Interpretation:

Linearity assumption assumes that changes in the independent variables are associated with constant changes in the dependent variable.

Conclusion: Linearity does not hold

For my data set, by examining the scatter plots of the dependent variable (TAX) against each independent variable (NOX, AGE, DIS, RAD, PTRATIO, MEDV). Also, consider partial regression plots, we can see that the Linearity does not hold with the dependent and the independent variables.

2. Normality of Residuals:

Examining a histogram and Q-Q plot of residuals:

# Extract residuals from the model
residuals <- residuals(model)

# Create a histogram
hist(residuals, main = "Histogram of Residuals", xlab = "Residuals", col = "lightblue", border = "black")

# Create a Q-Q plot
qqnorm(residuals)
qqline(residuals, col = 2)

1. Normal Q-Q Plot:

Interpretation:

A linear Q-Q plot indicates that the distribution of residuals is close to normal. Points lying roughly along a straight line suggest that the residuals follow a normal distribution. Deviations from the line at the extremes might indicate departures from normality.

Conclusion:

The normal Q-Q plot supporting linearity suggests that the assumption of normality for the residuals is reasonably met.

2. Histogram of Residuals:

Interpretation:

A right-skewed histogram means that there is a concentration of residuals on the left side (lower values) with a tail extending towards higher values. The right skewness indicates that there are some positive outliers or extreme positive residuals.

Conclusion:

While the histogram is right-skewed, the Q-Q plot suggests that the majority of residuals follow a normal distribution. The skewness in the histogram could be influenced by a few extreme positive residuals.

Conclusion: Normality partially holds for the model as we can see a lot of outliers are distracting the data from its normality.

3. Homoscedasticity (Constant Variance):

Extract residuals and fitted values from the model:

# Extract residuals and fitted values from the model
residuals <- residuals(model)
fitted_values <- fitted(model)

# Create a plot of residuals vs. fitted values
plot(fitted_values, residuals,
     main = "Residuals vs. Fitted Values",
     xlab = "Fitted Values",
     ylab = "Residuals",
     col = "blue",
     pch = 16)


# Add a horizontal line at y = 0
abline(h = 0, col = "red", lty = 1)

1. Constant Spread:

Look for a consistent spread of points along the horizontal axis. In a homoscedastic scenario, you would expect the spread of residuals to be relatively uniform across all levels of the fitted values.

2. No Clear Patterns:

Check for any systematic patterns or trends in the plot. If there is a cone shape or a funnel-like pattern, it could indicate heteroscedasticity (varying spread).

3. Residuals Should Be Randomly Scattered:

The points on the plot should be randomly scattered, indicating that the variability of the residuals does not depend on the level of the predicted values.

Conclusion: By the above constraints and the results of the graphs, we can conclude that Constant Variance does not hold.

4. Independence of Residuals:

# Assuming 'your_model_name' is the name of your linear regression model
# Extract residuals from the model
# residuals <- residuals(model)

# Perform Durbin-Watson test
durbinWatsonTest(residuals)

## [1] 0.4336651

The durbinWatsonTest function from the car package performs the Durbin-Watson test on the residuals of the regression model. The test result will include the test statistic.

Interpretation:

The Durbin-Watson statistic tests the null hypothesis that the residuals are not autocorrelated. The test statistic ranges from 0 to 4. 
A test statistic close to 2 suggests no autocorrelation.If the test statistic is significantly less than 2, it indicates positive autocorrelation. If the test statistic is significantly greater than 2, it indicates negative autocorrelation. If the p-value is less than the significance level (commonly 0.05), you may reject the null hypothesis of no autocorrelation.

Durbin-Watson Test Statistic (d): 0.43.
My test statistic (0.43) is significantly below 2. This suggests the presence of positive autocorrelation in the residuals.

Conclusion: Independence does not hold.

The low Durbin-Watson test statistic may indicate a systematic pattern or correlation among the residuals, suggesting that adjacent residuals are not independent.

Final Conclusion for Model Assumptions:

1. Normality partially holds for the model

2. Linearity does not hold.

3. Constant Variance does not hold.

4. Independence does not hold.

12. Do you see outliers?

From the Histogram and Normal Q-Q plots, we conclude that outliers exist in our model and hence the normal distribution is right skewed and in the Q-Q plot, we can spot the outliers in the right extreme values of the graph.

13. What does the Box-Cox transformation suggest you do?

# Use the boxcox() function with the new positive response variable
result <- MASS::boxcox(TAX ~ NOX + AGE + DIS + RAD + PTRATIO + MEDV, data = df)

# Access the optimal lambda from the result
optimal_lambda <- result$x[which.max(result$y)]

cat("Optimal Lambda:", optimal_lambda, "\n")

## Optimal Lambda: 1.030303

# Apply the Box-Cox transformation to 'TAX'
transformed_TAX <- if (optimal_lambda == 0) {
  log(df1$TAX )  # Adding 1 to avoid issues with zero values
} else {
  ((df1$TAX)^optimal_lambda - 1) / optimal_lambda
}

model_transformed <- lm(transformed_TAX ~ NOX + AGE + DIS + RAD + PTRATIO + MEDV, data = df)

#Summary
summary(model_transformed)

## 
## Call:
## lm(formula = transformed_TAX ~ NOX + AGE + DIS + RAD + PTRATIO + 
##     MEDV, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -153.63  -40.68  -13.37   33.64  440.00 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 193.63682   67.82755   2.855  0.00449 ** 
## NOX         238.03223   58.93448   4.039 6.22e-05 ***
## AGE          -0.03495    0.19968  -0.175  0.86112    
## DIS          -2.15035    2.90078  -0.741  0.45886    
## RAD          17.93650    0.55477  32.332  < 2e-16 ***
## PTRATIO       2.13312    2.07629   1.027  0.30474    
## MEDV         -2.22454    0.48437  -4.593 5.55e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76.37 on 499 degrees of freedom
## Multiple R-squared:  0.8598, Adjusted R-squared:  0.8582 
## F-statistic: 510.2 on 6 and 499 DF,  p-value: < 2.2e-16

14. What is the correlation between the independent and dependent variable?

The Durbin-Watson statistic test is used to find the correlation between the residuals and the Durbin-Watson Test Statistic (d) I got is 0.43. Therefore, the null hypothesis that the residuals are not autocorrelated.
Durbin-Watson Test Statistic (d): 0.43.
My test statistic (0.43) is significantly below 2. This suggests the presence of positive autocorrelation in the residuals.

15. What is the value of the coefficient of determination and how do you interpret it?

The coefficient of determination, often denoted as R^2, is a statistical measure that represents the proportion of the variance in the dependent variable that is explained by the independent variables in a regression model.

Multiple R-squared ( R^2 ): 0.8584
Adjusted R-squared: 0.8567

Interpretation:

R^2 (Multiple R-squared):
- The R^2 value indicates the proportion of the variance in the dependent variable (TAX) that is explained by the independent variables (NOX, AGE, DIS, RAD, PTRATIO, MEDV) in the regression model.
- In this model, approximately 85.84% of the variance in the ‘TAX’ variable is explained by the independent variables.
Adjusted R-squared:
- The adjusted R^2 value takes into account the number of independent variables in the model and penalizes the R^2 value if additional variables do not significantly contribute to explaining the variance.
- In this case, the adjusted R^2 is slightly lower than the R^2, which might suggest that the inclusion of some independent variables may not add much explanatory power to the model.

Conclusion:

- Adjusted R^2 is often used when comparing models with different numbers of independent variables. If it is significantly lower than R^2, it may suggest that some variables are not contributing meaningfully to the model.

Note: R^2 alone does not provide information about the causal relationship between variables.  It is crucial to consider other factors like statistical significance, model assumptions, and the overall context of the analysis.

16. Do you have any other suggestions to improve your model?

Outlier Handling:

We can handle outliers. Depending on the nature of my data, outliers have a significant impact on the model. so we can remove, transform, or handle outliers differently.

Feature Scaling:

We can Standardize or normalize numerical features to ensure that they contribute equally to the model. This is important, especially if the scales of different features vary significantly.

Variable Transformation:

We can look that in my analysis assumptions such as linearity and normality does not hold for my dataset. In this situation we can transform both independent and dependent variables using boxcox transformations which makes sure that increased accuracy in it’s predictive power.

Non-linear relationships:

We can see that there are few variables which depicts non-linear relationship with the dependent variable(TAX) which suggests us to use non-linear regression techniques by considering polynomial terms.

Assignment 04

11/30/2023

Problem Statement

3. Identifying the shape of data