library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data <- read.csv("C:\\Users\\Krishna\\Downloads\\productivity+prediction+of+garment+employees\\garments_worker_productivity.csv")
# this is the one which i used in previous lab
 model <- lm(actual_productivity ~ smv, data = data)
summary(model)
## 
## Call:
## lm(formula = actual_productivity ~ smv, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52095 -0.08825  0.04300  0.11334  0.37991 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.7644124  0.0085220  89.699  < 2e-16 ***
## smv         -0.0019467  0.0004578  -4.252 2.28e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1733 on 1195 degrees of freedom
## Multiple R-squared:  0.01491,    Adjusted R-squared:  0.01408 
## F-statistic: 18.08 on 1 and 1195 DF,  p-value: 2.281e-05
# Expanded the regression model with additional variables
expanded_model <- lm(actual_productivity ~ smv + department + no_of_workers, data = data)
summary(expanded_model)
## 
## Call:
## lm(formula = actual_productivity ~ smv + department + no_of_workers, 
##     data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.53694 -0.09234  0.04605  0.10418  0.40693 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.7057471  0.0127218  55.475  < 2e-16 ***
## smv                  -0.0061760  0.0011070  -5.579 2.99e-08 ***
## departmentfinishing   0.0592290  0.0151505   3.909 9.78e-05 ***
## departmentsweing     -0.0505433  0.0301982  -1.674   0.0944 .  
## no_of_workers         0.0040112  0.0007726   5.192 2.45e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1704 on 1192 degrees of freedom
## Multiple R-squared:  0.04989,    Adjusted R-squared:  0.0467 
## F-statistic: 15.65 on 4 and 1192 DF,  p-value: 1.744e-12
data$department <- as.numeric(data$department)
## Warning: NAs introduced by coercion
data$no_of_workers <- as.numeric(data$no_of_workers)
# Check for multicollinearity
cor_matrix <- cor(data[, c("smv", "department", "no_of_workers")])
print(cor_matrix)
##                     smv department no_of_workers
## smv           1.0000000         NA     0.9121763
## department           NA          1            NA
## no_of_workers 0.9121763         NA     1.0000000
data <- na.omit(data)

# Diagnostic plots for evaluating the expanded model
par(mfrow = c(2, 3))  



# Residuals vs. Fitted Values Plot
plot(expanded_model, which = 1, main = "Residuals vs. Fitted Values")

# Normal Q-Q Plot
plot(expanded_model, which = 2, main = "Normal Q-Q Plot")

# Scale-Location Plot
plot(expanded_model, which = 3, main = "Scale-Location Plot")

# Residuals vs. Leverage Plot
plot(expanded_model, which = 5, main = "Residuals vs. Leverage")

# Cook's Distance Plot
plot(expanded_model, which = 4, main = "Cook's Distance Plot")

1)Residuals vs Fitted Plot:

Indications of issues: Look for patterns or non-random scatter of points, which may suggest non-linearity or heteroscedasticity.

Support for assumptions: Ideally, we want to see a random scatter of points around the horizontal line at 0, indicating that the residuals are evenly distributed across the range of fitted values. This would support the assumption of linearity and homoscedasticity.

2)Normal Q-Q Plot:

indications of issues: Departure from the diagonal line suggests non-normality of residuals.

Support for assumptions: Ideally, we want to see points falling approximately along the diagonal line, indicating that the residuals are normally distributed. This would support the assumption of normality.

3)Scale-Location Plot (Square Root of Standardized Residuals vs Fitted Values):

Indications of issues: Uneven spread of points or patterns in the plot suggest heteroscedasticity.

Support for assumptions: Ideally, we want to see a horizontal line with evenly spread points, indicating constant variance of residuals across the range of fitted values. This would support the assumption of homoscedasticity.

4)Residuals vs Leverage Plot:

Indications of issues: High leverage points or points with large residuals may indicate influential observations.

upport for assumptions: Ideally, we want to see points evenly distributed within the Cook’s distance threshold, indicating that there are no influential observations significantly affecting the regression coefficients.

5)Cook’s Distance Plot

Indications of issues: Points with high Cook’s distances are potential outliers or influential observations.

Support for assumptions: Ideally, we want to see points within a reasonable threshold of Cook’s distance, indicating that there are no highly influential observations significantly affecting the regression coefficients.

INSIGHTS

  1. Residuals vs. Fitted Values Plot:

    This plot helps assess the assumption of homoscedasticity, i.e., constant variance of residuals across all levels of the independent variables.

    Ideally, the plot should show a random scatter of points around the horizontal line at y = 0. If there’s a noticeable pattern (e.g., a funnel shape), it suggests heteroscedasticity, indicating that the model’s errors vary systematically.

  2. Normal Q-Q Plot:

    This plot assesses the normality of residuals by comparing their distribution to a theoretical normal distribution.

    Ideally, the points on the Q-Q plot should fall close to the diagonal line. Deviations from the line indicate departures from normality.

  3. Scale-Location Plot:

    Also known as the spread-location plot, it examines whether the residuals’ spread is consistent across different values of the independent variables.

    The plot shows the square root of the absolute residuals against the fitted values.

    Ideally, the points should be randomly scattered around a horizontal line, indicating consistent spread of residuals.

  4. Residuals vs. Leverage Plot:

    This plot helps identify influential observations, i.e., data points with high leverage.

    Observations with high leverage have a disproportionate impact on the regression coefficients.

    Points outside the dashed horizontal lines may have high leverage and should be investigated further.

  5. Cook’s Distance Plot:

    Cook’s distance measures the influence of each observation on the regression coefficients.

    Observations with Cook’s distance greater than 1 are typically considered influential.

    This plot helps identify influential data points that may significantly affect the regression model’s coefficients.