Eduardo Ogawa Cardoso (IDUSP 10864890) - M.Sc Student, Marketing
Maria Carolina Dias Cavalcante (IDUSP 12436263) - P.hD Candidate, Marketing
eogawac@usp.br; mcarolinadias@usp.br
The objective of linear regression is to predict the value of a dependent variable Y based on one or more predictor variables X. The objective is to build a linear relationship (a mathematical formula) between the predictor variable(s) and the response variable, such that we can use this formula to estimate the value of the response variable Y when just the values of the predictor variables (Xs) are known.
The objective of linear regression is to describe a continuous variable Y as a mathematical function of one or more X variables, so that this regression model may be used to predict Y when only X is known. Following is a generalization of this mathematical equation:
Y = β1 + β2X + ϵ
where, β1 is the intercept and β2 is the slope. Collectively, they are called regression coefficients. ϵ is the error term, the part of Y the regression model is unable to explain.
install.packages('corrplot')
install.packages("psych")
install.packages("ggplot")
install.packages("car")
install.packages("olsrr")
library(car)
library(corrplot)
library(psych)
library(ggplot2)
library(olsrr)
The database was created with records of behavior of the urban traffic of the city of Sao Paulo in Brazil from December 14, 2009 to December 18, 2009 (From Monday to Friday). Registered from 7:00 to 20:00 every 30 minutes.
The intention of this study is to investigate whether the independent variables present in the dataset can be used as predictor in a linear regression model (logistic regression).
Creators original owner and donors: Ricardo Pinto Ferreira (1), Andrea Martiniano (2) and Renato Jose Sassi (3).
df = sp_urban_traffic
names(df) = gsub("\\.", "_", names(df))
df = subset(df, select = -c(Hour__Coded_) )
head(df)
Display the name of the columns
names(df)
[1] "Immobilized_bus" "Broken_Truck" "Vehicle_excess"
[4] "Accident_victim" "Running_over" "Fire_vehicles"
[7] "Occurrence_involving_freight" "Incident_involving_dangerous_freight" "Lack_of_electricity"
[10] "Fire" "Point_of_flooding" "Manifestations"
[13] "Defect_in_the_network_of_trolleybuses" "Tree_on_the_road" "Semaphore_off"
[16] "Intermittent_Semaphore" "Slowness_in_traffic____"
Now we perform a summary of all columns in the dataset. This summary helps understand the main statistical characteristics of the dataset, such as Min, Max, Median, Mean and etc.
summary(df)
Immobilized_bus Broken_Truck Vehicle_excess Accident_victim Running_over Fire_vehicles
Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.000000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000000
Median :0.0000 Median :1.0000 Median :0.00000 Median :0.0000 Median :0.0000 Median :0.000000
Mean :0.3407 Mean :0.8741 Mean :0.02963 Mean :0.4222 Mean :0.1185 Mean :0.007407
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.000000
Max. :4.0000 Max. :5.0000 Max. :1.00000 Max. :3.0000 Max. :2.0000 Max. :1.000000
Occurrence_involving_freight Incident_involving_dangerous_freight Lack_of_electricity Fire Point_of_flooding
Min. :0.000000 Min. :0.000000 Min. :0.0000 Min. :0.000000 Min. :0.0000
1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:0.000000 1st Qu.:0.0000
Median :0.000000 Median :0.000000 Median :0.0000 Median :0.000000 Median :0.0000
Mean :0.007407 Mean :0.007407 Mean :0.1185 Mean :0.007407 Mean :0.1185
3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.:0.000000 3rd Qu.:0.0000
Max. :1.000000 Max. :1.000000 Max. :4.0000 Max. :1.000000 Max. :7.0000
Manifestations Defect_in_the_network_of_trolleybuses Tree_on_the_road Semaphore_off Intermittent_Semaphore
Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.00000
1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
Median :0.00000 Median :0.0000 Median :0.00000 Median :0.0000 Median :0.00000
Mean :0.05185 Mean :0.2296 Mean :0.04444 Mean :0.1259 Mean :0.01481
3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000
Max. :1.00000 Max. :8.0000 Max. :1.00000 Max. :4.0000 Max. :1.00000
Slowness_in_traffic____
Min. : 3.40
1st Qu.: 7.40
Median : 9.00
Mean :10.05
3rd Qu.:11.85
Max. :23.40
Generally, any datapoint that lies outside the 1.5 * interquartile-range (1.5 * IQR) is considered an outlier, where, IQR is calculated as the distance between the 25th percentile and 75th percentile values for that variable.
Analyzing the results in the box plot we can observe that are a large number of outlier present in the dataset, however, by conducting a qualitative analysis we observed that this is an important characteristic to describe the target variable.
A correlation matrix is merely a table containing the correlation coefficients of several variables. The matrix illustrates the relationship between every conceivable pair of values in a table. It is a strong tool for summarizing massive datasets and identifying and visualizing data trends.
The use of correlation matrix may help us understand the relationship between all the independent variables, also, how the variables relate to the dependent variable.
corr = round(cor(df),2)
corrplot(corr, type = "upper", tl.pos = "td",
method = "circle", tl.cex = 0.5, tl.col = 'black',
order = "hclust", diag = FALSE)
We now separate the data into X and Y variable to facilitate further analysis. In the X variable we selected all the independent variables in the dataset, while in the Y variable we stored the target (or dependent variable)
X = subset(df, select = -c(Slowness_in_traffic____))
Y = df$Slowness_in_traffic____
The Kaiser-Meyer-Olkin (KMO) used to measure sampling adequacy is a better measure of factorability.
KMO(r=cor(X))
Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = cor(X))
Overall MSA = 0.48
MSA for each item =
Immobilized_bus Broken_Truck Vehicle_excess
0.43 0.59 0.36
Accident_victim Running_over Fire_vehicles
0.52 0.42 0.35
Occurrence_involving_freight Incident_involving_dangerous_freight Lack_of_electricity
0.41 0.39 0.51
Fire Point_of_flooding Manifestations
0.52 0.64 0.45
Defect_in_the_network_of_trolleybuses Tree_on_the_road Semaphore_off
0.33 0.44 0.55
Intermittent_Semaphore
0.53
Bartlett’s Test of Sphericity compares a correlation matrix observed to the identity matrix. Essentially, it examines whether there is duplication between the variables that may be summarized by a small number of elements.
cortest.bartlett(X)
R was not square, finding R from data
$chisq
[1] 205.5523
$p.value
[1] 0.000001919332
$df
[1] 120
det(cor(X))
[1] 0.2002936
The Kaiser-Meyer Olkin (KMO) and Bartlett’s Test measure of sampling adequacy were used to examine the appropriateness of Factor Analysis. The approximate of Chi-square is 205 with 120 degrees of freedom, which is significant at 0.05 Level of significance. The KMO statistic of 0.48 is lower than the expected 0.5, however, due the presence of outliers in the sample, Factor Analysis is considered as an appropriate technique for further analysis of the data.
Examining the “scree” plot of the successive eigenvalues is one method for identifying the number of variables or components in a data matrix or a correlation matrix. Sharp splits in the graph indicate the right amount of elements or components to extract. The scree plot depicts the Eigenvalue as a function of each factor.
The graph reveals that at factor 7, the scree plot’s curvature undergoes an abrupt change. This demonstrates that the total variance accounts for a lower amount after factor 7.
Selection of elements from the scree plot can be based on the following criteria:
1.According to the Kaiser-Guttman normalization criterion, we must select all factors having an eigenvalue larger than 1
2.Bend elbow rule
fafitfree = fa(df,nfactors = ncol(X), rotate = "none")
n_factors = length(fafitfree$e.values)
scree = data.frame(
Factor_n = as.factor(1:n_factors),
Eigenvalue = fafitfree$e.values)
ggplot(scree, aes(x = Factor_n, y = Eigenvalue, group = 1)) +
geom_point() + geom_line() +
xlab("Number of factors") +
ylab("Initial eigenvalue") +
labs( title = "Scree Plot",
subtitle = "(Based on the unreduced correlation matrix)")
parallel = fa.parallel(X)
Parallel analysis suggests that the number of factors = 2 and the number of components = 2
fa.none = fa(r=X, nfactors = 7, covar = FALSE, SMC = TRUE, max.iter=100,rotate="varimax")
Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, :
An ultra-Heywood case was detected. Examine the results carefully
print(fa.none)
Factor Analysis using method = minres
Call: fa(r = X, nfactors = 7, rotate = "varimax", SMC = TRUE, covar = FALSE,
max.iter = 100)
Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR2 MR5 MR3 MR7 MR4 MR6
SS loadings 1.45 1.15 1.11 1.11 1.07 1.02 0.98
Proportion Var 0.09 0.07 0.07 0.07 0.07 0.06 0.06
Cumulative Var 0.09 0.16 0.23 0.30 0.37 0.43 0.49
Proportion Explained 0.18 0.15 0.14 0.14 0.14 0.13 0.12
Cumulative Proportion 0.18 0.33 0.47 0.61 0.75 0.88 1.00
Mean item complexity = 1.8
Test of the hypothesis that 7 factors are sufficient.
The degrees of freedom for the null model are 120 and the objective function was 1.61 with Chi Square of 205.55
The degrees of freedom for the model are 29 and the objective function was 0.08
The root mean square of the residuals (RMSR) is 0.02
The df corrected root mean square of the residuals is 0.03
The harmonic number of observations is 135 with the empirical chi square 8.19 with prob < 1
The total number of observations was 135 with Likelihood Chi Square = 9.35 with prob < 1
Tucker Lewis Index of factoring reliability = 2.042
RMSEA index = 0 and the 90 % confidence intervals are 0 0
BIC = -132.9
Fit based upon off diagonal values = 0.98
Measures of factor score adequacy
MR1 MR2 MR5 MR3 MR7 MR4 MR6
Correlation of (regression) scores with factors 0.89 0.99 1.00 1.00 0.98 1.00 1
Multiple R square of scores with factors 0.79 0.97 0.99 1.00 0.95 0.99 1
Minimum correlation of possible factor scores 0.59 0.95 0.99 0.99 0.91 0.99 1
factanal.none = factanal(X, factors=7, scores = c("regression"), rotation = "varimax")
print(factanal.none)
Call:
factanal(x = X, factors = 7, scores = c("regression"), rotation = "varimax")
Uniquenesses:
Immobilized_bus Broken_Truck Vehicle_excess
0.005 0.649 0.950
Accident_victim Running_over Fire_vehicles
0.735 0.005 0.983
Occurrence_involving_freight Incident_involving_dangerous_freight Lack_of_electricity
0.561 0.709 0.085
Fire Point_of_flooding Manifestations
0.996 0.826 0.005
Defect_in_the_network_of_trolleybuses Tree_on_the_road Semaphore_off
0.005 0.968 0.452
Intermittent_Semaphore
0.968
Loadings:
Factor1 Factor2 Factor3 Factor4 Factor5 Factor6 Factor7
Immobilized_bus 0.301 0.159 0.926 0.108
Broken_Truck 0.124 0.133 0.553
Vehicle_excess 0.213
Accident_victim -0.107 0.489
Running_over 0.979 0.158
Fire_vehicles 0.105
Occurrence_involving_freight 0.656
Incident_involving_dangerous_freight 0.488 0.175 0.110
Lack_of_electricity 0.942 -0.115
Fire
Point_of_flooding 0.370 0.101
Manifestations 0.875 -0.114 0.454
Defect_in_the_network_of_trolleybuses 0.991
Tree_on_the_road 0.114
Semaphore_off 0.663 0.219 0.200
Intermittent_Semaphore 0.152
Factor1 Factor2 Factor3 Factor4 Factor5 Factor6 Factor7
SS loadings 1.517 1.097 1.076 1.027 0.982 0.739 0.660
Proportion Var 0.095 0.069 0.067 0.064 0.061 0.046 0.041
Cumulative Var 0.095 0.163 0.231 0.295 0.356 0.402 0.444
Test of the hypothesis that 7 factors are sufficient.
The chi square statistic is 8 on 29 degrees of freedom.
The p-value is 1
Typically, factor analysis results are interpreted in terms of each factor’s principal loadings. These structures may be represented as a table of loadings or graphically, where any loadings with an absolute value greater than a specified cut point are displayed as edges (path).
fa.diagram(fa.none)
head(factanal.none$scores)
Factor1 Factor2 Factor3 Factor4 Factor5 Factor6 Factor7
1 -0.2826382 -0.2786558 -0.2494845 -0.2441464 -0.3487159 -0.1358222 -0.5288559
2 -0.2826382 -0.2786558 -0.2494845 -0.2441464 -0.3487159 -0.1358222 -0.5288559
3 -0.2826382 -0.2786558 -0.2494845 -0.2441464 -0.3487159 -0.1358222 -0.5288559
4 -0.2826382 -0.2786558 -0.2494845 -0.2441464 -0.3487159 -0.1358222 -0.5288559
5 -0.2826382 -0.2786558 -0.2494845 -0.2441464 -0.3487159 -0.1358222 -0.5288559
6 -0.2826382 -0.2786558 -0.2494845 -0.2441464 -0.3487159 -0.1358222 -0.5288559
Now we generate a new dataset by combining the factor values (obtained via the factanal approach) with the Y (target variable) of our previous dataset.
fac_data = cbind(df["Slowness_in_traffic____"], factanal.none$scores)
names(fac_data) = c("Target_Y", "F1", "F2", "F3", "F4", "F5", "F6", "F7")
head(fac_data)
fac_model = lm(Target_Y~., fac_data)
summary(regression_model)
Call:
lm(formula = Target_Y ~ ., data = reg_data)
Residuals:
Min 1Q Median 3Q Max
-6.1093 -2.5435 -0.8632 1.9718 10.5456
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.0518 0.3320 30.277 < 0.0000000000000002 ***
F1 1.9855 0.3476 5.712 0.0000000752 ***
F2 -0.0129 0.3396 -0.038 0.970
F3 0.2543 0.3781 0.673 0.502
F4 -0.8638 0.3348 -2.580 0.011 *
F5 0.2299 0.3347 0.687 0.493
F6 0.1660 0.4684 0.354 0.724
F7 0.8825 0.4812 1.834 0.069 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.857 on 127 degrees of freedom
Multiple R-squared: 0.2593, Adjusted R-squared: 0.2184
F-statistic: 6.35 on 7 and 127 DF, p-value: 0.000002097
We found an Ajudsted R-squared of 0.22 and a Multiple R-squared of 0.26 when using the factor as a predictor of the dependent variable. This outcome can be regarded inadequate.
all_model = lm(Slowness_in_traffic____~., df)
summary(all_model)
Call:
lm(formula = Slowness_in_traffic____ ~ ., data = df)
Residuals:
Min 1Q Median 3Q Max
-6.2722 -2.6275 -0.5169 1.7329 10.5966
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.31709 0.47272 19.710 < 0.0000000000000002 ***
Immobilized_bus 0.73447 0.53725 1.367 0.17419
Broken_Truck -0.01366 0.31467 -0.043 0.96544
Vehicle_excess -1.14578 1.93553 -0.592 0.55500
Accident_victim 0.41350 0.49102 0.842 0.40142
Running_over 0.18428 0.98104 0.188 0.85132
Fire_vehicles 6.34860 3.73014 1.702 0.09139 .
Occurrence_involving_freight -1.40080 4.15978 -0.337 0.73690
Incident_involving_dangerous_freight -2.55804 4.21938 -0.606 0.54551
Lack_of_electricity 1.91319 0.84562 2.262 0.02550 *
Fire -1.60343 3.70287 -0.433 0.66579
Point_of_flooding 2.00539 0.48893 4.102 0.0000758 ***
Manifestations 1.48582 1.74176 0.853 0.39536
Defect_in_the_network_of_trolleybuses -1.13632 0.40422 -2.811 0.00578 **
Tree_on_the_road -1.26297 1.56574 -0.807 0.42150
Semaphore_off 1.29074 0.92773 1.391 0.16676
Intermittent_Semaphore -3.96964 2.65756 -1.494 0.13792
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.674 on 118 degrees of freedom
Multiple R-squared: 0.3757, Adjusted R-squared: 0.291
F-statistic: 4.438 on 16 and 118 DF, p-value: 0.0000008591
As the predictors in our prior models were unable to adequately explain the variation of the target, we are now employing the Stewise Forward approach to address this issue. Now, we are attempting to choose a new subset of variables that may provide a better fit to the target.
The stepwise regression (or stepwise selection) comprises of iteratively adding and removing predictors from the predictive model to discover the subset of variables in the data set that results in the model with the lowest prediction error.
j = ols_step_forward_p(all_model)
j
Selection Summary
------------------------------------------------------------------------------------------------------
Variable Adj.
Step Entered R-Square R-Square C(p) AIC RMSE
------------------------------------------------------------------------------------------------------
1 Lack_of_electricity 0.1906 0.1845 21.9836 757.3317 3.9402
2 Point_of_flooding 0.2752 0.2642 7.9943 744.4288 3.7427
3 Defect_in_the_network_of_trolleybuses 0.3075 0.2917 3.8828 740.2678 3.6722
4 Fire_vehicles 0.3274 0.3067 2.1196 738.3294 3.6329
5 Semaphore_off 0.3407 0.3152 1.6108 737.6384 3.6108
6 Intermittent_Semaphore 0.3522 0.3218 1.4452 737.2716 3.5933
7 Immobilized_bus 0.3624 0.3273 1.5045 737.1148 3.5787
------------------------------------------------------------------------------------------------------
plot(j)
The stepwise identified 7 variables that may represent a better subset as dependent variables. Using these new subset, we now run a regression analysis and visualize the results.
foward_model = lm(Slowness_in_traffic____~ Lack_of_electricity + Point_of_flooding + Defect_in_the_network_of_trolleybuses + Fire_vehicles + Semaphore_off + Intermittent_Semaphore + Immobilized_bus, df)
summary(foward_model)
Call:
lm(formula = Slowness_in_traffic____ ~ Lack_of_electricity +
Point_of_flooding + Defect_in_the_network_of_trolleybuses +
Fire_vehicles + Semaphore_off + Intermittent_Semaphore +
Immobilized_bus, data = df)
Residuals:
Min 1Q Median 3Q Max
-6.0366 -2.5691 -0.6016 1.6134 10.4634
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.4366 0.3685 25.609 < 0.0000000000000002 ***
Lack_of_electricity 1.8977 0.8050 2.357 0.01993 *
Point_of_flooding 2.0730 0.4679 4.431 0.00002 ***
Defect_in_the_network_of_trolleybuses -1.1122 0.3896 -2.855 0.00503 **
Fire_vehicles 6.6862 3.6085 1.853 0.06622 .
Semaphore_off 1.4185 0.8561 1.657 0.09998 .
Intermittent_Semaphore -3.9752 2.5539 -1.557 0.12208
Immobilized_bus 0.6772 0.4735 1.430 0.15513
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.579 on 127 degrees of freedom
Multiple R-squared: 0.3624, Adjusted R-squared: 0.3273
F-statistic: 10.31 on 7 and 127 DF, p-value: 0.0000000003389
Compared to the prior results, the Adjusted R-squared and Multiple R-squared for this new subset were 0.33 and 0.36, respectively. This novel model yields findings that are demonstrably superior to those obtained employing all variables or factors as independent variables.
The stepwise forward method provides a better fit for the data in this instance. However, R2 (0.36) is still poor in comparison to contemporary approaches. This may imply that the dataset is not of optimal quality or that linear regression is not an effective prediction model.