Comenzamos con un primer modelo usando todas las variables. Al revisarlo, encontramos que algunas de estas no eran importantes, así que para el segundo modelo las eliminamos. Luego intentamos validar el segundo modelo, pero no cumplió con los criterios necesarios. Después, probamos aplicar un logaritmo a las variables, pero eso tampoco funcionó. Intentamos otra transformación elevando las variables al cuadrado, pero tampoco mejoró. Decidimos quedarnos con el segundo modelo porque tenía el R^2 más alto.
En este modelo identificamos puntos de influencia y notamos que al eliminar los valores atípicos, el R^2 mejoraba. Repetimos este proceso cuatro veces y descubrimos que la variable “asientos” no era significativa. Creamos un nuevo modelo sin esta variable. Sin embargo, al validar este nuevo modelo, tampoco cumplió con los criterios. Finalmente, decidimos quedarnos con el último modelo, llamado “m9”, aunque sabíamos que no sería exacto para predecir la problemática original.
Warning: package 'ggcorrplot' was built under R version 4.3.3
Warning: package 'car' was built under R version 4.3.3
Loading required package: carData
Warning: package 'carData' was built under R version 4.3.3
Call:
lm(formula = selling_price ~ year + fuel + seller_type + transmission +
km_driven + mileage + engine + max_power + seats + owner,
data = Car)
Residuals:
Min 1Q Median 3Q Max
-2575253 -201071 -7713 139515 7894634
Coefficients:
Estimate Std. Error t value
(Intercept) -104514694.4704 4092124.0535 -25.540
year 52569.6649 2063.1654 25.480
fuelDiesel 201381.0552 196309.0758 1.026
fuelLPG 299157.2954 192749.2970 1.552
fuelPetrol 117496.3849 202848.8640 0.579
seller_typeIndividual -318367.2402 18701.8122 -17.023
seller_typeTrustmark Dealer -581737.0308 38017.4979 -15.302
transmissionManual -778972.2446 21126.6380 -36.872
km_driven -1.2490 0.1237 -10.097
mileage 941.9220 1859.7851 0.506
engine 748.0859 20.8666 35.851
max_power -4.8035 1.8209 -2.638
seats -171308.4678 8413.9990 -20.360
ownerFourth & Above Owner 47746.8214 43641.3818 1.094
ownerSecond Owner -49285.4077 15184.3445 -3.246
ownerTest Drive Car 2361207.5778 233760.8136 10.101
ownerThird Owner 703.2975 26121.2574 0.027
Pr(>|t|)
(Intercept) < 0.0000000000000002 ***
year < 0.0000000000000002 ***
fuelDiesel 0.30500
fuelLPG 0.12069
fuelPetrol 0.56245
seller_typeIndividual < 0.0000000000000002 ***
seller_typeTrustmark Dealer < 0.0000000000000002 ***
transmissionManual < 0.0000000000000002 ***
km_driven < 0.0000000000000002 ***
mileage 0.61254
engine < 0.0000000000000002 ***
max_power 0.00836 **
seats < 0.0000000000000002 ***
ownerFourth & Above Owner 0.27396
ownerSecond Owner 0.00118 **
ownerTest Drive Car < 0.0000000000000002 ***
ownerThird Owner 0.97852
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 521000 on 7889 degrees of freedom
Multiple R-squared: 0.5908, Adjusted R-squared: 0.5899
F-statistic: 711.7 on 16 and 7889 DF, p-value: < 0.00000000000000022
Call:
lm(formula = selling_price ~ year + km_driven + engine + seats +
mileage + owner, data = Car)
Residuals:
Min 1Q Median 3Q Max
-2798645 -268317 -103386 103435 8107774
Coefficients:
Estimate Std. Error t value
(Intercept) -135717165.9720 4363555.2271 -31.102
year 67790.6994 2167.4412 31.277
km_driven -1.9205 0.1347 -14.260
engine 1100.0004 17.1200 64.252
seats -291213.1515 8794.2820 -33.114
mileage -1616.2773 742.2055 -2.178
ownerFourth & Above Owner 38002.3268 49083.2723 0.774
ownerSecond Owner -96032.8651 16965.8957 -5.660
ownerTest Drive Car 2851019.0284 262758.9283 10.850
ownerThird Owner -35763.3204 29313.7362 -1.220
Pr(>|t|)
(Intercept) < 0.0000000000000002 ***
year < 0.0000000000000002 ***
km_driven < 0.0000000000000002 ***
engine < 0.0000000000000002 ***
seats < 0.0000000000000002 ***
mileage 0.0295 *
ownerFourth & Above Owner 0.4388
ownerSecond Owner 0.0000000156 ***
ownerTest Drive Car < 0.0000000000000002 ***
ownerThird Owner 0.2225
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 586700 on 7896 degrees of freedom
Multiple R-squared: 0.4805, Adjusted R-squared: 0.4799
F-statistic: 811.4 on 9 and 7896 DF, p-value: < 0.00000000000000022
GVIF Df GVIF^(1/(2*Df))
year 1.610277 1 1.268967
km_driven 1.343299 1 1.159008
engine 1.708775 1 1.307201
seats 1.633904 1 1.278243
mileage 1.148388 1 1.071629
owner 1.386978 4 1.041738
Analysis of Variance Table
Response: selling_price
Df Sum Sq Mean Sq F value
year 1 889478158860597 889478158860597 2583.6488
km_driven 1 13250390714890 13250390714890 38.4881
engine 1 1173632954213440 1173632954213440 3409.0274
seats 1 382653564140254 382653564140254 1111.4859
mileage 1 2070478028500 2070478028500 6.0141
owner 4 52995097393699 13248774348425 38.4834
Residuals 7896 2718372359604428 344272082017
Pr(>F)
year < 0.00000000000000022 ***
km_driven 0.0000000005786 ***
engine < 0.00000000000000022 ***
seats < 0.00000000000000022 ***
mileage 0.01421 *
owner < 0.00000000000000022 ***
Residuals
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Warning: package 'lmtest' was built under R version 4.3.3
Loading required package: zoo
Warning: package 'zoo' was built under R version 4.3.3
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
studentized Breusch-Pagan test
data: m2
BP = 850.16, df = 9, p-value < 0.00000000000000022
Warning: package 'tseries' was built under R version 4.3.3
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Jarque Bera Test
data: Car$residual
X-squared = 164241, df = 2, p-value < 0.00000000000000022
Durbin-Watson test
data: m2
DW = 1.475, p-value < 0.00000000000000022
alternative hypothesis: true autocorrelation is greater than 0
Call:
lm(formula = log(selling_price) ~ log(year) + log(km_driven) +
log(engine) + log(seats), data = Car)
Residuals:
Min 1Q Median 3Q Max
-2.25332 -0.22954 0.01144 0.21834 2.37841
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2017.32266 20.76397 -97.16 <0.0000000000000002 ***
log(year) 265.65682 2.72654 97.43 <0.0000000000000002 ***
log(km_driven) -0.09961 0.00609 -16.35 <0.0000000000000002 ***
log(engine) 1.59216 0.01682 94.66 <0.0000000000000002 ***
log(seats) -0.71256 0.03477 -20.50 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3865 on 7901 degrees of freedom
Multiple R-squared: 0.7821, Adjusted R-squared: 0.782
F-statistic: 7089 on 4 and 7901 DF, p-value: < 0.00000000000000022
Durbin-Watson test
data: m3
DW = 1.7476, p-value < 0.00000000000000022
alternative hypothesis: true autocorrelation is greater than 0
Call:
lm(formula = (selling_price)^2 ~ (year)^2 + (km_driven)^2 + (engine)^2 +
(seats)^2, data = Car)
Residuals:
Min 1Q Median 3Q Max
-12178978685540 -1370347041685 -630606862111 512847288024 93598807252400
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -369316319119938 23194540450977 -15.92 <0.0000000000000002 ***
year 184931401333 11514790927 16.06 <0.0000000000000002 ***
km_driven -9375775 807123 -11.62 <0.0000000000000002 ***
engine 4514600372 100641317 44.86 <0.0000000000000002 ***
seats -1474274875036 53012386826 -27.81 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3545000000000 on 7901 degrees of freedom
Multiple R-squared: 0.2542, Adjusted R-squared: 0.2538
F-statistic: 673.2 on 4 and 7901 DF, p-value: < 0.00000000000000022
Durbin-Watson test
data: m1
DW = 1.5779, p-value < 0.00000000000000022
alternative hypothesis: true autocorrelation is greater than 0
Warning: package 'olsrr' was built under R version 4.3.3
Attaching package: 'olsrr'
The following object is masked from 'package:datasets':
rivers
Stepwise Summary
----------------------------------------------------------------------------------
Step Variable AIC SBC SBIC R2 Adj. R2
----------------------------------------------------------------------------------
0 Base Model 237627.973 237641.923 215190.296 0.00000 0.00000
1 engine 235789.876 235810.802 213352.010 0.20765 0.20755
2 year 233967.822 233995.723 211530.467 0.37090 0.37074
3 seats 232838.919 232873.796 210402.291 0.45475 0.45454
4 km_driven 232617.303 232659.155 210180.860 0.46995 0.46969
5 owner 232471.500 232541.254 210029.252 0.48017 0.47964
6 mileage 232468.753 232545.482 210026.515 0.48048 0.47989
----------------------------------------------------------------------------------
Final Model Output
------------------
Model Summary
-----------------------------------------------------------------------------
R 0.693 RMSE 586375.840
R-Squared 0.480 MSE 344272082016.771
Adj. R-Squared 0.480 Coef. Var 90.295
Pred R-Squared 0.472 AIC 232468.753
MAE 334673.055 SBC 232545.482
-----------------------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
-------------------------------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
-------------------------------------------------------------------------------------------
Regression 2514080643351445.000 9 279342293705716.125 811.4 0.0000
Residual 2718372359604426.000 7896 344272082016.771
Total 5232453002955871.000 7905
-------------------------------------------------------------------------------------------
Parameter Estimates
--------------------------------------------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
--------------------------------------------------------------------------------------------------------------------------------
(Intercept) -135717165.972 4363555.227 -31.102 0.000 -144270888.244 -127163443.700
engine 1100.000 17.120 0.681 64.252 0.000 1066.441 1133.560
year 67790.699 2167.441 0.322 31.277 0.000 63541.941 72039.457
seats -291213.151 8794.282 -0.343 -33.114 0.000 -308452.270 -273974.033
km_driven -1.920 0.135 -0.134 -14.260 0.000 -2.184 -1.656
ownerFourth & Above Owner 38002.327 49083.272 0.007 0.774 0.439 -58213.868 134218.522
ownerSecond Owner -96032.865 16965.896 -0.051 -5.660 0.000 -129290.508 -62775.223
ownerTest Drive Car 2851019.028 262758.928 0.088 10.850 0.000 2335942.037 3366096.020
ownerThird Owner -35763.320 29313.736 -0.011 -1.220 0.222 -93225.996 21699.355
mileage -1616.277 742.205 -0.019 -2.178 0.029 -3071.196 -161.358
--------------------------------------------------------------------------------------------------------------------------------
rstudent unadjusted p-value
165 14.002181 0.000000000000000000000000000000000000000000050674
3378 10.042409 0.000000000000000000000013741999999999998927340678
2847 8.148899 0.000000000000000423720000000000010260264859951462
7143 6.799553 0.000000000011258999999999999584504237204463095168
134 6.755091 0.000000000015286999999999999787832910547180631511
7308 6.529677 0.000000000069988999999999999949464035697843655726
7713 6.323173 0.000000000270179999999999991369160901033552590889
373 6.242808 0.000000000451990000000000007648187638764625262411
411 6.242808 0.000000000451990000000000007648187638764625262411
643 6.242808 0.000000000451990000000000007648187638764625262411
Bonferroni p
165 0.00000000000000000000000000000000000000040063
3378 0.00000000000000000010864000000000001473592082
2847 0.00000000000334990000000000009506076481535786
7143 0.00000008901300000000000436743696763386424209
134 0.00000012086000000000001154080303544802177385
7308 0.00000055333000000000004529356056881539416281
7713 0.00000213599999999999996271558833083048511980
373 0.00000357339999999999995638176597534396705669
411 0.00000357339999999999995638176597534396705669
643 0.00000357339999999999995638176597534396705669
Call:
lm(formula = selling_price ~ year + km_driven + engine + seats +
mileage + owner, data = Car_nueva)
Residuals:
Min 1Q Median 3Q Max
-918884 -152293 -33137 93115 1972556
Coefficients:
Estimate Std. Error t value
(Intercept) -100151434.15093 2536977.29567 -39.477
year 49970.58398 1265.09702 39.499
km_driven -1.64344 0.09661 -17.011
engine 662.32632 10.54042 62.837
seats -125206.05501 5020.55055 -24.939
mileage -4588.65019 1078.06747 -4.256
ownerSecond Owner -46643.24703 8440.69751 -5.526
ownerThird Owner -28288.15748 15362.80437 -1.841
Pr(>|t|)
(Intercept) < 0.0000000000000002 ***
year < 0.0000000000000002 ***
km_driven < 0.0000000000000002 ***
engine < 0.0000000000000002 ***
seats < 0.0000000000000002 ***
mileage 0.0000210372 ***
ownerSecond Owner 0.0000000339 ***
ownerThird Owner 0.0656 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 285600 on 7311 degrees of freedom
Multiple R-squared: 0.5781, Adjusted R-squared: 0.5777
F-statistic: 1431 on 7 and 7311 DF, p-value: < 0.00000000000000022
rstudent unadjusted p-value Bonferroni p
1027 6.947624 0.0000000000040348 0.000000029530
3961 6.947624 0.0000000000040348 0.000000029530
2619 6.841905 0.0000000000084469 0.000000061823
7298 6.202527 0.0000000005859000 0.000004288200
663 6.121944 0.0000000009721900 0.000007115500
1781 6.077536 0.0000000012817000 0.000009380900
6043 6.077536 0.0000000012817000 0.000009380900
1768 6.006661 0.0000000019846000 0.000014525000
5492 6.006661 0.0000000019846000 0.000014525000
5873 5.958742 0.0000000026599000 0.000019468000

Call:
lm(formula = selling_price ~ year + km_driven + engine + seats +
mileage + owner, data = Car_nueva2)
Residuals:
Min 1Q Median 3Q Max
-583062 -115269 -19926 87556 867794
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -95335472.10865 1989492.72767 -47.919 < 0.0000000000000002
year 47396.78283 992.66460 47.747 < 0.0000000000000002
km_driven -0.86617 0.07726 -11.211 < 0.0000000000000002
engine 504.79908 8.45762 59.686 < 0.0000000000000002
seats -57907.77714 4067.38551 -14.237 < 0.0000000000000002
mileage 2489.17916 880.46994 2.827 0.00471
ownerSecond Owner -36391.10715 5976.59664 -6.089 0.0000000012
ownerThird Owner -46329.55721 20545.24618 -2.255 0.02417
(Intercept) ***
year ***
km_driven ***
engine ***
seats ***
mileage **
ownerSecond Owner ***
ownerThird Owner *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 195000 on 6559 degrees of freedom
Multiple R-squared: 0.6322, Adjusted R-squared: 0.6318
F-statistic: 1611 on 7 and 6559 DF, p-value: < 0.00000000000000022
No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
rstudent unadjusted p-value Bonferroni p
4251 4.461397 0.0000082787 0.054366

Call:
lm(formula = selling_price ~ year + km_driven + engine + seats +
mileage + owner, data = Car_nueva3)
Residuals:
Min 1Q Median 3Q Max
-414174 -97490 -10583 83158 579420
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -89419749.40398 1589393.37195 -56.260 < 0.0000000000000002
year 44368.88596 793.05611 55.947 < 0.0000000000000002
km_driven -0.61167 0.06256 -9.778 < 0.0000000000000002
engine 376.29352 7.25425 51.872 < 0.0000000000000002
seats -6307.79881 3459.87034 -1.823 0.0683
mileage 4963.47305 689.64817 7.197 0.000000000000688
ownerSecond Owner -31456.46493 4672.05176 -6.733 0.000000000018141
(Intercept) ***
year ***
km_driven ***
engine ***
seats .
mileage ***
ownerSecond Owner ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 148400 on 6140 degrees of freedom
Multiple R-squared: 0.6652, Adjusted R-squared: 0.6649
F-statistic: 2033 on 6 and 6140 DF, p-value: < 0.00000000000000022
No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
rstudent unadjusted p-value Bonferroni p
7486 3.913781 0.000091845 0.56457

Call:
lm(formula = selling_price ~ year + km_driven + engine + seats +
mileage + owner, data = Car_nueva4)
Residuals:
Min 1Q Median 3Q Max
-331067 -90300 -3422 82590 366624
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -85056433.04165 1466144.74700 -58.014 < 0.0000000000000002
year 42161.93317 731.55496 57.633 < 0.0000000000000002
km_driven -0.54348 0.05819 -9.339 < 0.0000000000000002
engine 351.93006 6.84799 51.392 < 0.0000000000000002
seats 6152.80776 3415.36089 1.802 0.0717
mileage 6651.51845 617.39276 10.774 < 0.0000000000000002
ownerSecond Owner -27249.66549 4205.31801 -6.480 0.0000000000995
(Intercept) ***
year ***
km_driven ***
engine ***
seats .
mileage ***
ownerSecond Owner ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 128100 on 5748 degrees of freedom
Multiple R-squared: 0.6927, Adjusted R-squared: 0.6924
F-statistic: 2160 on 6 and 5748 DF, p-value: < 0.00000000000000022
No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
rstudent unadjusted p-value Bonferroni p
5246 2.865559 0.0041779 NA
Jarque Bera Test
data: Car_nueva4$residual
X-squared = 658.29, df = 2, p-value < 0.00000000000000022
Analysis of Variance Table
Response: selling_price
Df Sum Sq Mean Sq F value Pr(>F)
year 1 131364557585545 131364557585545 7999.2324 < 0.00000000000000022
km_driven 1 3612344450074 3612344450074 219.9679 < 0.00000000000000022
engine 1 75205141083708 75205141083708 4579.4955 < 0.00000000000000022
seats 1 4848783098 4848783098 0.2953 0.5869
mileage 1 1931502744103 1931502744103 117.6157 < 0.00000000000000022
owner 1 689532213077 689532213077 41.9879 0.00000000009948
Residuals 5748 94394492260782 16422145487
year ***
km_driven ***
engine ***
seats
mileage ***
owner ***
Residuals
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
studentized Breusch-Pagan test
data: m8
BP = 556.6, df = 6, p-value < 0.00000000000000022
Call:
lm(formula = selling_price ~ year + km_driven + engine + mileage +
owner, data = Car_nueva4)
Residuals:
Min 1Q Median 3Q Max
-329449 -90469 -3252 82676 362970
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -85453001.04103 1449808.08393 -58.941 < 0.0000000000000002
year 42372.61131 722.28813 58.664 < 0.0000000000000002
km_driven -0.53552 0.05803 -9.227 < 0.0000000000000002
engine 358.57096 5.77227 62.120 < 0.0000000000000002
mileage 6414.24284 603.29810 10.632 < 0.0000000000000002
ownerSecond Owner -27228.26753 4206.12240 -6.473 0.000000000104
(Intercept) ***
year ***
km_driven ***
engine ***
mileage ***
ownerSecond Owner ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 128200 on 5749 degrees of freedom
Multiple R-squared: 0.6926, Adjusted R-squared: 0.6923
F-statistic: 2590 on 5 and 5749 DF, p-value: < 0.00000000000000022
Analysis of Variance Table
Response: selling_price
Df Sum Sq Mean Sq F value Pr(>F)
year 1 131364557585545 131364557585545 7996.109 < 0.00000000000000022
km_driven 1 3612344450074 3612344450074 219.882 < 0.00000000000000022
engine 1 75205141083708 75205141083708 4577.708 < 0.00000000000000022
mileage 1 1884131451611 1884131451611 114.686 < 0.00000000000000022
owner 1 688455212480 688455212480 41.906 0.0000000001037
Residuals 5749 94447789336969 16428559634
year ***
km_driven ***
engine ***
mileage ***
owner ***
Residuals
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
studentized Breusch-Pagan test
data: m9
BP = 490.84, df = 5, p-value < 0.00000000000000022
Jarque Bera Test
data: Car_nueva4$residual
X-squared = 51.296, df = 2, p-value = 0.000000000007263
Durbin-Watson test
data: m9
DW = 1.8611, p-value = 0.0000000673
alternative hypothesis: true autocorrelation is greater than 0