Renta de Bicis

1. Importar la base de datos

# file.choose()
df <- read.csv("/Users/dannaleal/Downloads/rentadebicis.csv")

2. Entender la base de datos

summary(df)
##       hora            dia              mes              año      
##  Min.   : 0.00   Min.   : 1.000   Min.   : 1.000   Min.   :2011  
##  1st Qu.: 6.00   1st Qu.: 5.000   1st Qu.: 4.000   1st Qu.:2011  
##  Median :12.00   Median :10.000   Median : 7.000   Median :2012  
##  Mean   :11.54   Mean   : 9.993   Mean   : 6.521   Mean   :2012  
##  3rd Qu.:18.00   3rd Qu.:15.000   3rd Qu.:10.000   3rd Qu.:2012  
##  Max.   :23.00   Max.   :19.000   Max.   :12.000   Max.   :2012  
##     estacion     dia_de_la_semana     asueto         temperatura   
##  Min.   :1.000   Min.   :1.000    Min.   :0.00000   Min.   : 0.82  
##  1st Qu.:2.000   1st Qu.:2.000    1st Qu.:0.00000   1st Qu.:13.94  
##  Median :3.000   Median :4.000    Median :0.00000   Median :20.50  
##  Mean   :2.507   Mean   :4.014    Mean   :0.02857   Mean   :20.23  
##  3rd Qu.:4.000   3rd Qu.:6.000    3rd Qu.:0.00000   3rd Qu.:26.24  
##  Max.   :4.000   Max.   :7.000    Max.   :1.00000   Max.   :41.00  
##  sensacion_termica    humedad       velocidad_del_viento
##  Min.   : 0.76     Min.   :  0.00   Min.   : 0.000      
##  1st Qu.:16.66     1st Qu.: 47.00   1st Qu.: 7.002      
##  Median :24.24     Median : 62.00   Median :12.998      
##  Mean   :23.66     Mean   : 61.89   Mean   :12.799      
##  3rd Qu.:31.06     3rd Qu.: 77.00   3rd Qu.:16.998      
##  Max.   :45.45     Max.   :100.00   Max.   :56.997      
##  rentas_de_no_registrados rentas_de_registrados rentas_totales 
##  Min.   :  0.00           Min.   :  0.0         Min.   :  1.0  
##  1st Qu.:  4.00           1st Qu.: 36.0         1st Qu.: 42.0  
##  Median : 17.00           Median :118.0         Median :145.0  
##  Mean   : 36.02           Mean   :155.6         Mean   :191.6  
##  3rd Qu.: 49.00           3rd Qu.:222.0         3rd Qu.:284.0  
##  Max.   :367.00           Max.   :886.0         Max.   :977.0

Observaciones:
1. Los días llegan hasta el 19 y no hasta el 31.
2. ¿Cuál es la relación de las estaciones? 1 es primavera, 2 es verano, 3 otoño y 4 invierno.
3. ¿Cuál es la relación de los días de la semana? 1 es domingo, 2 es lunes … y 7 es sábado.

3. Generar la regresión lineal

regresion <- lm(rentas_totales ~ hora + dia + mes + año + estacion + dia_de_la_semana
                + asueto + temperatura + sensacion_termica + humedad + velocidad_del_viento, 
                data = df)
summary(regresion)
## 
## Call:
## lm(formula = rentas_totales ~ hora + dia + mes + año + estacion + 
##     dia_de_la_semana + asueto + temperatura + sensacion_termica + 
##     humedad + velocidad_del_viento, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -305.52  -93.64  -27.70   61.85  649.10 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.661e+05  5.496e+03 -30.217  < 2e-16 ***
## hora                  7.735e+00  2.070e-01  37.368  < 2e-16 ***
## dia                   3.844e-01  2.482e-01   1.549  0.12150    
## mes                   9.996e+00  1.682e+00   5.943 2.89e-09 ***
## año                   8.258e+01  2.732e+00  30.225  < 2e-16 ***
## estacion             -7.774e+00  5.177e+00  -1.502  0.13324    
## dia_de_la_semana      4.393e-01  6.918e-01   0.635  0.52545    
## asueto               -4.864e+00  8.365e+00  -0.582  0.56089    
## temperatura           1.582e+00  1.038e+00   1.524  0.12752    
## sensacion_termica     4.748e+00  9.552e-01   4.971 6.76e-07 ***
## humedad              -2.115e+00  7.884e-02 -26.827  < 2e-16 ***
## velocidad_del_viento  5.582e-01  1.809e-01   3.086  0.00203 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 141.7 on 10874 degrees of freedom
## Multiple R-squared:  0.3891, Adjusted R-squared:  0.3885 
## F-statistic: 629.6 on 11 and 10874 DF,  p-value: < 2.2e-16

4. Ajustar la regresión lineal

regresion <- lm(rentas_totales ~ hora + mes + año + sensacion_termica + humedad + velocidad_del_viento, data = df)
summary(regresion)
## 
## Call:
## lm(formula = rentas_totales ~ hora + mes + año + sensacion_termica + 
##     humedad + velocidad_del_viento, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -308.60  -93.85  -28.34   61.05  648.09 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.662e+05  5.496e+03 -30.250  < 2e-16 ***
## hora                  7.734e+00  2.070e-01  37.364  < 2e-16 ***
## mes                   7.574e+00  4.207e-01  18.002  < 2e-16 ***
## año                   8.266e+01  2.732e+00  30.258  < 2e-16 ***
## sensacion_termica     6.172e+00  1.689e-01  36.539  < 2e-16 ***
## humedad              -2.121e+00  7.858e-02 -26.988  < 2e-16 ***
## velocidad_del_viento  6.208e-01  1.771e-01   3.506 0.000457 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 141.7 on 10879 degrees of freedom
## Multiple R-squared:  0.3886, Adjusted R-squared:  0.3883 
## F-statistic:  1153 on 6 and 10879 DF,  p-value: < 2.2e-16

5. Construir un modelo predictivo

datos <- data.frame(hora = 11.54, mes = 1:12, año = 2013, sensacion_termica = 23.66, 
                    humedad = 61.89, velocidad_del_viento = 12.799)
predict(regresion, datos)
##        1        2        3        4        5        6        7        8 
## 273.6001 281.1738 288.7475 296.3213 303.8950 311.4687 319.0424 326.6161 
##        9       10       11       12 
## 334.1898 341.7635 349.3372 356.9110

Conclusiones

El modelo predictivo nos muestra las bicicletas rentadas por hora por mes durante el próximo año (2013), considerando las demás variables como promedio, con una R-cuadrada ajustada del 39%.

Renta de Casas

1. Importar la base de datos

# file.choose()
bd <- read.csv("/Users/dannaleal/Downloads/HousePriceData.csv")

2. Entender la base de datos

summary(bd)
##   Observation      Dist_Taxi      Dist_Market    Dist_Hospital  
##  Min.   :  1.0   Min.   :  146   Min.   : 1666   Min.   : 3227  
##  1st Qu.:237.0   1st Qu.: 6477   1st Qu.: 9367   1st Qu.:11302  
##  Median :469.0   Median : 8228   Median :11149   Median :13189  
##  Mean   :468.4   Mean   : 8235   Mean   :11022   Mean   :13091  
##  3rd Qu.:700.0   3rd Qu.: 9939   3rd Qu.:12675   3rd Qu.:14855  
##  Max.   :932.0   Max.   :20662   Max.   :20945   Max.   :23294  
##                                                                 
##      Carpet         Builtup        Parking          City_Category     
##  Min.   :  775   Min.   :  932   Length:905         Length:905        
##  1st Qu.: 1317   1st Qu.: 1579   Class :character   Class :character  
##  Median : 1478   Median : 1774   Mode  :character   Mode  :character  
##  Mean   : 1511   Mean   : 1794                                        
##  3rd Qu.: 1654   3rd Qu.: 1985                                        
##  Max.   :24300   Max.   :12730                                        
##  NA's   :7                                                            
##     Rainfall       House_Price       
##  Min.   :-110.0   Min.   :  1492000  
##  1st Qu.: 600.0   1st Qu.:  4623000  
##  Median : 780.0   Median :  5860000  
##  Mean   : 786.9   Mean   :  6083992  
##  3rd Qu.: 970.0   3rd Qu.:  7200000  
##  Max.   :1560.0   Max.   :150000000  
## 
# install.packages("dplyr")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
count(bd, Parking, sort = TRUE)
##        Parking   n
## 1         Open 355
## 2 Not Provided 225
## 3      Covered 184
## 4   No Parking 141
count(bd, City_Category, sort = TRUE)
##   City_Category   n
## 1         CAT B 351
## 2         CAT A 320
## 3         CAT C 234
boxplot(bd$House_Price, horizontal = TRUE)

Observaciones:
1. Tenemos NA en la base de datos en la categoría “Carpet”.
2. Tenemos valores negativos en la categoría “Rainfall”.
3. El precio máximo de casa está con datos atípicos.

3. Limpiar la base de datos

# ¿Cuántos NA tengo en la base de datos?
sum(is.na(bd))
## [1] 7
# ¿Cuántos NA tengo por variable?
sapply(bd, function(x) sum(is.na(x)))
##   Observation     Dist_Taxi   Dist_Market Dist_Hospital        Carpet 
##             0             0             0             0             7 
##       Builtup       Parking City_Category      Rainfall   House_Price 
##             0             0             0             0             0
# Eliminar NA
bd <- na.omit(bd)

# ¿Cuántos NA tengo por variable?
sapply(bd, function(x) sum(is.na(x)))
##   Observation     Dist_Taxi   Dist_Market Dist_Hospital        Carpet 
##             0             0             0             0             0 
##       Builtup       Parking City_Category      Rainfall   House_Price 
##             0             0             0             0             0
# Eliminar el registro de precio alto
bd <- bd[bd$House_Price<15000000,]
boxplot(bd$House_Price, horizontal = TRUE)

# Eliminar registro negativo de Rainfall
bd <- bd <- bd[bd$Rainfall>0,]
boxplot(bd$Rainfall, horizontal = TRUE)

4. Generar la regresión lineal

regresion <- lm(House_Price ~ Dist_Taxi + Dist_Market + Dist_Hospital + Carpet + Builtup + Parking
                + City_Category + Rainfall, data = bd)
summary(regresion)
## 
## Call:
## lm(formula = House_Price ~ Dist_Taxi + Dist_Market + Dist_Hospital + 
##     Carpet + Builtup + Parking + City_Category + Rainfall, data = bd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3572286  -803711   -64861   759084  4399052 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          5.611e+06  3.681e+05  15.246  < 2e-16 ***
## Dist_Taxi            3.041e+01  2.684e+01   1.133   0.2575    
## Dist_Market          1.248e+01  2.083e+01   0.599   0.5492    
## Dist_Hospital        4.862e+01  3.009e+01   1.616   0.1065    
## Carpet              -7.734e+02  3.478e+03  -0.222   0.8241    
## Builtup              1.315e+03  2.902e+03   0.453   0.6506    
## ParkingNo Parking   -6.046e+05  1.390e+05  -4.351 1.52e-05 ***
## ParkingNot Provided -4.898e+05  1.236e+05  -3.963 8.00e-05 ***
## ParkingOpen         -2.635e+05  1.126e+05  -2.340   0.0195 *  
## City_CategoryCAT B  -1.875e+06  9.607e+04 -19.517  < 2e-16 ***
## City_CategoryCAT C  -2.890e+06  1.059e+05 -27.291  < 2e-16 ***
## Rainfall            -1.260e+02  1.558e+02  -0.809   0.4187    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1223000 on 883 degrees of freedom
## Multiple R-squared:  0.5005, Adjusted R-squared:  0.4943 
## F-statistic: 80.43 on 11 and 883 DF,  p-value: < 2.2e-16

5. Construir un modelo predictivo

datos <- data.frame(Dist_Taxi = 8278, Dist_Market = 16251, Dist_Hospital = 13857, 
                    Carpet = 1455, Builtup = 1764, Parking = "Covered", City_Category = "CAT A", 
                    Rainfall = 390)
predict(regresion, datos)
##       1 
## 7884599

Conclusiones

El modelo predictivo nos muestra el precio de la casa, considerando las demás variables como datos de entrada, con un R-cuadrada ajustada del 49%.

LS0tCnRpdGxlOiAiQWN0aXZpZGFkIDQuMyIKYXV0aG9yOiAiRGFubmEgTGVhbCBBMDA4MzE2OTgiCmRhdGU6ICIyMS85LzIwMjMiCm91dHB1dDogCiAgICAgIGh0bWxfZG9jdW1lbnQ6CiAgICAgICAgdG9jOiBUUlVFCiAgICAgICAgdG9jX2Zsb2F0OiBUUlVFCiAgICAgICAgY29kZV9kb3dubG9hZDogVFJVRQogICAgICAgIHRoZW1lOiAic2ltcGxleCIKICAgICAgICBoaWdobGlnaHQ6ICJtb25vY2hyb21lIgotLS0KIyBSZW50YSBkZSBCaWNpcwohW10oL1VzZXJzL2Rhbm5hbGVhbC9EZXNrdG9wL2JpY2ljbGV0YSBnaWYuZ2lmKQoKIyMgMS4gSW1wb3J0YXIgbGEgYmFzZSBkZSBkYXRvcwpgYGB7cn0KIyBmaWxlLmNob29zZSgpCmRmIDwtIHJlYWQuY3N2KCIvVXNlcnMvZGFubmFsZWFsL0Rvd25sb2Fkcy9yZW50YWRlYmljaXMuY3N2IikKYGBgCgojIyAyLiBFbnRlbmRlciBsYSBiYXNlIGRlIGRhdG9zCmBgYHtyfQpzdW1tYXJ5KGRmKQpgYGAKT2JzZXJ2YWNpb25lczogIAoxLiBMb3MgZMOtYXMgbGxlZ2FuIGhhc3RhIGVsIDE5IHkgbm8gaGFzdGEgZWwgMzEuICAKMi4gwr9DdcOhbCBlcyBsYSByZWxhY2nDs24gZGUgbGFzIGVzdGFjaW9uZXM/IDEgZXMgcHJpbWF2ZXJhLCAyIGVzIHZlcmFubywgMyBvdG/DsW8KeSA0IGludmllcm5vLiAgCjMuIMK/Q3XDoWwgZXMgbGEgcmVsYWNpw7NuIGRlIGxvcyBkw61hcyBkZSBsYSBzZW1hbmE/IDEgZXMgZG9taW5nbywgMiBlcyBsdW5lcyAuLi4KeSA3IGVzIHPDoWJhZG8uCgojIyAzLiBHZW5lcmFyIGxhIHJlZ3Jlc2nDs24gbGluZWFsCmBgYHtyfQpyZWdyZXNpb24gPC0gbG0ocmVudGFzX3RvdGFsZXMgfiBob3JhICsgZGlhICsgbWVzICsgYcOxbyArIGVzdGFjaW9uICsgZGlhX2RlX2xhX3NlbWFuYQogICAgICAgICAgICAgICAgKyBhc3VldG8gKyB0ZW1wZXJhdHVyYSArIHNlbnNhY2lvbl90ZXJtaWNhICsgaHVtZWRhZCArIHZlbG9jaWRhZF9kZWxfdmllbnRvLCAKICAgICAgICAgICAgICAgIGRhdGEgPSBkZikKc3VtbWFyeShyZWdyZXNpb24pCmBgYAojIyA0LiBBanVzdGFyIGxhIHJlZ3Jlc2nDs24gbGluZWFsCmBgYHtyfQpyZWdyZXNpb24gPC0gbG0ocmVudGFzX3RvdGFsZXMgfiBob3JhICsgbWVzICsgYcOxbyArIHNlbnNhY2lvbl90ZXJtaWNhICsgaHVtZWRhZCArIHZlbG9jaWRhZF9kZWxfdmllbnRvLCBkYXRhID0gZGYpCnN1bW1hcnkocmVncmVzaW9uKQpgYGAKIyMgNS4gQ29uc3RydWlyIHVuIG1vZGVsbyBwcmVkaWN0aXZvCmBgYHtyfQpkYXRvcyA8LSBkYXRhLmZyYW1lKGhvcmEgPSAxMS41NCwgbWVzID0gMToxMiwgYcOxbyA9IDIwMTMsIHNlbnNhY2lvbl90ZXJtaWNhID0gMjMuNjYsIAogICAgICAgICAgICAgICAgICAgIGh1bWVkYWQgPSA2MS44OSwgdmVsb2NpZGFkX2RlbF92aWVudG8gPSAxMi43OTkpCnByZWRpY3QocmVncmVzaW9uLCBkYXRvcykKYGBgCiMjIENvbmNsdXNpb25lcwpFbCBtb2RlbG8gcHJlZGljdGl2byBub3MgbXVlc3RyYSBsYXMgYmljaWNsZXRhcyByZW50YWRhcyBwb3IgaG9yYSBwb3IgbWVzIGR1cmFudGUgZWwgcHLDs3hpbW8gYcOxbyAoMjAxMyksIGNvbnNpZGVyYW5kbyBsYXMgZGVtw6FzIHZhcmlhYmxlcyBjb21vIHByb21lZGlvLCBjb24gdW5hIFItY3VhZHJhZGEgYWp1c3RhZGEgZGVsIDM5JS4KCgoKCiMgUmVudGEgZGUgQ2FzYXMKCiFbXSgvVXNlcnMvZGFubmFsZWFsL0Rlc2t0b3AvaG91c2VnaWYuZ2lmKQoKIyMgMS4gSW1wb3J0YXIgbGEgYmFzZSBkZSBkYXRvcwpgYGB7cn0KIyBmaWxlLmNob29zZSgpCmJkIDwtIHJlYWQuY3N2KCIvVXNlcnMvZGFubmFsZWFsL0Rvd25sb2Fkcy9Ib3VzZVByaWNlRGF0YS5jc3YiKQpgYGAKCiMjIDIuIEVudGVuZGVyIGxhIGJhc2UgZGUgZGF0b3MKYGBge3J9CnN1bW1hcnkoYmQpCiMgaW5zdGFsbC5wYWNrYWdlcygiZHBseXIiKQpsaWJyYXJ5KGRwbHlyKQpjb3VudChiZCwgUGFya2luZywgc29ydCA9IFRSVUUpCmNvdW50KGJkLCBDaXR5X0NhdGVnb3J5LCBzb3J0ID0gVFJVRSkKYm94cGxvdChiZCRIb3VzZV9QcmljZSwgaG9yaXpvbnRhbCA9IFRSVUUpCmBgYAoKT2JzZXJ2YWNpb25lczogIAoxLiBUZW5lbW9zIE5BIGVuIGxhIGJhc2UgZGUgZGF0b3MgZW4gbGEgY2F0ZWdvcsOtYSAiQ2FycGV0Ii4gIAoyLiBUZW5lbW9zIHZhbG9yZXMgbmVnYXRpdm9zIGVuIGxhIGNhdGVnb3LDrWEgIlJhaW5mYWxsIi4gIAozLiBFbCBwcmVjaW8gbcOheGltbyBkZSBjYXNhIGVzdMOhIGNvbiBkYXRvcyBhdMOtcGljb3MuCgojIyAzLiBMaW1waWFyIGxhIGJhc2UgZGUgZGF0b3MKYGBge3J9CiMgwr9DdcOhbnRvcyBOQSB0ZW5nbyBlbiBsYSBiYXNlIGRlIGRhdG9zPwpzdW0oaXMubmEoYmQpKQoKIyDCv0N1w6FudG9zIE5BIHRlbmdvIHBvciB2YXJpYWJsZT8Kc2FwcGx5KGJkLCBmdW5jdGlvbih4KSBzdW0oaXMubmEoeCkpKQoKIyBFbGltaW5hciBOQQpiZCA8LSBuYS5vbWl0KGJkKQoKIyDCv0N1w6FudG9zIE5BIHRlbmdvIHBvciB2YXJpYWJsZT8Kc2FwcGx5KGJkLCBmdW5jdGlvbih4KSBzdW0oaXMubmEoeCkpKQoKIyBFbGltaW5hciBlbCByZWdpc3RybyBkZSBwcmVjaW8gYWx0bwpiZCA8LSBiZFtiZCRIb3VzZV9QcmljZTwxNTAwMDAwMCxdCmJveHBsb3QoYmQkSG91c2VfUHJpY2UsIGhvcml6b250YWwgPSBUUlVFKQoKIyBFbGltaW5hciByZWdpc3RybyBuZWdhdGl2byBkZSBSYWluZmFsbApiZCA8LSBiZCA8LSBiZFtiZCRSYWluZmFsbD4wLF0KYm94cGxvdChiZCRSYWluZmFsbCwgaG9yaXpvbnRhbCA9IFRSVUUpCgpgYGAKCiMjIDQuIEdlbmVyYXIgbGEgcmVncmVzacOzbiBsaW5lYWwKYGBge3J9CnJlZ3Jlc2lvbiA8LSBsbShIb3VzZV9QcmljZSB+IERpc3RfVGF4aSArIERpc3RfTWFya2V0ICsgRGlzdF9Ib3NwaXRhbCArIENhcnBldCArIEJ1aWx0dXAgKyBQYXJraW5nCiAgICAgICAgICAgICAgICArIENpdHlfQ2F0ZWdvcnkgKyBSYWluZmFsbCwgZGF0YSA9IGJkKQpzdW1tYXJ5KHJlZ3Jlc2lvbikKYGBgCgojIyA1LiBDb25zdHJ1aXIgdW4gbW9kZWxvIHByZWRpY3Rpdm8KYGBge3J9CmRhdG9zIDwtIGRhdGEuZnJhbWUoRGlzdF9UYXhpID0gODI3OCwgRGlzdF9NYXJrZXQgPSAxNjI1MSwgRGlzdF9Ib3NwaXRhbCA9IDEzODU3LCAKICAgICAgICAgICAgICAgICAgICBDYXJwZXQgPSAxNDU1LCBCdWlsdHVwID0gMTc2NCwgUGFya2luZyA9ICJDb3ZlcmVkIiwgQ2l0eV9DYXRlZ29yeSA9ICJDQVQgQSIsIAogICAgICAgICAgICAgICAgICAgIFJhaW5mYWxsID0gMzkwKQpwcmVkaWN0KHJlZ3Jlc2lvbiwgZGF0b3MpCmBgYAoKIyMgQ29uY2x1c2lvbmVzCkVsIG1vZGVsbyBwcmVkaWN0aXZvIG5vcyBtdWVzdHJhIGVsIHByZWNpbyBkZSBsYSBjYXNhLCBjb25zaWRlcmFuZG8gbGFzIGRlbcOhcyB2YXJpYWJsZXMgY29tbyBkYXRvcyBkZSBlbnRyYWRhLCBjb24gdW4gUi1jdWFkcmFkYSBhanVzdGFkYSBkZWwgNDklLgoK