Final project

Data coding

Introduzione

Il dataset analizzato contiene informazioni su 2.930 proprietà ad Ames, Iowa, nel periodo 2006-2010, incluse colonne relative a:

  • Caratteristiche della casa (bedrooms, garage, fireplace, pool, porch, etc.)
  • Località (neighborhood)
  • Informazioni sul lotto (zoning, shape, size, etc.)
  • Valutazioni di condizione e qualità
  • Prezzo di vendita

L’obiettivo di modellazione è vedere come varia il prezzo di vendita (Sale_price) di una casa sulla base di altre informazioni in nostro possesso, come le sue caratteristiche e l’ubicazione.

Statistiche descrittive

na_count <-sapply(ames, function(ames) sum(length(which(is.na(ames)))))
na_count
       MS_SubClass          MS_Zoning       Lot_Frontage           Lot_Area 
                 0                  0                  0                  0 
            Street              Alley          Lot_Shape       Land_Contour 
                 0                  0                  0                  0 
         Utilities         Lot_Config         Land_Slope       Neighborhood 
                 0                  0                  0                  0 
       Condition_1        Condition_2          Bldg_Type        House_Style 
                 0                  0                  0                  0 
      Overall_Cond         Year_Built     Year_Remod_Add         Roof_Style 
                 0                  0                  0                  0 
         Roof_Matl       Exterior_1st       Exterior_2nd       Mas_Vnr_Type 
                 0                  0                  0                  0 
      Mas_Vnr_Area         Exter_Cond         Foundation          Bsmt_Cond 
                 0                  0                  0                  0 
     Bsmt_Exposure     BsmtFin_Type_1       BsmtFin_SF_1     BsmtFin_Type_2 
                 0                  0                  0                  0 
      BsmtFin_SF_2        Bsmt_Unf_SF      Total_Bsmt_SF            Heating 
                 0                  0                  0                  0 
        Heating_QC        Central_Air         Electrical       First_Flr_SF 
                 0                  0                  0                  0 
     Second_Flr_SF        Gr_Liv_Area     Bsmt_Full_Bath     Bsmt_Half_Bath 
                 0                  0                  0                  0 
         Full_Bath          Half_Bath      Bedroom_AbvGr      Kitchen_AbvGr 
                 0                  0                  0                  0 
     TotRms_AbvGrd         Functional         Fireplaces        Garage_Type 
                 0                  0                  0                  0 
     Garage_Finish        Garage_Cars        Garage_Area        Garage_Cond 
                 0                  0                  0                  0 
       Paved_Drive       Wood_Deck_SF      Open_Porch_SF     Enclosed_Porch 
                 0                  0                  0                  0 
Three_season_porch       Screen_Porch          Pool_Area            Pool_QC 
                 0                  0                  0                  0 
             Fence       Misc_Feature           Misc_Val            Mo_Sold 
                 0                  0                  0                  0 
         Year_Sold          Sale_Type     Sale_Condition         Sale_Price 
                 0                  0                  0                  0 
         Longitude           Latitude 
                 0                  0 
library(summarytools)
descr(ames)
Descriptive Statistics  
ames  
N: 2930  

                    Bedroom_AbvGr   Bsmt_Full_Bath   Bsmt_Half_Bath   Bsmt_Unf_SF   BsmtFin_SF_1
----------------- --------------- ---------------- ---------------- ------------- --------------
             Mean            2.85             0.43             0.06        559.07           4.18
          Std.Dev            0.83             0.52             0.25        439.54           2.23
              Min            0.00             0.00             0.00          0.00           0.00
               Q1            2.00             0.00             0.00        219.00           3.00
           Median            3.00             0.00             0.00        465.50           3.00
               Q3            3.00             1.00             0.00        802.00           7.00
              Max            8.00             3.00             2.00       2336.00           7.00
              MAD            0.00             0.00             0.00        414.39           2.97
              IQR            1.00             1.00             0.00        582.75           4.00
               CV            0.29             1.22             4.01          0.79           0.53
         Skewness            0.31             0.62             3.94          0.92           0.09
      SE.Skewness            0.05             0.05             0.05          0.05           0.05
         Kurtosis            1.88            -0.75            14.90          0.40          -1.51
          N.Valid         2930.00          2930.00          2930.00       2930.00        2930.00
        Pct.Valid          100.00           100.00           100.00        100.00         100.00

Table: Table continues below

 

                    BsmtFin_SF_2   Enclosed_Porch   Fireplaces   First_Flr_SF   Full_Bath
----------------- -------------- ---------------- ------------ -------------- -----------
             Mean          49.71            23.01         0.60        1159.56        1.57
          Std.Dev         169.14            64.14         0.65         391.89        0.55
              Min           0.00             0.00         0.00         334.00        0.00
               Q1           0.00             0.00         0.00         876.00        1.00
           Median           0.00             0.00         1.00        1084.00        2.00
               Q3           0.00             0.00         1.00        1384.00        2.00
              Max        1526.00          1012.00         4.00        5095.00        4.00
              MAD           0.00             0.00         1.48         349.89        0.00
              IQR           0.00             0.00         1.00         507.75        1.00
               CV           3.40             2.79         1.08           0.34        0.35
         Skewness           4.14             4.01         0.74           1.47        0.17
      SE.Skewness           0.05             0.05         0.05           0.05        0.05
         Kurtosis          18.74            28.42         0.10           6.95       -0.54
          N.Valid        2930.00          2930.00      2930.00        2930.00     2930.00
        Pct.Valid         100.00           100.00       100.00         100.00      100.00

Table: Table continues below

 

                    Garage_Area   Garage_Cars   Gr_Liv_Area   Half_Bath   Kitchen_AbvGr   Latitude
----------------- ------------- ------------- ------------- ----------- --------------- ----------
             Mean        472.66          1.77       1499.69        0.38            1.04      42.03
          Std.Dev        215.19          0.76        505.51        0.50            0.21       0.02
              Min          0.00          0.00        334.00        0.00            0.00      41.99
               Q1        320.00          1.00       1126.00        0.00            1.00      42.02
           Median        480.00          2.00       1442.00        0.00            1.00      42.03
               Q3        576.00          2.00       1743.00        1.00            1.00      42.05
              Max       1488.00          5.00       5642.00        2.00            3.00      42.06
              MAD        182.36          0.00        461.09        0.00            0.00       0.02
              IQR        256.00          1.00        616.75        1.00            0.00       0.03
               CV          0.46          0.43          0.34        1.32            0.20       0.00
         Skewness          0.24         -0.22          1.27        0.70            4.31      -0.49
      SE.Skewness          0.05          0.05          0.05        0.05            0.05       0.05
         Kurtosis          0.94          0.24          4.12       -1.03           19.82      -0.18
          N.Valid       2930.00       2930.00       2930.00     2930.00         2930.00    2930.00
        Pct.Valid        100.00        100.00        100.00      100.00          100.00     100.00

Table: Table continues below

 

                    Longitude    Lot_Area   Lot_Frontage   Mas_Vnr_Area   Misc_Val   Mo_Sold
----------------- ----------- ----------- -------------- -------------- ---------- ---------
             Mean      -93.64    10147.92          57.65         101.10      50.64      6.22
          Std.Dev        0.03     7880.02          33.50         178.63     566.34      2.71
              Min      -93.69     1300.00           0.00           0.00       0.00      1.00
               Q1      -93.66     7440.00          43.00           0.00       0.00      4.00
           Median      -93.64     9436.50          63.00           0.00       0.00      6.00
               Q3      -93.62    11556.00          78.00         163.00       0.00      8.00
              Max      -93.58   215245.00         313.00        1600.00   17000.00     12.00
              MAD        0.03     3024.50          25.20           0.00       0.00      2.97
              IQR        0.04     4115.00          35.00         162.75       0.00      4.00
               CV        0.00        0.78           0.58           1.77      11.18      0.44
         Skewness       -0.31       12.81           0.03           2.62      21.98      0.19
      SE.Skewness        0.05        0.05           0.05           0.05       0.05      0.05
         Kurtosis       -0.94      264.39           2.15           9.34     564.85     -0.46
          N.Valid     2930.00     2930.00        2930.00        2930.00    2930.00   2930.00
        Pct.Valid      100.00      100.00         100.00         100.00     100.00    100.00

Table: Table continues below

 

                    Open_Porch_SF   Pool_Area   Sale_Price   Screen_Porch   Second_Flr_SF
----------------- --------------- ----------- ------------ -------------- ---------------
             Mean           47.53        2.24    180796.06          16.00          335.46
          Std.Dev           67.48       35.60     79886.69          56.09          428.40
              Min            0.00        0.00     12789.00           0.00            0.00
               Q1            0.00        0.00    129500.00           0.00            0.00
           Median           27.00        0.00    160000.00           0.00            0.00
               Q3           70.00        0.00    213500.00           0.00          704.00
              Max          742.00      800.00    755000.00         576.00         2065.00
              MAD           40.03        0.00     54856.20           0.00            0.00
              IQR           70.00        0.00     84000.00           0.00          703.75
               CV            1.42       15.87         0.44           3.51            1.28
         Skewness            2.53       16.92         1.74           3.95            0.87
      SE.Skewness            0.05        0.05         0.05           0.05            0.05
         Kurtosis           10.92      299.06         5.10          17.81           -0.42
          N.Valid         2930.00     2930.00      2930.00        2930.00         2930.00
        Pct.Valid          100.00      100.00       100.00         100.00          100.00

Table: Table continues below

 

                    Three_season_porch   Total_Bsmt_SF   TotRms_AbvGrd   Wood_Deck_SF   Year_Built
----------------- -------------------- --------------- --------------- -------------- ------------
             Mean                 2.59         1051.26            6.44          93.75      1971.36
          Std.Dev                25.14          440.97            1.57         126.36        30.25
              Min                 0.00            0.00            2.00           0.00      1872.00
               Q1                 0.00          793.00            5.00           0.00      1954.00
           Median                 0.00          990.00            6.00           0.00      1973.00
               Q3                 0.00         1302.00            7.00         168.00      2001.00
              Max               508.00         6110.00           15.00        1424.00      2010.00
              MAD                 0.00          349.89            1.48           0.00        37.06
              IQR                 0.00          508.50            2.00         168.00        47.00
               CV                 9.70            0.42            0.24           1.35         0.02
         Skewness                11.39            1.15            0.75           1.84        -0.60
      SE.Skewness                 0.05            0.05            0.05           0.05         0.05
         Kurtosis               149.63            9.08            1.15           6.73        -0.50
          N.Valid              2930.00         2930.00         2930.00        2930.00      2930.00
        Pct.Valid               100.00          100.00          100.00         100.00       100.00

Table: Table continues below

 

                    Year_Remod_Add   Year_Sold
----------------- ---------------- -----------
             Mean          1984.27     2007.79
          Std.Dev            20.86        1.32
              Min          1950.00     2006.00
               Q1          1965.00     2007.00
           Median          1993.00     2008.00
               Q3          2004.00     2009.00
              Max          2010.00     2010.00
              MAD            20.76        1.48
              IQR            39.00        2.00
               CV             0.01        0.00
         Skewness            -0.45        0.13
      SE.Skewness             0.05        0.05
         Kurtosis            -1.34       -1.16
          N.Valid          2930.00     2930.00
        Pct.Valid           100.00      100.00

Dalle statistiche descrittive è possibile notare che la superficie dei lotti delle abitazioni vari tra i 1300 piedi quadri e circa 215000 piedi quadri, le case sono state costruite tra il 1872 e il 2010, e la superficie abitabile varia tra 334 piedi quadri ai 5642 piedi quadri, con il valore medio di 1500 piedi quadrati. In media, le case hanno un bagno e mezzo, 1 cucina, 3 camere da letto.

Distribuzione “Sale_price”

ggplot(ames, aes(x = Sale_Price)) +
  geom_histogram (bins = 50, col= "white", fill="light blue")

ggplot(ames, aes(x = Sale_Price)) +
  geom_histogram (bins = 50, col= "white", fill="light blue") +
  scale_x_log10()

ames <- ames %>% mutate(Sale_Price = log10(Sale_Price))

Sale_Price rispetto ai quartieri

Sale_Price rispetto al numero di camere da letto

Sale_Price rispetto al numero di camere da letto

Sale_Price rispetto alle condizioni dell’abitazione

Sale_Price rispetto alle condizioni dell’abitazione

Correlazione tra le variabili

Matrice di correlazione che mostra solo le variabili con coefficienti di correlazione superiori a 0,5.

               Var1           Var2       Freq
2250 BsmtFin_Type_1   BsmtFin_SF_1  0.9991444
4050    Garage_Cars    Garage_Area  0.8898660
1650   Exterior_1st   Exterior_2nd  0.8654165
3594    Gr_Liv_Area  TotRms_AbvGrd  0.8077721
2921  Total_Bsmt_SF   First_Flr_SF  0.8004287
1037    MS_SubClass      Bldg_Type  0.7188418
2976    House_Style  Second_Flr_SF  0.7175417
2400 BsmtFin_Type_2   BsmtFin_SF_2 -0.7113410
5296    Gr_Liv_Area     Sale_Price  0.6958623
5308    Garage_Cars     Sale_Price  0.6748777
3599  Bedroom_AbvGr  TotRms_AbvGrd  0.6726472
3075  Second_Flr_SF    Gr_Liv_Area  0.6552512
5309    Garage_Area     Sale_Price  0.6507663
1942     Year_Built     Foundation  0.6366324
3298    Gr_Liv_Area      Full_Bath  0.6303208
5289  Total_Bsmt_SF     Sale_Price  0.6256220
5272     Year_Built     Sale_Price  0.6154845
1350     Year_Built Year_Remod_Add  0.6120953
3371  Second_Flr_SF      Half_Bath  0.6116337
5294   First_Flr_SF     Sale_Price  0.6026285
5273 Year_Remod_Add     Sale_Price  0.5861531
3593  Second_Flr_SF  TotRms_AbvGrd  0.5852137
3346    House_Style      Half_Bath  0.5850323
5299      Full_Bath     Sale_Price  0.5773341
4725      Pool_Area        Pool_QC -0.5699490
3074   First_Flr_SF    Gr_Liv_Area  0.5621658
3792     Year_Built    Garage_Type -0.5430286
3940     Year_Built    Garage_Cars  0.5379817
3597      Full_Bath  TotRms_AbvGrd  0.5285992
3446    Gr_Liv_Area  Bedroom_AbvGr  0.5168075
5306    Garage_Type     Sale_Price -0.5047736
3445  Second_Flr_SF  Bedroom_AbvGr  0.5046506
2683 Year_Remod_Add     Heating_QC -0.5036757

Sale_Price rispetto all’anno di costruzione

Sale_Price rispetto all’anno di costruzione

Sale_Price rispetto all’area totale del seminterrato

Sale_Price rispetto al tipo di garage

Sale_Price rispetto al tipo di garage

Sale_Price rispetto all’area del garage

Regressione

Lasso e Ridge

Cross Validation e model selection

Random Forest

Regressione lineare multipla Lasso regression Ridge regression K-fold cross validation Random Forest
rmse 0.07736809 0.3616007 0.3566065 0.07481052 0.01971177