Import data

## Set the working directory
import os
os.chdir("/Users/nancy/Documents/GitHub/Intro-to-Machine-Learning")
  
## Import the CSV file
import pandas as pd
df_raw = pd.read_csv('Train.csv', 
                     low_memory=False, 
                     parse_dates=["saledate"])

Save the dataframe for fast reloading

The Train.csv file is ~100MB, which makes loading it through pandas every time very impractical. Saving the dataframe in the feather format is a convenient short cut.

Saving

os.makedirs('tmp', exist_ok=True)          ## Making a folder named "tmp" in the working directory
df_raw.to_feather('tmp/bulldozers-raw')    ##

Loading

In the course notebook, this command was used:

pd.read_feather('tmp/bulldozers-raw')

However, that throws an error for me. Instead I found that this fix works for me:

df_raw = feather.read_dataframe('tmp/bulldozers-raw')

Preview data

Summary info

print(df_raw.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 401125 entries, 0 to 401124
Data columns (total 53 columns):
SalesID                     401125 non-null int64
SalePrice                   401125 non-null int64
MachineID                   401125 non-null int64
ModelID                     401125 non-null int64
datasource                  401125 non-null int64
auctioneerID                380989 non-null float64
YearMade                    401125 non-null int64
MachineHoursCurrentMeter    142765 non-null float64
UsageBand                   69639 non-null object
saledate                    401125 non-null datetime64[ns]
fiModelDesc                 401125 non-null object
fiBaseModel                 401125 non-null object
fiSecondaryDesc             263934 non-null object
fiModelSeries               56908 non-null object
fiModelDescriptor           71919 non-null object
ProductSize                 190350 non-null object
fiProductClassDesc          401125 non-null object
state                       401125 non-null object
ProductGroup                401125 non-null object
ProductGroupDesc            401125 non-null object
Drive_System                104361 non-null object
Enclosure                   400800 non-null object
Forks                       192077 non-null object
Pad_Type                    79134 non-null object
Ride_Control                148606 non-null object
Stick                       79134 non-null object
Transmission                183230 non-null object
Turbocharged                79134 non-null object
Blade_Extension             25219 non-null object
Blade_Width                 25219 non-null object
Enclosure_Type              25219 non-null object
Engine_Horsepower           25219 non-null object
Hydraulics                  320570 non-null object
Pushblock                   25219 non-null object
Ripper                      104137 non-null object
Scarifier                   25230 non-null object
Tip_Control                 25219 non-null object
Tire_Size                   94718 non-null object
Coupler                     213952 non-null object
Coupler_System              43458 non-null object
Grouser_Tracks              43362 non-null object
Hydraulics_Flow             43362 non-null object
Track_Type                  99153 non-null object
Undercarriage_Pad_Width     99872 non-null object
Stick_Length                99218 non-null object
Thumb                       99288 non-null object
Pattern_Changer             99218 non-null object
Grouser_Type                99153 non-null object
Backhoe_Mounting            78672 non-null object
Blade_Type                  79833 non-null object
Travel_Controls             79834 non-null object
Differential_Type           69411 non-null object
Steering_Controls           69369 non-null object
dtypes: datetime64[ns](1), float64(2), int64(6), object(44)
memory usage: 162.2+ MB
None

Column headers

print(df_raw.columns)

Index(['SalesID', 'SalePrice', 'MachineID', 'ModelID', 'datasource',
       'auctioneerID', 'YearMade', 'MachineHoursCurrentMeter', 'UsageBand',
       'saledate', 'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc',
       'fiModelSeries', 'fiModelDescriptor', 'ProductSize',
       'fiProductClassDesc', 'state', 'ProductGroup', 'ProductGroupDesc',
       'Drive_System', 'Enclosure', 'Forks', 'Pad_Type', 'Ride_Control',
       'Stick', 'Transmission', 'Turbocharged', 'Blade_Extension',
       'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower', 'Hydraulics',
       'Pushblock', 'Ripper', 'Scarifier', 'Tip_Control', 'Tire_Size',
       'Coupler', 'Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow',
       'Track_Type', 'Undercarriage_Pad_Width', 'Stick_Length', 'Thumb',
       'Pattern_Changer', 'Grouser_Type', 'Backhoe_Mounting', 'Blade_Type',
       'Travel_Controls', 'Differential_Type', 'Steering_Controls'],
      dtype='object')

Dataframe preview

The dataframe is transposed so all columns will be visible

print(df_raw.tail(3).T)

                                                                     401122                        ...                                                                     401124
SalesID                                                             6333338                        ...                                                                    6333342
SalePrice                                                             11500                        ...                                                                       7750
MachineID                                                           1887659                        ...                                                                    1926965
ModelID                                                               21439                        ...                                                                      21435
datasource                                                              149                        ...                                                                        149
auctioneerID                                                              1                        ...                                                                          2
YearMade                                                               2005                        ...                                                                       2005
MachineHoursCurrentMeter                                                NaN                        ...                                                                        NaN
UsageBand                                                              None                        ...                                                                       None
saledate                                                2011-11-02 00:00:00                        ...                                                        2011-10-25 00:00:00
fiModelDesc                                                           35NX2                        ...                                                                       30NX
fiBaseModel                                                              35                        ...                                                                         30
fiSecondaryDesc                                                          NX                        ...                                                                         NX
fiModelSeries                                                             2                        ...                                                                       None
fiModelDescriptor                                                      None                        ...                                                                       None
ProductSize                                                            Mini                        ...                                                                       Mini
fiProductClassDesc        Hydraulic Excavator, Track - 3.0 to 4.0 Metric...                        ...                          Hydraulic Excavator, Track - 2.0 to 3.0 Metric...
state                                                              Maryland                        ...                                                                    Florida
ProductGroup                                                            TEX                        ...                                                                        TEX
ProductGroupDesc                                           Track Excavators                        ...                                                           Track Excavators
Drive_System                                                           None                        ...                                                                       None
Enclosure                                                             EROPS                        ...                                                                      EROPS
Forks                                                                  None                        ...                                                                       None
Pad_Type                                                               None                        ...                                                                       None
Ride_Control                                                           None                        ...                                                                       None
Stick                                                                  None                        ...                                                                       None
Transmission                                                           None                        ...                                                                       None
Turbocharged                                                           None                        ...                                                                       None
Blade_Extension                                                        None                        ...                                                                       None
Blade_Width                                                            None                        ...                                                                       None
Enclosure_Type                                                         None                        ...                                                                       None
Engine_Horsepower                                                      None                        ...                                                                       None
Hydraulics                                                        Auxiliary                        ...                                                                   Standard
Pushblock                                                              None                        ...                                                                       None
Ripper                                                                 None                        ...                                                                       None
Scarifier                                                              None                        ...                                                                       None
Tip_Control                                                            None                        ...                                                                       None
Tire_Size                                                              None                        ...                                                                       None
Coupler                                                 None or Unspecified                        ...                                                        None or Unspecified
Coupler_System                                                         None                        ...                                                                       None
Grouser_Tracks                                                         None                        ...                                                                       None
Hydraulics_Flow                                                        None                        ...                                                                       None
Track_Type                                                            Steel                        ...                                                                      Steel
Undercarriage_Pad_Width                                 None or Unspecified                        ...                                                        None or Unspecified
Stick_Length                                            None or Unspecified                        ...                                                        None or Unspecified
Thumb                                                   None or Unspecified                        ...                                                        None or Unspecified
Pattern_Changer                                         None or Unspecified                        ...                                                        None or Unspecified
Grouser_Type                                                         Double                        ...                                                                     Double
Backhoe_Mounting                                                       None                        ...                                                                       None
Blade_Type                                                             None                        ...                                                                       None
Travel_Controls                                                        None                        ...                                                                       None
Differential_Type                                                      None                        ...                                                                       None
Steering_Controls                                                      None                        ...                                                                       None

[53 rows x 3 columns]

Missing values

Note: data with missing values cannot be passed into the random forest algorithm

print(df_raw.isnull().sum().sort_index()/len(df_raw))

Backhoe_Mounting            0.803872
Blade_Extension             0.937129
Blade_Type                  0.800977
Blade_Width                 0.937129
Coupler                     0.466620
Coupler_System              0.891660
Differential_Type           0.826959
Drive_System                0.739829
Enclosure                   0.000810
Enclosure_Type              0.937129
Engine_Horsepower           0.937129
Forks                       0.521154
Grouser_Tracks              0.891899
Grouser_Type                0.752813
Hydraulics                  0.200823
Hydraulics_Flow             0.891899
MachineHoursCurrentMeter    0.644089
MachineID                   0.000000
ModelID                     0.000000
Pad_Type                    0.802720
Pattern_Changer             0.752651
ProductGroup                0.000000
ProductGroupDesc            0.000000
ProductSize                 0.525460
Pushblock                   0.937129
Ride_Control                0.629527
Ripper                      0.740388
SalePrice                   0.000000
SalesID                     0.000000
Scarifier                   0.937102
Steering_Controls           0.827064
Stick                       0.802720
Stick_Length                0.752651
Thumb                       0.752476
Tip_Control                 0.937129
Tire_Size                   0.763869
Track_Type                  0.752813
Transmission                0.543210
Travel_Controls             0.800975
Turbocharged                0.802720
Undercarriage_Pad_Width     0.751020
UsageBand                   0.826391
YearMade                    0.000000
auctioneerID                0.050199
datasource                  0.000000
fiBaseModel                 0.000000
fiModelDesc                 0.000000
fiModelDescriptor           0.820707
fiModelSeries               0.858129
fiProductClassDesc          0.000000
fiSecondaryDesc             0.342016
saledate                    0.000000
state                       0.000000
dtype: float64

Pre-processing

Transform the target variable

As the evaluation metric is root mean squared logarithmic error, the target variable needs to be log-transformed.

Before:

print(df_raw["SalePrice"].head(5))

0    66000
1    57000
2    10000
3    38500
4    11000
Name: SalePrice, dtype: int64

After:

df_raw["SalePrice"] = np.log(df_raw["SalePrice"])
print(df_raw["SalePrice"].head(5))

0    11.097410
1    10.950807
2     9.210340
3    10.558414
4     9.305651
Name: SalePrice, dtype: float64

Encode categorical variables

The `train_cats` function

Converts strings to pandas categories

def train_cats(df):
  
    """Change any columns of strings in a panda's dataframe to a column of
    categorical values. This applies the changes inplace.
    Parameters:
    -----------
    df: A pandas dataframe. Any columns of strings will be changed to
        categorical values.
    Examples:
    ---------
    >>> df = pd.DataFrame({'col1' : [1, 2, 3], 'col2' : ['a', 'b', 'a']})
    >>> df
       col1 col2
    0     1    a
    1     2    b
    2     3    a
    note the type of col2 is string
    >>> train_cats(df)
    >>> df
       col1 col2
    0     1    a
    1     2    b
    2     3    a
    now the type of col2 is category
    """ 
  
    for n,c in df.items():
        if is_string_dtype(c): df[n] = c.astype('category').cat.as_ordered()

Results

train_cats(df_raw)
 
print(df_raw.tail(3).T)

                                                                     401122                        ...                                                                     401124
SalesID                                                             6333338                        ...                                                                    6333342
SalePrice                                                            9.3501                        ...                                                                    8.95545
MachineID                                                           1887659                        ...                                                                    1926965
ModelID                                                               21439                        ...                                                                      21435
datasource                                                              149                        ...                                                                        149
auctioneerID                                                              1                        ...                                                                          2
YearMade                                                               2005                        ...                                                                       2005
MachineHoursCurrentMeter                                                NaN                        ...                                                                        NaN
UsageBand                                                               NaN                        ...                                                                        NaN
saledate                                                2011-11-02 00:00:00                        ...                                                        2011-10-25 00:00:00
fiModelDesc                                                           35NX2                        ...                                                                       30NX
fiBaseModel                                                              35                        ...                                                                         30
fiSecondaryDesc                                                          NX                        ...                                                                         NX
fiModelSeries                                                             2                        ...                                                                        NaN
fiModelDescriptor                                                       NaN                        ...                                                                        NaN
ProductSize                                                            Mini                        ...                                                                       Mini
fiProductClassDesc        Hydraulic Excavator, Track - 3.0 to 4.0 Metric...                        ...                          Hydraulic Excavator, Track - 2.0 to 3.0 Metric...
state                                                              Maryland                        ...                                                                    Florida
ProductGroup                                                            TEX                        ...                                                                        TEX
ProductGroupDesc                                           Track Excavators                        ...                                                           Track Excavators
Drive_System                                                            NaN                        ...                                                                        NaN
Enclosure                                                             EROPS                        ...                                                                      EROPS
Forks                                                                   NaN                        ...                                                                        NaN
Pad_Type                                                                NaN                        ...                                                                        NaN
Ride_Control                                                            NaN                        ...                                                                        NaN
Stick                                                                   NaN                        ...                                                                        NaN
Transmission                                                            NaN                        ...                                                                        NaN
Turbocharged                                                            NaN                        ...                                                                        NaN
Blade_Extension                                                         NaN                        ...                                                                        NaN
Blade_Width                                                             NaN                        ...                                                                        NaN
Enclosure_Type                                                          NaN                        ...                                                                        NaN
Engine_Horsepower                                                       NaN                        ...                                                                        NaN
Hydraulics                                                        Auxiliary                        ...                                                                   Standard
Pushblock                                                               NaN                        ...                                                                        NaN
Ripper                                                                  NaN                        ...                                                                        NaN
Scarifier                                                               NaN                        ...                                                                        NaN
Tip_Control                                                             NaN                        ...                                                                        NaN
Tire_Size                                                               NaN                        ...                                                                        NaN
Coupler                                                 None or Unspecified                        ...                                                        None or Unspecified
Coupler_System                                                          NaN                        ...                                                                        NaN
Grouser_Tracks                                                          NaN                        ...                                                                        NaN
Hydraulics_Flow                                                         NaN                        ...                                                                        NaN
Track_Type                                                            Steel                        ...                                                                      Steel
Undercarriage_Pad_Width                                 None or Unspecified                        ...                                                        None or Unspecified
Stick_Length                                            None or Unspecified                        ...                                                        None or Unspecified
Thumb                                                   None or Unspecified                        ...                                                        None or Unspecified
Pattern_Changer                                         None or Unspecified                        ...                                                        None or Unspecified
Grouser_Type                                                         Double                        ...                                                                     Double
Backhoe_Mounting                                                        NaN                        ...                                                                        NaN
Blade_Type                                                              NaN                        ...                                                                        NaN
Travel_Controls                                                         NaN                        ...                                                                        NaN
Differential_Type                                                       NaN                        ...                                                                        NaN
Steering_Controls                                                       NaN                        ...                                                                        NaN

[53 rows x 3 columns]

Specify the order of ordinal variables

For example, the UsageBand variable has three levels:

print(df_raw["UsageBand"].cat.categories)

Index(['High', 'Low', 'Medium'], dtype='object')

To set the order in which the values will be treated:

df_raw["UsageBand"].cat.set_categories(['High', 'Medium', 'Low'], 
                                       ordered=True, 
                                       inplace=True)

Feature engineering

For example, parsing the saledate field to extract more information about the dates

The `add_datepart` function

def add_datepart(df, fldname, drop=True, time=False):
  
    """add_datepart converts a column of df from a datetime64 to many columns containing
    the information from the date. This applies changes inplace.
    Parameters:
    -----------
    df: A pandas data frame. df gain several new columns.
    fldname: A string that is the name of the date column you wish to expand.
        If it is not a datetime64 series, it will be converted to one with pd.to_datetime.
    drop: If true then the original date column will be removed.
    time: If true time features: Hour, Minute, Second will be added.
    Examples:
    ---------
    >>> df = pd.DataFrame({ 'A' : pd.to_datetime(['3/11/2000', '3/12/2000', '3/13/2000'], infer_datetime_format=False) })
    >>> df
        A
    0   2000-03-11
    1   2000-03-12
    2   2000-03-13
    >>> add_datepart(df, 'A')
    >>> df
        AYear AMonth AWeek ADay ADayofweek ADayofyear AIs_month_end AIs_month_start AIs_quarter_end AIs_quarter_start AIs_year_end AIs_year_start AElapsed
    0   2000  3      10    11   5          71         False         False           False           False             False        False          952732800
    1   2000  3      10    12   6          72         False         False           False           False             False        False          952819200
    2   2000  3      11    13   0          73         False         False           False           False             False        False          952905600
    """
  
    fld = df[fldname]
    fld_dtype = fld.dtype
    if isinstance(fld_dtype, pd.core.dtypes.dtypes.DatetimeTZDtype):
        fld_dtype = np.datetime64
    if not np.issubdtype(fld_dtype, np.datetime64):
        df[fldname] = fld = pd.to_datetime(fld, infer_datetime_format=True)
    targ_pre = re.sub('[Dd]ate$', '', fldname)
    attr = ['Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear',
            'Is_month_end', 'Is_month_start', 'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start']
    if time: attr = attr + ['Hour', 'Minute', 'Second']
    for n in attr: df[targ_pre + n] = getattr(fld.dt, n.lower())
    df[targ_pre + 'Elapsed'] = fld.astype(np.int64) // 10 ** 9
    if drop: df.drop(fldname, axis=1, inplace=True)

Result

add_datepart(df_raw, 'saledate')

Notice now that the saledate column has disappeared, instead replaced by several columns with the sale prefix:

print(df_raw.columns)

Index(['SalesID', 'SalePrice', 'MachineID', 'ModelID', 'datasource',
       'auctioneerID', 'YearMade', 'MachineHoursCurrentMeter', 'UsageBand',
       'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc', 'fiModelSeries',
       'fiModelDescriptor', 'ProductSize', 'fiProductClassDesc', 'state',
       'ProductGroup', 'ProductGroupDesc', 'Drive_System', 'Enclosure',
       'Forks', 'Pad_Type', 'Ride_Control', 'Stick', 'Transmission',
       'Turbocharged', 'Blade_Extension', 'Blade_Width', 'Enclosure_Type',
       'Engine_Horsepower', 'Hydraulics', 'Pushblock', 'Ripper', 'Scarifier',
       'Tip_Control', 'Tire_Size', 'Coupler', 'Coupler_System',
       'Grouser_Tracks', 'Hydraulics_Flow', 'Track_Type',
       'Undercarriage_Pad_Width', 'Stick_Length', 'Thumb', 'Pattern_Changer',
       'Grouser_Type', 'Backhoe_Mounting', 'Blade_Type', 'Travel_Controls',
       'Differential_Type', 'Steering_Controls', 'saleYear', 'saleMonth',
       'saleWeek', 'saleDay', 'saleDayofweek', 'saleDayofyear',
       'saleIs_month_end', 'saleIs_month_start', 'saleIs_quarter_end',
       'saleIs_quarter_start', 'saleIs_year_end', 'saleIs_year_start',
       'saleElapsed'],
      dtype='object')

Prepare the dataframe using `proc_df`

At this point, a few more things need to be done to make the data ready for the random forest algorithm:

Replace categories with numeric codes
Replace missing values with the median of the column
- Data with missing values cannot be passed into the random forest algorithm
Split off the target variable into a separate variable

The proc_df function will do these at once to produce three objects:

df, y, nas = proc_df(df_raw, 'SalePrice')

df: the now entirely numeric dataframe
y: the response variable
nas: dictionary of the continuous variable columns with missing values and their medians

Functions

`fix_missing`

def fix_missing(df, col, name, na_dict):
    """ Fill missing data in a column of df with the median, and add a {name}_na column
    which specifies if the data was missing.
    Parameters:
    -----------
    df: The data frame that will be changed.
    col: The column of data to fix by filling in missing data.
    name: The name of the new filled column in df.
    na_dict: A dictionary of values to create na's of and the value to insert. If
        name is not a key of na_dict the median will fill any missing data. Also
        if name is not a key of na_dict and there is no missing data in col, then
        no {name}_na column is not created.
    Examples:
    ---------
    >>> df = pd.DataFrame({'col1' : [1, np.NaN, 3], 'col2' : [5, 2, 2]})
    >>> df
       col1 col2
    0     1    5
    1   nan    2
    2     3    2
    >>> fix_missing(df, df['col1'], 'col1', {})
    >>> df
       col1 col2 col1_na
    0     1    5   False
    1     2    2    True
    2     3    2   False
    >>> df = pd.DataFrame({'col1' : [1, np.NaN, 3], 'col2' : [5, 2, 2]})
    >>> df
       col1 col2
    0     1    5
    1   nan    2
    2     3    2
    >>> fix_missing(df, df['col2'], 'col2', {})
    >>> df
       col1 col2
    0     1    5
    1   nan    2
    2     3    2
    >>> df = pd.DataFrame({'col1' : [1, np.NaN, 3], 'col2' : [5, 2, 2]})
    >>> df
       col1 col2
    0     1    5
    1   nan    2
    2     3    2
    >>> fix_missing(df, df['col1'], 'col1', {'col1' : 500})
    >>> df
       col1 col2 col1_na
    0     1    5   False
    1   500    2    True
    2     3    2   False
    """
    if is_numeric_dtype(col):
        if pd.isnull(col).sum() or (name in na_dict):
            df[name+'_na'] = pd.isnull(col)
            filler = na_dict[name] if name in na_dict else col.median()
            df[name] = col.fillna(filler)
            na_dict[name] = filler
    return na_dict

`numericalize`

def numericalize(df, col, name, max_n_cat):
    """ Changes the column col from a categorical type to it's integer codes.
    Parameters:
    -----------
    df: A pandas dataframe. df[name] will be filled with the integer codes from
        col.
    col: The column you wish to change into the categories.
    name: The column name you wish to insert into df. This column will hold the
        integer codes.
    max_n_cat: If col has more categories than max_n_cat it will not change the
        it to its integer codes. If max_n_cat is None, then col will always be
        converted.
    Examples:
    ---------
    >>> df = pd.DataFrame({'col1' : [1, 2, 3], 'col2' : ['a', 'b', 'a']})
    >>> df
       col1 col2
    0     1    a
    1     2    b
    2     3    a
    note the type of col2 is string
    >>> train_cats(df)
    >>> df
       col1 col2
    0     1    a
    1     2    b
    2     3    a
    now the type of col2 is category { a : 1, b : 2}
    >>> numericalize(df, df['col2'], 'col3', None)
       col1 col2 col3
    0     1    a    1
    1     2    b    2
    2     3    a    1
    """
    if not is_numeric_dtype(col) and ( max_n_cat is None or len(col.cat.categories)>max_n_cat):
        df[name] = col.cat.codes+1

`proc_df`

def proc_df(df, y_fld=None, skip_flds=None, ignore_flds=None, do_scale=False, na_dict=None,
            preproc_fn=None, max_n_cat=None, subset=None, mapper=None):
  
    """ proc_df takes a data frame df and splits off the response variable, and
    changes the df into an entirely numeric dataframe. For each column of df 
    which is not in skip_flds nor in ignore_flds, na values are replaced by the
    median value of the column.
      
    Parameters:
    -----------
    df: The data frame you wish to process.
    y_fld: The name of the response variable
    skip_flds: A list of fields that dropped from df.
    ignore_flds: A list of fields that are ignored during processing.
    do_scale: Standardizes each column in df. Takes Boolean Values(True,False)
    na_dict: a dictionary of na columns to add. Na columns are also added if there
        are any missing values.
    preproc_fn: A function that gets applied to df.
    max_n_cat: The maximum number of categories to break into dummy values, instead
        of integer codes.
    subset: Takes a random subset of size subset from df.
    mapper: If do_scale is set as True, the mapper variable
        calculates the values used for scaling of variables during training time (mean and standard deviation).
  
    Returns:
    --------
    [x, y, nas, mapper(optional)]:
        x: x is the transformed version of df. x will not have the response variable
            and is entirely numeric.
        y: y is the response variable
        nas: returns a dictionary of which nas it created, and the associated median.
        mapper: A DataFrameMapper which stores the mean and standard deviation of the corresponding continuous
        variables which is then used for scaling of during test-time.
        
    Examples:
    ---------
    >>> df = pd.DataFrame({'col1' : [1, 2, 3], 'col2' : ['a', 'b', 'a']})
    >>> df
       col1 col2
    0     1    a
    1     2    b
    2     3    a
    note the type of col2 is string
    >>> train_cats(df)
    >>> df
       col1 col2
    0     1    a
    1     2    b
    2     3    a
    now the type of col2 is category { a : 1, b : 2}
    >>> x, y, nas = proc_df(df, 'col1')
    >>> x
       col2
    0     1
    1     2
    2     1
    >>> data = DataFrame(pet=["cat", "dog", "dog", "fish", "cat", "dog", "cat", "fish"],
                 children=[4., 6, 3, 3, 2, 3, 5, 4],
                 salary=[90, 24, 44, 27, 32, 59, 36, 27])
    >>> mapper = DataFrameMapper([(:pet, LabelBinarizer()),
                          ([:children], StandardScaler())])
    >>>round(fit_transform!(mapper, copy(data)), 2)
    8x4 Array{Float64,2}:
    1.0  0.0  0.0   0.21
    0.0  1.0  0.0   1.88
    0.0  1.0  0.0  -0.63
    0.0  0.0  1.0  -0.63
    1.0  0.0  0.0  -1.46
    0.0  1.0  0.0  -0.63
    1.0  0.0  0.0   1.04
    0.0  0.0  1.0   0.21
    """
  
    if not ignore_flds: ignore_flds=[]
    if not skip_flds: skip_flds=[]
    if subset: df = get_sample(df,subset)
    else: df = df.copy()
    ignored_flds = df.loc[:, ignore_flds]
    df.drop(ignore_flds, axis=1, inplace=True)
    if preproc_fn: preproc_fn(df)
    if y_fld is None: y = None
    else:
        if not is_numeric_dtype(df[y_fld]): df[y_fld] = df[y_fld].cat.codes
        y = df[y_fld].values
        skip_flds += [y_fld]
    df.drop(skip_flds, axis=1, inplace=True)
  
    if na_dict is None: na_dict = {}
    else: na_dict = na_dict.copy()
    na_dict_initial = na_dict.copy()
    for n,c in df.items(): na_dict = fix_missing(df, c, n, na_dict)
    if len(na_dict_initial.keys()) > 0:
        df.drop([a + '_na' for a in list(set(na_dict.keys()) - set(na_dict_initial.keys()))], axis=1, inplace=True)
    if do_scale: mapper = scale_vars(df, mapper)
    for n,c in df.items(): numericalize(df, c, n, max_n_cat)
    df = pd.get_dummies(df, dummy_na=True)
    df = pd.concat([ignored_flds, df], axis=1)
    res = [df, y, na_dict]
    if do_scale: res = res + [mapper]
    return res

Results

`df`

print(df.tail(3).T)

                                 401122      401123      401124
SalesID                         6333338     6333341     6333342
MachineID                       1887659     1903570     1926965
ModelID                           21439       21435       21435
datasource                          149         149         149
auctioneerID                          1           2           2
YearMade                           2005        2005        2005
MachineHoursCurrentMeter              0           0           0
UsageBand                             0           0           0
fiModelDesc                         657         483         483
fiBaseModel                         207         159         159
fiSecondaryDesc                     106         106         106
fiModelSeries                        63           0           0
fiModelDescriptor                     0           0           0
ProductSize                           5           5           5
fiProductClassDesc                   17          13          13
state                                20           9           9
ProductGroup                          4           4           4
ProductGroupDesc                      4           4           4
Drive_System                          0           0           0
Enclosure                             1           1           1
Forks                                 0           0           0
Pad_Type                              0           0           0
Ride_Control                          0           0           0
Stick                                 0           0           0
Transmission                          0           0           0
Turbocharged                          0           0           0
Blade_Extension                       0           0           0
Blade_Width                           0           0           0
Enclosure_Type                        0           0           0
Engine_Horsepower                     0           0           0
...                                 ...         ...         ...
Coupler                               3           3           3
Coupler_System                        0           0           0
Grouser_Tracks                        0           0           0
Hydraulics_Flow                       0           0           0
Track_Type                            2           2           2
Undercarriage_Pad_Width              19          19          19
Stick_Length                         29          29          29
Thumb                                 3           3           3
Pattern_Changer                       2           2           2
Grouser_Type                          1           1           1
Backhoe_Mounting                      0           0           0
Blade_Type                            0           0           0
Travel_Controls                       0           0           0
Differential_Type                     0           0           0
Steering_Controls                     0           0           0
saleYear                           2011        2011        2011
saleMonth                            11          10          10
saleWeek                             44          43          43
saleDay                               2          25          25
saleDayofweek                         2           1           1
saleDayofyear                       306         298         298
saleIs_month_end                  False       False       False
saleIs_month_start                False       False       False
saleIs_quarter_end                False       False       False
saleIs_quarter_start              False       False       False
saleIs_year_end                   False       False       False
saleIs_year_start                 False       False       False
saleElapsed                  1320192000  1319500800  1319500800
auctioneerID_na                   False       False       False
MachineHoursCurrentMeter_na        True        True        True

[66 rows x 3 columns]

`y`

print(y)

[11.09741002 10.95080655  9.21034037 ...  9.35010231  9.10497986
  8.95544812]

`nas`

print(nas)

{'auctioneerID': 2.0, 'MachineHoursCurrentMeter': 0.0}

A note

Creating the first random forest model

Split the dataset into training and validation sets

## Define function to split data set
def split_vals(a,n): 
  return a[:n].copy(), a[n:].copy()
## Split dataset
n_valid = 12000                                   # Same as Kaggle's test set size
n_trn = len(df)-n_valid
raw_train, raw_valid = split_vals(df_raw, n_trn)
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)
print(X_train.shape, y_train.shape, X_valid.shape)

(389125, 66) (389125,) (12000, 66)

Define function to calculate RMSE and R² scores

## Define function to calculate RMSE
def rmse(x,y): 
  return math.sqrt(((x-y)**2).mean())
## Define function to print out RMSE and R2 values for training and validation sets
def print_score(m):
    res = [rmse(m.predict(X_train), y_train),
           rmse(m.predict(X_valid), y_valid),
           m.score(X_train, y_train), 
           m.score(X_valid, y_valid)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

Create model and test performance on validation set

m = RandomForestRegressor(n_estimators = 10,
                          n_jobs=-1)
m.fit(X_train, y_train)
print_score(m)

[0.09070972618403471, 0.2535174519364779, 0.9828034020373368, 0.8852206670848883]

Bagging

Extract a subset of data to allow for faster computation

df_trn, y_trn, nas = proc_df(df_raw, 'SalePrice', subset=30000, na_dict=nas)
X_train, _ = split_vals(df_trn, 20000)
y_train, _ = split_vals(y_trn, 20000)

Build a single tree

m = RandomForestRegressor(n_estimators=1, 
                          bootstrap=False, 
                          n_jobs=-1)
m.fit(X_train, y_train)
print_score(m)

[5.329070518200751e-17, 0.47670897220102537, 1.0, 0.594160208846394]

Create a random forest with 10 trees

m = RandomForestRegressor(n_estimators=10, 
                          n_jobs=-1)
m.fit(X_train, y_train)
print_score(m)

[0.11199686542985562, 0.3605562719427123, 0.9724270771971114, 0.7678364247358116]

Examine the prediction of each tree

preds = np.stack([t.predict(X_valid) for t in m.estimators_])
print(preds[:,0])

[9.30565055 9.04782144 8.9226583  9.10497986 8.9226583  9.15904708
 9.30565055 9.35010231 9.54681261 8.98719682]

print(np.mean(preds[:,0]), y_valid[0])

9.165257782260593 9.104979856318357

print(preds.shape)

(10, 12000)

import matplotlib.pyplot as plt
plt.plot([metrics.r2_score(y_valid, np.mean(preds[:i+1], axis=0)) for i in range(10)]);
plt.show()

Create a random forest with more trees

Adding more trees do not increase the model performance significantly:

20 trees

m = RandomForestRegressor(n_estimators=20, 
                          n_jobs=-1)
m.fit(X_train, y_train)
print_score(m)

[0.10208552551172624, 0.349211430210069, 0.9770913547606447, 0.7822165493173546]

40 trees

m = RandomForestRegressor(n_estimators=40, 
                          n_jobs=-1)
m.fit(X_train, y_train)
print_score(m)

[0.09726308467086937, 0.35386511652111163, 0.979204606281339, 0.7763733905004107]

60 trees

m = RandomForestRegressor(n_estimators=60, 
                          n_jobs=-1)
m.fit(X_train, y_train)
print_score(m)

[0.09567778194716048, 0.3507615331500072, 0.979876974999839, 0.780278834860096]

Out-of-bag (OOB) score

Recall the print_score function where the last item printed out is the out-of-bag score:

def print_score(m):
    res = [rmse(m.predict(X_train), y_train),
           rmse(m.predict(X_valid), y_valid),
           m.score(X_train, y_train), 
           m.score(X_valid, y_valid)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

To get the OOB score, add one more parameter to the model constructor:

m = RandomForestRegressor(n_estimators=40, 
                          n_jobs=-1, 
                          oob_score=True)
                          
m.fit(X_train, y_train)
print_score(m)

[0.0973458316202628, 0.3492358738061539, 0.9791692077050513, 0.782186060068996, 0.8521597322650637]

This shows that the model is both over-fitting AND that the validation set time difference has an impact.

Reduce overfitting

Limit number of samples randomly available at each split

Split the entire dataset into testing and validation sets

df_trn, y_trn, nas = proc_df(df_raw, 'SalePrice')
X_train, X_valid = split_vals(df_trn, n_trn)
y_train, y_valid = split_vals(y_trn, n_trn)

Set the size of random sample that each tree can access

def set_rf_samples(n):
    """ Changes Scikit learn's random forests to give each tree a random sample of
    n random rows.
    """
    forest._generate_sample_indices = (lambda rs, n_samples:
        forest.check_random_state(rs).randint(0, n_samples, n))

set_rf_samples(20000)

Train the model

m = RandomForestRegressor(n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)

/Users/nancy/anaconda3/envs/r-reticulate/lib/python3.7/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

print_score(m)

[0.2418279065948944, 0.2799781522487347, 0.8777784860917807, 0.860010242879805, 0.8650363056265487]

Increasing number of trees enhances model performance

As each additional tree allows the model to see more data, increasing the number of trees can enhance performance

m = RandomForestRegressor(n_estimators=40, 
                          n_jobs=-1, 
                          oob_score=True)
m.fit(X_train, y_train)
print_score(m)

[0.22659425812627826, 0.2615041844731935, 0.8926918698148494, 0.8778748080643042, 0.8812056869050565]

Grow shallower trees

By setting min_samples_leaf:

def reset_rf_samples():
    """ Undoes the changes produced by set_rf_samples.
    """
    forest._generate_sample_indices = (lambda rs, n_samples:
        forest.check_random_state(rs).randint(0, n_samples, n_samples))

## First revert to using a full bootstrap
reset_rf_samples()
  
def dectree_max_depth(tree):
    children_left = tree.children_left
    children_right = tree.children_right
    def walk(node_id):
        if (children_left[node_id] != children_right[node_id]):
            left_max = 1 + walk(children_left[node_id])
            right_max = 1 + walk(children_right[node_id])
            return max(left_max, right_max)
        else: # leaf
            return 1
    root_node_id = 0
    return walk(root_node_id)

m = RandomForestRegressor(n_estimators=40, 
                          n_jobs=-1, 
                          oob_score=True)
                          
m.fit(X_train, y_train)
print_score(m)

[0.07850136700663472, 0.23982842268856314, 0.9871207887971841, 0.8972813562954464, 0.9081813562258134]

t=m.estimators_[0].tree_
print(dectree_max_depth(t))

m = RandomForestRegressor(n_estimators=40, 
                          min_samples_leaf=5, 
                          n_jobs=-1, 
                          oob_score=True)
                          
m.fit(X_train, y_train)
print_score(m)

[0.14066900750693243, 0.23378407318232006, 0.9586446526694946, 0.9023937076672716, 0.9069995123114761]

t=m.estimators_[0].tree_
print(dectree_max_depth(t))

Limit the number of variables randomly avaiable at each split

m = RandomForestRegressor(n_estimators=40, 
                          min_samples_leaf=3, 
                          max_features=0.5, 
                          n_jobs=-1, 
                          oob_score=True)
m.fit(X_train, y_train)
print_score(m)

[0.11908905343482946, 0.22811149301231815, 0.9703599786073658, 0.9070729163026368, 0.9117730171652707]

Part I - Intro to Random Forest

Import data

Save the dataframe for fast reloading

Saving

Loading

Preview data

Summary info

Column headers

Dataframe preview

Missing values

Pre-processing

Transform the target variable

Encode categorical variables

The train_cats function

Results

Specify the order of ordinal variables

Feature engineering

The add_datepart function

Result

Prepare the dataframe using proc_df

Functions

fix_missing

numericalize

proc_df

Results

df

y

nas

A note

Creating the first random forest model

Split the dataset into training and validation sets

Define function to calculate RMSE and R2 scores

Create model and test performance on validation set

Bagging

Extract a subset of data to allow for faster computation

Build a single tree

Create a random forest with 10 trees

Examine the prediction of each tree

Create a random forest with more trees

20 trees

40 trees

60 trees

Out-of-bag (OOB) score

Reduce overfitting

Limit number of samples randomly available at each split

Split the entire dataset into testing and validation sets

Set the size of random sample that each tree can access

Train the model

Increasing number of trees enhances model performance

Grow shallower trees

Limit the number of variables randomly avaiable at each split

The `train_cats` function

The `add_datepart` function

Prepare the dataframe using `proc_df`

`fix_missing`

`numericalize`

`proc_df`

`df`

`y`

`nas`

Define function to calculate RMSE and R² scores