The Train.csv file is ~100MB, which makes loading it through pandas every time very impractical. Saving the dataframe in the feather format is a convenient short cut.
In the course notebook, this command was used:
However, that throws an error for me. Instead I found that this fix works for me:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 401125 entries, 0 to 401124
Data columns (total 53 columns):
SalesID 401125 non-null int64
SalePrice 401125 non-null int64
MachineID 401125 non-null int64
ModelID 401125 non-null int64
datasource 401125 non-null int64
auctioneerID 380989 non-null float64
YearMade 401125 non-null int64
MachineHoursCurrentMeter 142765 non-null float64
UsageBand 69639 non-null object
saledate 401125 non-null datetime64[ns]
fiModelDesc 401125 non-null object
fiBaseModel 401125 non-null object
fiSecondaryDesc 263934 non-null object
fiModelSeries 56908 non-null object
fiModelDescriptor 71919 non-null object
ProductSize 190350 non-null object
fiProductClassDesc 401125 non-null object
state 401125 non-null object
ProductGroup 401125 non-null object
ProductGroupDesc 401125 non-null object
Drive_System 104361 non-null object
Enclosure 400800 non-null object
Forks 192077 non-null object
Pad_Type 79134 non-null object
Ride_Control 148606 non-null object
Stick 79134 non-null object
Transmission 183230 non-null object
Turbocharged 79134 non-null object
Blade_Extension 25219 non-null object
Blade_Width 25219 non-null object
Enclosure_Type 25219 non-null object
Engine_Horsepower 25219 non-null object
Hydraulics 320570 non-null object
Pushblock 25219 non-null object
Ripper 104137 non-null object
Scarifier 25230 non-null object
Tip_Control 25219 non-null object
Tire_Size 94718 non-null object
Coupler 213952 non-null object
Coupler_System 43458 non-null object
Grouser_Tracks 43362 non-null object
Hydraulics_Flow 43362 non-null object
Track_Type 99153 non-null object
Undercarriage_Pad_Width 99872 non-null object
Stick_Length 99218 non-null object
Thumb 99288 non-null object
Pattern_Changer 99218 non-null object
Grouser_Type 99153 non-null object
Backhoe_Mounting 78672 non-null object
Blade_Type 79833 non-null object
Travel_Controls 79834 non-null object
Differential_Type 69411 non-null object
Steering_Controls 69369 non-null object
dtypes: datetime64[ns](1), float64(2), int64(6), object(44)
memory usage: 162.2+ MB
None
Index(['SalesID', 'SalePrice', 'MachineID', 'ModelID', 'datasource',
'auctioneerID', 'YearMade', 'MachineHoursCurrentMeter', 'UsageBand',
'saledate', 'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc',
'fiModelSeries', 'fiModelDescriptor', 'ProductSize',
'fiProductClassDesc', 'state', 'ProductGroup', 'ProductGroupDesc',
'Drive_System', 'Enclosure', 'Forks', 'Pad_Type', 'Ride_Control',
'Stick', 'Transmission', 'Turbocharged', 'Blade_Extension',
'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower', 'Hydraulics',
'Pushblock', 'Ripper', 'Scarifier', 'Tip_Control', 'Tire_Size',
'Coupler', 'Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow',
'Track_Type', 'Undercarriage_Pad_Width', 'Stick_Length', 'Thumb',
'Pattern_Changer', 'Grouser_Type', 'Backhoe_Mounting', 'Blade_Type',
'Travel_Controls', 'Differential_Type', 'Steering_Controls'],
dtype='object')
The dataframe is transposed so all columns will be visible
401122 ... 401124
SalesID 6333338 ... 6333342
SalePrice 11500 ... 7750
MachineID 1887659 ... 1926965
ModelID 21439 ... 21435
datasource 149 ... 149
auctioneerID 1 ... 2
YearMade 2005 ... 2005
MachineHoursCurrentMeter NaN ... NaN
UsageBand None ... None
saledate 2011-11-02 00:00:00 ... 2011-10-25 00:00:00
fiModelDesc 35NX2 ... 30NX
fiBaseModel 35 ... 30
fiSecondaryDesc NX ... NX
fiModelSeries 2 ... None
fiModelDescriptor None ... None
ProductSize Mini ... Mini
fiProductClassDesc Hydraulic Excavator, Track - 3.0 to 4.0 Metric... ... Hydraulic Excavator, Track - 2.0 to 3.0 Metric...
state Maryland ... Florida
ProductGroup TEX ... TEX
ProductGroupDesc Track Excavators ... Track Excavators
Drive_System None ... None
Enclosure EROPS ... EROPS
Forks None ... None
Pad_Type None ... None
Ride_Control None ... None
Stick None ... None
Transmission None ... None
Turbocharged None ... None
Blade_Extension None ... None
Blade_Width None ... None
Enclosure_Type None ... None
Engine_Horsepower None ... None
Hydraulics Auxiliary ... Standard
Pushblock None ... None
Ripper None ... None
Scarifier None ... None
Tip_Control None ... None
Tire_Size None ... None
Coupler None or Unspecified ... None or Unspecified
Coupler_System None ... None
Grouser_Tracks None ... None
Hydraulics_Flow None ... None
Track_Type Steel ... Steel
Undercarriage_Pad_Width None or Unspecified ... None or Unspecified
Stick_Length None or Unspecified ... None or Unspecified
Thumb None or Unspecified ... None or Unspecified
Pattern_Changer None or Unspecified ... None or Unspecified
Grouser_Type Double ... Double
Backhoe_Mounting None ... None
Blade_Type None ... None
Travel_Controls None ... None
Differential_Type None ... None
Steering_Controls None ... None
[53 rows x 3 columns]
Note: data with missing values cannot be passed into the random forest algorithm
Backhoe_Mounting 0.803872
Blade_Extension 0.937129
Blade_Type 0.800977
Blade_Width 0.937129
Coupler 0.466620
Coupler_System 0.891660
Differential_Type 0.826959
Drive_System 0.739829
Enclosure 0.000810
Enclosure_Type 0.937129
Engine_Horsepower 0.937129
Forks 0.521154
Grouser_Tracks 0.891899
Grouser_Type 0.752813
Hydraulics 0.200823
Hydraulics_Flow 0.891899
MachineHoursCurrentMeter 0.644089
MachineID 0.000000
ModelID 0.000000
Pad_Type 0.802720
Pattern_Changer 0.752651
ProductGroup 0.000000
ProductGroupDesc 0.000000
ProductSize 0.525460
Pushblock 0.937129
Ride_Control 0.629527
Ripper 0.740388
SalePrice 0.000000
SalesID 0.000000
Scarifier 0.937102
Steering_Controls 0.827064
Stick 0.802720
Stick_Length 0.752651
Thumb 0.752476
Tip_Control 0.937129
Tire_Size 0.763869
Track_Type 0.752813
Transmission 0.543210
Travel_Controls 0.800975
Turbocharged 0.802720
Undercarriage_Pad_Width 0.751020
UsageBand 0.826391
YearMade 0.000000
auctioneerID 0.050199
datasource 0.000000
fiBaseModel 0.000000
fiModelDesc 0.000000
fiModelDescriptor 0.820707
fiModelSeries 0.858129
fiProductClassDesc 0.000000
fiSecondaryDesc 0.342016
saledate 0.000000
state 0.000000
dtype: float64
As the evaluation metric is root mean squared logarithmic error, the target variable needs to be log-transformed.
Before:
0 66000
1 57000
2 10000
3 38500
4 11000
Name: SalePrice, dtype: int64
After:
0 11.097410
1 10.950807
2 9.210340
3 10.558414
4 9.305651
Name: SalePrice, dtype: float64
train_cats functionConverts strings to pandas categories
def train_cats(df):
"""Change any columns of strings in a panda's dataframe to a column of
categorical values. This applies the changes inplace.
Parameters:
-----------
df: A pandas dataframe. Any columns of strings will be changed to
categorical values.
Examples:
---------
>>> df = pd.DataFrame({'col1' : [1, 2, 3], 'col2' : ['a', 'b', 'a']})
>>> df
col1 col2
0 1 a
1 2 b
2 3 a
note the type of col2 is string
>>> train_cats(df)
>>> df
col1 col2
0 1 a
1 2 b
2 3 a
now the type of col2 is category
"""
for n,c in df.items():
if is_string_dtype(c): df[n] = c.astype('category').cat.as_ordered() 401122 ... 401124
SalesID 6333338 ... 6333342
SalePrice 9.3501 ... 8.95545
MachineID 1887659 ... 1926965
ModelID 21439 ... 21435
datasource 149 ... 149
auctioneerID 1 ... 2
YearMade 2005 ... 2005
MachineHoursCurrentMeter NaN ... NaN
UsageBand NaN ... NaN
saledate 2011-11-02 00:00:00 ... 2011-10-25 00:00:00
fiModelDesc 35NX2 ... 30NX
fiBaseModel 35 ... 30
fiSecondaryDesc NX ... NX
fiModelSeries 2 ... NaN
fiModelDescriptor NaN ... NaN
ProductSize Mini ... Mini
fiProductClassDesc Hydraulic Excavator, Track - 3.0 to 4.0 Metric... ... Hydraulic Excavator, Track - 2.0 to 3.0 Metric...
state Maryland ... Florida
ProductGroup TEX ... TEX
ProductGroupDesc Track Excavators ... Track Excavators
Drive_System NaN ... NaN
Enclosure EROPS ... EROPS
Forks NaN ... NaN
Pad_Type NaN ... NaN
Ride_Control NaN ... NaN
Stick NaN ... NaN
Transmission NaN ... NaN
Turbocharged NaN ... NaN
Blade_Extension NaN ... NaN
Blade_Width NaN ... NaN
Enclosure_Type NaN ... NaN
Engine_Horsepower NaN ... NaN
Hydraulics Auxiliary ... Standard
Pushblock NaN ... NaN
Ripper NaN ... NaN
Scarifier NaN ... NaN
Tip_Control NaN ... NaN
Tire_Size NaN ... NaN
Coupler None or Unspecified ... None or Unspecified
Coupler_System NaN ... NaN
Grouser_Tracks NaN ... NaN
Hydraulics_Flow NaN ... NaN
Track_Type Steel ... Steel
Undercarriage_Pad_Width None or Unspecified ... None or Unspecified
Stick_Length None or Unspecified ... None or Unspecified
Thumb None or Unspecified ... None or Unspecified
Pattern_Changer None or Unspecified ... None or Unspecified
Grouser_Type Double ... Double
Backhoe_Mounting NaN ... NaN
Blade_Type NaN ... NaN
Travel_Controls NaN ... NaN
Differential_Type NaN ... NaN
Steering_Controls NaN ... NaN
[53 rows x 3 columns]
For example, the UsageBand variable has three levels:
Index(['High', 'Low', 'Medium'], dtype='object')
To set the order in which the values will be treated:
For example, parsing the saledate field to extract more information about the dates
add_datepart functiondef add_datepart(df, fldname, drop=True, time=False):
"""add_datepart converts a column of df from a datetime64 to many columns containing
the information from the date. This applies changes inplace.
Parameters:
-----------
df: A pandas data frame. df gain several new columns.
fldname: A string that is the name of the date column you wish to expand.
If it is not a datetime64 series, it will be converted to one with pd.to_datetime.
drop: If true then the original date column will be removed.
time: If true time features: Hour, Minute, Second will be added.
Examples:
---------
>>> df = pd.DataFrame({ 'A' : pd.to_datetime(['3/11/2000', '3/12/2000', '3/13/2000'], infer_datetime_format=False) })
>>> df
A
0 2000-03-11
1 2000-03-12
2 2000-03-13
>>> add_datepart(df, 'A')
>>> df
AYear AMonth AWeek ADay ADayofweek ADayofyear AIs_month_end AIs_month_start AIs_quarter_end AIs_quarter_start AIs_year_end AIs_year_start AElapsed
0 2000 3 10 11 5 71 False False False False False False 952732800
1 2000 3 10 12 6 72 False False False False False False 952819200
2 2000 3 11 13 0 73 False False False False False False 952905600
"""
fld = df[fldname]
fld_dtype = fld.dtype
if isinstance(fld_dtype, pd.core.dtypes.dtypes.DatetimeTZDtype):
fld_dtype = np.datetime64
if not np.issubdtype(fld_dtype, np.datetime64):
df[fldname] = fld = pd.to_datetime(fld, infer_datetime_format=True)
targ_pre = re.sub('[Dd]ate$', '', fldname)
attr = ['Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear',
'Is_month_end', 'Is_month_start', 'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start']
if time: attr = attr + ['Hour', 'Minute', 'Second']
for n in attr: df[targ_pre + n] = getattr(fld.dt, n.lower())
df[targ_pre + 'Elapsed'] = fld.astype(np.int64) // 10 ** 9
if drop: df.drop(fldname, axis=1, inplace=True)Notice now that the saledate column has disappeared, instead replaced by several columns with the sale prefix:
Index(['SalesID', 'SalePrice', 'MachineID', 'ModelID', 'datasource',
'auctioneerID', 'YearMade', 'MachineHoursCurrentMeter', 'UsageBand',
'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc', 'fiModelSeries',
'fiModelDescriptor', 'ProductSize', 'fiProductClassDesc', 'state',
'ProductGroup', 'ProductGroupDesc', 'Drive_System', 'Enclosure',
'Forks', 'Pad_Type', 'Ride_Control', 'Stick', 'Transmission',
'Turbocharged', 'Blade_Extension', 'Blade_Width', 'Enclosure_Type',
'Engine_Horsepower', 'Hydraulics', 'Pushblock', 'Ripper', 'Scarifier',
'Tip_Control', 'Tire_Size', 'Coupler', 'Coupler_System',
'Grouser_Tracks', 'Hydraulics_Flow', 'Track_Type',
'Undercarriage_Pad_Width', 'Stick_Length', 'Thumb', 'Pattern_Changer',
'Grouser_Type', 'Backhoe_Mounting', 'Blade_Type', 'Travel_Controls',
'Differential_Type', 'Steering_Controls', 'saleYear', 'saleMonth',
'saleWeek', 'saleDay', 'saleDayofweek', 'saleDayofyear',
'saleIs_month_end', 'saleIs_month_start', 'saleIs_quarter_end',
'saleIs_quarter_start', 'saleIs_year_end', 'saleIs_year_start',
'saleElapsed'],
dtype='object')
proc_dfAt this point, a few more things need to be done to make the data ready for the random forest algorithm:
The proc_df function will do these at once to produce three objects:
df: the now entirely numeric dataframey: the response variablenas: dictionary of the continuous variable columns with missing values and their mediansfix_missingdef fix_missing(df, col, name, na_dict):
""" Fill missing data in a column of df with the median, and add a {name}_na column
which specifies if the data was missing.
Parameters:
-----------
df: The data frame that will be changed.
col: The column of data to fix by filling in missing data.
name: The name of the new filled column in df.
na_dict: A dictionary of values to create na's of and the value to insert. If
name is not a key of na_dict the median will fill any missing data. Also
if name is not a key of na_dict and there is no missing data in col, then
no {name}_na column is not created.
Examples:
---------
>>> df = pd.DataFrame({'col1' : [1, np.NaN, 3], 'col2' : [5, 2, 2]})
>>> df
col1 col2
0 1 5
1 nan 2
2 3 2
>>> fix_missing(df, df['col1'], 'col1', {})
>>> df
col1 col2 col1_na
0 1 5 False
1 2 2 True
2 3 2 False
>>> df = pd.DataFrame({'col1' : [1, np.NaN, 3], 'col2' : [5, 2, 2]})
>>> df
col1 col2
0 1 5
1 nan 2
2 3 2
>>> fix_missing(df, df['col2'], 'col2', {})
>>> df
col1 col2
0 1 5
1 nan 2
2 3 2
>>> df = pd.DataFrame({'col1' : [1, np.NaN, 3], 'col2' : [5, 2, 2]})
>>> df
col1 col2
0 1 5
1 nan 2
2 3 2
>>> fix_missing(df, df['col1'], 'col1', {'col1' : 500})
>>> df
col1 col2 col1_na
0 1 5 False
1 500 2 True
2 3 2 False
"""
if is_numeric_dtype(col):
if pd.isnull(col).sum() or (name in na_dict):
df[name+'_na'] = pd.isnull(col)
filler = na_dict[name] if name in na_dict else col.median()
df[name] = col.fillna(filler)
na_dict[name] = filler
return na_dictnumericalizedef numericalize(df, col, name, max_n_cat):
""" Changes the column col from a categorical type to it's integer codes.
Parameters:
-----------
df: A pandas dataframe. df[name] will be filled with the integer codes from
col.
col: The column you wish to change into the categories.
name: The column name you wish to insert into df. This column will hold the
integer codes.
max_n_cat: If col has more categories than max_n_cat it will not change the
it to its integer codes. If max_n_cat is None, then col will always be
converted.
Examples:
---------
>>> df = pd.DataFrame({'col1' : [1, 2, 3], 'col2' : ['a', 'b', 'a']})
>>> df
col1 col2
0 1 a
1 2 b
2 3 a
note the type of col2 is string
>>> train_cats(df)
>>> df
col1 col2
0 1 a
1 2 b
2 3 a
now the type of col2 is category { a : 1, b : 2}
>>> numericalize(df, df['col2'], 'col3', None)
col1 col2 col3
0 1 a 1
1 2 b 2
2 3 a 1
"""
if not is_numeric_dtype(col) and ( max_n_cat is None or len(col.cat.categories)>max_n_cat):
df[name] = col.cat.codes+1proc_dfdef proc_df(df, y_fld=None, skip_flds=None, ignore_flds=None, do_scale=False, na_dict=None,
preproc_fn=None, max_n_cat=None, subset=None, mapper=None):
""" proc_df takes a data frame df and splits off the response variable, and
changes the df into an entirely numeric dataframe. For each column of df
which is not in skip_flds nor in ignore_flds, na values are replaced by the
median value of the column.
Parameters:
-----------
df: The data frame you wish to process.
y_fld: The name of the response variable
skip_flds: A list of fields that dropped from df.
ignore_flds: A list of fields that are ignored during processing.
do_scale: Standardizes each column in df. Takes Boolean Values(True,False)
na_dict: a dictionary of na columns to add. Na columns are also added if there
are any missing values.
preproc_fn: A function that gets applied to df.
max_n_cat: The maximum number of categories to break into dummy values, instead
of integer codes.
subset: Takes a random subset of size subset from df.
mapper: If do_scale is set as True, the mapper variable
calculates the values used for scaling of variables during training time (mean and standard deviation).
Returns:
--------
[x, y, nas, mapper(optional)]:
x: x is the transformed version of df. x will not have the response variable
and is entirely numeric.
y: y is the response variable
nas: returns a dictionary of which nas it created, and the associated median.
mapper: A DataFrameMapper which stores the mean and standard deviation of the corresponding continuous
variables which is then used for scaling of during test-time.
Examples:
---------
>>> df = pd.DataFrame({'col1' : [1, 2, 3], 'col2' : ['a', 'b', 'a']})
>>> df
col1 col2
0 1 a
1 2 b
2 3 a
note the type of col2 is string
>>> train_cats(df)
>>> df
col1 col2
0 1 a
1 2 b
2 3 a
now the type of col2 is category { a : 1, b : 2}
>>> x, y, nas = proc_df(df, 'col1')
>>> x
col2
0 1
1 2
2 1
>>> data = DataFrame(pet=["cat", "dog", "dog", "fish", "cat", "dog", "cat", "fish"],
children=[4., 6, 3, 3, 2, 3, 5, 4],
salary=[90, 24, 44, 27, 32, 59, 36, 27])
>>> mapper = DataFrameMapper([(:pet, LabelBinarizer()),
([:children], StandardScaler())])
>>>round(fit_transform!(mapper, copy(data)), 2)
8x4 Array{Float64,2}:
1.0 0.0 0.0 0.21
0.0 1.0 0.0 1.88
0.0 1.0 0.0 -0.63
0.0 0.0 1.0 -0.63
1.0 0.0 0.0 -1.46
0.0 1.0 0.0 -0.63
1.0 0.0 0.0 1.04
0.0 0.0 1.0 0.21
"""
if not ignore_flds: ignore_flds=[]
if not skip_flds: skip_flds=[]
if subset: df = get_sample(df,subset)
else: df = df.copy()
ignored_flds = df.loc[:, ignore_flds]
df.drop(ignore_flds, axis=1, inplace=True)
if preproc_fn: preproc_fn(df)
if y_fld is None: y = None
else:
if not is_numeric_dtype(df[y_fld]): df[y_fld] = df[y_fld].cat.codes
y = df[y_fld].values
skip_flds += [y_fld]
df.drop(skip_flds, axis=1, inplace=True)
if na_dict is None: na_dict = {}
else: na_dict = na_dict.copy()
na_dict_initial = na_dict.copy()
for n,c in df.items(): na_dict = fix_missing(df, c, n, na_dict)
if len(na_dict_initial.keys()) > 0:
df.drop([a + '_na' for a in list(set(na_dict.keys()) - set(na_dict_initial.keys()))], axis=1, inplace=True)
if do_scale: mapper = scale_vars(df, mapper)
for n,c in df.items(): numericalize(df, c, n, max_n_cat)
df = pd.get_dummies(df, dummy_na=True)
df = pd.concat([ignored_flds, df], axis=1)
res = [df, y, na_dict]
if do_scale: res = res + [mapper]
return resdf 401122 401123 401124
SalesID 6333338 6333341 6333342
MachineID 1887659 1903570 1926965
ModelID 21439 21435 21435
datasource 149 149 149
auctioneerID 1 2 2
YearMade 2005 2005 2005
MachineHoursCurrentMeter 0 0 0
UsageBand 0 0 0
fiModelDesc 657 483 483
fiBaseModel 207 159 159
fiSecondaryDesc 106 106 106
fiModelSeries 63 0 0
fiModelDescriptor 0 0 0
ProductSize 5 5 5
fiProductClassDesc 17 13 13
state 20 9 9
ProductGroup 4 4 4
ProductGroupDesc 4 4 4
Drive_System 0 0 0
Enclosure 1 1 1
Forks 0 0 0
Pad_Type 0 0 0
Ride_Control 0 0 0
Stick 0 0 0
Transmission 0 0 0
Turbocharged 0 0 0
Blade_Extension 0 0 0
Blade_Width 0 0 0
Enclosure_Type 0 0 0
Engine_Horsepower 0 0 0
... ... ... ...
Coupler 3 3 3
Coupler_System 0 0 0
Grouser_Tracks 0 0 0
Hydraulics_Flow 0 0 0
Track_Type 2 2 2
Undercarriage_Pad_Width 19 19 19
Stick_Length 29 29 29
Thumb 3 3 3
Pattern_Changer 2 2 2
Grouser_Type 1 1 1
Backhoe_Mounting 0 0 0
Blade_Type 0 0 0
Travel_Controls 0 0 0
Differential_Type 0 0 0
Steering_Controls 0 0 0
saleYear 2011 2011 2011
saleMonth 11 10 10
saleWeek 44 43 43
saleDay 2 25 25
saleDayofweek 2 1 1
saleDayofyear 306 298 298
saleIs_month_end False False False
saleIs_month_start False False False
saleIs_quarter_end False False False
saleIs_quarter_start False False False
saleIs_year_end False False False
saleIs_year_start False False False
saleElapsed 1320192000 1319500800 1319500800
auctioneerID_na False False False
MachineHoursCurrentMeter_na True True True
[66 rows x 3 columns]
## Define function to split data set
def split_vals(a,n):
return a[:n].copy(), a[n:].copy()
## Split dataset
n_valid = 12000 # Same as Kaggle's test set size
n_trn = len(df)-n_valid
raw_train, raw_valid = split_vals(df_raw, n_trn)
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)
print(X_train.shape, y_train.shape, X_valid.shape)(389125, 66) (389125,) (12000, 66)
## Define function to calculate RMSE
def rmse(x,y):
return math.sqrt(((x-y)**2).mean())
## Define function to print out RMSE and R2 values for training and validation sets
def print_score(m):
res = [rmse(m.predict(X_train), y_train),
rmse(m.predict(X_valid), y_valid),
m.score(X_train, y_train),
m.score(X_valid, y_valid)]
if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
print(res)[0.09070972618403471, 0.2535174519364779, 0.9828034020373368, 0.8852206670848883]
m = RandomForestRegressor(n_estimators=1,
bootstrap=False,
n_jobs=-1)
m.fit(X_train, y_train)
print_score(m)[5.329070518200751e-17, 0.47670897220102537, 1.0, 0.594160208846394]
[0.11199686542985562, 0.3605562719427123, 0.9724270771971114, 0.7678364247358116]
[9.30565055 9.04782144 8.9226583 9.10497986 8.9226583 9.15904708
9.30565055 9.35010231 9.54681261 8.98719682]
9.165257782260593 9.104979856318357
(10, 12000)
import matplotlib.pyplot as plt
plt.plot([metrics.r2_score(y_valid, np.mean(preds[:i+1], axis=0)) for i in range(10)]);
plt.show()Adding more trees do not increase the model performance significantly:
[0.10208552551172624, 0.349211430210069, 0.9770913547606447, 0.7822165493173546]
[0.09726308467086937, 0.35386511652111163, 0.979204606281339, 0.7763733905004107]
[0.09567778194716048, 0.3507615331500072, 0.979876974999839, 0.780278834860096]
Recall the print_score function where the last item printed out is the out-of-bag score:
def print_score(m):
res = [rmse(m.predict(X_train), y_train),
rmse(m.predict(X_valid), y_valid),
m.score(X_train, y_train),
m.score(X_valid, y_valid)]
if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
print(res)To get the OOB score, add one more parameter to the model constructor:
m = RandomForestRegressor(n_estimators=40,
n_jobs=-1,
oob_score=True)
m.fit(X_train, y_train)
print_score(m)[0.0973458316202628, 0.3492358738061539, 0.9791692077050513, 0.782186060068996, 0.8521597322650637]
This shows that the model is both over-fitting AND that the validation set time difference has an impact.
/Users/nancy/anaconda3/envs/r-reticulate/lib/python3.7/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
[0.2418279065948944, 0.2799781522487347, 0.8777784860917807, 0.860010242879805, 0.8650363056265487]
As each additional tree allows the model to see more data, increasing the number of trees can enhance performance
m = RandomForestRegressor(n_estimators=40,
n_jobs=-1,
oob_score=True)
m.fit(X_train, y_train)
print_score(m)[0.22659425812627826, 0.2615041844731935, 0.8926918698148494, 0.8778748080643042, 0.8812056869050565]
By setting min_samples_leaf:
def reset_rf_samples():
""" Undoes the changes produced by set_rf_samples.
"""
forest._generate_sample_indices = (lambda rs, n_samples:
forest.check_random_state(rs).randint(0, n_samples, n_samples))## First revert to using a full bootstrap
reset_rf_samples()
def dectree_max_depth(tree):
children_left = tree.children_left
children_right = tree.children_right
def walk(node_id):
if (children_left[node_id] != children_right[node_id]):
left_max = 1 + walk(children_left[node_id])
right_max = 1 + walk(children_right[node_id])
return max(left_max, right_max)
else: # leaf
return 1
root_node_id = 0
return walk(root_node_id)m = RandomForestRegressor(n_estimators=40,
n_jobs=-1,
oob_score=True)
m.fit(X_train, y_train)
print_score(m)[0.07850136700663472, 0.23982842268856314, 0.9871207887971841, 0.8972813562954464, 0.9081813562258134]
48
m = RandomForestRegressor(n_estimators=40,
min_samples_leaf=5,
n_jobs=-1,
oob_score=True)
m.fit(X_train, y_train)
print_score(m)[0.14066900750693243, 0.23378407318232006, 0.9586446526694946, 0.9023937076672716, 0.9069995123114761]
38
m = RandomForestRegressor(n_estimators=40,
min_samples_leaf=3,
max_features=0.5,
n_jobs=-1,
oob_score=True)
m.fit(X_train, y_train)
print_score(m)[0.11908905343482946, 0.22811149301231815, 0.9703599786073658, 0.9070729163026368, 0.9117730171652707]