Linear model is one of the most simple machine learning algorithm. People often getting attracted to more advanced model such as Neural Network or Gradient Boosting due to the hype and the predictive performance. However, on most of daily business case, building a linear model is good enough. Linear model is also comes with the benefit of being interpretable, compared to the black box Neural Network. On this occasion, we will build a linear model with the addition of regularization to analyze the data while still get a great predictive performance.
All source code for this article is provided at my github repo
# Data Wrangling
import pandas as pd
import numpy as np
# Regex
import re
# Statistics
import scipy.stats as stats
# Data Preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
# Model Evaluation
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
# Machine Learning Model
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import LassoCV
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import ElasticNetCV
# Data Visualization
import matplotlib.pyplot as plt
import seaborn as snsOn this occasion, we will try to understand what makes a price of a laptop to increase by building a linear model. For a computer geek or people who manufactured laptops may already now the production cost of each component. However, for a lay people like us who only know how to use the laptop, exploring this dataset and building a machine learning model around it will help us to compare laptop with various specifications and build by various companies. We may also see some intangible factors that can affect the price, such as the value of a brand like Apple or the CPU component such us Intel Core vs AMD.
The data come from Laptop Company Price List with the following dictionary:
laptop = pd.read_csv('data/laptop_price.csv')
laptop.head()## laptop_ID Company Product ... OpSys Weight Price_euros
## 0 1 Apple MacBook Pro ... macOS 1.37kg 1339.69
## 1 2 Apple Macbook Air ... macOS 1.34kg 898.94
## 2 3 HP 250 G6 ... No OS 1.86kg 575.00
## 3 4 Apple MacBook Pro ... macOS 1.83kg 2537.45
## 4 5 Apple MacBook Pro ... macOS 1.37kg 1803.60
##
## [5 rows x 13 columns]
Let’s check the dimension of the data.
laptop.shape## (1303, 13)
Let’s check the data type of each column. See if there is any incompatible column data type.
laptop.dtypes## laptop_ID int64
## Company object
## Product object
## TypeName object
## Inches float64
## ScreenResolution object
## Cpu object
## Ram object
## Memory object
## Gpu object
## OpSys object
## Weight object
## Price_euros float64
## dtype: object
Some column should be numerical, such as the RAM and memory. Those columns is still in string format and have non-numeric characters. We will clean the data later before building the model.
Let’s check if there is any duplicated data.
laptop[ laptop.duplicated()].shape## (0, 13)
Based on the finding, we will have 0 observation of duplicated data.
Let’s check if there is any missing value from each column.
laptop.isnull().sum(axis = 0)## laptop_ID 0
## Company 0
## Product 0
## TypeName 0
## Inches 0
## ScreenResolution 0
## Cpu 0
## Ram 0
## Memory 0
## Gpu 0
## OpSys 0
## Weight 0
## Price_euros 0
## dtype: int64
Based on the result, we find that there is no missing value in any column on our dataset.
Although the information given from the dataset is quite comprehensive, we need to transform the data to a proper format to build a machine learning.
The first we do is removing the weight unit (kg) from the Weight column and tranform the value into float/numeric.
# Remove "kg" from the weight
laptop['Weight'] = list(map(lambda x: float(re.sub('kg', '', x)), laptop['Weight']))
laptop.head()## laptop_ID Company Product ... OpSys Weight Price_euros
## 0 1 Apple MacBook Pro ... macOS 1.37 1339.69
## 1 2 Apple Macbook Air ... macOS 1.34 898.94
## 2 3 HP 250 G6 ... No OS 1.86 575.00
## 3 4 Apple MacBook Pro ... macOS 1.83 2537.45
## 4 5 Apple MacBook Pro ... macOS 1.37 1803.60
##
## [5 rows x 13 columns]
Let’s check if there is any missing value as a result of our data wrangling process on the Weight column.
laptop['Weight'].isnull().sum()## 0
The next thing we do is removing the unit GB from the Ram column and transform the value into integer.
laptop['Ram'] = list(map(lambda x: int(re.sub('GB', '', x)), laptop['Ram']))
laptop.head()## laptop_ID Company Product ... OpSys Weight Price_euros
## 0 1 Apple MacBook Pro ... macOS 1.37 1339.69
## 1 2 Apple Macbook Air ... macOS 1.34 898.94
## 2 3 HP 250 G6 ... No OS 1.86 575.00
## 3 4 Apple MacBook Pro ... macOS 1.83 2537.45
## 4 5 Apple MacBook Pro ... macOS 1.37 1803.60
##
## [5 rows x 13 columns]
Now we will separate the Memory column into 3 different columns: SSD, HDD, and Flash based on the type of the storage system. The first thing we do is to find the specific storage type, for example SSD, and extract the value. If a laptop does not have any SSD, the value will be empty.
Here I use the regex pattern \d+ which means digits (0-9) followed by the unit size of the storage (GB/TB) and ended with SSD to indicate that we only looking for SSD system.
temp_ssd = list(map(lambda x: re.findall('\d+GB SSD|\d+TB SSD', x), laptop['Memory']))
temp_ssd[0:10]## [['128GB SSD'], [], ['256GB SSD'], ['512GB SSD'], ['256GB SSD'], [], [], [], ['512GB SSD'], ['256GB SSD']]
The next thing we do is converting the string into proper numeric storage size value. If we find laptop with a TB size of storage, we will convert it into GB by multiply the value with 1000. Some laptops may have 2 separate SSD embedded, such as 256GB SSD + 256GB SSD. To simplify the problem, we just sum the value.
final_ssd = []
for i in range(len(temp_ssd)):
for j in range(len(temp_ssd[i])):
if re.search('TB', temp_ssd[i][j]):
temp_ssd[i][j] = int(re.sub('TB SSD', '', temp_ssd[i][j]))*1000 # Convert TB to GB
else:
temp_ssd[i][j] = int(re.sub('GB SSD', '', temp_ssd[i][j]))
final_ssd.append( np.sum(temp_ssd[i])) # Sum the SSD Memory Storage
final_ssd[0:10]## [128, 0.0, 256, 512, 256, 0.0, 0.0, 0.0, 512, 256]
We will do the same thing with the HDD and Flash Storage.
# HDD Storage
temp_hdd = list(map(lambda x: re.findall('\d+GB HDD|\d+TB HDD', x), laptop['Memory']))
final_hdd = []
for i in range(len(temp_hdd)):
for j in range(len(temp_hdd[i])):
if re.search('TB', temp_hdd[i][j]):
temp_hdd[i][j] = int(re.sub('TB HDD', '', temp_hdd[i][j]))*1000 # Convert TB to GB
else:
temp_hdd[i][j] = int(re.sub('GB HDD', '', temp_hdd[i][j]))
final_hdd.append( np.sum(temp_hdd[i])) # Sum the total hdd Memory Storage
# Flash Storage
temp_flash = list(map(lambda x: re.findall('\d+GB Flash|\d+TB Flash', x), laptop['Memory']))
final_flash = []
for i in range(len(temp_flash)):
for j in range(len(temp_flash[i])):
if re.search('TB', temp_flash[i][j]):
temp_flash[i][j] = int(re.sub('TB Flash', '', temp_flash[i][j]))*1000 # Convert TB to GB
else:
temp_flash[i][j] = int(re.sub('GB Flash', '', temp_flash[i][j]))
final_flash.append( np.sum(temp_flash[i])) # Sum the total flash Memory StorageFinally, we will attach the processed list into the initial laptop dataframe.
laptop['ssd'] = final_ssd
laptop['hdd'] = final_hdd
laptop['flash'] = final_flash
laptop.head()## laptop_ID Company Product TypeName ... Price_euros ssd hdd flash
## 0 1 Apple MacBook Pro Ultrabook ... 1339.69 128.0 0.0 0.0
## 1 2 Apple Macbook Air Ultrabook ... 898.94 0.0 0.0 128.0
## 2 3 HP 250 G6 Notebook ... 575.00 256.0 0.0 0.0
## 3 4 Apple MacBook Pro Ultrabook ... 2537.45 512.0 0.0 0.0
## 4 5 Apple MacBook Pro Ultrabook ... 1803.60 256.0 0.0 0.0
##
## [5 rows x 16 columns]
The next thing we do is transforming the Cpu column. We will separate the processor type and the processor clock speed.
The processor clock speed is indicated by the number followed by the GigaHertz (GHz) unit. Let’s check the number of CPU type and their respective frequency of data.
laptop.value_counts('Cpu')## Cpu
## Intel Core i5 7200U 2.5GHz 190
## Intel Core i7 7700HQ 2.8GHz 146
## Intel Core i7 7500U 2.7GHz 134
## Intel Core i7 8550U 1.8GHz 73
## Intel Core i5 8250U 1.6GHz 72
## ...
## Intel Atom Z8350 1.92GHz 1
## Intel Core i7 2.2GHz 1
## Intel Core i7 2.7GHz 1
## Intel Core i7 2.8GHz 1
## Samsung Cortex A72&A53 2.0GHz 1
## Length: 118, dtype: int64
To simplify the processor/CPU type and prevent us from getting to many categorical class, we will only consider the general type only. For example, Intel Core i5 and Intel Core i5 7200U will be considered as the same type of CPU. Let’s check the result of the CPU type name cleansing process. We expect a general CPU type and try not to be to specific to reduce number of new features.
# CPU type
laptop_cpu = list(map(lambda x: re.findall('.*? \d+', x)[0].strip(), laptop['Cpu']))
laptop_cpu = list(map(lambda x: re.sub(' \d+.*', '', x), laptop_cpu)) # Remove string started with numbers after whitespace
laptop_cpu = list(map(lambda x: re.sub('[-].*', '', x), laptop_cpu)) # Remove type extension such as x-Z090 into x
laptop_cpu = list(map(lambda x: re.sub(' [A-Z]\d+.*', '', x), laptop_cpu)) # Remove string started with capital letters followed by numbers after whitespace
pd.DataFrame(laptop_cpu).value_counts()## Intel Core i7 527
## Intel Core i5 423
## Intel Core i3 136
## Intel Celeron Dual Core 80
## AMD 47
## Intel Pentium Quad Core 27
## Intel Core M 16
## Intel Atom x5 10
## AMD E 9
## Intel Celeron Quad Core 8
## Intel Xeon 4
## AMD Ryzen 4
## Intel Pentium Dual Core 3
## Intel Atom 3
## Intel Core M m3 2
## AMD FX 2
## Intel Core M m7 1
## Samsung Cortex 1
## dtype: int64
We will continue by extracting the CPU clock speed.
# CPU clock speed
laptop_cpu_clock = list(map(lambda x: float(re.sub('GHz', '', re.findall('\d+GHz|\d+[.]\d+.*GHz', x)[0])), laptop['Cpu']))
laptop_cpu_clock[0:10]## [2.3, 1.8, 2.5, 2.7, 3.1, 3.0, 2.2, 1.8, 1.8, 1.6]
After we have collected the list for the processor type and the processor clock speed, we attach them to the initial dataset.
laptop['cpu_type'] = laptop_cpu
laptop['cpu_clock'] = laptop_cpu_clock
laptop.head(10)## laptop_ID Company Product ... flash cpu_type cpu_clock
## 0 1 Apple MacBook Pro ... 0.0 Intel Core i5 2.3
## 1 2 Apple Macbook Air ... 128.0 Intel Core i5 1.8
## 2 3 HP 250 G6 ... 0.0 Intel Core i5 2.5
## 3 4 Apple MacBook Pro ... 0.0 Intel Core i7 2.7
## 4 5 Apple MacBook Pro ... 0.0 Intel Core i5 3.1
## 5 6 Acer Aspire 3 ... 0.0 AMD 3.0
## 6 7 Apple MacBook Pro ... 256.0 Intel Core i7 2.2
## 7 8 Apple Macbook Air ... 256.0 Intel Core i5 1.8
## 8 9 Asus ZenBook UX430UN ... 0.0 Intel Core i7 1.8
## 9 10 Acer Swift 3 ... 0.0 Intel Core i5 1.6
##
## [10 rows x 18 columns]
GPU is also an important part, especially for people who want to look for better gaming experience. Since there are a lot of GPU variant, we will only extract the first 2 words from the GPU type. For example, Intel Iris or Intel HD.
# gpu_type = list(map(lambda x: re.findall('.*? ', x)[0].strip(), laptop['Gpu']))
gpu_type = list(map(lambda x: " ".join(x.split()[0:2]), laptop['Gpu']))
laptop['gpu_type'] = gpu_type
laptop.head()## laptop_ID Company Product ... cpu_type cpu_clock gpu_type
## 0 1 Apple MacBook Pro ... Intel Core i5 2.3 Intel Iris
## 1 2 Apple Macbook Air ... Intel Core i5 1.8 Intel HD
## 2 3 HP 250 G6 ... Intel Core i5 2.5 Intel HD
## 3 4 Apple MacBook Pro ... Intel Core i7 2.7 AMD Radeon
## 4 5 Apple MacBook Pro ... Intel Core i5 3.1 Intel Iris
##
## [5 rows x 19 columns]
Next, we will extract information from the ScreenResolution. If the laptop has touchscreen feature, we will give value of 1.
touch_screen = []
for i in range(len(laptop['ScreenResolution'])):
if re.search('Touchscreen', laptop['ScreenResolution'][i]):
touch_screen.append(1)
else:
touch_screen.append(0)
touch_screen[0:20]## [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
Now we will extract the screen width. A special is when the screen resolution is in 4K, where the dimension is not explicitly stated. To counter such problem, we will assume that for all laptop with 4K resolution has aspect ratio of 16:9 or 3840x2160, which is the most common 4K resolution according to PC Monitors.
screen_width_str = list(map(lambda x: re.sub('x', '', re.findall('\d+.*?x', x)[0]), laptop['ScreenResolution']))
screen_width = []
for i in range(len(screen_width_str)):
if re.search('4K', screen_width_str[i]):
screen_width.append(3840)
else:
screen_width.append(int(screen_width_str[i]))
screen_width[0:10]## [2560, 1440, 1920, 2880, 2560, 1366, 2880, 1440, 1920, 1920]
We will continue extracting the width resolution of the screen.
screen_height_str = list(map(lambda x: re.sub('x', '', re.findall('x.*\d+', x)[0]), laptop['ScreenResolution']))
screen_height = []
for i in range(len(screen_height_str)):
if re.search('4K', screen_height_str[i]):
screen_height.append(2160)
else:
screen_height.append(int(screen_height_str[i]))
screen_height[0:10]## [1600, 900, 1080, 1800, 1600, 768, 1800, 900, 1080, 1080]
We will also extract the type of the monitor. If an observation doesn’t have any type of monitor and only show the screen resolution, we will fill the monitor type with others.
monitor_type = list(map(lambda x: re.sub('\d+.*x.*', '', x), laptop['ScreenResolution']))
monitor_type = list(map(lambda x: re.sub('Touchscreen', '', x), monitor_type))
monitor_type = list(map(lambda x: re.sub('[/]', '', x).strip(), monitor_type))
for i in range(len(monitor_type)):
if monitor_type[i] == '':
monitor_type[i] = 'others'
pd.DataFrame(monitor_type).value_counts()## Full HD 555
## others 364
## IPS Panel Full HD 288
## IPS Panel 49
## Quad HD+ 19
## IPS Panel Retina Display 17
## IPS Panel Quad HD+ 11
## dtype: int64
Finally, we attach the screen information to the initial dataset.
laptop['touchscreen'] = touch_screen
laptop['screen_width'] = screen_width
laptop['screen_height'] = screen_height
laptop['monitor_type'] = monitor_type
laptop.head()## laptop_ID Company ... screen_height monitor_type
## 0 1 Apple ... 1600 IPS Panel Retina Display
## 1 2 Apple ... 900 others
## 2 3 HP ... 1080 Full HD
## 3 4 Apple ... 1800 IPS Panel Retina Display
## 4 5 Apple ... 1600 IPS Panel Retina Display
##
## [5 rows x 23 columns]
After we have completed the data wrangling process, we will continue exploring the information from the dataset. Understanding the data is crucial before we start to build the machine learning model.
To simplify the dataset, we will drop some columns that are not necessary for building the model.
laptop_clean = laptop.copy()
laptop_clean.drop(['laptop_ID', 'Product', 'TypeName', 'ScreenResolution', 'Cpu', 'Memory', 'Gpu'], axis = 1, inplace = True)
laptop_clean.head()## Company Inches Ram ... screen_width screen_height monitor_type
## 0 Apple 13.3 8 ... 2560 1600 IPS Panel Retina Display
## 1 Apple 13.3 8 ... 1440 900 others
## 2 HP 15.6 8 ... 1920 1080 Full HD
## 3 Apple 15.4 16 ... 2880 1800 IPS Panel Retina Display
## 4 Apple 13.3 8 ... 2560 1600 IPS Panel Retina Display
##
## [5 rows x 16 columns]
Let’s explore the distribution of laptop price from different companies, regardless of the laptop specs.
corr_mat = laptop_clean.drop('touchscreen', axis = 1).corr()
plt.pcolor(corr_mat, cmap = 'RdBu')## <matplotlib.collections.PolyCollection object at 0x7f745c066b50>
plt.colorbar()## <matplotlib.colorbar.Colorbar object at 0x7f745c029a90>
plt.xticks(range(len(list(corr_mat.columns))), labels= list(corr_mat.columns), rotation = 90)## ([<matplotlib.axis.XTick object at 0x7f745c0ca880>, <matplotlib.axis.XTick object at 0x7f745c0ca850>, <matplotlib.axis.XTick object at 0x7f745c0b8490>, <matplotlib.axis.XTick object at 0x7f745c045c40>, <matplotlib.axis.XTick object at 0x7f745bfe13a0>, <matplotlib.axis.XTick object at 0x7f745bfe1400>, <matplotlib.axis.XTick object at 0x7f745bfe1ca0>, <matplotlib.axis.XTick object at 0x7f745bfe8430>, <matplotlib.axis.XTick object at 0x7f745bfe8b80>, <matplotlib.axis.XTick object at 0x7f745bfef310>], [Text(0, 0, 'Inches'), Text(1, 0, 'Ram'), Text(2, 0, 'Weight'), Text(3, 0, 'Price_euros'), Text(4, 0, 'ssd'), Text(5, 0, 'hdd'), Text(6, 0, 'flash'), Text(7, 0, 'cpu_clock'), Text(8, 0, 'screen_width'), Text(9, 0, 'screen_height')])
plt.yticks(range(len(list(corr_mat.columns))), labels= list(corr_mat.columns))## ([<matplotlib.axis.YTick object at 0x7f745c0df820>, <matplotlib.axis.YTick object at 0x7f745c0df070>, <matplotlib.axis.YTick object at 0x7f745c0ca040>, <matplotlib.axis.YTick object at 0x7f745bfef5e0>, <matplotlib.axis.YTick object at 0x7f745bfe8490>, <matplotlib.axis.YTick object at 0x7f745bff3190>, <matplotlib.axis.YTick object at 0x7f745bff3850>, <matplotlib.axis.YTick object at 0x7f745bff40a0>, <matplotlib.axis.YTick object at 0x7f745bff4730>, <matplotlib.axis.YTick object at 0x7f745bff4e80>], [Text(0, 0, 'Inches'), Text(0, 1, 'Ram'), Text(0, 2, 'Weight'), Text(0, 3, 'Price_euros'), Text(0, 4, 'ssd'), Text(0, 5, 'hdd'), Text(0, 6, 'flash'), Text(0, 7, 'cpu_clock'), Text(0, 8, 'screen_width'), Text(0, 9, 'screen_height')])
plt.xlabel('')## Text(0.5, 0, '')
plt.show()plt.close()Based on the correlation matrix, we can see that the price (Price_euros) has a relatively strong correlation with the RAM while other features has low correlation with the price.
Next, we will check the number of each variant of the operating system (OS) of the laptop.
laptop_agg = laptop_clean[['Price_euros', 'OpSys']].groupby('OpSys').count().\
rename(columns = {'Price_euros':'Frequency'}).sort_values(by = 'Frequency', ascending = False)
plt.bar(x = laptop_agg.index, height = laptop_agg['Frequency'])
# Insert text## <BarContainer object of 9 artists>
for i in range(laptop_agg.shape[0]):
plt.text(laptop_agg.index[i], laptop_agg['Frequency'][i], laptop_agg['Frequency'][i])## Text(Windows 10, 1072, '1072')
## Text(No OS, 66, '66')
## Text(Linux, 62, '62')
## Text(Windows 7, 45, '45')
## Text(Chrome OS, 27, '27')
## Text(macOS, 13, '13')
## Text(Mac OS X, 8, '8')
## Text(Windows 10 S, 8, '8')
## Text(Android, 2, '2')
plt.xticks(rotation = 90)## ([0, 1, 2, 3, 4, 5, 6, 7, 8], [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])
plt.xlabel('OS')## Text(0.5, 0, 'OS')
plt.ylabel('Frequency')## Text(0, 0.5, 'Frequency')
plt.title('Operating System')## Text(0.5, 1.0, 'Operating System')
plt.show()plt.close()Next, we will check the frequency of each type of processor based on the CPU general type.
laptop_agg = laptop_clean[['Price_euros', 'cpu_type']].groupby('cpu_type').count().\
rename(columns = {'Price_euros':'Frequency'}).sort_values(by = 'Frequency', ascending = False)
plt.bar(x = laptop_agg.index, height = laptop_agg['Frequency'])
# Insert text## <BarContainer object of 18 artists>
for i in range(laptop_agg.shape[0]):
plt.text(laptop_agg.index[i], laptop_agg['Frequency'][i], laptop_agg['Frequency'][i])## Text(Intel Core i7, 527, '527')
## Text(Intel Core i5, 423, '423')
## Text(Intel Core i3, 136, '136')
## Text(Intel Celeron Dual Core, 80, '80')
## Text(AMD, 47, '47')
## Text(Intel Pentium Quad Core, 27, '27')
## Text(Intel Core M, 16, '16')
## Text(Intel Atom x5, 10, '10')
## Text(AMD E, 9, '9')
## Text(Intel Celeron Quad Core, 8, '8')
## Text(Intel Xeon, 4, '4')
## Text(AMD Ryzen, 4, '4')
## Text(Intel Pentium Dual Core, 3, '3')
## Text(Intel Atom, 3, '3')
## Text(Intel Core M m3, 2, '2')
## Text(AMD FX, 2, '2')
## Text(Intel Core M m7, 1, '1')
## Text(Samsung Cortex, 1, '1')
plt.xticks(rotation = 90)## ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17], [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])
plt.xlabel('CPU')## Text(0.5, 0, 'CPU')
plt.ylabel('Frequency')## Text(0, 0.5, 'Frequency')
plt.title('CPU Type by Frequency')## Text(0.5, 1.0, 'CPU Type by Frequency')
plt.show()plt.close()Intel Core series are the most frequent processor type in the market. There are some CPU type with only 1 or 2 observations, such as the Samsung Cortex. We will label CPU type as other for CPU with only observation from the data.
low_cpu = list(laptop_agg[ laptop_agg['Frequency'] == 1].index)
id_pos = list(laptop_clean['cpu_type'][laptop_clean['cpu_type'].isin(low_cpu)].index)
laptop_clean.loc[id_pos, 'cpu_type'] = '1_others'
laptop_clean[['cpu_type']].value_counts()## cpu_type
## Intel Core i7 527
## Intel Core i5 423
## Intel Core i3 136
## Intel Celeron Dual Core 80
## AMD 47
## Intel Pentium Quad Core 27
## Intel Core M 16
## Intel Atom x5 10
## AMD E 9
## Intel Celeron Quad Core 8
## Intel Xeon 4
## AMD Ryzen 4
## Intel Atom 3
## Intel Pentium Dual Core 3
## Intel Core M m3 2
## AMD FX 2
## 1_others 2
## dtype: int64
Let’s check the price distribution for each CPU vendor.
sns.boxplot(data = laptop_clean, y = 'cpu_type', x = 'Price_euros')## <AxesSubplot:xlabel='Price_euros', ylabel='cpu_type'>
plt.ylabel('CPU Type')## Text(0, 0.5, 'CPU Type')
plt.xlabel('Price in Euro')## Text(0.5, 0, 'Price in Euro')
plt.show()plt.close()Based on the boxplot, we can see that Intel Xeon, Intel Core i7, and AMD Ryzen has higher median compared to other processor. The most expensive laptops are build with Intel Core i7 based on the outliers.
We will continue by checking the type of the GPU.
laptop_agg = laptop_clean[['Price_euros', 'gpu_type']].groupby('gpu_type').count().\
rename(columns = {'Price_euros':'Frequency'}).sort_values(by = 'Frequency', ascending = False)
plt.bar(x = laptop_agg.index, height = laptop_agg['Frequency'])
# Insert text## <BarContainer object of 12 artists>
for i in range(laptop_agg.shape[0]):
plt.text(laptop_agg.index[i], laptop_agg['Frequency'][i], laptop_agg['Frequency'][i])## Text(Intel HD, 639, '639')
## Text(Nvidia GeForce, 368, '368')
## Text(AMD Radeon, 173, '173')
## Text(Intel UHD, 68, '68')
## Text(Nvidia Quadro, 31, '31')
## Text(Intel Iris, 14, '14')
## Text(AMD FirePro, 5, '5')
## Text(AMD R17M-M1-70, 1, '1')
## Text(AMD R4, 1, '1')
## Text(ARM Mali, 1, '1')
## Text(Intel Graphics, 1, '1')
## Text(Nvidia GTX, 1, '1')
plt.xticks(rotation = 90)## ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])
plt.xlabel('GPU')## Text(0.5, 0, 'GPU')
plt.ylabel('Frequency')## Text(0, 0.5, 'Frequency')
plt.title('GPU Type by Frequency')## Text(0.5, 1.0, 'GPU Type by Frequency')
plt.show()plt.close()Intel HD is the most common, followed by the NVidia GeForce series. We will also group all GPU that only has 1 observation as others.
low_gpu = list(laptop_agg[ laptop_agg['Frequency'] == 1].index)
id_pos = list(laptop_clean[ laptop_clean['gpu_type'].isin(low_gpu)].index)
laptop_clean.loc[id_pos, 'gpu_type'] = '1_others'
laptop_clean[['gpu_type']].value_counts()## gpu_type
## Intel HD 639
## Nvidia GeForce 368
## AMD Radeon 173
## Intel UHD 68
## Nvidia Quadro 31
## Intel Iris 14
## 1_others 5
## AMD FirePro 5
## dtype: int64
We will also check the price distribution of each GPU type.
sns.boxplot(data = laptop_clean, y = 'gpu_type', x = 'Price_euros')## <AxesSubplot:xlabel='Price_euros', ylabel='gpu_type'>
plt.ylabel('GPU Vendor')## Text(0, 0.5, 'GPU Vendor')
plt.xlabel('Price in Euro')## Text(0.5, 0, 'Price in Euro')
plt.show()plt.close()Based on the distribution of each boxplot, laptop with NVidia GPU is slightly more pricy compared to other vendor. From the outliers, combined with the information from the previous CPU vendor price distribution, we can see that laptop with Intel process and NVidia GPU has higher price. This is kinda expected since most gaming laptop tend to have NVidia GPU and Intel processor. To check this argument, we will explore the laptop with price higher than 3000 Euro and see the combination of the CPU and GPU.
laptop[['Company', 'Product', 'Cpu', 'Gpu']][ laptop['Price_euros'] > 3000]## Company ... Gpu
## 196 Razer ... Nvidia GeForce GTX 1080
## 204 Dell ... Nvidia Quadro M1200
## 238 Asus ... Nvidia GeForce GTX 1080
## 530 Dell ... Nvidia GeForce GTX 1070
## 610 Lenovo ... Nvidia Quadro M2200M
## 659 Dell ... Nvidia GeForce GTX 1070
## 723 Dell ... Nvidia GeForce GTX 1070
## 744 Lenovo ... Nvidia Quadro M520M
## 749 HP ... Nvidia Quadro M2000M
## 780 Dell ... Nvidia GeForce GTX 1070M
## 830 Razer ... Nvidia GeForce GTX 1080
## 841 Dell ... Nvidia GeForce GTX 1070
## 911 HP ... Intel HD Graphics 515
## 955 Dell ... Nvidia GeForce GTX 1070
## 968 Dell ... Nvidia GeForce GTX 1070
## 1066 Asus ... Nvidia GeForce GTX 980
## 1081 Lenovo ... Nvidia GeForce GTX 980M
## 1136 HP ... Nvidia Quadro M3000M
## 1231 Razer ... Nvidia GeForce GTX 1060
##
## [19 rows x 4 columns]
We can see that all of the laptop with price higher than 3000 Euro has Intel Core i7 or higher as the processor and NVidia as the GPU. We have done the exploratory data analysis to understand our data, now we will start to build the machine learning model.
Before we split the data into data training and data testing, now we will convert the categorical variable into dummy features by using one-hot encoding so that it can be processed by the machine learing model.
The following columns will be transformed:
First, we will convert the category into integer number using the label encoding. For example, AMD will be 0, Intel will be 1, etc. After the data is converted, we then apply the one-hot encoding and convert the result into an array. The drop='first' will remove the first category from the encoding, so if we have 5 different categories in the column, we will only get 4 new columns from the result of one-hot encoding. Since we name the others category as 1_others, it will be removed during the encoding process, allowing us to predict any new type of category that is not present in the current dataset.
# Convert Category into Integer
cpu_label = LabelEncoder().fit(laptop_clean['cpu_type']).transform(laptop_clean['cpu_type'])
cpu_label = cpu_label.reshape(len(laptop_clean['cpu_type']), 1)
# Convert Label into One Hot Encoding
cpu_label = OneHotEncoder(drop = 'first').fit(cpu_label).transform(cpu_label).toarray()
cpu_label## array([[0., 0., 0., ..., 0., 0., 0.],
## [0., 0., 0., ..., 0., 0., 0.],
## [0., 0., 0., ..., 0., 0., 0.],
## ...,
## [0., 0., 0., ..., 0., 0., 0.],
## [0., 0., 0., ..., 0., 0., 0.],
## [0., 0., 0., ..., 0., 0., 0.]])
To help us during model interpretation, we will collect the cpu category as the column name for later purpose. One hot encoding will use alphabetical order everytime it convert the categorical data.
cpu_name = list(set(laptop_clean['cpu_type']))
cpu_name.sort()
cpu_name = list(map(lambda x: 'cpu_' + x, cpu_name))
cpu_name = cpu_name[ 1:len(cpu_name) ]
cpu_onehot = pd.DataFrame(cpu_label, columns = cpu_name)
cpu_onehot.head()## cpu_AMD cpu_AMD E ... cpu_Intel Pentium Quad Core cpu_Intel Xeon
## 0 0.0 0.0 ... 0.0 0.0
## 1 0.0 0.0 ... 0.0 0.0
## 2 0.0 0.0 ... 0.0 0.0
## 3 0.0 0.0 ... 0.0 0.0
## 4 0.0 0.0 ... 0.0 0.0
##
## [5 rows x 16 columns]
Finally, we will create a dataframe from the one hot encoding and add them into the dataset.
laptop_clean = pd.concat([laptop_clean.reset_index(drop = True), cpu_onehot], axis = 1)
laptop_clean.head()## Company Inches ... cpu_Intel Pentium Quad Core cpu_Intel Xeon
## 0 Apple 13.3 ... 0.0 0.0
## 1 Apple 13.3 ... 0.0 0.0
## 2 HP 15.6 ... 0.0 0.0
## 3 Apple 15.4 ... 0.0 0.0
## 4 Apple 13.3 ... 0.0 0.0
##
## [5 rows x 32 columns]
We will do the same thing for the GPU type, company and the OS.
# Convert Category into Integer
gpu_label = LabelEncoder().fit(laptop_clean['gpu_type']).transform(laptop_clean['gpu_type'])
gpu_label = gpu_label.reshape(len(laptop_clean['gpu_type']), 1)
# Convert Label into One Hot Encoding
gpu_label = OneHotEncoder(drop = 'first').fit(gpu_label).transform(gpu_label).toarray()
# Create Column name
gpu_name = list(set(laptop_clean['gpu_type']))
gpu_name.sort()
gpu_name = list(map(lambda x: 'gpu_' + x, gpu_name))
gpu_name = gpu_name[ 1:len(gpu_name) ]
# Concat the Column
gpu_onehot = pd.DataFrame(gpu_label, columns = gpu_name)
laptop_clean = pd.concat([laptop_clean.reset_index(drop = True), gpu_onehot], axis = 1)
laptop_clean.head()## Company Inches Ram ... gpu_Intel UHD gpu_Nvidia GeForce gpu_Nvidia Quadro
## 0 Apple 13.3 8 ... 0.0 0.0 0.0
## 1 Apple 13.3 8 ... 0.0 0.0 0.0
## 2 HP 15.6 8 ... 0.0 0.0 0.0
## 3 Apple 15.4 16 ... 0.0 0.0 0.0
## 4 Apple 13.3 8 ... 0.0 0.0 0.0
##
## [5 rows x 39 columns]
# Convert Category into Integer
os_label = LabelEncoder().fit(laptop_clean['OpSys']).transform(laptop_clean['OpSys'])
os_label = os_label.reshape(len(laptop_clean['OpSys']), 1)
# Convert Label into One Hot Encoding
os_label = OneHotEncoder(drop = 'first').fit(os_label).transform(os_label).toarray()
# Create Column name
os_name = list(set(laptop_clean['OpSys']))
os_name.sort()
os_name = list(map(lambda x: 'os_' + x, os_name))
os_name = os_name[ 1:len(os_name) ]
# Concat the Column
os_onehot = pd.DataFrame(os_label, columns = os_name)
laptop_clean = pd.concat([laptop_clean.reset_index(drop = True), os_onehot], axis = 1)
laptop_clean.head()## Company Inches Ram ... os_Windows 10 S os_Windows 7 os_macOS
## 0 Apple 13.3 8 ... 0.0 0.0 1.0
## 1 Apple 13.3 8 ... 0.0 0.0 1.0
## 2 HP 15.6 8 ... 0.0 0.0 0.0
## 3 Apple 15.4 16 ... 0.0 0.0 1.0
## 4 Apple 13.3 8 ... 0.0 0.0 1.0
##
## [5 rows x 47 columns]
# Convert Category into Integer
company_label = LabelEncoder().fit(laptop_clean['Company']).transform(laptop_clean['Company'])
company_label = company_label.reshape(len(laptop_clean['Company']), 1)
# Convert Label into One Hot Encoding
company_label = OneHotEncoder(drop = 'first').fit(company_label).transform(company_label).toarray()
# Create Column name
company_name = list(set(laptop_clean['Company']))
company_name.sort()
company_name = list(map(lambda x: 'company_' + x, company_name))
company_name = company_name[ 1:len(company_name) ]
# Concat the Column
company_onehot = pd.DataFrame(company_label, columns = company_name)
laptop_clean = pd.concat([laptop_clean.reset_index(drop = True), company_onehot], axis = 1)
laptop_clean.head()## Company Inches Ram ... company_Toshiba company_Vero company_Xiaomi
## 0 Apple 13.3 8 ... 0.0 0.0 0.0
## 1 Apple 13.3 8 ... 0.0 0.0 0.0
## 2 HP 15.6 8 ... 0.0 0.0 0.0
## 3 Apple 15.4 16 ... 0.0 0.0 0.0
## 4 Apple 13.3 8 ... 0.0 0.0 0.0
##
## [5 rows x 65 columns]
Finally, we will once again drop the unncessary columns and only take the numeric columns.
laptop_clean = laptop_clean.select_dtypes(include = 'number')
laptop_clean.columns = list(map(lambda x: re.sub(' ', '_', x), laptop_clean.columns))
laptop_clean.head()## Inches Ram Weight ... company_Toshiba company_Vero company_Xiaomi
## 0 13.3 8 1.37 ... 0.0 0.0 0.0
## 1 13.3 8 1.34 ... 0.0 0.0 0.0
## 2 15.6 8 1.86 ... 0.0 0.0 0.0
## 3 15.4 16 1.83 ... 0.0 0.0 0.0
## 4 13.3 8 1.37 ... 0.0 0.0 0.0
##
## [5 rows x 60 columns]
We will start splitting the data intro training and testing dataset. We will use 20% of the data as the testing dataset.
x_laptop = laptop_clean.drop('Price_euros', axis = 1)
y_laptop = laptop_clean['Price_euros']
x_train, x_test, y_train, y_test = train_test_split(x_laptop, y_laptop, test_size = 0.2, random_state = 100)
print("Number of Data Train: " + str(x_train.shape[0]))## Number of Data Train: 1042
print("Number of Data Test: " + str(x_test.shape[0]))## Number of Data Test: 261
We will start building machine learning model. We will build the following model and compare the predictive performance:
First, we fit the OLS (Ordinary Least Square) linear regression into the training dataset. OLS will try to find the best coefficient for the intercept and each feature by minimizing the Sum of Squared Error as the lost function.
\[ SSE = \Sigma_{i=1}^n (y - \overline y)^2 \]
lm_model = LinearRegression().fit(x_train, y_train)Let’s check the estimate coefficients for each features, see how big the association between the features and the laptop price. We will only highlight features with the highest (top 10) and the lowest (bottom 10) coefficient value due to limited visualization space.
coef_lm = pd.DataFrame({'features': x_train.columns, 'estimate':lm_model.coef_}).sort_values(by = 'estimate', ascending = False)
top_last10 = coef_lm.iloc[np.r_[0:10, -10:0]]
sns.barplot(data = top_last10, x = 'estimate', y = 'features')## <AxesSubplot:xlabel='estimate', ylabel='features'>
plt.xlabel('Estimate Coefficients')## Text(0.5, 0, 'Estimate Coefficients')
plt.ylabel('Features')## Text(0, 0.5, 'Features')
plt.show()plt.close()Based on the top 10 highest and lowest coefficient of each feature, we can see that certain type of CPUs will lower the predicted price due to the negative value of the estimate coefficient. For example, laptop installed with AMD FX or Intel Pentium Dual Core will have lower price compared to laptop that is installed with Intel Xeon. If the laptop is build by Razer, the predicted price will increase by around 1250 Euros.
Let’s check the prediction performance of the linear regression. We will use the R-squared (R2 Score) and the error measured by Root Mean Squared Error (RMSE). RMSE is a good measure to evaluate regression problem because they punish model more if there are observations that has high error.
pred_lm = lm_model.predict(x_test)
print('R2 Score: ' + str(np.round(r2_score(y_test, pred_lm), 3)))## R2 Score: 0.793
print('RMSE: ' + str(np.round(np.sqrt(mean_squared_error(y_test, pred_lm)), 3)) )## RMSE: 312.12
print('Price Standard Deviation: ' + str(np.round(np.std(y_test), 3)))## Price Standard Deviation: 686.157
We can compare the RMSE with the standard deviation of the price variable from the testing dataset. According to Bowles, if the RMSE is lower than the standard deviation, then we can conclude that the model has a good performance. A good model should, on average, have better predictions than the naive estimate of the mean for all predictions.
Lasso regression is a variant of linear regression that comes with a penalty on the loss function to help the model do regularization and reduce the model variance. Model with less variance will be better at predicing new data. The idea is to induce the penalty against complexity by adding the regularization term such as that with increasing value of regularization parameter, the weights get reduced (and, hence penalty induced).
As you may have learn before, linear regression try to get the best estimate value for the model intercept and slope for each feature by minimizing the Sum of Squared Error (SSE).
\[ SSE = \Sigma_i^N (y_i - \hat y_i)^2 \]
Lasso Regression will add an L1 penalty with \(\lambda\) constant to the loss function. If \(\lambda\) equals zero, then the lasso regression become identical with the ordinary linear regression.
\[ SSE = \Sigma_{i=1}^N (y_i - \hat y_i)^2 + \lambda\ \Sigma_{j=1}^n |\beta_j| \]
The benefit of using Lasso is that it can function as a feature selection method. This model will shrink and sometimes remove features so that we only have the features that affect the target data. To fit a Lasso model, we need to scale all features. The features need to have the same scale so that the coefficient values are chosen based only on which attribute is most useful, not on the basis of which one has the most favorable scale.
# Scale Features
x_scaler = StandardScaler().fit(x_train.to_numpy())
# Transform Data
x_train_norm = x_scaler.transform(x_train.to_numpy())
x_test_norm = x_scaler.transform(x_test.to_numpy())The first thing we need to do to build a Lasso model is by choosing the appropriate value of \(\lambda\) as the penalty constant. Luckily, the sklearn package has build in estimator can help us get the optimal hyper-parameter (in this case, \(\lambda\) or \(\alpha\)) with Cross-Validation method to evaluate the model.
In the following step, we set 10-Fold Cross-Validation method to fit and evaluate the data and try 1000 different alpha (\(\lambda\)) as the penalty constant. The model will give us the best alpha to choose.
lasso_model_cv = LassoCV(cv = 10, n_alphas = 1000).fit(x_train_norm, y_train)
print('Best alpha: ' + str(np.round(lasso_model_cv.alpha_, 5)))## Best alpha: 0.80411
We can directly predict new data using the previously fitted model. Let’s evaluate the model on the unseen testing dataset.
pred_lasso = lasso_model_cv.predict(x_test_norm)
print('R2 Score: ' + str(np.round(r2_score(y_test, pred_lasso), 3)))## R2 Score: 0.794
print('RMSE: ' + str(np.round(np.sqrt(mean_squared_error(y_test, pred_lasso)), 3)) )## RMSE: 311.765
print('Price Standard Deviation: ' + str(np.round(np.std(y_test), 3)))## Price Standard Deviation: 686.157
You can also try to refit the data into new Lasso model with the best alpha as the input. The result is the same.
lasso_model = Lasso(alpha = lasso_model_cv.alpha_).fit(x_train_norm, y_train)
pred_lasso = lasso_model.predict(x_test_norm)
print('R2 Score: ' + str(np.round(r2_score(y_test, pred_lasso), 3)))
print('RMSE: ' + str(np.round(np.sqrt(mean_squared_error(y_test, pred_lasso)), 3)) )
print('Price Standard Deviation: ' + str(np.round(np.std(y_test), 3)))Let’s visualize on how the different value of \(\lambda\) will affect the estimate of coefficient for each feature. Here we use \(\lambda\) from 0.0001 to 600 and fit the lasso regression.
nbins = 1000
list_lambda = np.linspace(1e-4, 600, nbins)
list_coef = np.zeros((nbins, x_train.shape[1]))
for i in range(len(list_lambda)):
lasso_reg = Lasso(alpha = list_lambda[i]).fit(x_train_norm, y_train)
list_coef[i, :] = lasso_reg.coef_## /home/argaadya/anaconda3/envs/learning/lib/python3.8/site-packages/sklearn/linear_model/_coordinate_descent.py:530: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 50776583.72232635, tolerance: 51104.05075618397
## model = cd_fast.enet_coordinate_descent(
df_coef = pd.DataFrame(list_coef, columns = x_train.columns)
df_coef.head()## Inches Ram ... company_Vero company_Xiaomi
## 0 -105.488959 227.462027 ... -0.015091 16.500042
## 1 -100.040799 229.084431 ... -0.000000 14.306614
## 2 -94.583023 231.504600 ... -0.000000 11.952901
## 3 -89.181879 234.128459 ... -0.000000 9.596973
## 4 -84.696358 236.023662 ... -0.000000 8.073195
##
## [5 rows x 59 columns]
Now we will visualize the result.
for i in df_coef.columns:
plt.plot(list_lambda, df_coef[i])## [<matplotlib.lines.Line2D object at 0x7f745bcf4940>]
## [<matplotlib.lines.Line2D object at 0x7f745bcf4ee0>]
## [<matplotlib.lines.Line2D object at 0x7f745bcf42e0>]
## [<matplotlib.lines.Line2D object at 0x7f745bcf45e0>]
## [<matplotlib.lines.Line2D object at 0x7f745bd29d00>]
## [<matplotlib.lines.Line2D object at 0x7f745bd29be0>]
## [<matplotlib.lines.Line2D object at 0x7f745bd29130>]
## [<matplotlib.lines.Line2D object at 0x7f745bd39670>]
## [<matplotlib.lines.Line2D object at 0x7f745bd391c0>]
## [<matplotlib.lines.Line2D object at 0x7f745bd461c0>]
## [<matplotlib.lines.Line2D object at 0x7f745bbf4c10>]
## [<matplotlib.lines.Line2D object at 0x7f745bd46c70>]
## [<matplotlib.lines.Line2D object at 0x7f745bd46a60>]
## [<matplotlib.lines.Line2D object at 0x7f745bc763a0>]
## [<matplotlib.lines.Line2D object at 0x7f745bc76a90>]
## [<matplotlib.lines.Line2D object at 0x7f745bc76f70>]
## [<matplotlib.lines.Line2D object at 0x7f745bce44c0>]
## [<matplotlib.lines.Line2D object at 0x7f745bce4d90>]
## [<matplotlib.lines.Line2D object at 0x7f745bce4520>]
## [<matplotlib.lines.Line2D object at 0x7f745bce4e50>]
## [<matplotlib.lines.Line2D object at 0x7f745bcf6bb0>]
## [<matplotlib.lines.Line2D object at 0x7f745bcf9c10>]
## [<matplotlib.lines.Line2D object at 0x7f745bd689a0>]
## [<matplotlib.lines.Line2D object at 0x7f745bd68a30>]
## [<matplotlib.lines.Line2D object at 0x7f745bd688b0>]
## [<matplotlib.lines.Line2D object at 0x7f745bd25100>]
## [<matplotlib.lines.Line2D object at 0x7f745bcefd00>]
## [<matplotlib.lines.Line2D object at 0x7f745bcefcd0>]
## [<matplotlib.lines.Line2D object at 0x7f745bd16970>]
## [<matplotlib.lines.Line2D object at 0x7f745bd16340>]
## [<matplotlib.lines.Line2D object at 0x7f745bd16af0>]
## [<matplotlib.lines.Line2D object at 0x7f745bdb6370>]
## [<matplotlib.lines.Line2D object at 0x7f745bdd09d0>]
## [<matplotlib.lines.Line2D object at 0x7f745bdd0b80>]
## [<matplotlib.lines.Line2D object at 0x7f745bdd02e0>]
## [<matplotlib.lines.Line2D object at 0x7f745bd5cc40>]
## [<matplotlib.lines.Line2D object at 0x7f745bd5c820>]
## [<matplotlib.lines.Line2D object at 0x7f745bd5c550>]
## [<matplotlib.lines.Line2D object at 0x7f745bd8bd60>]
## [<matplotlib.lines.Line2D object at 0x7f745bd8b9a0>]
## [<matplotlib.lines.Line2D object at 0x7f745bd8b2b0>]
## [<matplotlib.lines.Line2D object at 0x7f745bda0d00>]
## [<matplotlib.lines.Line2D object at 0x7f745bda00a0>]
## [<matplotlib.lines.Line2D object at 0x7f745bda0760>]
## [<matplotlib.lines.Line2D object at 0x7f745bda0be0>]
## [<matplotlib.lines.Line2D object at 0x7f745bd95520>]
## [<matplotlib.lines.Line2D object at 0x7f745bd95550>]
## [<matplotlib.lines.Line2D object at 0x7f745bd95b20>]
## [<matplotlib.lines.Line2D object at 0x7f745bd95df0>]
## [<matplotlib.lines.Line2D object at 0x7f745be2d040>]
## [<matplotlib.lines.Line2D object at 0x7f745be2d430>]
## [<matplotlib.lines.Line2D object at 0x7f745be2dbb0>]
## [<matplotlib.lines.Line2D object at 0x7f745bf2dfd0>]
## [<matplotlib.lines.Line2D object at 0x7f745bf2d2b0>]
## [<matplotlib.lines.Line2D object at 0x7f745bf2dc10>]
## [<matplotlib.lines.Line2D object at 0x7f745bf2db50>]
## [<matplotlib.lines.Line2D object at 0x7f745be0a9a0>]
## [<matplotlib.lines.Line2D object at 0x7f745be0a220>]
## [<matplotlib.lines.Line2D object at 0x7f745be0adf0>]
plt.xlabel('Lambda Hyper-Parameter')## Text(0.5, 0, 'Lambda Hyper-Parameter')
plt.ylabel('Standardized Coefficients')## Text(0, 0.5, 'Standardized Coefficients')
plt.show()plt.close()With bigger \(\lambda\), more features will be omitted or will have estimate coefficient of 0 and only retain the most important features. With \(\lambda\) = 400, only 1 feature remain.
Now check the remaining feature for \(\lambda\) > 100. Note that the coefficient is already normalized.
print('lambda: ' + str(list_lambda[200]))## lambda: 120.1202001001001
print('\nRemaining features')##
## Remaining features
df_coef.iloc[200][ np.abs(df_coef.iloc[200]) > 0]## Ram 264.659767
## ssd 133.845728
## cpu_clock 21.985854
## screen_height 65.203670
## cpu_Intel_Core_i7 27.852797
## gpu_Nvidia_Quadro 30.892689
## Name: 200, dtype: float64
Let’s check the remaining feature for \(\lambda\) > 200.
print('lambda: ' + str(list_lambda[340]))## lambda: 204.2042701701702
print('\nRemaining features')##
## Remaining features
df_coef.iloc[300][ np.abs(df_coef.iloc[300]) > 0]## Ram 251.009581
## ssd 114.966493
## screen_height 32.800928
## cpu_Intel_Core_i7 8.158653
## Name: 300, dtype: float64
Ridge regression is similar with Lasso by creating a penalty toward the lost function. The difference is that the ridge regression will square the coefficient instead of making it absolute for the penalty. Larger value of \(\lambda\) will make the coefficient to be smaller, but never reach to 0 in Ridge regression.
\[ SSE = \Sigma_{i=1}^N (y_i - \hat y_i)^2 + \lambda\ \Sigma_{j=1}^n \beta_j^2 \]
In the following process, I set the possible alpha values from 0.0001 to 100 with different steps.
alpha_range = [1e-4, 1e-3, 1e-2, 0.1, 1]
alpha_range.extend(np.arange(10, 100, 1))
ridge_model_cv = RidgeCV(cv = 10, alphas = alpha_range).fit(x_train_norm, y_train)
ridge_model_cv.alpha_## 63.0
Let’s evaluate the model.
pred_ridge = ridge_model_cv.predict(x_test_norm)
print('R2 Score: ' + str(np.round(r2_score(y_test, pred_ridge), 3)))## R2 Score: 0.796
print('RMSE: ' + str(np.round(np.sqrt(mean_squared_error(y_test, pred_ridge)), 3)) )## RMSE: 309.696
print('Price Standard Deviation: ' + str(np.round(np.std(y_test), 3)))## Price Standard Deviation: 686.157
If the lasso regression can remove unnecessary feature by making the coefficient to 0 one by one, ridge regression will shrink all coeffients but it will never reach absolute zero.
nbins = 1000
list_lambda = np.linspace(1e-4, 1e5, nbins)
list_coef = np.zeros((nbins, x_train.shape[1]))
for i in range(len(list_lambda)):
ridge_reg = Ridge(alpha = list_lambda[i]).fit(x_train_norm, y_train)
list_coef[i, :] = ridge_reg.coef_
df_coef = pd.DataFrame(list_coef, columns = x_train.columns)
for i in df_coef.columns:
plt.plot(list_lambda, df_coef[i])## [<matplotlib.lines.Line2D object at 0x7f745bf137f0>]
## [<matplotlib.lines.Line2D object at 0x7f745bf139a0>]
## [<matplotlib.lines.Line2D object at 0x7f745bf13d90>]
## [<matplotlib.lines.Line2D object at 0x7f745c25a9d0>]
## [<matplotlib.lines.Line2D object at 0x7f745bf3a640>]
## [<matplotlib.lines.Line2D object at 0x7f745bf3aeb0>]
## [<matplotlib.lines.Line2D object at 0x7f745bf3a070>]
## [<matplotlib.lines.Line2D object at 0x7f745bf1da60>]
## [<matplotlib.lines.Line2D object at 0x7f745bfbbbb0>]
## [<matplotlib.lines.Line2D object at 0x7f745bfbb730>]
## [<matplotlib.lines.Line2D object at 0x7f745bedb220>]
## [<matplotlib.lines.Line2D object at 0x7f745c022190>]
## [<matplotlib.lines.Line2D object at 0x7f745bff48b0>]
## [<matplotlib.lines.Line2D object at 0x7f745bff4f10>]
## [<matplotlib.lines.Line2D object at 0x7f745bff4a90>]
## [<matplotlib.lines.Line2D object at 0x7f745c003b20>]
## [<matplotlib.lines.Line2D object at 0x7f745c003370>]
## [<matplotlib.lines.Line2D object at 0x7f745c0030d0>]
## [<matplotlib.lines.Line2D object at 0x7f745c003e20>]
## [<matplotlib.lines.Line2D object at 0x7f745c05b580>]
## [<matplotlib.lines.Line2D object at 0x7f745c05b7c0>]
## [<matplotlib.lines.Line2D object at 0x7f745c05b3a0>]
## [<matplotlib.lines.Line2D object at 0x7f745c2946a0>]
## [<matplotlib.lines.Line2D object at 0x7f745bfc0130>]
## [<matplotlib.lines.Line2D object at 0x7f745bfc03d0>]
## [<matplotlib.lines.Line2D object at 0x7f745bfe8d90>]
## [<matplotlib.lines.Line2D object at 0x7f745bff35b0>]
## [<matplotlib.lines.Line2D object at 0x7f745bff3f70>]
## [<matplotlib.lines.Line2D object at 0x7f745bf03520>]
## [<matplotlib.lines.Line2D object at 0x7f745bf03d90>]
## [<matplotlib.lines.Line2D object at 0x7f745bf03850>]
## [<matplotlib.lines.Line2D object at 0x7f745c0df6d0>]
## [<matplotlib.lines.Line2D object at 0x7f745c029d30>]
## [<matplotlib.lines.Line2D object at 0x7f745c093490>]
## [<matplotlib.lines.Line2D object at 0x7f745c093280>]
## [<matplotlib.lines.Line2D object at 0x7f745c0931c0>]
## [<matplotlib.lines.Line2D object at 0x7f745bf27f10>]
## [<matplotlib.lines.Line2D object at 0x7f745bf27430>]
## [<matplotlib.lines.Line2D object at 0x7f745bf27e50>]
## [<matplotlib.lines.Line2D object at 0x7f745bdae730>]
## [<matplotlib.lines.Line2D object at 0x7f745bdae700>]
## [<matplotlib.lines.Line2D object at 0x7f745bc8dbb0>]
## [<matplotlib.lines.Line2D object at 0x7f745c0b8670>]
## [<matplotlib.lines.Line2D object at 0x7f745c0b8640>]
## [<matplotlib.lines.Line2D object at 0x7f745c0ca490>]
## [<matplotlib.lines.Line2D object at 0x7f745c0ca2e0>]
## [<matplotlib.lines.Line2D object at 0x7f745bfb0a60>]
## [<matplotlib.lines.Line2D object at 0x7f745bfb0610>]
## [<matplotlib.lines.Line2D object at 0x7f745bc6b1f0>]
## [<matplotlib.lines.Line2D object at 0x7f745bc6b610>]
## [<matplotlib.lines.Line2D object at 0x7f745bc6ba30>]
## [<matplotlib.lines.Line2D object at 0x7f745bc6be50>]
## [<matplotlib.lines.Line2D object at 0x7f745bd832b0>]
## [<matplotlib.lines.Line2D object at 0x7f745bd836d0>]
## [<matplotlib.lines.Line2D object at 0x7f745bd83af0>]
## [<matplotlib.lines.Line2D object at 0x7f745bd83f10>]
## [<matplotlib.lines.Line2D object at 0x7f745bd76370>]
## [<matplotlib.lines.Line2D object at 0x7f745bd76790>]
## [<matplotlib.lines.Line2D object at 0x7f745bd76bb0>]
plt.xlabel('Alpha Hyper-Parameter')## Text(0.5, 0, 'Alpha Hyper-Parameter')
plt.ylabel('Standardized Coefficients')## Text(0, 0.5, 'Standardized Coefficients')
plt.show()plt.close()Elastic Net combine both L1 penalty and the L2 penalty into a single formula. This combination allows for learning a sparse model where few of the weights are non-zero like Lasso, while still maintaining the regularization properties of Ridge.
In the following example, we can set the ratio between the L1 and L2 penalty. If l1_ratio = 0, then the model will be Ridge regression while if l1_ratio = 1 then the model become Lasso regression.
elastic_model_cv = ElasticNetCV(l1_ratio = 0.5, n_alphas = 1000).fit(x_train_norm, y_train)
elastic_model_cv.alpha_## 1.0331308855759713
Let’s evaluate the model.
pred_elastic = elastic_model_cv.predict(x_test_norm)
print('R2 Score: ' + str(np.round(r2_score(y_test, pred_elastic), 3)))## R2 Score: 0.787
print('RMSE: ' + str(np.round(np.sqrt(mean_squared_error(y_test, pred_elastic)), 3)) )## RMSE: 316.605
print('Price Standard Deviation: ' + str(np.round(np.std(y_test), 3)))## Price Standard Deviation: 686.157
Different l1_ratio will give us different model performance.
elastic_model_cv = ElasticNetCV(l1_ratio = 0.8, n_alphas = 1000).fit(x_train_norm, y_train)
pred_elastic = elastic_model_cv.predict(x_test_norm)
print('R2 Score: ' + str(np.round(r2_score(y_test, pred_elastic), 3)))## R2 Score: 0.797
print('RMSE: ' + str(np.round(np.sqrt(mean_squared_error(y_test, pred_elastic)), 3)) )## RMSE: 309.048
print('Price Standard Deviation: ' + str(np.round(np.std(y_test), 3)))## Price Standard Deviation: 686.157
We can put multiple l1_ratio to try with list.
elastic_model_cv = ElasticNetCV(l1_ratio = [0.05, 0.1, 0.2, 0.3, 0.7, 0.8, 0.9, 0.95], n_alphas = 1000).fit(x_train_norm, y_train)
pred_elastic = elastic_model_cv.predict(x_test_norm)
print('Chosen L1 ratio: ', elastic_model_cv.l1_ratio_)## Chosen L1 ratio: 0.9
print('R2 Score: ' + str(np.round(r2_score(y_test, pred_elastic), 3)))## R2 Score: 0.796
print('RMSE: ' + str(np.round(np.sqrt(mean_squared_error(y_test, pred_elastic)), 3)) )## RMSE: 309.797
print('Price Standard Deviation: ' + str(np.round(np.std(y_test), 3)))## Price Standard Deviation: 686.157
Based on our result, all regularization method works better than the vanilla linear regression, with the Elastic Net achieve the lowest error on testing dataset. We also see that even with linear model we can achieve good result, as the RMSE of each model is still better than the standard deviation of the testing dataset. We have also learn how Lasso and Ridge regression remove or shrink the coefficient of each feature.