Blog3: PCA Linear Model

My Thought

I was just gonna do some PCA in data transformation, but it turns out that I walk a little further, which I build a PCA linear model on the dataset from Kaggle.

Data Structure Overview

There is 896 observation and 22 features in the data. The Majority type of features are characters and integers. In addition, it does not seem to have missing values(i.e. NA)

## Rows: 896
## Columns: 22
## $ brand           <chr> "ASUS", "ASUS", "ASUS", "HP", "HP", "Lenovo", "HP", "A…
## $ model           <chr> "Celeron", "VivoBook", "Vivobook", "Core", "Core", "Id…
## $ processor_brand <chr> "Intel", "Intel", "Intel", "Intel", "Intel", "Intel", …
## $ processor_name  <chr> "Celeron Dual", "Core i3", "Core i3", "Core i3", "Core…
## $ processor_gnrtn <chr> "Missing", "10th", "10th", "11th", "11th", "10th", "Mi…
## $ ram_gb          <chr> "4", "8", "8", "8", "8", "8", "8", "4", "8", "8", "8",…
## $ ram_type        <chr> "DDR4", "DDR4", "DDR4", "DDR4", "DDR4", "DDR4", "DDR4"…
## $ ssd             <int> 0, 512, 0, 512, 512, 0, 256, 256, 256, 256, 512, 256, …
## $ hdd             <int> 1024, 0, 1024, 0, 0, 1024, 1024, 0, 1024, 0, 0, 0, 0, …
## $ os              <chr> "Windows", "Windows", "Windows", "Windows", "Windows",…
## $ os_bit          <int> 64, 64, 64, 64, 64, 64, 64, 64, 64, 32, 64, 64, 64, 64…
## $ graphic_card_gb <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, …
## $ weight          <chr> "Casual", "Casual", "Casual", "ThinNlight", "ThinNligh…
## $ display_size    <chr> "15.6", "15.6", "14.1", "15.6", "15.6", "15.6", "15.6"…
## $ warranty        <int> 1, 1, 1, 1, 0, 1, 1, 0, 1, 2, 1, 1, 1, 2, 1, 1, 0, 1, …
## $ Touchscreen     <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", …
## $ msoffice        <chr> "No", "No", "No", "Yes", "No", "Yes", "Yes", "No", "Ye…
## $ latest_price    <int> 23990, 37990, 32890, 42990, 54990, 35990, 41999, 34429…
## $ discount        <int> 11, 25, 30, 25, 21, 37, 13, 31, 24, 34, 23, 17, 17, 28…
## $ star_rating     <dbl> 3.8, 4.3, 3.9, 4.4, 4.2, 4.0, 4.3, 4.1, 5.0, 4.3, 4.1,…
## $ ratings         <int> 15279, 990, 28, 158, 116, 2124, 3524, 37, 7, 2080, 206…
## $ reviews         <int> 1947, 108, 4, 18, 15, 233, 432, 6, 5, 235, 17, 241, 53…

However, I find that there are data types of some features are not appropriate, therefore, the conversion between integer and double and the conversion between integer and character are necessary.

For Example:

ssd: should be character because a laptop either has sdd or not, if it has it, the size of sdd is static.
hdd: same as ssd
os_bit: the size is fixed
graphic_card_gb: same as ssd
discount: percentage of discounts

Data Structure After Fixing

## Rows: 896
## Columns: 22
## $ brand           <chr> "ASUS", "ASUS", "ASUS", "HP", "HP", "Lenovo", "HP", "A…
## $ model           <chr> "Celeron", "VivoBook", "Vivobook", "Core", "Core", "Id…
## $ processor_brand <chr> "Intel", "Intel", "Intel", "Intel", "Intel", "Intel", …
## $ processor_name  <chr> "Celeron Dual", "Core i3", "Core i3", "Core i3", "Core…
## $ processor_gnrtn <chr> "Missing", "10th", "10th", "11th", "11th", "10th", "Mi…
## $ ram_gb          <chr> "4", "8", "8", "8", "8", "8", "8", "4", "8", "8", "8",…
## $ ram_type        <chr> "DDR4", "DDR4", "DDR4", "DDR4", "DDR4", "DDR4", "DDR4"…
## $ ssd             <chr> "0", "512", "0", "512", "512", "0", "256", "256", "256…
## $ hdd             <chr> "1024", "0", "1024", "0", "0", "1024", "1024", "0", "1…
## $ os              <chr> "Windows", "Windows", "Windows", "Windows", "Windows",…
## $ os_bit          <chr> "64", "64", "64", "64", "64", "64", "64", "64", "64", …
## $ graphic_card_gb <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0",…
## $ weight          <chr> "Casual", "Casual", "Casual", "ThinNlight", "ThinNligh…
## $ display_size    <chr> "15.6", "15.6", "14.1", "15.6", "15.6", "15.6", "15.6"…
## $ warranty        <int> 1, 1, 1, 1, 0, 1, 1, 0, 1, 2, 1, 1, 1, 2, 1, 1, 0, 1, …
## $ Touchscreen     <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", …
## $ msoffice        <chr> "No", "No", "No", "Yes", "No", "Yes", "Yes", "No", "Ye…
## $ latest_price    <int> 23990, 37990, 32890, 42990, 54990, 35990, 41999, 34429…
## $ discount        <dbl> 0.11, 0.25, 0.30, 0.25, 0.21, 0.37, 0.13, 0.31, 0.24, …
## $ star_rating     <dbl> 3.8, 4.3, 3.9, 4.4, 4.2, 4.0, 4.3, 4.1, 5.0, 4.3, 4.1,…
## $ ratings         <int> 15279, 990, 28, 158, 116, 2124, 3524, 37, 7, 2080, 206…
## $ reviews         <int> 1947, 108, 4, 18, 15, 233, 432, 6, 5, 235, 17, 241, 53…

Examination of numerical variables

Colinearity

only rating and reviews are highly correlated, therefore, only one of two is kept.

Distributions

The distribution of ratings and reviews have the same pattern which indicate that they are highly correlated as well.

Examination of Categorical Variables

distribution of raw data

The categorical variables looks pretty messy, even though I saw in the beginning that there is no NAs, there is plenty missing values in processor generation and os which state “Missing”. Meanwhile, by examining laptop model name, there are typos everywhere in the records, and in my opinion, a laptop model is based on its configuration, or it would probably just a name. As a result I am going to leave it out.

factor is better than character type in further analysis, so I am going to convert all character variable into factors

distribution of tidy data

Now it looks much better.

Data Spliting

## [1] "dimension of training set: (672,44)"

## [1] "dimension of testing set: (224,44)"

Modeling

pca model in training set

## Linear Regression 
## 
## 672 samples
##  43 predictor
## 
## Pre-processing: centered (43), scaled (43), principal component
##  signal extraction (43) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 672, 672, 672, 672, 672, 672, ... 
## Resampling results:
## 
##   RMSE      Rsquared  MAE     
##   29302.26  0.658741  17974.88
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Model Performance

pca model evaluation performance in testing set

## [1] "RMSE:24309.07"

## [1] "R-squared:0.630293"

Conclusion

PCA modeling is pretty useful while dealing with lots of features. At first, I only have 22 features which is less than half amount of the total number of features in the model. The reason why the feature increase is because dummy variables are created for all categorical variables, in which I think it is easier to build the model. PCA helps to reduce the dimension of data. I’ve examine before, even I have 48 features, only 37 component will achieve 95% of variance.

I also check the marginal model plot, the majority looks fine, only 2 or 3 components do not fit the model from two ends. As a result, PCA linear model may not be enough for this data set, other PCA modelings are worth to try.