1. Load dataset ggplot2::diamonds, perform a log transformation on the price, and create a new variable logPrice. Split your diamonds dataset into training (70%) and testing (30%) sets by setting random seed set.seed(1).

train_idx先存一個抽樣的數量到index,因為訓練資料是要70%,所以用nrow(diamonds) * 0.7
nrow()可以算出diamonds有幾列的資料,乘0.7就是取出diamonds七成的資料。
這邊比較重要的是setdiff(),意義是差集,可以篩出七成訓練資料之外的列。

library(ggplot2)
diamonds = data.frame(diamonds)
diamonds$logPrice = log(diamonds$price) 

set.seed(1); train_idx = sample(1:nrow(diamonds), nrow(diamonds) * 0.7 )
train_d = diamonds[train_idx, ]
test_d = diamonds[setdiff( 1:nrow(diamonds), train_idx), ]; rm(diamonds)

2. Create two histogram or density plots for “price” and “logPrice”. Do you see anything interesting? If you’d like to fit a general linear model to predict the price, should we do the log transformation on the outcome “price”?
ggplot(train_d, aes(x = price)) + geom_histogram(color = 'blue', bins = 100) 

ggplot(train_d, aes(x = log(price))) + geom_histogram(color = 'red', bins = 100)

General linear model assume that the distribution of the outcome price is approximately normal.
Seems that doing log transformation on the price works better.
一般線性模型假定假定price結果的分佈是趨近於常態的。
看起來取log過後的price較接近常態。


3. Use any modeling techniques you know to create models (from the training set) that better predict your logPrice. Do any transformation on your input (X) if you’d like. Please report your RMSE on the testing set.

create_lm_transformer主要目的會return一個list,裡面存除了targetName(也就是y)其他的x對y的lm線性模型。
假如我有一個data.frame(df)裡面有a, b, c, d四個變數,我想了解b, c, d對a的影響。

帶入create_lm_transformer(df, a)會return一個list,裡面有:

這三個線性模型。

那接下來介紹這個function內部怎麼運作。

這篇文章很清楚的解釋non-standard evaluation在幹嘛 因為這個function裡面會用到這個概念!

setdiff()把targetName去掉,存在X_colnames裡面。
lapply()就是把所有X_colnames放到lm,去做與y的線性模型。
names因為lms是一個list,把做出來的各個model命名。

# Remove "price"
train_d$price = NULL; test_d$price = NULL;

# Function to create linear model transformer (transformation "plan" object) as y_i = f(x_i)
create_lm_transformer = function(df, targetName){
  X_colnames = setdiff(colnames(df), targetName)
  # Create linear models y ~ x_i 
  lms = lapply(X_colnames, function(x) eval(parse(text = paste("lm(", targetName, " ~ ", x , ", data = df)", sep = "") )))
  names(lms) = X_colnames
  return(lms)
}
# Function to apply plan object to dataset
predict_lm_transformer = function(df, targetName, lm_transformer){
  X_colnames = setdiff(colnames(df), targetName)
  convertedXY = sapply(X_colnames, function(x) predict( lm_transformer[[x]], newdata = df[, X_colnames]))
  return(convertedXY)
}

# Create lm transformation plan object
dia_lm_transformer = create_lm_transformer(train_d, "logPrice")
# Convert X of training and testing dataset into new feature spaces
train_d_t = predict_lm_transformer(train_d, "logPrice", dia_lm_transformer)
train_d_t = as.data.frame(train_d_t)
train_d_t$logPrice = train_d$logPrice
head(train_d_t)#look train_d_t
##          carat      cut    color  clarity    depth    table        x
## 14322 9.186193 8.114171 8.144895 8.165505 7.792173 7.895712 8.998311
## 20072 8.714339 7.952502 7.790557 7.764232 7.786155 7.966462 8.825046
## 30899 7.023528 7.636722 7.616861 7.848618 7.789091 7.471212 6.927784
## 48987 6.925225 7.636722 7.911107 7.848618 7.788063 7.471212 6.780508
## 10878 8.281806 7.952502 7.763981 7.848618 7.788797 7.895712 8.391881
## 48455 7.298777 7.797603 7.763981 7.732115 7.786449 7.711762 7.395602
##              y        z logPrice
## 14322 8.972414 9.235546 8.665786
## 20072 8.867529 8.717817 9.053570
## 30899 6.962119 7.071703 6.612041
## 48987 6.804791 6.885851 6.287859
## 10878 8.369325 8.439039 8.492696
## 48455 7.434101 7.403580 7.588324
test_d_t = predict_lm_transformer(test_d, "logPrice", dia_lm_transformer)
test_d_t = as.data.frame(test_d_t)
test_d_t$logPrice = test_d$logPrice
head(test_d_t)#look test_d_t
##       carat      cut    color  clarity    depth    table        x        y
## 2  6.630316 7.952502 7.577313 7.848618 7.784394 8.037212 6.191404 6.131779
## 4  6.787601 7.952502 8.039615 7.764232 7.788210 7.824962 6.459966 6.472656
## 8  6.728619 7.797603 7.911107 7.848618 7.787476 7.612712 6.347343 6.367771
## 9  6.649977 8.114171 7.577313 7.764232 7.792173 8.037212 6.174077 6.079337
## 13 6.649977 7.952502 7.763981 7.848618 7.785274 8.037212 6.182741 6.131779
## 15 6.610656 7.952502 7.577313 8.165505 7.784981 8.107962 6.104771 6.053116
##           z logPrice
## 2  6.155720 5.786897
## 4  6.580523 5.811141
## 8  6.447772 5.820083
## 9  6.394672 5.820083
## 13 6.182270 5.834811
## 15 6.102619 5.843544

General Linear model on all raw X

lm_logPrice = lm(logPrice ~ ., data = train_d) 
yhat = as.numeric(predict(lm_logPrice, newdata = test_d))
round(sqrt(mean( (test_d$logPrice - yhat)  ^ 2 )), 4) 
## [1] 0.3584

General Linear model on transformed X. It sucks.

lm_logPrice = lm(logPrice ~ ., data = train_d_t) 
yhat = as.numeric(predict(lm_logPrice, newdata = test_d_t))
round(sqrt(mean( (test_d$logPrice - yhat)  ^ 2 )), 4) 
## [1] 0.3603

Doing some variable/feature selections may help

lm_logPrice = lm(logPrice ~ . + x:y + carat:clarity, data = train_d) # add interaction term x:y and carat:clarity
yhat = as.numeric(predict(lm_logPrice, newdata = test_d))
round(sqrt(mean( (test_d$logPrice - yhat)  ^ 2 )), 4) 
## [1] 0.1869

Using KNN for regression on transformed X. Looks promising.
You may try different k.

yhat = FNN::knn.reg(train = train_d_t[, -10], test = test_d_t[,-10], y = train_d_t[, 10], k = 6)
round(sqrt(mean( (test_d$logPrice - yhat$pred)  ^ 2 )), 4) 
## [1] 0.1189

RandomForest using package “ranger”. Try to beat the power rangers? :)

library(ranger); 
rg = ranger(logPrice ~ ., data = train_d, num.trees = 200, seed = 1)
yhat = predict(rg, data = test_d[, -10])
y_yhat = test_d$logPrice - yhat$predictions
round(sqrt(mean( (test_d$logPrice - yhat$pred)  ^ 2 )),4) 
## [1] 0.091

Note that non-linear feature transformations work great on this particular case * (predicting diamonds price) *.
You may try other kinds of feature representations/transformations and see if they work.