ggplot2::diamonds, perform a log transformation on the price, and create a new variable logPrice. Split your diamonds dataset into training (70%) and testing (30%) sets by setting random seed set.seed(1).train_idx先存一個抽樣的數量到index,因為訓練資料是要70%,所以用nrow(diamonds) * 0.7。
nrow()可以算出diamonds有幾列的資料,乘0.7就是取出diamonds七成的資料。
這邊比較重要的是setdiff(),意義是差集,可以篩出七成訓練資料之外的列。
library(ggplot2)
diamonds = data.frame(diamonds)
diamonds$logPrice = log(diamonds$price)
set.seed(1); train_idx = sample(1:nrow(diamonds), nrow(diamonds) * 0.7 )
train_d = diamonds[train_idx, ]
test_d = diamonds[setdiff( 1:nrow(diamonds), train_idx), ]; rm(diamonds)
ggplot(train_d, aes(x = price)) + geom_histogram(color = 'blue', bins = 100)
ggplot(train_d, aes(x = log(price))) + geom_histogram(color = 'red', bins = 100)
General linear model assume that the distribution of the outcome price is approximately normal.
Seems that doing log transformation on the price works better.
一般線性模型假定假定price結果的分佈是趨近於常態的。
看起來取log過後的price較接近常態。
logPrice. Do any transformation on your input (X) if you’d like. Please report your RMSE on the testing set.create_lm_transformer主要目的會return一個list,裡面存除了targetName(也就是y)其他的x對y的lm線性模型。
假如我有一個data.frame(df)裡面有a, b, c, d四個變數,我想了解b, c, d對a的影響。
帶入create_lm_transformer(df, a)會return一個list,裡面有:
這三個線性模型。
那接下來介紹這個function內部怎麼運作。
這篇文章很清楚的解釋non-standard evaluation在幹嘛 因為這個function裡面會用到這個概念!
setdiff()把targetName去掉,存在X_colnames裡面。
lapply()就是把所有X_colnames放到lm,去做與y的線性模型。
names因為lms是一個list,把做出來的各個model命名。
# Remove "price"
train_d$price = NULL; test_d$price = NULL;
# Function to create linear model transformer (transformation "plan" object) as y_i = f(x_i)
create_lm_transformer = function(df, targetName){
X_colnames = setdiff(colnames(df), targetName)
# Create linear models y ~ x_i
lms = lapply(X_colnames, function(x) eval(parse(text = paste("lm(", targetName, " ~ ", x , ", data = df)", sep = "") )))
names(lms) = X_colnames
return(lms)
}
# Function to apply plan object to dataset
predict_lm_transformer = function(df, targetName, lm_transformer){
X_colnames = setdiff(colnames(df), targetName)
convertedXY = sapply(X_colnames, function(x) predict( lm_transformer[[x]], newdata = df[, X_colnames]))
return(convertedXY)
}
# Create lm transformation plan object
dia_lm_transformer = create_lm_transformer(train_d, "logPrice")
# Convert X of training and testing dataset into new feature spaces
train_d_t = predict_lm_transformer(train_d, "logPrice", dia_lm_transformer)
train_d_t = as.data.frame(train_d_t)
train_d_t$logPrice = train_d$logPrice
head(train_d_t)#look train_d_t
## carat cut color clarity depth table x
## 14322 9.186193 8.114171 8.144895 8.165505 7.792173 7.895712 8.998311
## 20072 8.714339 7.952502 7.790557 7.764232 7.786155 7.966462 8.825046
## 30899 7.023528 7.636722 7.616861 7.848618 7.789091 7.471212 6.927784
## 48987 6.925225 7.636722 7.911107 7.848618 7.788063 7.471212 6.780508
## 10878 8.281806 7.952502 7.763981 7.848618 7.788797 7.895712 8.391881
## 48455 7.298777 7.797603 7.763981 7.732115 7.786449 7.711762 7.395602
## y z logPrice
## 14322 8.972414 9.235546 8.665786
## 20072 8.867529 8.717817 9.053570
## 30899 6.962119 7.071703 6.612041
## 48987 6.804791 6.885851 6.287859
## 10878 8.369325 8.439039 8.492696
## 48455 7.434101 7.403580 7.588324
test_d_t = predict_lm_transformer(test_d, "logPrice", dia_lm_transformer)
test_d_t = as.data.frame(test_d_t)
test_d_t$logPrice = test_d$logPrice
head(test_d_t)#look test_d_t
## carat cut color clarity depth table x y
## 2 6.630316 7.952502 7.577313 7.848618 7.784394 8.037212 6.191404 6.131779
## 4 6.787601 7.952502 8.039615 7.764232 7.788210 7.824962 6.459966 6.472656
## 8 6.728619 7.797603 7.911107 7.848618 7.787476 7.612712 6.347343 6.367771
## 9 6.649977 8.114171 7.577313 7.764232 7.792173 8.037212 6.174077 6.079337
## 13 6.649977 7.952502 7.763981 7.848618 7.785274 8.037212 6.182741 6.131779
## 15 6.610656 7.952502 7.577313 8.165505 7.784981 8.107962 6.104771 6.053116
## z logPrice
## 2 6.155720 5.786897
## 4 6.580523 5.811141
## 8 6.447772 5.820083
## 9 6.394672 5.820083
## 13 6.182270 5.834811
## 15 6.102619 5.843544
General Linear model on all raw X
lm_logPrice = lm(logPrice ~ ., data = train_d)
yhat = as.numeric(predict(lm_logPrice, newdata = test_d))
round(sqrt(mean( (test_d$logPrice - yhat) ^ 2 )), 4)
## [1] 0.3584
General Linear model on transformed X. It sucks.
lm_logPrice = lm(logPrice ~ ., data = train_d_t)
yhat = as.numeric(predict(lm_logPrice, newdata = test_d_t))
round(sqrt(mean( (test_d$logPrice - yhat) ^ 2 )), 4)
## [1] 0.3603
Doing some variable/feature selections may help
lm_logPrice = lm(logPrice ~ . + x:y + carat:clarity, data = train_d) # add interaction term x:y and carat:clarity
yhat = as.numeric(predict(lm_logPrice, newdata = test_d))
round(sqrt(mean( (test_d$logPrice - yhat) ^ 2 )), 4)
## [1] 0.1869
Using KNN for regression on transformed X. Looks promising.
You may try different k.
yhat = FNN::knn.reg(train = train_d_t[, -10], test = test_d_t[,-10], y = train_d_t[, 10], k = 6)
round(sqrt(mean( (test_d$logPrice - yhat$pred) ^ 2 )), 4)
## [1] 0.1189
RandomForest using package “ranger”. Try to beat the power rangers? :)
library(ranger);
rg = ranger(logPrice ~ ., data = train_d, num.trees = 200, seed = 1)
yhat = predict(rg, data = test_d[, -10])
y_yhat = test_d$logPrice - yhat$predictions
round(sqrt(mean( (test_d$logPrice - yhat$pred) ^ 2 )),4)
## [1] 0.091
Note that non-linear feature transformations work great on this particular case * (predicting diamonds price) *.
You may try other kinds of feature representations/transformations and see if they work.