莊皓鈞教授授課筆記整理
Regression is a statistical method to assess relationships among variables.
The objective is to find a model for predicting the dependent (response) variable given one or multiple independent (predictor) variables.Being proficient in regression modeling will enable you to answer many questions by using data formally and provide a solid foundation for business data analytics.
Given limited time, I will try to minimize mathematical foundation of statistical models in the remaining lectures. Without compromising your learning, I will put some technical details aside and focus on applied statistical modeling in R.
For a dataset with n observations, a simple linear regression model is defined by
\[y_i = \beta_0 + \beta_ix_i+\varepsilon_i, i=1,2,...,n\]
where \(\varepsilon_i\) is a random error/term with mean zero and variance \(\sigma^2\).
The word simple means that there is one and only one independent variable in the model.
Simple意味著只有一個獨立變數在模型裡。
As we collect a data set of sample size \(n\), in most cases we estimate \(\beta_0\) (intercept) and \(\beta_1\) (slope) using ordinary least squares (OLS), which basically solves the optimization problem:
\[Min \sum_{i=1}^n(y_i-\hat{y}_i)^2 \\ s.t. \hat{y}_i=\hat{\beta}_0+\hat{\beta}_1x_i\]
使真實值與預測值的誤差最小。
We can rewrite the objective function as: \[Min(y_1 - \hat{\beta}_0+\hat{\beta}_1x_1)^2 + (y_2 - \hat{\beta}_0+\hat{\beta}_1x_2)+...+(y_n - \hat{\beta}_0+\hat{\beta}_1x_n)\]
Using some calculus, we can show that:
\[\hat{\beta_1}=\frac{\sum^n_{i=1}(x_i-\bar{x})(y_i-\bar{y})}{\sum^n_{i=1}(x_i-\bar{x})^2}\]
\[\& \hat{\beta_0}=\bar{y}-\hat{\beta_1}\bar{x}\]
data(); attach(cars)
plot(cars)
這張圖顯示車子distance與speed之間的關係,假設我們想藉由下面的模型來做估計:
\[dist_i = \beta_0speed_i+\varepsilon_i, i=1,2,...,50\]
L1=lm(dist ~ speed)
summary(L1)
Call:
lm(formula = dist ~ speed)
Residuals:
Min 1Q Median 3Q Max
-29.07 -9.53 -2.27 9.21 43.20
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.579 6.758 -2.60 0.012 *
speed 3.932 0.416 9.46 0.0000000000015 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 15.4 on 48 degrees of freedom
Multiple R-squared: 0.651, Adjusted R-squared: 0.644
F-statistic: 89.6 on 1 and 48 DF, p-value: 0.00000000000149
由上述的線性模型可知
\(\beta_0\)=-17.5791, \(\beta_1\)=3.9324
本質上參數估計包含了uncertainty,\(\beta_0\)\(\beta_1\)都有以標準差(statndard error, SE)來量化。標準差有兩種應用:
Essentially, we want to determine whether \(\hat{\beta_1}\) is sufficiently far from zero so that we can be confident that \(\beta_1\) is non-zero. In practice, we compute a t-statistic
\[t=\frac{\hat{\beta_1}-0}{SE(\hat{\beta_1})}\]
which measures the number of standard deviations that \(\hat{\beta_1}\) is away from 0.
The p-value is nothing but probability of observing any value \(>=|t|\), assuming \(\beta_1=0\). A small p-value indicates that the observed association between x and y is probably not due to chance. We reject H0 and decalre a relationship between x and y to exist, if the p-value is small enough.
t-value = estimate / std error
t-value統計量的意義:分子是估計值(心中最好的猜測,讓我的配適度最高)/標準誤差(衡量不確定性)
- Uncertainty高->範圍大 -> t-value小 -> p-value大
- Uncertainty低->範圍小 -> t-value小 -> p-value小 -> reject H0
In addition to estimated coefficients \(\hat{\beta_0}\) and \(\hat{\beta_1}\), we are interested in estimated residual.
\[\hat{\varepsilon}_i = y_i-\hat{y}_i\]
res = residuals(L1)
mean(res)
[1] 0.00000000000000000919
Residuals allow us to detect potential outliers that we may fail to predict well (DON’T throw outliers away if you have a large sample) and help us assess model specification errors.
殘差使我們能夠檢測出可能無法正確預測的異常值(如果樣本量較大,請不要將異常值丟掉),並幫助我們評估模型規格誤差。
yhat = fitted(L1)
plot(yhat, res)
# Do you notice outliers?
lm() is a very powerful function and we will see it so many times. Below lists functions you can apply to a fitted object such as L1 (Kleiber & Zeileis, 2008)
Multiple linear regression refers to a model with more than one independent variables
\[y_i = \beta_0 + \beta_ix_{1i}+...+\beta_kx_{ki}\varepsilon_i, i=1,2,...,n\]
Again, what OLS does is to solve the optimization problem
The objective function in the optimization problem is called SSE (sum of squared errors). We can modify it into MSE (mean squared errors)
\[MSE = \frac{1}{n}\sum^n_{i=1}(y_i-\hat{y}_i)^2\]
errorMSE=function(par){
obj=0
for(i in 1:nrow(cars)){
obj=(dist[i]-(par[1]+par[2]*speed[i]))^2+obj
}
obj
}
nlminb(c(-5,1), errorMSE)
$par
[1] -17.58 3.93
$objective
[1] 11354
$convergence
[1] 0
$iterations
[1] 6
$evaluations
function gradient
10 18
$message
[1] "relative convergence (4)"
errorMLE=function(par){
loglik=0
for(i in 1:nrow(cars)){
loglik=dnorm(dist[i],(par[1]+par[2]*speed[i]),par[3],log=T)+loglik
}
-loglik
}
nlminb(c(-5,1,2), errorMLE, lower=c(-Inf,-Inf,1e-5))
Error in dist[i] : object of type 'closure' is not subsettable
\[E[MSE]=(Model Bias)^2 + Model\quad Variance + Irreducible\quad Error\]
The first term is the squared bias that reflects how close the functional form of the model can get to the true relationship between the predictors and the outcome.
The second term is model variance. Here a famous variance-bais trade-ff arises. Complex models tend to have high variance and over-fit. Simple models, on the other hand, tend to have high bias and under-fit.
💡模型複雜度變高-> variance會變高 -> Bias會變低
As mentioned earlier, we impose assumptions on the error term \(\varepsilon_i\) . Essentially we assume
A direct consequence of this assumption is that the response variable y must be continuous and iid. This assumption is so critical but often ignored by many analysts (including a lot of professors). On some occasions you will have to deal with non-continuous and/or non-iid y observations, which require a set of different techniques.
Let’s analyze the dataset wine.df. March 1990 –Orley Ashenfelter, a Princeton economics professor, claims he can predict wine quality without tasting the wine. Ashenfelter used a linear regression model:
click to download: wine.csv
detach(cars)
#Use the wine.csv file
wine.df = read.csv("wine.csv")
str(wine.df)
'data.frame': 25 obs. of 7 variables:
$ Year : int 1952 1953 1955 1957 1958 1959 1960 1961 1962 1963 ...
$ Price : num 7.5 8.04 7.69 6.98 6.78 ...
$ WinterRain : int 600 690 502 420 582 485 763 830 697 608 ...
$ AGST : num 17.1 16.7 17.1 16.1 16.4 ...
$ HarvestRain: int 160 80 130 110 187 187 290 38 52 155 ...
$ Age : int 31 30 28 26 25 24 23 22 21 20 ...
$ FrancePop : num 43184 43495 44218 45152 45654 ...
summary(wine.df)
Year Price WinterRain AGST HarvestRain
Min. :1952 Min. :6.20 Min. :376 Min. :15.0 Min. : 38
1st Qu.:1960 1st Qu.:6.52 1st Qu.:536 1st Qu.:16.2 1st Qu.: 89
Median :1966 Median :7.12 Median :600 Median :16.5 Median :130
Mean :1966 Mean :7.07 Mean :605 Mean :16.5 Mean :149
3rd Qu.:1972 3rd Qu.:7.50 3rd Qu.:697 3rd Qu.:17.1 3rd Qu.:187
Max. :1978 Max. :8.49 Max. :830 Max. :17.6 Max. :292
Age FrancePop
Min. : 5.0 Min. :43184
1st Qu.:11.0 1st Qu.:46584
Median :17.0 Median :50255
Mean :17.2 Mean :49694
3rd Qu.:23.0 3rd Qu.:52894
Max. :31.0 Max. :54602
#relationship between variables
library(gpairs)
gpairs(wine.df[2:7])
cor(wine.df[2:7])
Price WinterRain AGST HarvestRain Age FrancePop
Price 1.000 0.13665 0.6596 -0.5633 0.448 -0.46686
WinterRain 0.137 1.00000 -0.3211 -0.2754 -0.017 -0.00162
AGST 0.660 -0.32109 1.0000 -0.0645 0.247 -0.25916
HarvestRain -0.563 -0.27544 -0.0645 1.0000 -0.028 0.04126
Age 0.448 -0.01697 0.2469 -0.0280 1.000 -0.99449
FrancePop -0.467 -0.00162 -0.2592 0.0413 -0.994 1.00000
library(corrplot)
corrplot.mixed(corr=cor(wine.df[2:7], use="complete.obs"),
upper="ellipse", tl.pos="lt")
Consider a multiple linear regression model with two predictors:
m1 = lm(Price ~ AGST + HarvestRain, data=wine.df)
summary(m1)
Call:
lm(formula = Price ~ AGST + HarvestRain, data = wine.df)
Residuals:
Min 1Q Median 3Q Max
-0.8832 -0.1960 0.0618 0.1538 0.5972
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.20265 1.85443 -1.19 0.24759
AGST 0.60262 0.11128 5.42 0.000019 ***
HarvestRain -0.00457 0.00101 -4.52 0.00017 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.367 on 22 degrees of freedom
Multiple R-squared: 0.707, Adjusted R-squared: 0.681
F-statistic: 26.6 on 2 and 22 DF, p-value: 0.00000135
R-squared \(Rsquared \approx corr(y, \hat{y})^2\) coef範圍:[-1, 1]
yhat_m1 = fitted(m1)
(cor(wine.df$Price, yhat_m1))^2
[1] 0.707
可以得知跟lm model裡的R-squared數字相同,越高表示fitting的越好。
F-statistic
H0: \(\beta_1 =\beta_2=...=\beta_k=0\)
當p-value很大、F-statistic很小,代表你連這個虛無假說都拒絕不了->模型比\(y=\beta_0\)還爛
m2 = lm(Price ~ AGST + HarvestRain + WinterRain + Age + FrancePop, data=wine.df)
summary(m2)
Call:
lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age +
FrancePop, data = wine.df)
Residuals:
Min 1Q Median 3Q Max
-0.4818 -0.2466 -0.0073 0.2201 0.5199
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.4503989 10.1888839 -0.04 0.96520
AGST 0.6012239 0.1030203 5.84 0.000013 ***
HarvestRain -0.0039581 0.0008751 -4.52 0.00023 ***
WinterRain 0.0010425 0.0005310 1.96 0.06442 .
Age 0.0005847 0.0790031 0.01 0.99417
FrancePop -0.0000495 0.0001667 -0.30 0.76958
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.302 on 19 degrees of freedom
Multiple R-squared: 0.829, Adjusted R-squared: 0.784
F-statistic: 18.5 on 5 and 19 DF, p-value: 0.00000104
m2的模型有5個變數 -> R-squared是0.82,正常來說變數越多,R-squared會越大
m3 = lm(Price ~ AGST + HarvestRain + WinterRain + Age, data=wine.df)
summary(m3)
Call:
lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age, data = wine.df)
Residuals:
Min 1Q Median 3Q Max
-0.4547 -0.2427 0.0075 0.1977 0.5364
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.429980 1.765898 -1.94 0.06631 .
AGST 0.607209 0.098702 6.15 0.0000052 ***
HarvestRain -0.003972 0.000854 -4.65 0.00015 ***
WinterRain 0.001076 0.000507 2.12 0.04669 *
Age 0.023931 0.008097 2.96 0.00782 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.295 on 20 degrees of freedom
Multiple R-squared: 0.829, Adjusted R-squared: 0.794
F-statistic: 24.2 on 4 and 20 DF, p-value: 0.000000204
m3的模型有4個變數 -> 年齡變顯著,係數變大 R-squared一樣是0.82多
其實要看adjusted R-squared,調整過後的m3是0.79,m2是0.78,也可以知道m3比較好。m3也符合我們的預期,酒齡跟價格有關。
💡adjusted R-squared 會懲罰變數的數量(penalized number of variables)
\[\bar{R}^2=1-(1-R^2)\frac{n-1}{n-p-1}\]
我們固定n,當模型中的變數(p)變多,分母變小-> \(R^2\)變大
發現age跟francePop成負相關,因為age越高表示越早釀的酒,越早期的人口數比較少 -> 法國人口是多餘的資訊
plot_summs(m3,scale=TRUE)
Age is in fact a categorical variable. Yet, in prior models, we treat it as a numerical variable. We show how to correctly model a categorical variable (e.g., gender, cities) using dummy variable. The most common way to code dummy variables is to create j-1 dummies for j categories, in which the excluded category serves as the base level (including j variables for j categories makes the regression model NON-identifiable).
We first categorize Age – ageCat=1 if Age<=10; ageCat=2 if 10< Age < =20; ageCat=3 if Age > 20
ageCat=rep(0,length(wine.df$Age))
for (i in c(1:length(wine.df$Age))){
if (wine.df$Age[i]>20){
ageCat[i]=3
} else if (wine.df$Age[i]<=10)
{
ageCat[i]=1
}else{
ageCat[i]=2
}
}
wine.new.df=cbind(wine.df,ageCat)
m4 = lm(Price ~ AGST + HarvestRain + WinterRain+as.factor(ageCat), data=wine.new.df)
summary(m4)
Call:
lm(formula = Price ~ AGST + HarvestRain + WinterRain + as.factor(ageCat),
data = wine.new.df)
Residuals:
Min 1Q Median 3Q Max
-0.3712 -0.2486 -0.0156 0.1999 0.4672
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.673549 1.874337 -1.96 0.06484 .
AGST 0.633222 0.103321 6.13 0.0000068 ***
HarvestRain -0.004007 0.000854 -4.69 0.00016 ***
WinterRain 0.000955 0.000512 1.87 0.07752 .
as.factor(ageCat)2 0.311788 0.155597 2.00 0.05956 .
as.factor(ageCat)3 0.497198 0.158081 3.15 0.00533 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.294 on 19 degrees of freedom
Multiple R-squared: 0.839, Adjusted R-squared: 0.796
F-statistic: 19.7 on 5 and 19 DF, p-value: 0.000000625
ageCat2 -> 1 if ageCat=2
ageCat3 -> 1 if ageCat=3
How many dummy variables are included? Ans: 2個
Which category is the base level? Ans: ageCat = 1類
#what as.factor does
age2 = ifelse(wine.new.df$ageCat==2,1,0)
age3 = ifelse(wine.new.df$ageCat==3,1,0)
m5 = lm(Price ~ AGST + HarvestRain + WinterRain+age2+age3, data=wine.new.df)
summary(m5)
Call:
lm(formula = Price ~ AGST + HarvestRain + WinterRain + age2 +
age3, data = wine.new.df)
Residuals:
Min 1Q Median 3Q Max
-0.3712 -0.2486 -0.0156 0.1999 0.4672
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.673549 1.874337 -1.96 0.06484 .
AGST 0.633222 0.103321 6.13 0.0000068 ***
HarvestRain -0.004007 0.000854 -4.69 0.00016 ***
WinterRain 0.000955 0.000512 1.87 0.07752 .
age2 0.311788 0.155597 2.00 0.05956 .
age3 0.497198 0.158081 3.15 0.00533 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.294 on 19 degrees of freedom
Multiple R-squared: 0.839, Adjusted R-squared: 0.796
F-statistic: 19.7 on 5 and 19 DF, p-value: 0.000000625
m5跟m4係數相同
💡回歸不代表任何因果關係!
Despite its simplicity, the linear regression (\(lm( )\)) is probably the most popular model used to learn associations between x variables and y. The R2 metric reflects how capable x variables are of explaining variation in y ( in-sample variation ). However, R2 increases as the number of variables increase. A model with higher R2 may not be good at out-of-sample prediction.
\[y_i = \beta_0 + \beta_ix_{1i}+...+\beta_kx_{ki}\varepsilon_i, i=n+1, n+2,...,N\]
Before introducing the use of linear regression for prediction, we must distinguish between in- sample explanation and out-of-sample prediction. The former is about explaining or quantifying the average effect of inputs on an outcome, whereas the latter is about predicting the outcome value for new records, given their input values.
Please keep the distinction between in-sample fit and out-of-sample predict in your mind. Below is an example of predictive modeling through regression. The table below lists key variables to be used for analysis. Our task is to help a dealer predict the price he/she will get for used Toyota Corollas. The total number of records in the dataset is 1000 cars (we use the first 1000 cars from the dataset ToyotoCorolla.csv).
我們現在的回歸模型都是樣本內的表現(in-sample variation),真正預測分析在於樣本外(out-of-sample)的表現,模型沒看過的。
Prediction Explaination
click to download: ToyotaCorolla.csv
##Analyze the ToyotaCorolla file
car.df = read.csv("ToyotaCorolla.csv")
# use first 1000 rows of data
car.df = car.df[1:1000, ]
# select variables for regression
selected.var = c(3, 4, 7, 8, 9, 10, 12, 13, 14, 17, 18)
A good predictive model could have a looser fit to the data on which it is based, and a good descriptive model can have low prediction accuracy. Striking a balance between over- and under-fitting is a challenging task for prediction. Luckily, we have k-fold cross validation, an extremely useful technique for us to assess the predictive power of any models (not limited to regression). This technique helps us mitigate over-fitting and has become a standard practice of predictive data analytics. Below shows an example of 5-fold cross validation.
💡k-fold讓你來選擇變數跟設計模型
Two models – car.lma (all variables), car.lmb (interaction) – are fitted. We use a common trick: adding interaction terms between variables.
\[Price_i = \beta_0+\beta_1 Age_i + \beta_2 Automatic_i + \beta_3Age_i * Automatic_i+...+\varepsilon_i, i=1,2,...,n\]
##cross validation: Toyota example
car.df.new = car.df[,c(3, 4, 7, 8, 9, 10, 12, 13, 14, 17, 18)]
names(car.df.new) #examine data
[1] "Price" "Age_08_04" "KM" "Fuel_Type" "HP"
[6] "Met_Color" "Automatic" "CC" "Doors" "Quarterly_Tax"
[11] "Weight"
table(car.df.new$Automatic) #check variable
0 1
952 48
car.lma = lm(Price~.,data=car.df.new )
summary(car.lma)
Call:
lm(formula = Price ~ ., data = car.df.new)
Residuals:
Min 1Q Median 3Q Max
-10690 -832 -42 796 7040
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6419.29475 1398.20235 -4.59 0.000004977252140 ***
Age_08_04 -132.41047 3.76506 -35.17 < 0.0000000000000002 ***
KM -0.01865 0.00178 -10.48 < 0.0000000000000002 ***
Fuel_TypeDiesel 654.47182 426.53374 1.53 0.13
Fuel_TypePetrol 2536.47473 420.27180 6.04 0.000000002241419 ***
HP 35.06966 3.92364 8.94 < 0.0000000000000002 ***
Met_Color 58.86125 93.98117 0.63 0.53
Automatic 239.57173 208.52260 1.15 0.25
CC -0.02102 0.09449 -0.22 0.82
Doors -69.01729 48.27076 -1.43 0.15
Quarterly_Tax 15.91235 2.00734 7.93 0.000000000000006 ***
Weight 17.40015 1.34783 12.91 < 0.0000000000000002 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1360 on 988 degrees of freedom
Multiple R-squared: 0.87, Adjusted R-squared: 0.868
F-statistic: 599 on 11 and 988 DF, p-value: <0.0000000000000002
#interaction
car.lmb=lm(Price~ Age_08_04+Automatic+Age_08_04:Automatic+KM+Fuel_Type+
HP+Met_Color+CC+Doors+Quarterly_Tax+Weight,data=car.df.new)
summary(car.lmb)
Call:
lm(formula = Price ~ Age_08_04 + Automatic + Age_08_04:Automatic +
KM + Fuel_Type + HP + Met_Color + CC + Doors + Quarterly_Tax +
Weight, data = car.df.new)
Residuals:
Min 1Q Median 3Q Max
-10676 -829 -24 802 7040
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6400.65341 1399.45871 -4.57 0.000005401572628 ***
Age_08_04 -132.12799 3.82401 -34.55 < 0.0000000000000002 ***
Automatic 485.45415 611.22843 0.79 0.43
KM -0.01869 0.00178 -10.49 < 0.0000000000000002 ***
Fuel_TypeDiesel 660.21663 426.92125 1.55 0.12
Fuel_TypePetrol 2528.81664 420.82624 6.01 0.000000002620392 ***
HP 35.15778 3.93066 8.94 < 0.0000000000000002 ***
Met_Color 58.82933 94.02008 0.63 0.53
CC -0.02908 0.09638 -0.30 0.76
Doors -68.80555 48.29326 -1.42 0.15
Quarterly_Tax 15.88724 2.00902 7.91 0.000000000000007 ***
Weight 17.38266 1.34900 12.89 < 0.0000000000000002 ***
Age_08_04:Automatic -5.07461 11.85731 -0.43 0.67
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1360 on 987 degrees of freedom
Multiple R-squared: 0.87, Adjusted R-squared: 0.868
F-statistic: 549 on 12 and 987 DF, p-value: <0.0000000000000002
interaction讓模型變得更複雜,考慮他們之間的相互關係。
library(DAAG)
# cross validation for Linear Regression
cv.lm(data=car.df.new, car.lma, m=2, printit=F)
cv.lm(data=car.df.new, car.lmb, m=2, printit=F)
💡模型A平均會猜錯的價格比模型B還少,透過這例子要理解兩件事情:
1. 永遠可以創造出很多模型,評估是不是好的模型需要out-of-sample
2. 比較強、複雜的模型不一定會比較準,可能會overfitting
set.seed(5566) # set seed for reproducing the partition
train.index = sample(c(1:1000), 600, replace=FALSE)
train.df = car.df[train.index, selected.var]
valid.df = car.df[-train.index, selected.var]
# use lm() to run a linear regression of Price on all 11 predictors in the
# training set.
# use . after ~ to include all the remaining columns in train.df as predictors.
car.lm = lm(Price ~ ., data = train.df)
# use options() to ensure numbers are not displayed in scientific notation.
options(scipen = 999)
summary(car.lm)
Call:
lm(formula = Price ~ ., data = train.df)
Residuals:
Min 1Q Median 3Q Max
-6963 -797 -69 720 6375
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -21456.14372 2260.50997 -9.49 < 0.0000000000000002 ***
Age_08_04 -116.76022 4.78608 -24.40 < 0.0000000000000002 ***
KM -0.01547 0.00207 -7.49 0.00000000000026 ***
Fuel_TypeDiesel -250.67447 506.14087 -0.50 0.6206
Fuel_TypePetrol 3717.17617 490.49978 7.58 0.00000000000014 ***
HP 19.26602 4.67796 4.12 0.00004360787847 ***
Met_Color 14.50101 111.87370 0.13 0.8969
Automatic 74.04368 259.44406 0.29 0.7754
CC -0.06081 0.08806 -0.69 0.4901
Doors -183.91791 60.08247 -3.06 0.0023 **
Quarterly_Tax 15.62252 2.31551 6.75 0.00000000003623 ***
Weight 31.69292 2.24725 14.10 < 0.0000000000000002 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1250 on 588 degrees of freedom
Multiple R-squared: 0.889, Adjusted R-squared: 0.887
F-statistic: 427 on 11 and 588 DF, p-value: <0.0000000000000002
#evaluate prediction accuracy
library(forecast)
# use predict() to make predictions on a new set.
car.lm.pred = predict(car.lm, valid.df)
# use accuracy() to compute common accuracy measures.
accuracy(car.lm.pred, valid.df$Price)
ME RMSE MAE MPE MAPE
Test set -107 1696 1086 -1.91 9.64
會出問題在於Percentage分母是0,都很有可能掛掉 -> Kaggle比賽常用評估方法“RMSLE”
The last piece I want to discuss here is quantile regression, which has gradually become an attractive alternative to the ordinary least squares (OLS) regression.
\[y_i=\beta_0+\beta_1x_{1i}+...\beta_kx_{ki}+\varepsilon_i, i=1,2,...,n\] The OLS linear regression models the response y as a conditional mean of x
\[E(y_i|x_i)=\beta_0+\beta_1x_{1i}+...\beta_kx_{ki}\] Suppose 0<τ<1 indicates the proportion of population with y values below the τth quantile.
The quantile regression models the response y as a conditional quantile of x
\(\tau\) 是第幾百分位數
💡 Quantile regression為什麼要用絕對值不是用平方?
💡 rolling validation
summary(cps_lad)
Call: rq(formula = cps_f, data = CPS1988)
tau: [1] 0.5
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 4.24088 0.02190 193.67805 0.00000
experience 0.07744 0.00115 67.50041 0.00000
I(experience^2) -0.00130 0.00003 -49.97891 0.00000
education 0.09429 0.00140 67.57171 0.00000
預設是\(\tau\) = 0.5,可以找到中位數的線性迴歸=\(4.24+0.077X1-0.001X2+0.009X3\)
The quantile regression is extremely useful when modeling several quantiles of the response y simultaneously. Let’s consider the first (\(\tau=0.25\)) and third(\(\tau=0.75\)) quartiles.
如果縣有興趣的是25%跟75%,直接給rq兩個百分位就可以求出
cps_rq=rq(cps_f, tau=c(0.25, 0.75), data=CPS1988)
summary(cps_rq)
Call: rq(formula = cps_f, tau = c(0.25, 0.75), data = CPS1988)
tau: [1] 0.25
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 3.78227 0.02866 131.95189 0.00000
experience 0.09156 0.00152 60.26474 0.00000
I(experience^2) -0.00164 0.00004 -45.39065 0.00000
education 0.09321 0.00185 50.32520 0.00000
Call: rq(formula = cps_f, tau = c(0.25, 0.75), data = CPS1988)
tau: [1] 0.75
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 4.66005 0.02023 230.39734 0.00000
experience 0.06377 0.00097 65.41364 0.00000
I(experience^2) -0.00099 0.00002 -44.15591 0.00000
education 0.09434 0.00134 70.65855 0.00000
\(\tau=0.75\) \(4.66+0.063X1-0.001X2+0.009X3\) \(\tau=0.5\) \(4.24+0.077X1-0.001X2+0.009X3\) \(\tau=0.25\) \(3,78+0.091X1-0.001X2+0.009X3\)
所以對於高薪水的來說,經驗影響比薪水低的人還少。
A natural question is whether the regression lines are parallel. That is, are the effects of the regressors are uniform across quantiles?
cps_rq25=rq(cps_f, tau=0.25, data=CPS1988)
cps_rq75=rq(cps_f, tau=0.75, data=CPS1988)
anova(cps_rq25, cps_rq75)
Quantile Regression Analysis of Deviance Table
Model: log(wage) ~ experience + I(experience^2) + education
Joint Test of Equality of Slopes: tau in { 0.25 0.75 }
Df Resid Df F value Pr(>F)
1 3 56307 115 <0.0000000000000002 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
ANOVA在檢定25%跟75%的斜率,是否有顯著的不同,依結果來看是不同的。
What is the key finding here? We can further visualize the results from quantile regression modeling.
cps_rqbig=rq(cps_f, tau=seq(0.05, 0.95, 0.05), data=CPS1988)
cps_rqbigs=summary(cps_rqbig)
18 non-positive fis
plot(cps_rqbigs)
X軸不同的\(\tau\),y軸期望值。紅線是一般回歸的值(直線因為是期望值,每個百分位數都一樣),虛線是信賴區間。
Can you see the drawback of relyiung merely on the OLS regression (lm())? You should have confidence in performing regression analysis for cross-sectional data in R from now on.
自己手刻Quantile Regression
attach(CPS1988)
tau=0.75
errorMAE = function(par) {
obj=0 # start from 0
for(i in 1:nrow(CPS1988)) {
ei=log(wage[i])-(par[1]+par[2]*experience[i]+par[3]*experience[i]^2
+par[4]*education[i])
obj=(tau*ei*(ei>0)+(1-tau)*abs(ei)*(ei<0))+obj
}
obj
}
fit=optim(c(0.5,0.5,0.5,0.5),errorMAE)
errorMAE(rep(0.5,4))
[1] 1837907
errorMAE(fit$par)
[1] 4710
coef(cps_rq75)
(Intercept) experience I(experience^2) education
4.660054 0.063768 -0.000992 0.094340
errorMAE(coef(cps_rq75))
(Intercept)
4710
為什麼用optim,因為optim內建搜尋方式不用偏微分,偏微分對function有絕對值不友善,適合我們的目標函式。算出來的結果基本上都與\(\tau=0.5\)算出來的function parameters 相同。
一樣的參數,會有許多組最佳解,一堆變數丟進模型(為度變高),可能會出現許多不同的參數出來的loss相同。