rm(list=ls(all=T))
options(digits=4, scipen=12)
library(dplyr); library(ggplot2)
Introduction
議題:使用歌曲的屬性,預測它會不會進入流行歌曲排行榜的前10名
學習重點:
- 依時間分割資料
- model formula 的寫法
- 高相關(共線性)自變數之間的選擇
- accuracy, sensitivity, specificity的實際意義
- 如何調整臨界機率來權衡:TFR/sensitivity vs. FPR/specificity
1 基本的資料處理 Understanding the Data
【1.1】How many observations (songs) are from the year 2010?
【1.2】How many songs does the dataset include for which the artist name is “Michael Jackson”?
【1.3】Which of these songs by Michael Jackson made it to the Top 10? Select all that apply.
【1.4】(a) What are the values of timesignature that occur in our dataset? (b) Which timesignature value is the most frequent among songs in our dataset?
【1.5】 Which of the following songs has the highest tempo?
2 建立模型 Creating Our Prediction Model
【2.1 依時間分割資料】How many observations (songs) are in the training set?
【2.2 建立模型、模型摘要】What is the value of the Akaike Information Criterion (AIC)?
【2.3 模型係數判讀】The LOWER or HIGHER our confidence about time signature, key and tempo, the more likely the song is to be in the Top 10
【2.4 進行推論】What does Model 1 suggest in terms of complexity?
【2.5 檢查異常係數】 (a) By inspecting the coefficient of the variable “loudness”, what does Model 1 suggest? (b) By inspecting the coefficient of the variable “energy”, do we draw the same conclusions as above?
3 處理共線性 Beware of Multicollinearity Issues!
【3.1 檢查相關係數】What is the correlation between loudness and energy in the training set?
【3.1 檢查相關係數】What is the correlation between loudness and energy in the training set?
【3.2 重新建立模型、檢查係數】Look at the summary of SongsLog2, and inspect the coefficient of the variable “energy”. What do you observe?
【3.3 選擇模型】 do we make the same observation about the popularity of heavy instrumentation as we did with Model 2?
4 驗證模型 Validating Our Model
【4.1 正確性】What is the accuracy of Model 3 on the test set, using a threshold of 0.45?
【4.2 底線正確率】What would the accuracy of the baseline model be on the test set? ?
【4.3 正確性 vs. 辨識率】How many songs does Model 3 correctly predict as Top 10 hits in 2010? How many non-hit songs does Model 3 predict will be Top 10 hits?
【Q】不能大幅度增加正確性的模型也會有用嗎?為甚麼?
【4.4 敏感性 & 明確性】What is the sensitivity and specificity of Model 3 on the test set, using a threshold of 0.45?
【4.5 結論】What conclusions can you make about our model?
【Q】從這個結論我們學到什麼?
LS0tDQp0aXRsZTogIkFTMy0xIFBvcHVsYXJpdHkgb2YgbXVzaWMgcmVjb3JkcyINCmF1dGhvcjogIuWNk+mbjeeEtiBEOTk0MDEwMDAxIg0Kb3V0cHV0OiBodG1sX25vdGVib29rDQotLS0NCg0KYGBge3IgZWNobz1ULCBtZXNzYWdlPUYsIGNhY2hlPUYsIHdhcm5pbmc9Rn0NCnJtKGxpc3Q9bHMoYWxsPVQpKQ0Kb3B0aW9ucyhkaWdpdHM9NCwgc2NpcGVuPTEyKQ0KbGlicmFyeShkcGx5cik7IGxpYnJhcnkoZ2dwbG90MikNCmBgYA0KDQotIC0gLQ0KDQojIyMgSW50cm9kdWN0aW9uDQoNCuitsOmhjO+8muS9v+eUqOatjOabsueahOWxrOaAp++8jOmgkOa4rOWug+acg+S4jeacg+mAsuWFpea1geihjOatjOabsuaOkuihjOamnOeahOWJjTEw5ZCNDQoNCuWtuOe/kumHjem7nu+8mg0KDQorIOS+neaZgumWk+WIhuWJsuizh+aWmQ0KKyBtb2RlbCBmb3JtdWxhIOeahOWvq+azlQ0KKyDpq5jnm7jpl5wo5YWx57ea5oCnKeiHquiuiuaVuOS5i+mWk+eahOmBuOaThw0KKyBhY2N1cmFjeSwgc2Vuc2l0aXZpdHksIHNwZWNpZmljaXR555qE5a+m6Zqb5oSP576pIA0KKyDlpoLkvZXoqr/mlbToh6jnlYzmqZ/njofkvobmrIrooaHvvJpURlIvc2Vuc2l0aXZpdHkgdnMuIEZQUi9zcGVjaWZpY2l0eSANCg0KPGJyPg0KDQotIC0gLQ0KDQojIyMgMSDln7rmnKznmoTos4fmlpnomZXnkIYgVW5kZXJzdGFuZGluZyB0aGUgRGF0YQ0KDQrjgJAqKjEuMSoq44CRSG93IG1hbnkgb2JzZXJ2YXRpb25zIChzb25ncykgYXJlIGZyb20gdGhlIHllYXIgMjAxMD8NCmBgYHtyfQ0KDQpgYGANCg0K44CQKioxLjIqKuOAkUhvdyBtYW55IHNvbmdzIGRvZXMgdGhlIGRhdGFzZXQgaW5jbHVkZSBmb3Igd2hpY2ggdGhlIGFydGlzdCBuYW1lIGlzICJNaWNoYWVsIEphY2tzb24iPw0KYGBge3J9DQoNCmBgYA0KDQrjgJAqKjEuMyoq44CRV2hpY2ggb2YgdGhlc2Ugc29uZ3MgYnkgTWljaGFlbCBKYWNrc29uIG1hZGUgaXQgdG8gdGhlIFRvcCAxMD8gU2VsZWN0IGFsbCB0aGF0IGFwcGx5Lg0KYGBge3J9DQoNCmBgYA0KDQrjgJAqKjEuNCoq44CRKGEpIFdoYXQgYXJlIHRoZSB2YWx1ZXMgb2YgYHRpbWVzaWduYXR1cmVgIHRoYXQgb2NjdXIgaW4gb3VyIGRhdGFzZXQ/IChiKSBXaGljaCB0aW1lc2lnbmF0dXJlIHZhbHVlIGlzIHRoZSBtb3N0IGZyZXF1ZW50IGFtb25nIHNvbmdzIGluIG91ciBkYXRhc2V0PyANCmBgYHtyfQ0KDQpgYGANCg0K44CQKioxLjUqKuOAkSBXaGljaCBvZiB0aGUgZm9sbG93aW5nIHNvbmdzIGhhcyB0aGUgaGlnaGVzdCB0ZW1wbz8NCmBgYHtyfQ0KDQpgYGANCjxicj4NCg0KLSAtIC0NCg0KIyMjIDIg5bu656uL5qih5Z6LIENyZWF0aW5nIE91ciBQcmVkaWN0aW9uIE1vZGVsDQoNCuOAkCoqMi4xIOS+neaZgumWk+WIhuWJsuizh+aWmSoq44CRSG93IG1hbnkgb2JzZXJ2YXRpb25zIChzb25ncykgYXJlIGluIHRoZSB0cmFpbmluZyBzZXQ/DQpgYGB7cn0NCg0KYGBgDQoNCuOAkCoqMi4yIOW7uueri+aooeWei+OAgeaooeWei+aRmOimgSoq44CRV2hhdCBpcyB0aGUgdmFsdWUgb2YgdGhlIEFrYWlrZSBJbmZvcm1hdGlvbiBDcml0ZXJpb24gKEFJQyk/DQpgYGB7cn0NCg0KYGBgDQoNCuOAkCoqMi4zIOaooeWei+S/guaVuOWIpOiugCoq44CRVGhlIGBMT1dFUmAgb3IgYEhJR0hFUmAgb3VyIGNvbmZpZGVuY2UgYWJvdXQgdGltZSBzaWduYXR1cmUsIGtleSBhbmQgdGVtcG8sIHRoZSBtb3JlIGxpa2VseSB0aGUgc29uZyBpcyB0byBiZSBpbiB0aGUgVG9wIDEwDQpgYGB7cn0NCg0KYGBgDQoNCuOAkCoqMi40IOmAsuihjOaOqOirlioq44CRV2hhdCBkb2VzIE1vZGVsIDEgc3VnZ2VzdCBpbiB0ZXJtcyBvZiBjb21wbGV4aXR5Pw0KYGBge3J9DQoNCmBgYA0KDQrjgJAqKjIuNSDmqqLmn6XnlbDluLjkv4LmlbgqKuOAkSAoYSkgQnkgaW5zcGVjdGluZyB0aGUgY29lZmZpY2llbnQgb2YgdGhlIHZhcmlhYmxlICJsb3VkbmVzcyIsIHdoYXQgZG9lcyBNb2RlbCAxIHN1Z2dlc3Q/IChiKSBCeSBpbnNwZWN0aW5nIHRoZSBjb2VmZmljaWVudCBvZiB0aGUgdmFyaWFibGUgImVuZXJneSIsIGRvIHdlIGRyYXcgdGhlIHNhbWUgY29uY2x1c2lvbnMgYXMgYWJvdmU/DQpgYGB7cn0NCg0KYGBgDQo8YnI+DQoNCi0gLSAtDQoNCiMjIyAzIOiZleeQhuWFsee3muaApyBCZXdhcmUgb2YgTXVsdGljb2xsaW5lYXJpdHkgSXNzdWVzIQ0KDQrjgJAqKjMuMSDmqqLmn6Xnm7jpl5zkv4LmlbgqKuOAkVdoYXQgaXMgdGhlIGNvcnJlbGF0aW9uIGJldHdlZW4gYGxvdWRuZXNzYCBhbmQgYGVuZXJneWAgaW4gdGhlIHRyYWluaW5nIHNldD8NCmBgYHtyfQ0KDQpgYGANCg0K44CQKiozLjEg5qqi5p+l55u46Zec5L+C5pW4KirjgJFXaGF0IGlzIHRoZSBjb3JyZWxhdGlvbiBiZXR3ZWVuIGBsb3VkbmVzc2AgYW5kIGBlbmVyZ3lgIGluIHRoZSB0cmFpbmluZyBzZXQ/DQpgYGB7cn0NCg0KYGBgDQoNCuOAkCoqMy4yIOmHjeaWsOW7uueri+aooeWei+OAgeaqouafpeS/guaVuCoq44CRTG9vayBhdCB0aGUgc3VtbWFyeSBvZiBTb25nc0xvZzIsIGFuZCBpbnNwZWN0IHRoZSBjb2VmZmljaWVudCBvZiB0aGUgdmFyaWFibGUgImVuZXJneSIuIFdoYXQgZG8geW91IG9ic2VydmU/DQpgYGB7cn0NCg0KYGBgDQoNCuOAkCoqMy4zIOmBuOaTh+aooeWeiyoq44CRIGRvIHdlIG1ha2UgdGhlIHNhbWUgb2JzZXJ2YXRpb24gYWJvdXQgdGhlIHBvcHVsYXJpdHkgb2YgaGVhdnkgaW5zdHJ1bWVudGF0aW9uIGFzIHdlIGRpZCB3aXRoIE1vZGVsIDI/DQpgYGB7cn0NCg0KYGBgDQo8YnI+DQoNCi0gLSAtDQoNCiMjIyA0IOmpl+itieaooeWeiyBWYWxpZGF0aW5nIE91ciBNb2RlbA0KDQrjgJAqKjQuMSDmraPnorrmgKcqKuOAkVdoYXQgaXMgdGhlIGFjY3VyYWN5IG9mIE1vZGVsIDMgb24gdGhlIHRlc3Qgc2V0LCB1c2luZyBhIHRocmVzaG9sZCBvZiAwLjQ1PyANCmBgYHtyfQ0KDQpgYGANCg0K44CQKio0LjIg5bqV57ea5q2j56K6546HKirjgJFXaGF0IHdvdWxkIHRoZSBhY2N1cmFjeSBvZiB0aGUgYmFzZWxpbmUgbW9kZWwgYmUgb24gdGhlIHRlc3Qgc2V0PyA/IA0KYGBge3J9DQoNCmBgYA0KDQrjgJAqKjQuMyDmraPnorrmgKcgdnMuIOi+qOitmOeOhyoq44CRSG93IG1hbnkgc29uZ3MgZG9lcyBNb2RlbCAzIGNvcnJlY3RseSBwcmVkaWN0IGFzIFRvcCAxMCBoaXRzIGluIDIwMTA/ICBIb3cgbWFueSBub24taGl0IHNvbmdzIGRvZXMgTW9kZWwgMyBwcmVkaWN0IHdpbGwgYmUgVG9wIDEwIGhpdHM/DQpgYGB7cn0NCg0KYGBgDQoNCuOAkCoqUSoq44CR5LiN6IO95aSn5bmF5bqm5aKe5Yqg5q2j56K65oCn55qE5qih5Z6L5Lmf5pyD5pyJ55So5ZeO77yf54K655Sa6bq877yfDQoNCuOAkCoqNC40IOaVj+aEn+aApyAmIOaYjueiuuaApyoq44CRV2hhdCBpcyB0aGUgYHNlbnNpdGl2aXR5YCBhbmQgYHNwZWNpZmljaXR5YCBvZiBNb2RlbCAzIG9uIHRoZSB0ZXN0IHNldCwgdXNpbmcgYSB0aHJlc2hvbGQgb2YgMC40NT8NCmBgYHtyfQ0KDQpgYGANCg0K44CQKio0LjUg57WQ6KuWKirjgJFXaGF0IGNvbmNsdXNpb25zIGNhbiB5b3UgbWFrZSBhYm91dCBvdXIgbW9kZWw/DQpgYGB7cn0NCg0KYGBgDQoNCjxicj4NCg0K44CQKipRKirjgJHlvp7pgJnlgIvntZDoq5bmiJHlgJHlrbjliLDku4DpurzvvJ8NCg0KDQotIC0gLQ0KDQo8YnI+PGJyPjxicj4NCg==