rm(list=ls(all=T))
options(digits=4, scipen=12)
library(dplyr); library(ggplot2)
Introduction
議題:使用歌曲的屬性,預測它會不會進入流行歌曲排行榜的前10名
學習重點:
- 依時間分割資料
- model formula 的寫法
- 高相關(共線性)自變數之間的選擇
- accuracy, sensitivity, specificity的實際意義
- 如何調整臨界機率來權衡:TFR/sensitivity vs. FPR/specificity
1 基本的資料處理 Understanding the Data
setwd("~/big data/unit3/code_data/data")
The working directory was changed to C:/Users/Administrator/Documents/big data/unit3/code_data/data inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
songs=read.csv("songs.csv")
【1.1】How many observations (songs) are from the year 2010?
nrow(songs[which(songs$year==2010),])
[1] 373
【1.2】How many songs does the dataset include for which the artist name is “Michael Jackson”?
nrow(songs[which(songs$artistname=="Michael Jackson"),])
[1] 18
【1.3】Which of these songs by Michael Jackson made it to the Top 10? Select all that apply.
as.character(songs[which(songs$artistname=="Michael Jackson" & songs$Top10==1),2])
[1] "You Rock My World" "You Are Not Alone" "Black or White"
[4] "Remember the Time" "In The Closet"
【1.4】(a) What are the values of timesignature that occur in our dataset? (b) Which timesignature value is the most frequent among songs in our dataset?
names(table(songs$timesignature))
[1] "0" "1" "3" "4" "5" "7"
names(which.max(table(songs$timesignature)))
[1] "4"
【1.5】 Which of the following songs has the highest tempo?
as.character(songs[which.max(songs$tempo),2])
[1] "Wanna Be Startin' Somethin'"
2 建立模型 Creating Our Prediction Model
【2.1 依時間分割資料】How many observations (songs) are in the training set?
SongsTrain=subset(songs,year<=2009)
SongsTest=subset(songs,year==2010)
nrow(SongsTrain)
[1] 7201
【2.2 建立模型、模型摘要】What is the value of the Akaike Information Criterion (AIC)?
nonvars = c("year", "songtitle", "artistname", "songID", "artistID")
SongsTrain = SongsTrain[ , !(names(SongsTrain) %in% nonvars) ]
SongsTest = SongsTest[ , !(names(SongsTest) %in% nonvars) ]
SongsLog1 = glm(Top10 ~ ., data=SongsTrain, family=binomial)
summary(SongsLog1)$aic
[1] 4827
【2.3 模型係數判讀】The LOWER or HIGHER our confidence about time signature, key and tempo, the more likely the song is to be in the Top 10
c("Ans:The higher our confidence about time signature, key and tempo, the more likely the song is to be in the Top 10 正确")
[1] "Ans:The higher our confidence about time signature, key and tempo, the more likely the song is to be in the Top 10 正确"
Warning message:
In strsplit(code, "\n", fixed = TRUE) :
input string 1 is invalid in this locale
【2.4 進行推論】What does Model 1 suggest in terms of complexity?
c("Ans:Mainstream listeners tend to prefer less complex songs")
[1] "Ans:Mainstream listeners tend to prefer less complex songs"
【2.5 檢查異常係數】 (a) By inspecting the coefficient of the variable “loudness”, what does Model 1 suggest? (b) By inspecting the coefficient of the variable “energy”, do we draw the same conclusions as above?
c("Ans:Mainstream listeners prefer songs with heavy instrumentation")
[1] "Ans:Mainstream listeners prefer songs with heavy instrumentation"
3 處理共線性 Beware of Multicollinearity Issues!
【3.1 檢查相關係數】What is the correlation between loudness and energy in the training set?
cor(SongsTrain$loudness,SongsTrain$energy)
[1] 0.7399
【3.2 重新建立模型、檢查係數】Look at the summary of SongsLog2, and inspect the coefficient of the variable “energy”. What do you observe?
SongsLog2 = glm(Top10 ~ . - loudness, data=SongsTrain, family=binomial)
summary(SongsLog2)$coefficients["energy",1]
[1] 0.1813
【3.3 選擇模型】 do we make the same observation about the popularity of heavy instrumentation as we did with Model 2?
SongsLog3 = glm(Top10 ~ . - energy, data=SongsTrain, family=binomial)
summary(SongsLog3)$coefficients["loudness",1]
[1] 0.2306
c("Ans:YES")
[1] "Ans:YES"
4 驗證模型 Validating Our Model
【4.1 正確性】What is the accuracy of Model 3 on the test set, using a threshold of 0.45?
testPredict = predict(SongsLog3, newdata=SongsTest, type="response")
table(SongsTest$Top10, testPredict >= 0.45)
FALSE TRUE
0 309 5
1 40 19
(309+19)/(309+5+40+19)
[1] 0.8794
【4.2 底線正確率】What would the accuracy of the baseline model be on the test set? ?
table(SongsTest$Top10)
0 1
314 59
314/(314+59)
[1] 0.8418
【4.3 正確性 vs. 辨識率】How many songs does Model 3 correctly predict as Top 10 hits in 2010? How many non-hit songs does Model 3 predict will be Top 10 hits?
table(SongsTest$Top10, testPredict >= 0.45)[,2]
0 1
5 19
【Q】不能大幅度增加正確性的模型也會有用嗎?為甚麼?
Ans: 有用,因為辨識率比正確性重要,正確率的臨界值可以改動,但清楚的區分兩種狀況更能有助於進行決策。
【4.4 敏感性 & 明確性】What is the sensitivity and specificity of Model 3 on the test set, using a threshold of 0.45?
table(SongsTest$Top10, testPredict >= 0.45)
FALSE TRUE
0 309 5
1 40 19
Warning message:
In strsplit(code, "\n", fixed = TRUE) :
input string 1 is invalid in this locale
19/(19+40)
[1] 0.322
309/(309+5)
[1] 0.9841
【4.5 結論】What conclusions can you make about our model?
c("Ans:Model 3 provides conservative predictions, and predicts that a song will make it to the Top 10 very rarely. So while it detects less than half of the Top 10 songs, we can be very confident in the songs that it does predict to be Top 10 hits")
[1] "Ans:Model 3 provides conservative predictions, and predicts that a song will make it to the Top 10 very rarely. So while it detects less than half of the Top 10 songs, we can be very confident in the songs that it does predict to be Top 10 hits"
【Q】從這個結論我們學到什麼?
Ans: 辨識率可以幫助我們明確的區分兩部分狀況,除此之外,可以試當下情境需要,選擇你要比較關注敏感性或者明確性。像是MIT的個案需求,就比較偏好明確性大於敏感性。 - - -
LS0tDQp0aXRsZTogIkFTMy0xIFBvcHVsYXJpdHkgb2YgbXVzaWMgcmVjb3JkcyBHcm91cDMiDQphdXRob3I6ICJHcm91cDMiDQpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sNCi0tLQ0KDQpgYGB7ciBlY2hvPVQsIG1lc3NhZ2U9RiwgY2FjaGU9Riwgd2FybmluZz1GfQ0Kcm0obGlzdD1scyhhbGw9VCkpDQpvcHRpb25zKGRpZ2l0cz00LCBzY2lwZW49MTIpDQpsaWJyYXJ5KGRwbHlyKTsgbGlicmFyeShnZ3Bsb3QyKQ0KYGBgDQoNCi0gLSAtDQoNCiMjIyBJbnRyb2R1Y3Rpb24NCg0K6K2w6aGM77ya5L2/55So5q2M5puy55qE5bGs5oCn77yM6aCQ5ris5a6D5pyD5LiN5pyD6YCy5YWl5rWB6KGM5q2M5puy5o6S6KGM5qac55qE5YmNMTDlkI0NCg0K5a2457+S6YeN6bue77yaDQoNCisg5L6d5pmC6ZaT5YiG5Ymy6LOH5paZDQorIG1vZGVsIGZvcm11bGEg55qE5a+r5rOVDQorIOmrmOebuOmXnCjlhbHnt5rmgKcp6Ieq6K6K5pW45LmL6ZaT55qE6YG45pOHDQorIGFjY3VyYWN5LCBzZW5zaXRpdml0eSwgc3BlY2lmaWNpdHnnmoTlr6bpmpvmhI/nvqkgDQorIOWmguS9leiqv+aVtOiHqOeVjOapn+eOh+S+huasiuihoe+8mlRGUi9zZW5zaXRpdml0eSB2cy4gRlBSL3NwZWNpZmljaXR5IA0KDQo8YnI+DQoNCi0gLSAtDQoNCiMjIyAxIOWfuuacrOeahOizh+aWmeiZleeQhiBVbmRlcnN0YW5kaW5nIHRoZSBEYXRhDQpgYGB7cn0NCnNldHdkKCJ+L2JpZyBkYXRhL3VuaXQzL2NvZGVfZGF0YS9kYXRhIikNCnNvbmdzPXJlYWQuY3N2KCJzb25ncy5jc3YiKQ0KYGBgDQoNCuOAkCoqMS4xKirjgJFIb3cgbWFueSBvYnNlcnZhdGlvbnMgKHNvbmdzKSBhcmUgZnJvbSB0aGUgeWVhciAyMDEwPw0KYGBge3J9DQpucm93KHNvbmdzW3doaWNoKHNvbmdzJHllYXI9PTIwMTApLF0pDQpgYGANCg0K44CQKioxLjIqKuOAkUhvdyBtYW55IHNvbmdzIGRvZXMgdGhlIGRhdGFzZXQgaW5jbHVkZSBmb3Igd2hpY2ggdGhlIGFydGlzdCBuYW1lIGlzICJNaWNoYWVsIEphY2tzb24iPw0KYGBge3J9DQpucm93KHNvbmdzW3doaWNoKHNvbmdzJGFydGlzdG5hbWU9PSJNaWNoYWVsIEphY2tzb24iKSxdKQ0KYGBgDQoNCuOAkCoqMS4zKirjgJFXaGljaCBvZiB0aGVzZSBzb25ncyBieSBNaWNoYWVsIEphY2tzb24gbWFkZSBpdCB0byB0aGUgVG9wIDEwPyBTZWxlY3QgYWxsIHRoYXQgYXBwbHkuDQpgYGB7cn0NCmFzLmNoYXJhY3Rlcihzb25nc1t3aGljaChzb25ncyRhcnRpc3RuYW1lPT0iTWljaGFlbCBKYWNrc29uIiAmIHNvbmdzJFRvcDEwPT0xKSwyXSkNCmBgYA0KDQrjgJAqKjEuNCoq44CRKGEpIFdoYXQgYXJlIHRoZSB2YWx1ZXMgb2YgYHRpbWVzaWduYXR1cmVgIHRoYXQgb2NjdXIgaW4gb3VyIGRhdGFzZXQ/IChiKSBXaGljaCB0aW1lc2lnbmF0dXJlIHZhbHVlIGlzIHRoZSBtb3N0IGZyZXF1ZW50IGFtb25nIHNvbmdzIGluIG91ciBkYXRhc2V0PyANCmBgYHtyfQ0KbmFtZXModGFibGUoc29uZ3MkdGltZXNpZ25hdHVyZSkpDQpuYW1lcyh3aGljaC5tYXgodGFibGUoc29uZ3MkdGltZXNpZ25hdHVyZSkpKQ0KYGBgDQoNCuOAkCoqMS41KirjgJEgV2hpY2ggb2YgdGhlIGZvbGxvd2luZyBzb25ncyBoYXMgdGhlIGhpZ2hlc3QgdGVtcG8/DQpgYGB7cn0NCmFzLmNoYXJhY3Rlcihzb25nc1t3aGljaC5tYXgoc29uZ3MkdGVtcG8pLDJdKQ0KYGBgDQo8YnI+DQoNCi0gLSAtDQoNCiMjIyAyIOW7uueri+aooeWeiyBDcmVhdGluZyBPdXIgUHJlZGljdGlvbiBNb2RlbA0KDQrjgJAqKjIuMSDkvp3mmYLplpPliIblibLos4fmlpkqKuOAkUhvdyBtYW55IG9ic2VydmF0aW9ucyAoc29uZ3MpIGFyZSBpbiB0aGUgdHJhaW5pbmcgc2V0Pw0KYGBge3J9DQpTb25nc1RyYWluPXN1YnNldChzb25ncyx5ZWFyPD0yMDA5KQ0KU29uZ3NUZXN0PXN1YnNldChzb25ncyx5ZWFyPT0yMDEwKQ0KbnJvdyhTb25nc1RyYWluKQ0KYGBgDQoNCuOAkCoqMi4yIOW7uueri+aooeWei+OAgeaooeWei+aRmOimgSoq44CRV2hhdCBpcyB0aGUgdmFsdWUgb2YgdGhlIEFrYWlrZSBJbmZvcm1hdGlvbiBDcml0ZXJpb24gKEFJQyk/DQpgYGB7cn0NCm5vbnZhcnMgPSBjKCJ5ZWFyIiwgInNvbmd0aXRsZSIsICJhcnRpc3RuYW1lIiwgInNvbmdJRCIsICJhcnRpc3RJRCIpDQpTb25nc1RyYWluID0gU29uZ3NUcmFpblsgLCAhKG5hbWVzKFNvbmdzVHJhaW4pICVpbiUgbm9udmFycykgXQ0KU29uZ3NUZXN0ID0gU29uZ3NUZXN0WyAsICEobmFtZXMoU29uZ3NUZXN0KSAlaW4lIG5vbnZhcnMpIF0NClNvbmdzTG9nMSA9IGdsbShUb3AxMCB+IC4sIGRhdGE9U29uZ3NUcmFpbiwgZmFtaWx5PWJpbm9taWFsKQ0Kc3VtbWFyeShTb25nc0xvZzEpJGFpYw0KYGBgDQoNCuOAkCoqMi4zIOaooeWei+S/guaVuOWIpOiugCoq44CRVGhlIGBMT1dFUmAgb3IgYEhJR0hFUmAgb3VyIGNvbmZpZGVuY2UgYWJvdXQgdGltZSBzaWduYXR1cmUsIGtleSBhbmQgdGVtcG8sIHRoZSBtb3JlIGxpa2VseSB0aGUgc29uZyBpcyB0byBiZSBpbiB0aGUgVG9wIDEwDQpgYGB7cn0NCmMoIkFuczpUaGUgaGlnaGVyIG91ciBjb25maWRlbmNlIGFib3V0IHRpbWUgc2lnbmF0dXJlLCBrZXkgYW5kIHRlbXBvLCB0aGUgbW9yZSBsaWtlbHkgdGhlIHNvbmcgaXMgdG8gYmUgaW4gdGhlIFRvcCAxMCDmraPnoa4iKQ0KYGBgDQoNCuOAkCoqMi40IOmAsuihjOaOqOirlioq44CRV2hhdCBkb2VzIE1vZGVsIDEgc3VnZ2VzdCBpbiB0ZXJtcyBvZiBjb21wbGV4aXR5Pw0KYGBge3J9DQpjKCJBbnM6TWFpbnN0cmVhbSBsaXN0ZW5lcnMgdGVuZCB0byBwcmVmZXIgbGVzcyBjb21wbGV4IHNvbmdzIikNCmBgYA0KDQrjgJAqKjIuNSDmqqLmn6XnlbDluLjkv4LmlbgqKuOAkSAoYSkgQnkgaW5zcGVjdGluZyB0aGUgY29lZmZpY2llbnQgb2YgdGhlIHZhcmlhYmxlICJsb3VkbmVzcyIsIHdoYXQgZG9lcyBNb2RlbCAxIHN1Z2dlc3Q/IChiKSBCeSBpbnNwZWN0aW5nIHRoZSBjb2VmZmljaWVudCBvZiB0aGUgdmFyaWFibGUgImVuZXJneSIsIGRvIHdlIGRyYXcgdGhlIHNhbWUgY29uY2x1c2lvbnMgYXMgYWJvdmU/DQpgYGB7cn0NCmMoIkFuczpNYWluc3RyZWFtIGxpc3RlbmVycyBwcmVmZXIgc29uZ3Mgd2l0aCBoZWF2eSBpbnN0cnVtZW50YXRpb24iKQ0KYGBgDQo8YnI+DQoNCi0gLSAtDQoNCiMjIyAzIOiZleeQhuWFsee3muaApyBCZXdhcmUgb2YgTXVsdGljb2xsaW5lYXJpdHkgSXNzdWVzIQ0KDQrjgJAqKjMuMSDmqqLmn6Xnm7jpl5zkv4LmlbgqKuOAkVdoYXQgaXMgdGhlIGNvcnJlbGF0aW9uIGJldHdlZW4gYGxvdWRuZXNzYCBhbmQgYGVuZXJneWAgaW4gdGhlIHRyYWluaW5nIHNldD8NCmBgYHtyfQ0KY29yKFNvbmdzVHJhaW4kbG91ZG5lc3MsU29uZ3NUcmFpbiRlbmVyZ3kpDQpgYGANCg0K44CQKiozLjIg6YeN5paw5bu656uL5qih5Z6L44CB5qqi5p+l5L+C5pW4KirjgJFMb29rIGF0IHRoZSBzdW1tYXJ5IG9mIFNvbmdzTG9nMiwgYW5kIGluc3BlY3QgdGhlIGNvZWZmaWNpZW50IG9mIHRoZSB2YXJpYWJsZSAiZW5lcmd5Ii4gV2hhdCBkbyB5b3Ugb2JzZXJ2ZT8NCmBgYHtyfQ0KU29uZ3NMb2cyID0gZ2xtKFRvcDEwIH4gLiAtIGxvdWRuZXNzLCBkYXRhPVNvbmdzVHJhaW4sIGZhbWlseT1iaW5vbWlhbCkNCnN1bW1hcnkoU29uZ3NMb2cyKSRjb2VmZmljaWVudHNbImVuZXJneSIsMV0NCmBgYA0KDQrjgJAqKjMuMyDpgbjmk4fmqKHlnosqKuOAkSBkbyB3ZSBtYWtlIHRoZSBzYW1lIG9ic2VydmF0aW9uIGFib3V0IHRoZSBwb3B1bGFyaXR5IG9mIGhlYXZ5IGluc3RydW1lbnRhdGlvbiBhcyB3ZSBkaWQgd2l0aCBNb2RlbCAyPw0KYGBge3J9DQpTb25nc0xvZzMgPSBnbG0oVG9wMTAgfiAuIC0gZW5lcmd5LCBkYXRhPVNvbmdzVHJhaW4sIGZhbWlseT1iaW5vbWlhbCkNCnN1bW1hcnkoU29uZ3NMb2czKSRjb2VmZmljaWVudHNbImxvdWRuZXNzIiwxXQ0KYygiQW5zOllFUyIpDQpgYGANCjxicj4NCg0KLSAtIC0NCg0KIyMjIDQg6amX6K2J5qih5Z6LIFZhbGlkYXRpbmcgT3VyIE1vZGVsDQoNCuOAkCoqNC4xIOato+eiuuaApyoq44CRV2hhdCBpcyB0aGUgYWNjdXJhY3kgb2YgTW9kZWwgMyBvbiB0aGUgdGVzdCBzZXQsIHVzaW5nIGEgdGhyZXNob2xkIG9mIDAuNDU/IA0KYGBge3J9DQp0ZXN0UHJlZGljdCA9IHByZWRpY3QoU29uZ3NMb2czLCBuZXdkYXRhPVNvbmdzVGVzdCwgdHlwZT0icmVzcG9uc2UiKQ0KdGFibGUoU29uZ3NUZXN0JFRvcDEwLCB0ZXN0UHJlZGljdCA+PSAwLjQ1KQ0KKDMwOSsxOSkvKDMwOSs1KzQwKzE5KQ0KYGBgDQoNCuOAkCoqNC4yIOW6lee3muato+eiuueOhyoq44CRV2hhdCB3b3VsZCB0aGUgYWNjdXJhY3kgb2YgdGhlIGJhc2VsaW5lIG1vZGVsIGJlIG9uIHRoZSB0ZXN0IHNldD8gPyANCmBgYHtyfQ0KdGFibGUoU29uZ3NUZXN0JFRvcDEwKQ0KMzE0LygzMTQrNTkpDQpgYGANCg0K44CQKio0LjMg5q2j56K65oCnIHZzLiDovqjorZjnjocqKuOAkUhvdyBtYW55IHNvbmdzIGRvZXMgTW9kZWwgMyBjb3JyZWN0bHkgcHJlZGljdCBhcyBUb3AgMTAgaGl0cyBpbiAyMDEwPyAgSG93IG1hbnkgbm9uLWhpdCBzb25ncyBkb2VzIE1vZGVsIDMgcHJlZGljdCB3aWxsIGJlIFRvcCAxMCBoaXRzPw0KYGBge3J9DQp0YWJsZShTb25nc1Rlc3QkVG9wMTAsIHRlc3RQcmVkaWN0ID49IDAuNDUpWywyXQ0KYGBgDQoNCuOAkCoqUSoq44CR5LiN6IO95aSn5bmF5bqm5aKe5Yqg5q2j56K65oCn55qE5qih5Z6L5Lmf5pyD5pyJ55So5ZeO77yf54K655Sa6bq877yfDQoNCkFuczog5pyJ55So77yM5Zug54K66L6o6K2Y546H5q+U5q2j56K65oCn6YeN6KaB77yM5q2j56K6546H55qE6Ieo55WM5YC85Y+v5Lul5pS55YuV77yM5L2G5riF5qWa55qE5Y2A5YiG5YWp56iu54uA5rOB5pu06IO95pyJ5Yqp5pa86YCy6KGM5rG6562W44CCDQoNCuOAkCoqNC40IOaVj+aEn+aApyAmIOaYjueiuuaApyoq44CRV2hhdCBpcyB0aGUgYHNlbnNpdGl2aXR5YCBhbmQgYHNwZWNpZmljaXR5YCBvZiBNb2RlbCAzIG9uIHRoZSB0ZXN0IHNldCwgdXNpbmcgYSB0aHJlc2hvbGQgb2YgMC40NT8NCmBgYHtyfQ0KdGFibGUoU29uZ3NUZXN0JFRvcDEwLCB0ZXN0UHJlZGljdCA+PSAwLjQ1KQ0KMTkvKDE5KzQwKQ0KMzA5LygzMDkrNSkNCmBgYA0KDQrjgJAqKjQuNSDntZDoq5YqKuOAkVdoYXQgY29uY2x1c2lvbnMgY2FuIHlvdSBtYWtlIGFib3V0IG91ciBtb2RlbD8NCmBgYHtyfQ0KYygiQW5zOk1vZGVsIDMgcHJvdmlkZXMgY29uc2VydmF0aXZlIHByZWRpY3Rpb25zLCBhbmQgcHJlZGljdHMgdGhhdCBhIHNvbmcgd2lsbCBtYWtlIGl0IHRvIHRoZSBUb3AgMTAgdmVyeSByYXJlbHkuIFNvIHdoaWxlIGl0IGRldGVjdHMgbGVzcyB0aGFuIGhhbGYgb2YgdGhlIFRvcCAxMCBzb25ncywgd2UgY2FuIGJlIHZlcnkgY29uZmlkZW50IGluIHRoZSBzb25ncyB0aGF0IGl0IGRvZXMgcHJlZGljdCB0byBiZSBUb3AgMTAgaGl0cyIpDQpgYGANCg0K44CQKipRKirjgJHlvp7pgJnlgIvntZDoq5bmiJHlgJHlrbjliLDku4DpurzvvJ8NCg0KQW5zOiDovqjorZjnjoflj6/ku6XluavliqnmiJHlgJHmmI7norrnmoTljYDliIblhanpg6jliIbni4Dms4HvvIzpmaTmraTkuYvlpJbvvIzlj6/ku6XoqabnlbbkuIvmg4XlooPpnIDopoHvvIzpgbjmk4fkvaDopoHmr5TovIPpl5zms6jmlY/mhJ/mgKfmiJbogIXmmI7norrmgKfjgILlg4/mmK/vvK3vvKnvvLTnmoTlgIvmoYjpnIDmsYLvvIzlsLHmr5TovIPlgY/lpb3mmI7norrmgKflpKfmlrzmlY/mhJ/mgKfjgIINCi0gLSAtDQoNCjxicj48YnI+PGJyPg0K