Update on 2025-09-25

背景:纷争乱世(Chaos)

统计学(Statistics)

机器学习(Machine learning)

数据科学(Data science)

人工智能(Artificial intelligence)

数据挖掘(Data mining)

模式识别(Pattern recognition)

深度学习(Deep learning)

入门:听闻天下武功(Recognise algrithms)

学习招式(A simple case)

Major League Baseball Data (263) from the 1986 and 1987 seasons.
Years Hits Salary
-Alan Ashby 14 81 475.0
-Alvin Davis 3 130 480.0
-Andre Dawson 11 141 500.0
-Andres Galarraga 2 87 91.5
-Alfredo Griffin 11 169 750.0
-Al Newman 2 37 70.0
-Argenis Salazar 3 73 100.0
-Andres Thomas 2 81 75.0

学习招式(Model selection)

学习招式(Model setup, training and evalutation)

建立模型(Model setup)

\[ \mathbf y = F_m(\mathbf X; \boldsymbol \beta)\ -> \mathbf X \boldsymbol \beta. \] 训练参数(Training)

\[ \hat{\boldsymbol \beta} = \underset{\boldsymbol \beta}{\rm argmin}\ L(\mathbf X_{train}, \mathbf y_{train}, \boldsymbol \beta)\ -> \underset{\boldsymbol \beta}{\rm argmin} || \mathbf y_{train} - \mathbf X_{train} \boldsymbol \beta||_2^2. \] 预测(Prediction)

\[ \hat{\mathbf y} = F_m(\mathbf X_{test}; \hat{\boldsymbol \beta})\ -> \mathbf X_{test} \hat{\boldsymbol \beta}. \] 评估(Evaluation)

\[ {\rm mse(Mean\ squared\ error}) = \frac{1}{n_{\rm test}} \sum_{n_{\rm test}} (\hat{ y} - y_{\rm test})^2,\\ \rm acc(Accuracy) = \frac{True\ samples}{Total\ samples}. \]

学习招式(Model evaluation)

学习招式(Model comparison)

选择武器(Programing)

选择武器(Python packages)

选择武器(R packages)

选择武器(Py & R)

Python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn

R

library(data.table)
library(dplyr)
library(ggplot2)
library(mlr3)

武器比较(Platform and Versions)

## Platform:  macOS-15.7-arm64-arm-64bit
## Platform:  Apple M4 Pro

Python

print("Python version:", platform.python_version())
## Python version: 3.12.10

R

print(R.version$version.string)
## [1] "R version 4.5.1 (2025-06-13)"

武器比较(Matrix multiplication)

Python

import time
a = np.random.standard_normal([5000, 5000])
start = time.time()
b = a @ a.T
end = time.time()
time_ela = end - start
print(time_ela)
## 0.19492793083190918

R

a <- matrix(rnorm(5000 * 5000), 5000)
system.time( b <- tcrossprod(a))
##    user  system elapsed 
##   0.350   0.014   0.204

高级武器比较(The extension of python)

Python

import torch

a = torch.randn([5000, 5000], device = torch.device('mps'))
t1 = time.time()
b_t = a @ a.T
print('torch @:', time.time() - t1)
## torch @: 0.0041158199310302734
import mlx.core as mx

a = mx.random.normal([5000, 5000], stream = mx.gpu)
t1 = time.time()
b_m = a @ a.T
mx.eval(b_m)
print('mlx @:', time.time() - t1)
## mlx @: 0.06186795234680176

武器比较(Inverse matrix)

Python

start = time.time()
inv = np.linalg.inv(b)
end = time.time()
time_ela = end - start
print(time_ela)
## 0.797569990158081

R

system.time(inv <- solve(b))
##    user  system elapsed 
##   1.655   0.041   1.039

武器比较(Eigen decomposition)

Python

start = time.time()
uv = np.linalg.eigh(b)
end = time.time()
time_ela = end - start
print(time_ela)
## 7.239617109298706

R

system.time(uv <- eigen(b))
##    user  system elapsed 
##  10.713   0.128   9.472

多般兵器, 样样精通(Combine R and Python)

Python

from nycflights13 import flights
## /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/reticulate/python/rpytools/loader.py:120: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
##   return _find_and_load(name, import_)
pd.set_option('display.max_columns', 20)
print(flights.iloc[0:7, 0:10])
##    year  month  day  dep_time  sched_dep_time  dep_delay  arr_time  \
## 0  2013      1    1     517.0             515        2.0     830.0   
## 1  2013      1    1     533.0             529        4.0     850.0   
## 2  2013      1    1     542.0             540        2.0     923.0   
## 3  2013      1    1     544.0             545       -1.0    1004.0   
## 4  2013      1    1     554.0             600       -6.0     812.0   
## 5  2013      1    1     554.0             558       -4.0     740.0   
## 6  2013      1    1     555.0             600       -5.0     913.0   
## 
##    sched_arr_time  arr_delay carrier  
## 0             819       11.0      UA  
## 1             830       20.0      UA  
## 2             850       33.0      AA  
## 3            1022      -18.0      B6  
## 4             837      -25.0      DL  
## 5             728       12.0      UA  
## 6             854       19.0      B6

多般兵器, 样样精通(Combine R and Python)

R

kable(head(py$flights[, 1 : 10], 7), align = "c")
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
2013 1 1 517 515 2 830 819 11 UA
2013 1 1 533 529 4 850 830 20 UA
2013 1 1 542 540 2 923 850 33 AA
2013 1 1 544 545 -1 1004 1022 -18 B6
2013 1 1 554 600 -6 812 837 -25 DL
2013 1 1 554 558 -4 740 728 12 UA
2013 1 1 555 600 -5 913 854 19 B6

修炼内功(Books)

行走江湖,接受挑战(Kaggle)