Giới thiệu

Kể từ khi xuất hiện gần 40 năm trước, cuốn Basic Econometrics đã chứng tỏ là một trong những cuốn sách kinh tế lượng được yêu thích và được sử dụng rộng rãi. Dấu ấn cũng như ảnh hưởng của sách các bạn có thể thấy ở cả hai cuốn giáo trình kinh tế lượng của NEU và EUH (đặc biệt là phiên bản đầu, sách bìa màu vàng của tác giả Nguyễn Quang Dong). Bản dịch tiếng Việt của nó và được lưu hành nội bộ và sử dụng ở chương trình giảng dạy kinh tế học Fullbright.

Nếu bạn không có điều kiện tiếp xúc với bản dịch của sách này thì sao? Bạn nên đọc bản gốc bằng tiếng Anh (link sách ở trên). Tác giả chỉ sử dụng chừng 6300 từ tiếng Anh (có trong từ điển tiếng Anh) để viết sách - một con số ít hơn khá nhiều so với, chẳng hạn, cuốn Introductory to Econometrics của Wooldridge (điều này dễ hiểu, vì tác giả sách là một người Ấn Độ nhập quốc tịch Mĩ). Nhưng điều này chưa phải quan trọng: có 500 từ vựng xuất hiện với tần suất tổng cộng là 70%. Nghĩa là bạn chỉ cần thành thạo 500 từ vựng này thì có thể đọc cuốn sách này ở mức tương đối (và người viết - chỉ được học tiếng Pháp - thực sự áp dụng cách học này dù ở một mức độ khác và đạt được kết quả tốt).

Bài viết này chỉ ra cho bạn cách tìm ra danh sách 500 từ đó - những từ xuất hiện xấp xỉ chừng 70% trong toàn bộ sách.

Bài viết này chia thành hai phần. Phần 1 chúng ta đọc dữ liệu là hơn 1 triệu từ vựng tiếng Anh trên cơ sở dữ liệu về ngôn ngữ học của SIL International rồi dùng nó để so sánh với những từ vựng mà Gujarati viết cuốn sách của mình.

Phần 2 là các phân tích chi tiết hơn về từ vựng được sử dụng trong cuốn sách này.

Phần 1

eng <- read.table("http://www-01.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt", stringsAsFactors = FALSE)
library(tidyverse)
library(stringr)

eng <- eng %>% mutate(n_w = str_count(V1))
theme_set(theme_minimal())

# Phân bố số lượng chữ cái trong từ tiếng Anh: 
eng %>% 
  group_by(n_w) %>% 
  count() %>% 
  ggplot(aes(n_w, n)) + geom_col() + 
  labs(x = NULL, y = NULL)

# Các nguyên âm: 
vowels <- c("a", "e", "i", "o", "u")
num_vowels <- vector(mode = "integer", length = 5)
for (j in seq_along(vowels)) {
  num_aux = str_count(eng$V1, vowels[j])
  num_vowels[j] = sum(num_aux)
}

df1 <- data.frame(N = num_vowels, nguyen_am = vowels)

# Tần suất xuất hiện các nguyên âm: 
library(hrbrthemes)
df1 %>% 
  ggplot(aes(reorder(nguyen_am, N), N)) +  
  geom_col() + labs(x = NULL, y = NULL) + coord_flip() + 
  scale_y_continuous(breaks = seq(0, 110000, by = 10000)) + 
  theme_ipsum(grid = "X")

# Tần suất xuất hiện của các chữ cái trong tiếng  Anh: 
u <- strsplit(eng$V1, "")
k <- unlist(u)

let <- eng$V1 %>% strsplit("") %>% unlist()
let <- data.frame(l = let)

let %>% group_by(l) %>% count() %>% 
  mutate(per = 100*n / nrow(let)) %>% 
  mutate_if(is.numeric, function(x) round(x, 2)) %>% 
  ggplot(aes(reorder(l, per), per)) + geom_col() + 
  coord_flip() + 
  labs(x = NULL, y = NULL) + 
  theme_ipsum(grid = "X") + 
  scale_y_continuous(breaks = seq(1, 12, by = 1))

Phần 2

Trước hết các bạn load bản PDF của cuốn sách rồi để vào một folder có tên gujarati thuộc ổ E của máy tính.

Để R đọc được dữ liệu từ PDF các bạn cần có một số thao tác chuẩn bị. Trước hết các bạn download xpdfbin-win-3.04.zip. Giải nén ra và chọn hai file là pdftotext.exe và pdfinfo.exe rồi copy chúng để vào ổ E. Nhắc lại: copy và để vào ổ E.

Bước kế tiếp, kích chuột phải vào cửa sổ Start của window, chọn system, chọn tiếp Advanced system settings, chọn Advanced. Sau đó chọn Environment Variables, chọn Path, chọn Edit. Cuối cùng gõ chính xác đường dẫn của file zip vừa download về, trong tình huống máy tính của tôi, là C:10ls-win-3.0464 rồi kích OK.

Nếu máy của bạn là win32 thì thay bin64 thành bin32.

# Load các gói: 
pakg <- c("tm", 
          "SnowballC", 
          "wordcloud", 
          "RColorBrewer")

lapply(pakg, require, character.only = TRUE)

## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] TRUE

setwd("E:/")
mk <- Corpus(DirSource("gujarati/", pattern = "pdf"), 
             readerControl = list(reader = readPDF))


#----------------------
#  Xử lí data thô
#----------------------

# Chuyển hóa tất về chữ không in hoa: 
mk <- mk %>%
  tm_map(tolower) 

# Xóa tất cả các loại dấu câu như chấm, phẩy: 
mk <- mk %>% tm_map(removePunctuation) 

# Bỏ các Stop Word trong tiếng Anh: 
mk <- mk %>% tm_map(removeWords,stopwords("english")) 

# Hoặc bỏ các từ như cengage, learning vì cengage learning là tên 
# của nhà in. Sẽ không hợp  lí nếu tính chúng vào  phân  tích: 
mk <- mk %>% tm_map(removeWords, c("cengage", "learning"))

# Bỏ các con số: 
mk <- mk %>% tm_map(removeNumbers) 

# Một  số xử lí khác: 
mk <- mk %>% tm_map(stripWhitespace) 
mk <- mk %>% tm_map(PlainTextDocument)

# Chuyển hóa về Document Matrix: 
dtm <- TermDocumentMatrix(mk)
dtm

## <<TermDocumentMatrix (terms: 9078, documents: 1)>>
## Non-/sparse entries: 9078/0
## Sparsity           : 0%
## Maximal term length: 188
## Weighting          : term frequency (tf)

m <- as.matrix(dtm)

v <- sort(rowSums(m), decreasing = TRUE)
d <- data.frame(word = names(v), freq = v)

# d là data frame quen thuộc: 
str(d)

## 'data.frame':    9078 obs. of  2 variables:
##  $ word: Factor w/ 9078 levels "aaa","aaaa","aappendix",..: 5177 6739 986 1842 5186 8615 8630 8084 5668 8626 ...
##  $ freq: num  2717 2535 1585 1522 1237 ...

# Chuyển hóa factor về character: 
d$word <- as.character(d$word)

# Danh sách của 500 từ xuất hiện  nhiều nhất trong
# cuốn sách của Gujarati cùng tần suất tương ứng: 
knitr::kable(d %>% slice(1:500))

word	freq
model	2717
regression	2535
can	1585
data	1522
models	1237
value	1231
variables	1203
test	1134
one	1083
variable	1072
will	1053
chapter	1020
time	997
table	975
given	968
values	897
see	887
example	874
may	831
two	829
error	724
estimated	698
linear	688
hypothesis	681
following	672
series	663
sample	661
function	651
income	622
ols	609
coefficient	605
mean	605
rate	594
coefficients	581
figure	572
results	572
obtain	552
term	530
estimate	524
also	519
variance	517
probability	510
use	509
econometrics	507
note	497
var	471
analysis	469
equation	468
therefore	466
distribution	459
part	456
estimators	451
now	444
since	442
zero	439
standard	433
number	429
matrix	422
true	421
expenditure	411
errors	407
shown	401
consider	399
follows	398
consumption	397
change	395
first	395
method	395
percent	390
statistical	390
using	387
obtained	383
thus	382
case	380
observations	366
estimation	356
assumption	355
correlation	351
residuals	350
autocorrelation	346
economic	345
new	345
problem	343
random	339
appendix	335
known	325
level	319
dependent	313
find	313
discussed	311
heteroscedasticity	311
parameters	308
estimator	307
average	300
regressors	300
assumptions	299
section	298
null	292
significant	290
testing	289
tests	289
period	286
price	282
terms	280
statistically	277
significance	276
explanatory	274
although	272
squares	271
say	269
normal	262
classical	261
stochastic	260
intercept	259
statistic	259
population	258
per	252
let	251
used	248
equations	247
york	247
econometric	246
interval	245
slope	243
multicollinearity	242
distributed	241
log	238
relationship	238
specification	237
suppose	237
preceding	236
dummy	234
methods	234
show	233
constant	232
demand	231
exercise	223
theory	223
total	223
large	222
confidence	220
three	218
money	216
gives	213
difference	212
reject	212
size	212
sum	212
least	210
noted	208
different	207
unit	205
equal	202
form	202
several	201
properties	199
states	199
real	197
based	195
shows	194
gdp	191
called	190
critical	190
means	190
estimates	189
partial	188
stationary	188
supply	187
year	187
however	186
present	186
assume	185
interest	184
twovariable	183
lag	182
positive	182
singleequation	182
procedure	181
small	181
dollars	179
increases	179
effect	178
reader	177
various	177
output	176
unbiased	176
whether	174
lagged	173
vol	173
correlated	171
regressions	171
well	171
discussion	168
index	168
university	168
approach	166
multiple	166
expected	164
know	164
point	163
fact	162
simultaneousequation	162
fixed	160
negative	158
individual	157
normally	157
panel	156
usual	155
ratio	153
high	152
source	151
course	150
labor	150
statistics	150
logit	149
take	149
conditional	147
general	147
growth	146
important	146
observation	145
practice	144
empirical	143
line	142
regressand	142
economics	141
estimating	141
increase	140
study	140
nature	139
even	138
press	138
order	137
included	136
often	136
rss	136
autoregressive	135
defined	135
happens	135
process	135
united	135
variances	135
computed	134
hence	134
original	134
regressor	134
trend	134
want	134
another	132
stock	132
appropriate	131
less	131
must	131
durbinwatson	130
information	130
actual	129
fit	128
four	128
likelihood	128
prices	128
applied	127
cost	127
years	127
bias	126
continued	126
greater	126
much	126
simple	126
structural	125
result	124
samples	124
basis	122
education	122
leastsquares	121
john	120
possible	120
criterion	119
capital	118
discuss	118
disturbance	118
question	118
give	117
nonlinear	117
independent	115
personal	115
whereas	115
effects	114
just	114
recall	114
assumed	113
changes	113
endogenous	113
normality	113
collinearity	112
disturbances	112
generally	112
make	112
topics	112
adjusted	111
unemployment	109
relaxing	108
run	108
capita	107
expectations	107
like	107
sales	107
forecasting	106
interpret	106
measure	106
probit	106
problems	106
related	106
chisquare	105
curve	105
found	105
qualitative	105
second	105
among	104
answer	104
eqs	104
introduction	104
previous	104
refer	104
causality	103
examples	103
way	103
elasticity	102
need	102
cov	101
likely	101
plot	101
residual	101
consistent	100
condition	99
current	99
root	99
type	99
simply	98
white	98
book	97
many	97
measurement	97
scale	97
identified	96
savings	96
short	96
squared	96
cit	94
investment	94
made	94
production	94
topic	94
wage	94
write	94
corresponding	93
easily	93
family	93
notice	93
wealth	93
compare	92
hypotheses	92
lags	91
nonstationary	91
respect	91
sons	91
vector	91
age	90
alternative	90
differences	90
periods	90
regress	90
response	90
return	90
set	90
step	90
gls	89
highly	89
measured	89
parameter	89
reducedform	89
serial	89
wages	89
account	88
identification	88
illustrate	88
matter	88
namely	88
relation	88
basic	87
deviation	87
practical	87
prob	87
respectively	87
seen	87
taking	87
expressed	86
for	86
prediction	86
product	86
rates	86
review	86
shall	86
variation	86
concepts	85
inference	85
standardized	85
apply	84
column	84
covariance	84
due	84
latter	84
otherwise	84
words	84
obtaining	83
sense	83
units	83
already	82
billions	82
get	82
journal	82
priori	82
system	82
without	82
choose	81
individually	81
previously	81
rather	81
depends	80
inflation	80
particular	80
pgnp	80
quarter	80
xki	80
exogenous	79
formula	79
marginal	79
modeling	79
notation	79
percentage	79
wiley	79
better	78
distributedlag	78
introduced	78
pdf	78
quite	78
underlying	78
assuming	77
consumer	77
crosssectional	77
exercises	77
presence	77
situation	77
uncorrelated	77
gnp	76
necessary	76
prf	76
suggests	76
summary	76
earlier	75
joint	75
low	75
market	75
mind	75
reasons	75
theorem	75
additional	74
blue	74
illustrative	74
measures	74
packages	74
similar	74
transformation	74
application	73
higher	73
maximum	73
points	73
biased	72
include	72
rank	72
cases	71
decide	71
density	71
directly	71
expectation	71
female	71
firstorder	71
involved	71
larger	71
reason	71
rsquared	71
workers	71
written	71
best	70
chapters	70
interpretation	70
look	70
lpm	70
might	70
theoretical	70
times	70

# 500 từ này chiếm 64.79% tất cả các từ vựng 
# mà tác giả dùng để viết sách: 
sum(d$freq[1:500]) / sum(d$freq)

## [1] 0.6479166

# Trong khi đó 500 từ kế tiếp chỉ chiếm 13.2%: 
sum(d$freq[1:1000]) / sum(d$freq)

## [1] 0.779916

#-----------------------------------------------------
# Dưới đây chúng ta lọc ra hai bộ dữ liệu riêng biệt. 
# Bộ  thứ nhất là những từ thuộc bộ từ điển tiếng Anh. 
# Bộ  còn  lại không  thuộc. 
#-----------------------------------------------------

# Bộ thứ nhất chủ yếu là các từ phổ thông: 

gu_eng <- d %>% filter(word  %in% eng$V1) 

# Bộ thứ hai hầu  hết là từ chuyên ngành Thống kê - Kinh tế lượng: 
gu_non_eng <- dplyr::setdiff(d, gu_eng) 

# Tổng số từ vựng sử dụng để viết sách là 9078: 
nrow(d)

## [1] 9078

# Trong đó các từ thuộc từ điển tiếng Anh là: 
nrow(gu_eng)

## [1] 6344

# Và các từ không thuộc từ điển tiếng Anh (và hầu hết là từ chuyên ngành)
# hoặc liên quan đến chuyên ngành  này: 
nrow(gu_non_eng)

## [1] 2734

nrow(gu_non_eng) / nrow(d)

## [1] 0.3011677

# Danh sách các từ chuyên nghành Thống kê - Kinh Tế Lượng
# xuất hiện nhiều hơn 100 lần: 
knitr::kable(gu_non_eng %>% filter(freq >= 100))

word	freq
ols	609
econometrics	507
var	471
autocorrelation	346
heteroscedasticity	311
stochastic	260
econometric	246
multicollinearity	242
gdp	191
twovariable	183
singleequation	182
simultaneousequation	162
logit	149
regressand	142
rss	136
autoregressive	135
durbinwatson	130
leastsquares	121
collinearity	112
probit	106
chisquare	105
eqs	104
cov	101

# Có thể thấy, mặc dù có tới 30% các từ mà tác giả sử dụng
# không  thuộc từ điển tiếng Anh nhưng hầu hết chúng là các 
# từ như tên tác giả (và các tình huống khác).  Có thể nhìn
# thấy điều này rõ hơn bằng cách liệt kê 200 từ xuất hiện
# với tần suất nhiều nhất ơ bộ phi từ điển  tiếng Anh: 
knitr::kable(head(gu_non_eng, 200))

word	freq
ols	609
econometrics	507
var	471
autocorrelation	346
heteroscedasticity	311
stochastic	260
econometric	246
multicollinearity	242
gdp	191
twovariable	183
singleequation	182
simultaneousequation	162
logit	149
regressand	142
rss	136
autoregressive	135
durbinwatson	130
leastsquares	121
collinearity	112
probit	106
chisquare	105
eqs	104
cov	101
nonstationary	91
gls	89
reducedform	89
prob	87
covariance	84
pgnp	80
xki	80
wiley	79
distributedlag	78
pdf	78
crosssectional	77
uncorrelated	77
gnp	76
prf	76
firstorder	71
rsquared	71
eviews	68
flr	66
koyck	66
crosssection	61
sig	61
fstatistic	59
anova	56
kvariable	56
sls	56
plim	55
std	55
ecm	54
lgdp	54
loglinear	52
nonstochastic	52
clrm	49
mpc	47
srf	47
threevariable	47
mcgrawhill	46
eyi	45
tss	44
cointegration	43
correlogram	43
resid	43
schwarz	43
akaike	42
stationarity	42
website	42
arima	41
eui	40
firstdifference	39
shortrun	39
econometrica	38
homoscedastic	38
stata	38
cobbdouglas	37
hausman	37
phillips	36
ppce	35
englewood	34
homoscedasticity	34
socalled	34
durbin	33
recursive	33
andor	32
goldberger	32
maddala	32
wagesproductivity	31
davidson	30
largesample	30
demandandsupply	29
garch	29
heteroscedastic	29
ils	29
macmillan	29
nonnested	29
vif	29
antilog	28
nonconstant	28
overidentified	28
unbiasedness	28
almon	27
cdf	27
minitab	27
poisson	27
acf	26
gaussmarkov	26
ith	26
lsdv	26
savingsincome	26
scattergram	26
uit	26
nlrm	25
theil	25
autoregression	24
exp	24
gujarati	24
mackinnon	24
tstatistic	24
wls	24
boxjenkins	23
cointegrated	23
greene	23
hac	23
jan	23
variancecovariance	23
adf	22
autocorrelated	22
covariances	22
glejser	22
keynesian	22
kmenta	22
tion	22
aic	21
egls	21
eut	21
exogeneity	21
kurtosis	21
mse	21
ppdi	21
righthand	21
dickeyfuller	20
doublelog	20
jarquebera	20
lgdpt	20
mit	20
obs	20
twostage	20
twostep	20
varcov	20
cnlrm	19
gdpt	19
hannanquinn	19
lagrange	19
rwm	19
sims	19
wageseducation	19
autocorrelations	18
bernoulli	18
capm	18
johnston	18
linearity	18
pce	18
tobit	18
xti	18
zeroorder	18
arma	17
cofactor	17
criter	17
cunr	17
dpi	17
micronumerosity	17
neweywest	17
onetail	17
pacf	17
uin	17
wellknown	17
ahe	16
bivariate	16
eyt	16
gpdi	16
misspecification	16
nyse	16
timeinvariant	16
charemza	15
cpit	15
dpt	15
lpcet	15
microeconomics	15
piecewise	15
ppcet	15
quandt	15
semilog	15
boxcox	14
confidenceinterval	14
eds	14
elgar	14
firstdifferenced	14
gnpt	14
gpa	14

# Điều này cho ta điên đến một  kết luận táo bạo là số 
# từ tiếng Anh mà tác giả sử dụng chỉ tầm 6344 từ  mà thôi. 
# Điều  này cũng có nghĩa là 500 từ ở trên  chiếm đến gần 
# 70% trong toàn bộ cuốn  sách: 
sum(gu_eng$freq[1:500]) / sum(gu_eng$freq)

## [1] 0.6958463

#------------------------------------
#           Vẽ đám mây từ
#------------------------------------

# Cho 100 từ xuất hiện  nhiều  nhất ở bộ 1: 
par(bg = "black") 
set.seed(1709)
wordcloud(words = gu_eng$word, 
          freq = gu_eng$freq, 
          # Chỉ hiện thị từ nào xuất hiện  ít nhất 100  lần:
          min.freq = 100,
          # Số từ hiển thị trên wordcloud tối đa là 200: 
          max.words = 200, 
          # Ngẫu  nhiên thứ tự: 
          random.order = FALSE, 
          # 35% số từ được hiển thị theo chiều thẳng đứng
          rot.per = 0.35, 
          # Chọn kích cỡ chữ:
          font = 2,
          # Tô màu cho chữ: 
          colors = brewer.pal(8, "Dark2"))

# Cho 100 từ xuất hiện ở bộ 2: 
set.seed(29)
wordcloud(words = gu_non_eng$word, 
          freq = gu_non_eng$freq, 
          # Chỉ hiện thị từ nào xuất hiện  ít nhất 10  lần:
          min.freq = 10,
          # Số từ hiển thị trên wordcloud tối đa là 200: 
          max.words = 200, 
          # Ngẫu  nhiên thứ tự: 
          random.order = FALSE, 
          # 35% số từ được hiển thị theo chiều thẳng đứng
          rot.per = 0.35, 
          # Chọn kích cỡ chữ:
          font = 2,
          # Tô màu cho chữ: 
          colors = brewer.pal(8, "Dark2"))

Text Mining: The number of English words used for Basic Econometrics by Gujarati

Especially to TT

Nguyen Chi Dung

Giới thiệu

Phần 1

Phần 2