Vershynin, Roman, High-dimensional probability. An introduction with applications in data science
\[ \left(\forall z \geq 0|z-1| \leq\left|z^{2}-1\right|\right) \rightarrow E\left(\frac{\|X\|_{2}}{\sqrt{n}}-1\right)^{2} \leq E\left(\frac{\|X\|_{2}^{2}}{n}-1\right)^{2} \] \[ \sqrt{n \pm O(\sqrt{n})}=\sqrt{n} \pm O(1) \]
\[ |z-1| \geq \delta \quad \text { implies } \quad\left|z^{2}-1\right| \geq \max \left(\delta, \delta^{2}\right),\mathbb{P}\left\{\left|\frac{1}{\sqrt{n}}\|X\|_{2}-1\right| \geq \delta\right\} \leq \mathbb{P}\left\{\left|\frac{1}{n}\|X\|_{2}^{2}-1\right| \geq \max \left(\delta, \delta^{2}\right)\right\} \]
\[ e^{z} \leq 1+z+\frac{z^{2} / 2}{1-|z| / 3} \text { that is valid provided }|z|<3 \] \[ e^{x} \leq x+e^{x^{2}},\quad \mathbb{E} e^{\lambda X} \leq \mathbb{E}\left[\lambda X+e^{\lambda^{2} X^{2}}\right] \] \[ |x|^{p} \leq p^{p}\left(e^{x}+e^{-x}\right), \mathbb{E}|X|^{p} \leq p^{p}\left(\mathbb{E} e^{X}+\mathbb{E} e^{-X}\right) \]
\[ 1 /(1-x) \leq e^{2 x},x \in[0,1 / 2] \]
\[ Stirling , k ! \approx \sqrt{2 \pi k}(k / e)^{k} \]
\[ \int_{t}^{\infty}\left(1-3 x^{-4}\right) e^{-x^{2} / 2} d x=\left(\frac{1}{t}-\frac{1}{t^{3}}\right) e^{-t^{2} / 2}, \text{add }-3x^{-4} \text{ is trick} \] \[ \mathbb{E}|X|^{p}=\int_{0}^{\infty} p t^{p-1} \mathbb{P}\{|X|>t\} d t \] \[ \|A\|=\max _{x \in S^{n-1}, y \in S^{m-1}}\langle A x, y\rangle,\text{CS-inequality} \] \[ \begin{aligned} &\max \left(|z-1|,|z-1|^{2}\right) \leq\left|z^{2}-1\right|, \quad z \geq 0\\ &\max \left(\delta, \delta^{2}\right) \geq\left|\|A x\|_{2}^{2}-1\right|\Rightarrow \left|\|A x\|_{2}-1\right| \leq \delta \end{aligned} \]
Exercise 2.2.9 (Robust estimation of the mean). estimate the mean \(\mu\) of a random variable \(X\) from a sample \(X_{1}, \ldots, X_{N}\) drawn independently from the distribution of \(X\). We want an \(\varepsilon\) -accurate estimate, i.e. one that falls in the interval \((\mu-\varepsilon, \mu+\varepsilon)\).
some words in red
Exercise Bounds on the Expectation of the Maximum of Samples from a Gaussian: Let \(Y=\max _{1 \leq i \leq n} X_{i},\) where \(X_{i} \sim \mathcal{N}\left(0, \sigma^{2}\right)\) are i.i.d. nandom variables. Then \[ \frac{1}{\sqrt{\pi \log 2}} \sigma \sqrt{\log n} \leq \mathbf{E}[Y] \leq \sqrt{2} \sigma \sqrt{\log n} \]
Exercise 2.6.5 (Khintchine’s inequality). \(=\) Let \(X_{1}, \ldots, X_{N}\) be independent sub-gaussian random variables with zero means and unit variances, and let \(a=\) \(\left(a_{1}, \ldots, a_{N}\right) \in \mathbb{R}^{N}\). Prove that for every \(p \in[2, \infty)\) we have \[ \left(\sum_{i=1}^{N} a_{i}^{2}\right)^{1 / 2} \leq\left\|\sum_{i=1}^{N} a_{i} X_{i}\right\|_{L^{p}} \leq C K \sqrt{p}\left(\sum_{i=1}^{N} a_{i}^{2}\right)^{1 / 2} \] where \(K=\max _{i}\left\|X_{i}\right\|_{\psi_{2}}\) and \(C\) is an absolute constant.
Answer, equivalent definition of subgaussian norm.
Exercise 3.1.4 (Expectation of the norm).
Can \(C K^{2}\) be replaced by \(o(1),\) a quantity that vanishes as \(n \rightarrow \infty ?\)
show \[ \operatorname{Var}\left(\|X\|_{2}\right) \leq C K^{4} \]
Answer1,finite fourth moment,Answer2, subgaussian, Start with \(\operatorname{Var}\|X\|_{2}=\mathrm{E}\left(\|X\|_{2}-\mathrm{E}\|X\|_{2}\right)^{2} \leq \mathrm{E}\left(\|X\|_{2}-\sqrt{n}\right)^{2}\) because the mean \(\mathrm{E}\|X\|_{2}\) minimizes the mean squared error.
Exercise 3.3.7 (Normal and spherical distributions). Let us represent \(g \sim\) \(N\left(0, I_{n}\right)\) in polar form as \[ g=r \theta \] where \(r=\|g\|_{2}\) is the length and \(\theta=g /\|g\|_{2}\) is the direction of \(g .\) Prove the following:
The length \(r\) and direction \(\theta\) are independent random variables.
The direction \(\theta\) is uniformly distributed on the unit sphere \(S^{n-1}\).
Exercise 3.4.3. This exercise clarifies the role of independence of coordinates in Lemma 3.4 .2 .
Let \(X=\left(X_{1}, \ldots, X_{n}\right) \in \mathbb{R}^{n}\) be a random vector with sub-gaussian coordinates \(X_{i} .\) Show that \(X\) is a sub-gaussian random vector.
Nevertheless, find an example of a random vector \(X\) with \[ \|X\|_{\psi_{2}} \gg \max _{i \leq n}\left\|X_{i}\right\|_{\psi_{2}} \]
Answer 1, try to prove \(Y:=\langle X, x\rangle,\|Y\|_{L_p} \leq K_{2} \sqrt{p} \quad \forall p \geq 1\).
Lemma 3.6.6 (Grothendieck’s identity). Consider a random vector \(g \sim N\left(0, I_{n}\right)\). Then, for any fixed vectors \(u, v \in S^{n-1},\) we have \[ \mathbb{E} \operatorname{sign}\langle g, u\rangle \operatorname{sign}\langle g, v\rangle=\frac{2}{\pi} \arcsin \langle u, v\rangle \]
The book begin with a appetizer: using probabilistic method to cover the geometric set. The classical Caratheodory’s theorem tell us that we can use linear combination of at most \(n+1\) points to express any point in \(n\) dimensional convex hull. Suppose now we only want to approximate the point inside the convex hull, we actually get a dimension-free result, for any integer \(k\) independent of \(n\), there exists \(k\) points \(x_1,\cdots, X_k\) such that \[ \left\|x-\frac{1}{k} \sum_{j=1}^{k} x_{j}\right\|_{2} \leq \frac{1}{\sqrt{k}} \] The idea of using probabilistic method to show statement like “there exists xxxx” is simple, we just need to replace \(x_i\) with some random variable \(Z_i\) then show \(\left\|x-1/k \sum_{j=1}^{k} Z_{j}\right\|_{2}^{2} \leq 1/k\) with positive probability, then there must be a realization of random variable \(Z_1,\cdots ,Z_k\) to make our inequality happen. One way to show positive probability is simply prove \(\mathbb{E}\left\|x-1/k \sum_{j=1}^{k} Z_{j}\right\|_{2}^{2} \leq 1/k\)
Ch1 covers some preliminaries on random variable, two things is unfamiliar to me: first is Holder inequality \(\mathbb{E} X Y \mid \leq\|X\|_{L^{p}}\|Y\|_{L^{q}}\), second is integral identity(three versions) for expectation \[ \begin{aligned} & X>0,\mathbb{E} X=\int_{0}^{\infty} \mathbb{P}\{X>t\} d t\\ & \mathbb{E} X=\int_{0}^{\infty} \mathbb{P}\{X>t\} d t-\int_{-\infty}^{0} \mathbb{P}\{X<t\} d t\\ & \mathbb{E}|X|^{p}=\int_{0}^{\infty} p t^{p-1} \mathbb{P}\{|X|>t\} d t \end{aligned} \]
Two common limit theorems for Bernoulli random variable are present, first is iid case which correspond to CLT, second is poisson limit theorem which holds when summation converge to finite value \(\lambda\).
Ch2 travel on concentration of sum of independent RV. We do not apply Berry-Esseen CLT due to its asymptotic nature, instead we focus on non-asymptotic results such as Hoeffding’s inequality for sub-gaussian and Bernstein’s inequality for sub-exponential. Using chernoff’s inequality on Erdos-Renyi graph(G(n,p)), we can show that dense grahps(\(d \geq C \log n\)) are almost regular, which means that the degree of each vertices approximately equal to \((n-1)p=d\). However the conclusion for sparse graph is different, I’d better try to do those exercise in the future.
One thing frequent used for equivalent properties of sub-gaussian is to check \(\|X\|_{L^{p}}=\left(\mathbb{E}|X|^{p}\right)^{1 / p} \leq K_{2} \sqrt{p}\), perhaps this is easy to check using integral identity. And subgaussian norm is an convenient way to characterize subgaussian, for sum of independent subgaussian we have \(\left\|\sum_{i=1}^{N} X_{i}\right\|_{\psi_{2}}^{2} \leq C \sum_{i=1}^{N}\left\|X_{i}\right\|_{\psi_{2}}^{2}\). Since we know that expectation is close connected with tail probability, we can actually derive moment inequality–Khintchine’s inequality using general hoeffding’s inequality.(It is worthwhile to remember general version of Hoeffding’s inequality since it can recover classical ones). And we also see that centering does no harm to subgaussianity due to \(\|X-\mathbb{E} X\|_{\psi_{2}} \leq C\|X\|_{\psi_{2}}\). For sub-exponentail, we have following relation with subgaussian \[ \begin{aligned} &\left\|X^{2}\right\|_{\psi_{1}}=\|X\|_{\psi_{2}}^{2}, X \text{ is subgaussian}\\ &\|X Y\|_{\psi_{1}} \leq\|X\|_{\psi_{2}}\|Y\|_{\psi_{2}},\text{Product of sub-gaussians is sub-exponential} \end{aligned} \] Recall in the Bernstein type inequality, we have a term “min(xx,yy)”. The reason why we have this is that bound for MGF of sub-exponential does not hold for all \(\lambda>0\). The difference happens when we pick up specific optimal value in chernoff argument.
McDiarmid’s inequality and Bennetts inequality are generalization to Hoeffding and chernoff inequality respectively. But they all involve some boundedness. E.g \(\left|X_{i}-\mathbb{E} X_{i}\right| \leq K\)
Ch3 move to high dimension random vectors, most result deal with isotropic random vector. In high dimension space, independent and isotropic random vectors tend to be almost orthogonal as we can see \(\bar{X}:=X/{\|X\|_{2}},\bar{Y}:=Y/{\|Y\|_{2}}, |\langle\bar{X}, \bar{Y}\rangle| \asymp 1/{\sqrt{n}}\). Besides multivariate gaussian, spherically distributed random vector X\(\sim\) Unif \(\left(\sqrt{n} S^{n-1}\right)\) is also isotropic. Instead of concentrate on disk, high dimenison gaussian RV seems to concentrate in a thin spherical shell due to \(\|g\|_{2} \approx \sqrt{n}\) with high probability. High dimension random vector can also be “sub gaussian”, the orlicz norm is defined to be the maximum subgaussian norm of every one dimension projection. Example contain multivariate gaussian, multivariate Bernoulli, uniform distribution on sphere and uniform distribution on Euclidean ball. For Unif \(\left(\sqrt{n} S^{n-1}\right)\), the marginals will become asymptotically normal as n increase, which is called projective limit theorem.
An most freshing important application I learn in this chapter is Grothendieck’s inequality, which can be used to quantify the performance of semidefinite programming. Originally this arise in applied math and was proved in another way, here Roman tell us an probabilistic proof of Grothendieck’s inequality. The integer optimization problem(NP hard) corresponds to the assumption of Grothendieck’s inequality, SDP corresponds to implication of assumption. General strategy for SDP relaxations is to replace 1 with 2, see below \[ \begin{aligned} & 1: \text { maximize } \sum_{i, j=1}^{n} A_{i j} x_{i} x_{j}: \quad x_{i}=\pm 1 \text { for } i=1, \ldots, n\\ & 2: \text { maximize } \sum_{i, j=1}^{n} A_{i j}\left\langle X_{i}, X_{j}\right\rangle: \quad\left\|X_{i}\right\|_{2}=1 \text { for } i=1, \ldots, n \end{aligned} \] In another application– Maximum cut for graphs, we do similar replacement: \[ \begin{aligned} & 1: \text { MAX-CUT }(G)=\frac{1}{4} \max \left\{\sum_{i, j=1}^{n} A_{i j}\left(1-x_{i} x_{j}\right): x_{i}=\pm 1 \text { for all } i\right\}\\ & 2: \operatorname{SDP}(G):=\frac{1}{4} \max \left\{\sum_{i, j=1}^{n} A_{i j}\left(1-\left\langle X_{i}, X_{j}\right\rangle\right): X_{i} \in \mathbb{R}^{n},\left\|X_{i}\right\|_{2}=1 \text { for all } i\right\} \end{aligned} \] The last part of this chapter devote to give tight constant of Grothendieck’s inequality with an emphasize on kernel trick. Grothendieck’s identity is the first step in the proof: \(g \sim N\left(0, I_{n}\right)\), fixed vectors \(u, v \in S^{n-1}\) we have \(\mathbb{E} \operatorname{sign}\langle g, u\rangle \operatorname{sign}\langle g, v\rangle=2/{\pi} \arcsin \langle u, v\rangle\). The second step is to use RKHS knowledge to bulid a connection: \(2/{\pi} \arcsin \langle\Phi(u), \Psi(v)\rangle=\beta\langle u, v\rangle\). The philosophy in machine learning is: to compute the inner product \(\langle\Phi(u), \Phi(v)\rangle\) in \(H\), one does not need to know \(\Phi\), some identity allows us to compute \(K(u,v)\) instead \[ \langle\Phi(u), \Phi(v)\rangle=K(u, v) \quad \text { for all } u, v \in \mathcal{X} \]
Ch4 give some non-asymptotic result to random matrices. Similar to the approximate orthogonal in high dimension independnent RV. Random matrix satisfies approximate isometries,i.e \(A^{\top} A\approx I_{n}\) or \((1-\delta)\|x\|_{2} \leq\|A x\|_{2} \leq(1+\delta)\|x\|_{2}\). Using this isometry property we can preserve the distance between points when we do dimension reduction. A good connection between projection and isometries is: \(Q^{\top} Q=I_{n}, P:=Q Q^{\top}\), a matrix \(A\) is an isometric embedding of \(\mathbb{R}^{n}\) into \(\mathbb{R}^{m}\) if and only if \(A^{\top}\) is a projection from \(\mathbb{R}^{m}\) onto \(\mathbb{R}^{n}\).
Geometric concept such as covering, packing and \(\epsilon\)-net are useful tool to do discretization before we use the union bound. A common choice for sphere \(S^{n-1}\) is: \[ \epsilon=\frac{1}{4},|\mathcal{N}| \leq 9^{n} \] Several clustering application were introduced, the problem can be touched through spectral decomposition so we also call this method spectral clustering. For example: community detection in stochastic block model, clustering of mixture gaussian, since the label of cluster related with population eigenvector, the problem reduced to estimate principal eigenvector. Some perturbation theory like Davis kahan are used to quantify the error between estimator and population version.
Exercise 5.1.10 M is median of \(f(X)\), show that concentration of median imply the concentration of mean \[ \|f(X)-M\|_{\psi_{2}} \leq C \Rightarrow \|f(X)-E(f(X))\|_{\psi_{2}} \leq C_1 \] Hint: \((f-M[f])-\mathbb{E}(f-M[f])=f-\mathbb{E} f\), and Centering Lemma 2.6.8.
Exercise 5.1.13 Concentration about expectation and median are equivalent. Consider a random variable \(Z\) with median \(M\). Show that \[ c\|Z-\mathbb{E} Z\|_{\psi_{2}} \leq\|Z-M\|_{\psi_{2}} \leq C\|Z-\mathbb{E} Z\|_{\psi_{2}} \] where \(c, C>0\) are some absolute constants.
Hint: first inequality prove in previous exercise, for second inequality,use the definition of the median and Jensen’s inequality.
Exercise 5.1.15 (Exponential set of mutually almost orthogonal points) From linear algebra, we know that any set of orthonormal vectors in \(\mathbb{R}^{n}\) must contain at most \(n\) vectors. However, if we allow the vectors to be almost orthogonal, there can be exponentially many of them! Prove this counterintuitive fact as follows. Fix \(\varepsilon \in(0,1)\). Show that there exists a set \(\left\{x_{1}, \ldots, x_{N}\right\}\) of unit vectors in \(\mathbb{R}^{n}\) which are mutually almost orthogonal: \[ \left|\left\langle x_{i}, x_{j}\right\rangle\right| \leq \varepsilon \quad \text { for all } i \neq j \] and the set is exponentially large in \(n\) : \[ N \geq \exp (c(\varepsilon) n) \]
Hint: “there exist”–means that we can use probabilistic way to show this happen with positive probability. First choose \(k = \exp(\epsilon^2 n/4)\) vectors \(v_1, \dots, v_k\) by choosing each coordinate to be \(\pm 1\) with probability \(1/2\) each. Then define \(u_i = v_i/\sqrt{n}\). A Chernoff bound shows that the probability of \(|\langle u_i, u_j \rangle| \geq \epsilon\) is at most \(2 \exp(-(\epsilon^2/2)n)\). This equals \(2/k^2\) by choice of \(k\), and hence one can take a Union Bound over the at most \(\binom{k}{2} < k^2/2\) pairs \((i,j)\) to show that there’s a positive probability of \(|\langle u_i, u_j \rangle| < \epsilon\) holding for all \(i \neq j\).
Exercise 5.2.3 \(X \sim N\left(0, I_{n}\right)\), prove \(\|f(X)-\mathbb{E} f(X)\|_{\psi_{2}} \leq C\|f\|_{\text {Lip. }}\)
Hint: use Gaussian isoperimetric inequality, and gaussian measure is rotation-invariance, we can rotate the half space and focus only on first coordinates.
Exercise 5.2.4 Replace \(\mathbb{E}(f(X))\) by \(\left(\mathbb{E} f^{p}\right)^{1 / p}\), for non-negative Lipschitz function \(f\), show a similar result, the constants may depend on \(p\).
Hint: Let \(Z=f(X) \geq 0\), \(\|Z-\mathbb{E} Z\|_{p} \geq\|Z\|_{p}-\|\mathbb{E} Z\|_{p}=\|Z\|_{p}-\mathbb{E} Z\) \[ \begin{array}{l} \|Z-\| Z\left\|_{p}\right\|_{\psi_{2}} \leq\|Z-E Z\|_{\psi_{2}}+\left\|E Z-\left|Z\left\|_{p}\right\|_{\psi_{2}} \leq C\|f\|_{L i p}+C^{\prime}\right| E Z-\right\| Z \|_{p} \mid \\ \left|E Z-\|Z\|_{p}\right| \leq\|E Z-Z\|_{p}<C^{\prime \prime} \sqrt{p}\|Z-E Z\|_{\psi_{2}}<C^{\prime} C \sqrt{p}\|f\|_{L i p} \end{array} \]
Ch5 focus on the concentration of Lipschitz function, the property of Lipschitz guarantee the functional value does not oscillate too much. Most of time people investigate the Lipschitz when they want to extend linear to non-linear function. Also in this chapter we relax independent assumption somehow, for example we look at uniform distribution on sphere and ball, whose coordinate obviously are not independent, but they have good geometric property.
The argument of concentration probability proceed in two steps: first we need some isopermetric property on \(\mathbb{R}^n\), which claims that the \(\epsilon-\) neighborhood of ball/sphere have minimal volume, or half space in Gaussian case. Second step is to show a “blow-up phenomena” for \(\epsilon-\) neighborhood, even we just enlarge a little bit, but this small neignborhood contains almost all volume/probability.
Some material in this chapter is hard for me, I need to get some knowledge on Haar measure in order to better understand the concentration on symmetric group, orthogonal group and Grassmannian. However, the concentration inequality of Lipschitz function are elegant for those special cases, other examples contains: continuous cube, Euclidean ball, hamming cube. The result can be generalized to other density with form \(f(x)=e^{-U(x)}\), if \(U\) is a general function whose curvature is at least like \(\|x\|_{2}^{2}\), then we should expect at least Gaussian concentration. Mathematically, \(\operatorname{Hess} U(x) \succeq \kappa I_{n}\) imply \(\|f(X)-\mathbb{E} f(X)\|_{\psi_{2}} \leq C\|f\|_{\text {Lip }}/{\sqrt{\kappa}}\)
Apply the result of Grassmannian, Johnson-Lindenstrauss Lemma states that random projection is a promising tool to reduce dimension of the data without sacrificing too much of its geometry. This chapter also spends lots of time to derive the matrix Bernstein’s inequality, by applying this we have application on community detection in sparse networks and analysis of covariance matrix for general distribution.
Exercise 6.4.5 Let \(F: \mathbb{R}_{+} \rightarrow \mathbb{R}\) be an increasing, convex function. Show that the same inequalities in Lemma 6.4 .2 hold if the norm \(\|\cdot\|\) is replaced with \(F(\|\cdot\|),\) namely
\[ \mathbb{E} F\left(\frac{1}{2}\left\|\sum_{i=1}^{N} \varepsilon_{i} X_{i}\right\|\right) \leq \mathbb{E} F\left(\left\|\sum_{i=1}^{N} X_{i}\right\|\right) \leq \mathbb{E} F\left(2\left\|\sum_{i=1}^{N} \varepsilon_{i} X_{i}\right\|\right) \]
Hint: notice that \(F(\|\cdot\|)\) is still convex, we can apply \(\mathbb{E} F(Y) \leq \mathbb{E} F(Y+Z)\) for convex \(F\). And also use the distributional equality: \(\varepsilon_{i}\left(X_{i}-X_{i}^{\prime}\right)=\left(X_{i}-X_{i}^{\prime}\right)\). Also triangle inequality.
Exercise 6.4.6 Let \(X_{1}, \ldots, X_{N}\) be independent, mean zero random variables. Show that their sum \(\sum_{i} X_{i}\) is sub-gaussian if and only if \(\sum_{i} \varepsilon_{i} X_{i}\) is subgaussian, and \[ c\left\|\sum_{i=1}^{N} \varepsilon_{i} X_{i}\right\|_{\psi_{2}} \leq\left\|\sum_{i=1}^{N} X_{i}\right\|_{\psi_{2}} \leq C\left\|\sum_{i=1}^{N} \varepsilon_{i} X_{i}\right\|_{\psi_{2}} \]
Hint: Use the result of Exercise 6.4 .5 with \(F(x)=\exp (\lambda x)\) to bound the moment generating function, or with \(F(x)=\exp \left(c x^{2}\right)\). Alternatively, subgaussian norm is also convex.
Exercise 6.7.3 (Contraction principle for general distributions) Let \(X_{1}, \ldots, X_{N}\) be independent, mean zero random vectors in a normed space, and let \(a=\left(a_{1}, \ldots, a_{n}\right) \in \mathbb{R}^{n}\). Then \[ \mathbb{E}\left\|\sum_{i=1}^{N} a_{i} X_{i}\right\| \leq 4\|a\|_{\infty} \cdot \mathbb{E}\left\|\sum_{i=1}^{N} X_{i}\right\| \]
Proof: Using contraction principle and symmetrization and decopling. Apply following inequality : \[ \mathbb{E}\left\|\sum_{i=1}^{N} a_{i} \varepsilon_{i} x_{i}\right\| \leq\|a\|_{\infty} \cdot \mathbb{E}\left\|\sum_{i=1}^{N} \varepsilon_{i} x_{i}\right\| \] \[ \frac{1}{2} \mathbb{E}\left\|\sum_{i=1}^{N} \varepsilon_{i} X_{i}\right\| \leq \mathbb{E}\left\|\sum_{i=1}^{N} X_{i}\right\| \leq 2 \mathbb{E}\left\|\sum_{i=1}^{N} \varepsilon_{i} X_{i}\right\| \]
Exercise 6.7.7 (Talagrand’s contraction principle) Consider a bounded subset \(T \subset \mathbb{R}^{n},\) and let \(\varepsilon_{1}, \ldots, \varepsilon_{n}\) be independent symmetric Bernoulli random variables. Let \(\phi_{i}: \mathbb{R} \rightarrow \mathbb{R}\) be contractions, i.e. Lipschitz functions with \(\left\|\phi_{i}\right\|_{\text {Lip }} \leq\) 1. Then \[ \mathbb{E} \sup _{t \in T} \sum_{i=1}^{n} \varepsilon_{i} \phi_{i}\left(t_{i}\right) \leq \mathbb{E} \sup _{t \in T} \sum_{i=1}^{n} \varepsilon_{i} t_{i} \] Hint: (a) First let \(n=2\). Consider a subset \(T \subset \mathbb{R}^{2}\) and contraction \(\phi: \mathbb{R} \rightarrow \mathbb{R}\), and check that: answer \[ \sup _{t \in T}\left(t_{1}+\phi\left(t_{2}\right)\right)+\sup _{t \in T}\left(t_{1}-\phi\left(t_{2}\right)\right) \leq \sup _{t \in T}\left(t_{1}+t_{2}\right)+\sup _{t \in T}\left(t_{1}-t_{2}\right) \] (b) induction and condition on \(\varepsilon_{1}, \ldots, \varepsilon_{n-1}\) and apply part 1 .
This chapter cover some proof techniques such as decompling and symmetrization, they are related with convex function and independent copy.
Decompling focus on the chaos,i.e quadratic form, which give a bound on MGF of chaos by replacing one \(X\) into its independent copy \(X^{\prime}\), then follow an comparison lemma \(\mathbb{E} \exp \left(\lambda X^{\top} A X^{\prime}\right) \leq \mathbb{E} \exp \left(C K^{2} \lambda g^{\top} A g^{\prime}\right)\), we can derive Hanson-Wright inequality via diagonal term and off-diagonal term. One inequality play an important role in decompling proof(as well as symmetrization) is \(\mathbb{E} F(Y) \leq \mathbb{E} F(Y+Z)\) for conex function \(F\) and \(\mathbb{E}Z=0\). Beside this inequality, we need to make use of the independent copy, the idea is to use tower property to condition on one copy of \(X^{\prime}\), then the problem is reduced to vector case. The application is flexible, it depends on whether you can transfer you problem into quadratic form, e.g. \(\mathbb{E}\|B X\|_{2}^{2}=\|B\|_{F}^{2}\), \(\left(\mathbb{E} \operatorname{dist}(X, E)^{2}\right)^{1 / 2}=\sqrt{n-d}\), where \(E\) is a \(d\) dimension subspace of \(\mathbb{R}^n\).
Symmetrization technique multiply symmetric Bernoulli/Rademacher random variable to usual random variable to make it symmetric. The induced inequality can be generalized to increasing convex function besides the 2-norm(see above exercise). One advantage to do symmetrization is: because Rademacher has some good bound, we can condition on \(X\) first to apply some moment inequality(e.g. Khintchine’s inequality). Application of Symmetrization is on bounding Norms of random matrices with non-identical entries and further matrix completion.
A little bit contraction principle was discussed, this can be use to prove Symmetrization lemma with Gaussians. Talagrand contraction principle with Lipschitz function and symmetric Bernoulli was shown in exercise. But up to now I cannot appreciate this contraction principle, perhaps more application is investigated in next chapter.
Exercise 7.3.2 For any vectors \(u, w \in S^{n-1}\) and \(v, z \in S^{m-1},\) we have \[ \left\|u v^{\top}-w z^{\top}\right\|_{F}^{2} \leq\|u-w\|_{2}^{2}+\|v-z\|_{2}^{2} \]
Exercise 7.3.5 Given symmetric \(n \times n\) Gaussian random matrix \(A\) whose entries above the diagonal are independent \(N(0,1)\) random variables, and the diagonal entries are independent \(N(0,2)\) random variables. We are expected to use Sudakov-Fernique’s inequality or Gordon’s inequality to derive the bound on operator norm of \(A\).
Exercise 7.6.1 Equivalence of Gaussian width and its square version, \(g \sim N\left(0, I_{n}\right)\) \[ w(T):=\mathbb{E} \sup _{x \in T}\langle g, x\rangle,\quad h(T):=\sqrt{\mathbb{E} \sup _{t \in T}\langle g, t\rangle^{2}} \] Prove \[ w(T-T) \leq h(T-T) \leq w(T-T)+C_{1} \operatorname{diam}(T) \leq C w(T-T) \] In particular, we have \[ 2 w(T) \leq h(T-T) \leq 2 C w(T) \]
Answer: Let \(F: g \mapsto \sup _{x, y \in T}\langle g, x-y\rangle\), prove that \(F\) is \(\operatorname{diam}(T)\) -Lipschitz: for \(g, g^{\prime} \in \mathbb{R}^{n}\).
Exercise 7.6.6 Let \(A\) be an \(m \times n\) matrix, and let \(B_{2}^{n}\) denote the unit Euclidean ball. Let ellipsoid \(T=A B_{2}^{n}\), show that \[ d(T):=\frac{h(T-T)^{2}}{\operatorname{diam}(T)^{2}}=\frac{\|A\|_{F}^{2}}{\|A\|^{2}_{op}} \]
Hint: draw the domain and reduce the problem from sup over \(x,y\) to one variable. This problme is special because Euclidean ball is perfectly symmetric.
Several comparison inequalities are covered in the first part, named Slepian’s inequality, Sudakov-Fernique inequality and Gordon’s inequality(Gaussian min-max), the proof of those inequalities take advantage of Stein identity and Gaussian interpolation of smoothed indicator function and log partition function. The proof is elegant if we can construct desired twice diffierentable function. The application of comparison inequality is straightforward, we need to rewrite the problem into corresponding form, which is a hard part, then remain computation is easiler. E.g sharp bound for minimum singular value of gaussian matrix, operator norm of GOE matrix. Another application is to prove Sudakov’s minoration combined with lower bound of Gaussian Maximizer.
Other important geometric concepts are gaussian width, spherical width, Gaussian complexity and stable dimension. These concepts can be used to illustrate the size of random projection in bounded set. (Recall Johnson-Lindenstrauss Lemma is for finite set, for general set we need Gaussian/spherical width)
8.3.21 (Dimension reduction for covering numbers) Let \(\mathcal{F}\) be a class of functions on a probability space \((\Omega, \Sigma, \mu)\), which are all bounded by 1 in absolute value. Let \(\varepsilon \in(0,1) .\) Show that there exists a number \(n \leq\) \(C \varepsilon^{-4} \log \mathcal{N}\left(\mathcal{F}, L^{2}(\mu), \varepsilon\right)\) and an \(n\) -point subset \(\Omega_{n} \subset \Omega\) such that \[ \mathcal{N}\left(\mathcal{F}, L^{2}(\mu), \varepsilon\right) \leq \mathcal{N}\left(\mathcal{F}, L^{2}\left(\mu_{n}\right), \varepsilon / 4\right) \] where \(\mu_{n}\) denotes the uniform probability measure on \(\Omega_{n}\).
Hint: Apply Lemma 8.3.19 and use following equivalence of packing and covering number: \[ \mathcal{P}(K, d, 2 \varepsilon) \leq \mathcal{N}(K, d, \varepsilon) \leq \mathcal{P}(K, d, \varepsilon) \]
Exercise 8.3.24 (Symmetrization for empirical processes) Prove that \[ \mathbb{E} \sup _{f \in \mathcal{F}}\left|\frac{1}{n} \sum_{i=1}^{n} f\left(X_{i}\right)-\mathbb{E} f(X)\right| \leq 2 \mathbb{E} \sup _{f \in \mathcal{F}}\left|\frac{1}{n} \sum_{i=1}^{n} \varepsilon_{i} f\left(X_{i}\right)\right| \] where \(\varepsilon_{1}, \varepsilon_{2}, \ldots\) are independent symmetric Bernoulli random variables.
Exercise 8.5.7 (Dudley’s integral vs. \(\gamma_{2}\) functional). Show that \(\gamma_2\) functional is bounded by Dudley’s integral. Namely, show that for any metric space \((T, d)\), one has \[ \gamma_{2}(T, d) \leq C \int_{0}^{\infty} \sqrt{\log \mathcal{N}(T, d, \varepsilon)} d \varepsilon \] Hint: \[ \sup _{t \in T} \sum_{n \geq 0} 2^{n / 2} d\left(t, T_{n}\right) \leq \sum_{n \geq 0} 2^{n / 2} \sup _{t \in T} d\left(t, T_{n}\right) \]
Argue the RHS us less than Dudley’s integral using the definition of covering number and entropy number. Talagrand’s 2014 book (2.37)
Exercise 8.6.6 (Higher moments of the deviation) Suppose for every \(u \geq 0\) we have \[ \sup _{x \in T}\left|X_{x}\right| \leq C K(w(T)+u \cdot \operatorname{rad}(T)) \] with probability at least \(1-2 \exp \left(-u^{2}\right)\). Then check that \[ \left(\mathbb{E} \sup _{x \in T}\left|X_{x}\right|^{p}\right)^{1 / p} \leq C \sqrt{p} K \gamma(T) \] Hint: \(\operatorname{rad}(\mathrm{T}) \lesssim \gamma(\mathrm{T})\), and integrate the tail probability directly.
Exercise 8.7.2 Let \(A\) be an \(m \times n\) random matrix whose entries \(A_{i j}\) are independent \(N(0,1)\) random variables. Let \(T \subset \mathbb{R}^{n}\) and \(S \subset \mathbb{R}^{m}\) be arbitrary bounded sets. Show that the reverse of Chevet’s inequality holds: \[ \mathbb{E} \sup _{x \in T, y \in S}\langle A x, y\rangle \geq c[w(T) \operatorname{rad}(S)+w(S) \operatorname{rad}(T)] \] Hint: Take one of the sup outside the expectation, normalize one vector to be unit norm.
This chapter focus on the random process with subgaussian increnment, which means that the Orlicz norm of the increment \(X_t-X_s\) is bounded by the \(K d(t,s)\). The distance of two point depend on the setting of the metric space, for exmaple \(\left\|X_{t}-X_{s}\right\|_{L^{2}}\), for function on space we may use \(\|\cdot\|_{\infty}\) or \(\|f-g\|_{L^{2}(\mu)}\) in VC dimension case. A good new is that whenever we have metric space (\(T,d\)), we have covering number, the theorem in this chapter is a little bit general since we do not need to know what the exact form of the distance. Dudley’s integral inequality give an upper bound for expectation of the supremum of random process in terms of log covering number(this is called entropy number?). Dudley’s inequality can be used to quantify the preformance of Monte Carlo integration for Lipschitz function class, i.e uniform law of large number for empirical process.
VC dimension is a measure of complexity for set of boolean function, it is a combinatorial complexity because it related to number of set it can shatter. Sauer-Shelah lemma and Pajor’s lemma give a upper bound for cardinality. There is a interesting lemma(8.3.19) somehow related to Johnson-Lindenstrauss reduction. For a probability space \(\Omega\) if there is a class of \(\epsilon-\)separate function \(\mathcal{F}\) with cardiniality N. There exist a smaller space \(\Omega_n\) with cardiniality \(\approx \log(N)\) to preserve the geometry of the set(preserve the covering number). Upper bound for covering number also indicate that the covering number scale exponentially with the dimension. Applying the bound related with VC dimension and covering number, we can derive the bound for empirical process whose functional class is boolean function(when we see boolean function, we may think of using something related with VC dimesnion). Also we can use the VC dimension to see whether the empirical risk is a good substitute for true risk.
Generic chainnig is a technique to give a sharper bound for expectation of the supremum of random process, which rely on Talagrand’s \(\gamma_{2}\) functional. The general strategy is follwoing: 1. Chaining set-up 2. Controlling the increments(union bound and some geometric sequence) 3. Summing up the increments(telecope sum). Specific to gaussian process, the lower and upper bound, both of order \(\gamma_{2}(T, d)\), this is called Talagrand’s majorizing measure theorem. Based on this, we have Talagrand comparison inequality, which is a generalization of Sudakov-Fernique’s inequality(to subgaussian process). One application is uniform bound for a random quadratic form, \(\sup _{x \in T, y \in S}\langle A x, y\rangle\) (Chevet’s inequality).
One technique is how to transform the expectation to high probability tail bound, although we can use theorem 8.5.5, but this theorem itself is hard to establish. But I think you can see some provided reference to learn.
Ch9 focus on matrix deviation inequality and its application \[ \mathbb{E} \sup _{x \in T}\left|\|A x\|_{2}-\sqrt{m}\|x\|_{2}\right| \leq C K^{2} \gamma(T) \] instead of random matrix \(A\), we can replace it with random projection with a extra \(1/\sqrt{n}\) on the right hand side. The tail bound is also available and they are very useful when we derive some high probability result. Besides the Johnson-Lindenstruss for infinite set, two important result are \(M^{*}\) bound and Escape Theorem. \(M^{*}\) bound can give the upper bound for perturbation, Escape theorem is used to establish the exact recovery in sparse linear model.
Ch10 first focus on the sparse recovery problem without noise(optimization with constraint), under this scenario we just apply \(M^*\) bound and compute the gaussian width of set where truth lie in. For exact recovery we show that \(h=\hat{x}-x^* \neq 0\) will lead to some contradiction due to escape theorem. Above discussion base on the fact the \(A\) is good random matrix, for general matrix we need to assume some condition on the design matrix. e.g. restricted isometry property(RIP). Through a peeling analysis we can show that RIP can guarantee the exact recovery. In the analysis, we often use cone condition, \(\left\|h_{S^{c}}\right\|_{1} \leq c \left\|h_{S}\right\|_{1}\), this condition have optimzation nature, in noiseless case we simply derive it using basic inequality, for noise case we can specify a value of \(\lambda\) to get cone condition. Also notice that peeling argument is: Let \(I_{0}\) be the support of \(x\), decompose \(h_{I_{0}^{c}}\) into many piece, each piece lengeh is \(s\), and decrease in magnitude. we generally combine \(I_{0,1}=I_{0} \cup I_{1}\).
This chapter focus on Dvoretzky-Milman’s Theorem, which is the explanation of the phase transition phenomenon of the Random projections of sets. Basically, it tell us that the diameter of the projected set will not change when rank of projection matrix is small. Together with the Johnson Lindestruass lemma, phase transition can be fully explained. If we want to use the random \(m\) x \(n\) gaussian matrix to do dimension reduction from \(T\subset \mathbb{R}^n\rightarrow \mathbb{R}^m\) and \(m<< d(T)\) , the projected set will look like a ball with radius \(w(T)\). That is to say, if \(m\) is below the stable dimension of original set \(T\), the dimension of the projected set will not changed. Different from the Johnson Lindestruass lemma, we reduce the dimension by a factor \(\sqrt{m/n}\) and simultaneously preserve the geometry of the original set.
To sum up this chapter: when \(m \gtrsim d(T)\), additive Johnson-Lindenstrauss that we studied in Section 9.3.2 shows that the random projection approximately preserves thegeometry of \(T\). When \(m \lesssim d(T)\), geometry can nolonger be preserved due to “round ball”.
Sparse graph: Exercise 2.4.2—Exercise 2.4.5
Normal and spherical distributions: Exercise 3.3.7
Discrete distribution is bad subgaussian except Bernoulli and coordinate distribution: Exercise 3.4.4—3.4.5
Formulate NP hard (with nonsymmetric matrix \(A\)) as a semidefinite program: Exercise 3.5.7
Monotonicity of covering number is not hold: Exercise 4.2.10
Upper bound of covariance matrix estiamtion: Exercise 4.7.3
Blow-up of exponentially small sets: Exercise 5.1.9
Optimality of Johnson-Lindenstrauss: Exercise 5.3.4
Higher-dimensional Hanson-Wright inequality for off-doagonal sum: Exercise 6.2.7
\[ \|D B\|_{F} \leq\|D\|\|B\|_{F} \] Dual norm of nuclear norm: answer \[ \|X\|_{*}=\max \{|\langle X, U\rangle|: U \in O(d)\}=\sup _{\sigma_{1}(Q) \leq 1}\langle U, X\rangle \] SVD and expansion tell us that: \[ \langle X, Y\rangle \leq\|X\|_{*} \cdot\|Y\|,\|X\|_{F}^{2} \leq\|X\|_{*} \cdot\|X\| \]