快乐的时光,就像铃铛滚过坡路一般急速逝去,唯留余音。平淡无奇的季节在疏忽之中变化着。

Harvard CS 226/MIT 6.889

# Overview

Big data is data so large that it does not fit in the main memory of a single machine. The need to process big data by space-efficient algorithms arises in Internet search, machine learning, network traffic monitoring, scientific computing, signal processing, and other areas.

This course will cover mathematically rigorous models for developing such algorithms, as well as some provable limitations of algorithms operating in those models. Some topics covered include:

  • Dimensionality reduction. General techniques and impossibility results for reducing data dimension while still preserving geometric structure.
  • Numerical linear algebra. Algorithms for big matrices (e.g. a user/product rating matrix for Netflix or Amazon). Regression, low rank approximation, matrix completion, etc.
  • Compressed sensing. Recovery of (approximately) sparse signals based on few linear measurements.
  • Sparse Fourier Transform. Fast algorithms for (approximately) computing the Fourier Transform of signals which are (approximately) sparse in the frequency domain.

# Lecture 1.

# Topics Covered in the Course

  1. Sketching.
  2. Streaming.
  3. Dimensionality Reduction.
  4. Large-Scale Matrix Computations.
  5. Compressed Sensing.
  6. Sparse Fourier Transform.

# An example of sketching: Approximate Counting

Problem: Design an algorithm that monitors a sequence of events and upon request can output (an estimate of) the number of events thus far. More formally, create a data structure that maintains a single integer nn and supports the following operations.

  • init() ; set n0n\leftarrow 0.
  • update() ; increments nn+1n\leftarrow n+1.
  • query() ; outputs nn or an estimate of nn.

Before any operations are performed, we assume that n=0n=0.

A trivial algorithm stores nn as a sequence of logn=O(logn)\lceil \log n\rceil=O(\log n) bits (a counter). If we want query() ; to return the exact value of nn, this is the best we can do. Suppose for some such algorithm, we use f(n)f(n) bits to store the integer nn. There are 2f(n)2^{f(n)} configurations for these bits. In order for the algorithm to be able to store the exact value of all integers up to nn, the number of configurations must be greater than or equal to the number nn. Hence,

2f(n)nf(n)logn2^{f(n)}\geq n\Rightarrow f(n)\geq\log n

The goal is to use much less space than O(logn)O(\log n), and so we must instead answer query() ; with some estimate n~\tilde{n} of nn. We would like this n~\tilde{n} to satisfy:

P[n~n>εn]<δ\mathbb{P}[|\tilde{n}-n|>\varepsilon n]<\delta

for some 0<ε,δ<10 < \varepsilon,\delta < 1 that are given to the algorithm up front.

Morris’s algorithm:

  • init() ; sets X0X\leftarrow 0.
  • update() ; increments XX with probability 2X2^{-X}.
  • query() ; outputs n~=2X1\tilde{n}=2^X-1.

Analysis: Let XnX_n denote XX in Morris’s algorithm after nn updates.

Claim: E[2Xn]=n+1\mathbb{E}[2^{X_n}]=n+1.

Proof: With induction,

E[2Xn+1]=j=0P[Xn=j]E[2Xn+1Xn=j]=j=0P[Xn=j](12j2j+1+(112j)2j)=j=0P[Xn=j]2j+j=0P[Xn=j]=E[2Xn]+1=n+2\begin{aligned} \mathbb{E}[2^{X_{n+1}}]&=\sum_{j=0}^{\infty}\mathbb{P}[X_n=j]\cdot\mathbb{E}[2^{X_{n+1}}\mid X_n=j]\\ &=\sum_{j=0}^{\infty}\mathbb{P}[X_n=j]\cdot\left (\frac{1}{2^j}\cdot 2^{j+1}+\left (1-\frac{1}{2^j}\right )\cdot 2^j\right)\\ &=\sum_{j=0}^\infty\mathbb{P}[X_n=j]\cdot 2^j+\sum_{j=0}^{\infty}\mathbb{P}[X_n=j]\\ &=\mathbb{E}[2^{X_n}]+1\\ &=n+2 \end{aligned}

\square.

Claim: E[22Xn]=32n2+32n+1\mathbb{E}[2^{2X_n}]=\frac{3}{2}n^2+\frac{3}{2}n+1.

Proof:

E[22Xn+1]=j=0P[2Xn=j]E[22Xn+12Xn=j]=j=0P[2Xn=j](1j22(j+1)+(11j)22j)=j=0P[2Xn=j](j2+3j)=E[22Xn]+3E[2Xn]=(32n2+32n+1)+3n+3=32(n+1)2+32(n+1)+1\begin{aligned} \mathbb{E}[2^{2X_{n+1}}]&=\sum_{j=0}^\infty \mathbb{P}[2^{X_n}=j]\cdot\mathbb{E}[2^{2X_{n+1}}\mid 2^{X_n}=j]\\ &=\sum_{j=0}^\infty \mathbb{P}[2^{X_n}=j]\cdot\left (\frac{1}{j}\cdot 2^{2(j+1)}+\left (1-\frac{1}{j}\right )\cdot 2^{2j}\right )\\ &=\sum_{j=0}^\infty \mathbb{P}[2^{X_n}=j]\cdot(j^2+3j)\\ &=\mathbb{E}[2^{2X_n}]+3\mathbb{E}[2^{X_n}]\\ &=\left (\frac{3}{2}n^2+\frac{3}{2}n+1\right )+3n+3\\ &=\frac{3}{2}(n+1)^2+\frac{3}{2}(n+1)+1 \end{aligned}

\square.

Thus, by Chebyshev’s inequality,

P[n~n>εn]<V[n~2]ε2n2=E[n~2]E[n~]2ε2n2=E[(2Xn1)2]E[2Xn1]2ε2n2=E[22Xn]2(n+1)+1n2ε2n2=12n212nε2n2<12ε2\begin{aligned} \mathbb{P}[|\tilde{n}-n|>\varepsilon n]&<\frac{\mathbb{V}[\tilde{n}^2]}{\varepsilon^2n^2}\\ &=\frac{\mathbb{E}[\tilde{n}^2]-\mathbb{E}[\tilde{n}]^2}{\varepsilon^2n^2}\\ &=\frac{\mathbb{E}[(2^{X_n}-1)^2]-\mathbb{E}[2^{X_n}-1]^2}{\varepsilon^2n^2}\\ &=\frac{\mathbb{E}[2^{2X_n}]-2(n+1)+1-n^2}{\varepsilon^2n^2}\\ &=\frac{\frac{1}{2}n^2-\frac{1}{2}n}{\varepsilon^2n^2}\\ &<\frac{1}{2\varepsilon^2} \end{aligned}

which is not particularly meaningful since the right hand side is only better than 1/21/2 failure probability when ε1\varepsilon\geq 1 (which means the estimator may very well always be 0).

Morris+: To decrease the failure probability of Morris’s basic algorithm, we instantiate ss independent copies of Morris’s algorithm and average their outputs. That is, we obtain independent estimators n~1,...,n~s\tilde{n}_1,...,\tilde{n}_s from independent instantiations of Morris’s algorithm, and our output to a query is:

n~=1si=1sn~i\tilde{n}=\frac{1}{s}\sum_{i=1}^s\tilde{n}_i

Due to the property of expectation, E[n~]=n,V[n~]=ns\mathbb{E}[\tilde{n}]=n,\mathbb{V}[\tilde{n}]=\frac{n}{s}. Thus

P[n~n>εn]<12sε2<δ\mathbb{P}[|\tilde{n}-n|>\varepsilon n]<\frac{1}{2s\varepsilon^2}<\delta

for s>12ε2δ=Θ(1ε2δ)s>\frac{1}{2\varepsilon^2\delta}=\varTheta(\frac{1}{\varepsilon^2\delta}).

It turns out there is a simple technique (which we will see often) to reduce the dependence on the failure probability δ\delta from 1/δ1/\delta to log(1/δ)\log(1/\delta). The technique is as follows.

Morris++: We run tt instantiations of Morris+, each with failure probability 13\frac{1}{3}. We then output the median estimate from all the tt Morris+ instantiations.

Analysis:

If more than half of the instantiations are good estimations, then the median value is obviously a good estimation.

Note that the expected number of Morris+ instantiations that succeed is 2t3\frac{2t}{3} (expectation of Bernoulli trials). Let

Yi={1if i-th instantiation is a good estimation0otherwiseY_i=\begin{cases} 1&\text{if i-th instantiation is a good estimation}\\ 0 & \text{otherwise} \end{cases}

then E[i=1tYi]=2t3\mathbb{E}[\sum_{i=1}^tY_i]=\frac{2t}{3}. By Chernoff bound:

P[i=1tYit2]P[i=1tYiE[i=1tYi]t6]2et/3<δ\begin{aligned} \mathbb{P}\left [\sum_{i=1}^tY_i\leq \frac{t}{2}\right ]&\leq\mathbb{P}\left [\left | \sum_{i=1}^t Y_i-\mathbb{E}\left [\sum_{i=1}^tY_i\right ]\right |\geq\frac{t}{6}\right ]\\ &\leq 2e^{-t/3}<\delta \end{aligned}

for t=Θ(ln(1δ))t=\varTheta(\ln(\frac{1}{\delta})).

Space Complexity: Note that the space is also a random variable. With probability 1δ1-\delta, the space is at most

O(ε2log(1δ)loglog(nεδ))O\left(\varepsilon^{-2}\log\left (\frac{1}{\delta}\right )\cdot \log\log\left (\frac{n}{\varepsilon\delta}\right )\right )

bits. So for fixed constants ε,δ\varepsilon,\delta, the space complexity is O(loglogn)O(\log\log n), which is better than the initial O(logn)O(\log n).

# Lecture 2. Counting Distinct Elements in a Stream

# Idealized solution

The following solutions assume that real number storage does not require infinite memory and we have true random hash function. However, we will introduce non-idealized solution later.

# FM

FM algorithm

  1. Pick a random hash function h:[n][0,1]h:[n]\rightarrow[0,1].
  2. Maintain in memory the smallest hash we’ve seen so far: X=ministreamh(i)X=\min_{i\in\text{stream}}h(i).
  3. query() : output 1/X11/X-1.

For some intuition, say that tt is the number of distinct elements. We are partitioning the interval [0,1][0, 1] into bins of size 1/(t+1)1/(t + 1). With this in mind, we claim the following:

Claim: E[X]=1t+1\mathbb{E}[X]=\frac{1}{t+1}.

Proof:

E[X]=0xP[X=x]dx=0(0xP[X=x]dy)dx=0(yP[X=x]dx)dy=0P[X>y]dy=0istreamP[h(i)>y]dy=01(1y)tdy=1t+1\begin{aligned} \mathbb{E}[X]&=\int_0^\infty x \cdot\mathbb{P}[X=x]dx\\ &=\int_0^\infty\left (\int_0^x \mathbb{P}[X=x]dy\right )dx\\ &=\int_0^\infty\left (\int_y^\infty\mathbb{P}[X=x]dx\right )dy\\ &=\int_0^\infty\mathbb{P}[X>y]dy\\ &=\int_0^\infty\prod_{i\in\text{stream}}\mathbb{P}[h(i)>y]dy\\ &=\int_0^1(1-y)^tdy\\ &=\frac{1}{t+1} \end{aligned}

\square.

Claim: E[X2]=2(t+1)(t+2)\mathbb{E}[X^2]=\frac{2}{(t+1)(t+2)}.

Proof:

E[X2]=0P[X2>y]dy=0P[X>y]dy=01(1y)tdy=201ut(1u)du[u=1y]=2(t+1)(t+2)\begin{aligned} \mathbb{E}[X^2]&=\int_0^\infty\mathbb{P}[X^2>y]dy\\ &=\int_0^\infty\mathbb{P}[X>\sqrt{y}]dy\\ &=\int_0^1 (1-\sqrt{y})^tdy\\ &=2\int_0^1u^t(1-u)du\quad[u=1-\sqrt{y}]\\ &=\frac{2}{(t+1)(t+2)} \end{aligned}

\square.

So V[X]=2(t+1)(t+2)1(t+1)2=t(t+1)2(t+2)<E[X]2\mathbb{V}[X]=\frac{2}{(t+1)(t+2)}-\frac{1}{(t+1)^2}=\frac{t}{(t+1)^2(t+2)}<\mathbb{E}[X]^2.

# FM+

We can upgrade our basic algorithm into FM+ by running it q=1ε2δq=\frac{1}{\varepsilon^2\delta} times in parallel to obtain X1,...,XqX_1,...,X_q, and query() outputs 11qi=1qXi1\frac{1}{\frac{1}{q}\sum_{i=1}^qX_i}-1.

Claim 3:

P(1qi=1qXi1t+1>εt+1)<δ\mathbb{P}\left (\left | \frac{1}{q}\sum_{i=1}^qX_i-\frac{1}{t+1}\right |>\frac{\varepsilon}{t+1}\right )<\delta

Proof: By Chebyshev’s inequality,

P(1qi=1qXi1t+1>εt+1)<V[1qiXi]ε2/(t+1)2=V[X]/qε2/(t+1)2<E[X]2/qε2/(t+1)2=1ε2q=δ\begin{aligned} \mathbb{P}\left (\left | \frac{1}{q}\sum_{i=1}^qX_i-\frac{1}{t+1}\right |>\frac{\varepsilon}{t+1}\right )&<\frac{\mathbb{V}[\frac{1}{q}\sum_{i}X_i]}{\varepsilon^2/(t+1)^2}\\ &=\frac{\mathbb{V}[X]/q}{\varepsilon^2/(t+1)^2}\\ &<\frac{\mathbb{E}[X]^2/q}{\varepsilon^2/(t+1)^2}\\ &=\frac{1}{\varepsilon^2q}=\delta \end{aligned}

\square.

# FM++

To achieve a logarithmic dependence on the failure probability, we can running t=Θ(log1δ)t=\varTheta(\log\frac{1}{\delta}) independent copies of FM+, each with δ=13\delta=\frac{1}{3}. Then query() outputs the median across all FM+ estimates.

The new space for FM++ is O(1ε2log1δ)O(\frac{1}{\varepsilon^2}\log\frac{1}{\delta}). The analysis of median trick is similar to lecture 1.

# Non-idealized solution

Definition: A family H\mathcal{H} of functions mapping [a][a] into [b][b] is k-wise independent if and only if for all distinct i1,...,ik[a]i_1,...,i_k\in [a] and for all j1,...,jk[b]j_1,...,j_k\in [b],

PhH[h(i1)=j1...h(ik)=jk]=1bk\mathbb{P}_{h\in\mathcal{H}}[h(i_1)=j_1\wedge...\wedge h(i_k)=j_k]=\frac{1}{b^k}

Note that we can store hHh\in\mathcal{H} in memory with log2H\log_2|\mathcal{H}| bits. One example of such a family H\mathcal{H} is the set of all functions mapping [a][a] to [b][b]. Then H=ba|\mathcal{H}|=b^a so logH=alogb\log|\mathcal{H}|=a\log b. A less trivial example is due to Carter and Wegman, where H\mathcal{H} is the set of all degree-(k1k-1) polynomials over Fq\mathbb{F}_q such that a=b=qa = b = q. Then H=qk|\mathcal{H}|=q^k , and so logH=klogq\log|\mathcal{H}|=k\log q.

Having seen these examples, we will just assume that we have access to some 2-wise independent hash families, which will let us store in logn\log n bits.

# Common strategy: Geometric Sampling of Streams

Suppose we have a substitute that gives us t~\tilde{t} as a 32-approximation to tt. To get the (1+ε)(1+\varepsilon)-approximation, we will use the common strategy of geometric sampling of streams. This is important to understand because it is used fairly often in scenarios like this one.

First let’s consider the trivial solution: remember the first KK distinct elements in the stream with K=cε2K=\frac{c}{\varepsilon^2 }, if the number of distinct elements is larger than KK then the trivial solution will dismiss them. Then we can compose them as follows:

  1. Assume nn is a power of 2, pick g:[n][n]g:[n]\rightarrow[n] from a 2-wise family.
  2. init() : create logn+1\log n+1 trivial solutions TS0,...,TSK\text{TS}_0,...,\text{TS}_K.
  3. update(i) : feed ii to TSLSB(g(i))\text{TS}_{\text{LSB}(g(i))}.
  4. query() : choose jj such that t~2j+11ε2\frac{\tilde{t}}{2^{j+1}}\approx\frac{1}{\varepsilon^2}.
  5. Output TSjquery()2j+1\text{TS}_{j\cdot\text{query()}\cdot 2^{j+1}}.

Define a set of virtual streams wrapping the trivial solutions. That is, consider TSi\text{TS}_i is the number of distinct elements for virtual stream VSi\text{VS}_i.

Fix some jj, let Zi={1LSB(g(i))=j0otherwiseZ_i=\begin{cases}1&\text{LSB}(g(i))=j\\0&\text{otherwise}\end{cases}. Then Z=istreamZiZ=\sum_{i\in\text{stream}}Z_i is the number of distinct elements in VSj\text{VS}_j.

Claim: E[Z]=t2j+1\mathbb{E}[Z]=\frac{t}{2^{j+1}}.

待续。