Harvard CS 226/MIT 6.889

# # Overview

Big data is data so large that it does not fit in the main memory of a single machine. The need to process big data by space-efficient algorithms arises in Internet search, machine learning, network traffic monitoring, scientific computing, signal processing, and other areas.

This course will cover mathematically rigorous models for developing such algorithms, as well as some provable limitations of algorithms operating in those models. Some topics covered include:

• Dimensionality reduction. General techniques and impossibility results for reducing data dimension while still preserving geometric structure.
• Numerical linear algebra. Algorithms for big matrices (e.g. a user/product rating matrix for Netflix or Amazon). Regression, low rank approximation, matrix completion, etc.
• Compressed sensing. Recovery of (approximately) sparse signals based on few linear measurements.
• Sparse Fourier Transform. Fast algorithms for (approximately) computing the Fourier Transform of signals which are (approximately) sparse in the frequency domain.

# # Lecture 1.

## # Topics Covered in the Course

1. Sketching.
2. Streaming.
3. Dimensionality Reduction.
4. Large-Scale Matrix Computations.
5. Compressed Sensing.
6. Sparse Fourier Transform.

## # An example of sketching: Approximate Counting

Problem: Design an algorithm that monitors a sequence of events and upon request can output (an estimate of) the number of events thus far. More formally, create a data structure that maintains a single integer $n$ and supports the following operations.

• init() ; set $n\leftarrow 0$.
• update() ; increments $n\leftarrow n+1$.
• query() ; outputs $n$ or an estimate of $n$.

Before any operations are performed, we assume that $n=0$.

A trivial algorithm stores $n$ as a sequence of $\lceil \log n\rceil=O(\log n)$ bits (a counter). If we want query() ; to return the exact value of $n$, this is the best we can do. Suppose for some such algorithm, we use $f(n)$ bits to store the integer $n$. There are $2^{f(n)}$ configurations for these bits. In order for the algorithm to be able to store the exact value of all integers up to $n$, the number of configurations must be greater than or equal to the number $n$. Hence,

$2^{f(n)}\geq n\Rightarrow f(n)\geq\log n$

The goal is to use much less space than $O(\log n)$, and so we must instead answer query() ; with some estimate $\tilde{n}$ of $n$. We would like this $\tilde{n}$ to satisfy:

$\mathbb{P}[|\tilde{n}-n|>\varepsilon n]<\delta$

for some $0 < \varepsilon,\delta < 1$ that are given to the algorithm up front.

Morris’s algorithm:

• init() ; sets $X\leftarrow 0$.
• update() ; increments $X$ with probability $2^{-X}$.
• query() ; outputs $\tilde{n}=2^X-1$.

Analysis: Let $X_n$ denote $X$ in Morris’s algorithm after $n$ updates.

Claim: $\mathbb{E}[2^{X_n}]=n+1$.

Proof: With induction,

\begin{aligned} \mathbb{E}[2^{X_{n+1}}]&=\sum_{j=0}^{\infty}\mathbb{P}[X_n=j]\cdot\mathbb{E}[2^{X_{n+1}}\mid X_n=j]\\ &=\sum_{j=0}^{\infty}\mathbb{P}[X_n=j]\cdot\left (\frac{1}{2^j}\cdot 2^{j+1}+\left (1-\frac{1}{2^j}\right )\cdot 2^j\right)\\ &=\sum_{j=0}^\infty\mathbb{P}[X_n=j]\cdot 2^j+\sum_{j=0}^{\infty}\mathbb{P}[X_n=j]\\ &=\mathbb{E}[2^{X_n}]+1\\ &=n+2 \end{aligned}

$\square$.

Claim: $\mathbb{E}[2^{2X_n}]=\frac{3}{2}n^2+\frac{3}{2}n+1$.

Proof:

\begin{aligned} \mathbb{E}[2^{2X_{n+1}}]&=\sum_{j=0}^\infty \mathbb{P}[2^{X_n}=j]\cdot\mathbb{E}[2^{2X_{n+1}}\mid 2^{X_n}=j]\\ &=\sum_{j=0}^\infty \mathbb{P}[2^{X_n}=j]\cdot\left (\frac{1}{j}\cdot 2^{2(j+1)}+\left (1-\frac{1}{j}\right )\cdot 2^{2j}\right )\\ &=\sum_{j=0}^\infty \mathbb{P}[2^{X_n}=j]\cdot(j^2+3j)\\ &=\mathbb{E}[2^{2X_n}]+3\mathbb{E}[2^{X_n}]\\ &=\left (\frac{3}{2}n^2+\frac{3}{2}n+1\right )+3n+3\\ &=\frac{3}{2}(n+1)^2+\frac{3}{2}(n+1)+1 \end{aligned}

$\square$.

Thus, by Chebyshev’s inequality,

\begin{aligned} \mathbb{P}[|\tilde{n}-n|>\varepsilon n]&<\frac{\mathbb{V}[\tilde{n}^2]}{\varepsilon^2n^2}\\ &=\frac{\mathbb{E}[\tilde{n}^2]-\mathbb{E}[\tilde{n}]^2}{\varepsilon^2n^2}\\ &=\frac{\mathbb{E}[(2^{X_n}-1)^2]-\mathbb{E}[2^{X_n}-1]^2}{\varepsilon^2n^2}\\ &=\frac{\mathbb{E}[2^{2X_n}]-2(n+1)+1-n^2}{\varepsilon^2n^2}\\ &=\frac{\frac{1}{2}n^2-\frac{1}{2}n}{\varepsilon^2n^2}\\ &<\frac{1}{2\varepsilon^2} \end{aligned}

which is not particularly meaningful since the right hand side is only better than $1/2$ failure probability when $\varepsilon\geq 1$ (which means the estimator may very well always be 0).

Morris+: To decrease the failure probability of Morris’s basic algorithm, we instantiate $s$ independent copies of Morris’s algorithm and average their outputs. That is, we obtain independent estimators $\tilde{n}_1,...,\tilde{n}_s$ from independent instantiations of Morris’s algorithm, and our output to a query is:

$\tilde{n}=\frac{1}{s}\sum_{i=1}^s\tilde{n}_i$

Due to the property of expectation, $\mathbb{E}[\tilde{n}]=n,\mathbb{V}[\tilde{n}]=\frac{n}{s}$. Thus

$\mathbb{P}[|\tilde{n}-n|>\varepsilon n]<\frac{1}{2s\varepsilon^2}<\delta$

for $s>\frac{1}{2\varepsilon^2\delta}=\varTheta(\frac{1}{\varepsilon^2\delta})$.

It turns out there is a simple technique (which we will see often) to reduce the dependence on the failure probability $\delta$ from $1/\delta$ to $\log(1/\delta)$. The technique is as follows.

Morris++: We run $t$ instantiations of Morris+, each with failure probability $\frac{1}{3}$. We then output the median estimate from all the $t$ Morris+ instantiations.

Analysis:

If more than half of the instantiations are good estimations, then the median value is obviously a good estimation.

Note that the expected number of Morris+ instantiations that succeed is $\frac{2t}{3}$ (expectation of Bernoulli trials). Let

$Y_i=\begin{cases} 1&\text{if i-th instantiation is a good estimation}\\ 0 & \text{otherwise} \end{cases}$

then $\mathbb{E}[\sum_{i=1}^tY_i]=\frac{2t}{3}$. By Chernoff bound:

\begin{aligned} \mathbb{P}\left [\sum_{i=1}^tY_i\leq \frac{t}{2}\right ]&\leq\mathbb{P}\left [\left | \sum_{i=1}^t Y_i-\mathbb{E}\left [\sum_{i=1}^tY_i\right ]\right |\geq\frac{t}{6}\right ]\\ &\leq 2e^{-t/3}<\delta \end{aligned}

for $t=\varTheta(\ln(\frac{1}{\delta}))$.

Space Complexity: Note that the space is also a random variable. With probability $1-\delta$, the space is at most

$O\left(\varepsilon^{-2}\log\left (\frac{1}{\delta}\right )\cdot \log\log\left (\frac{n}{\varepsilon\delta}\right )\right )$

bits. So for fixed constants $\varepsilon,\delta$, the space complexity is $O(\log\log n)$, which is better than the initial $O(\log n)$.

# # Lecture 2. Counting Distinct Elements in a Stream

## # Idealized solution

The following solutions assume that real number storage does not require infinite memory and we have true random hash function. However, we will introduce non-idealized solution later.

### # FM

FM algorithm

1. Pick a random hash function $h:[n]\rightarrow[0,1]$.
2. Maintain in memory the smallest hash we’ve seen so far: $X=\min_{i\in\text{stream}}h(i)$.
3. query() : output $1/X-1$.

For some intuition, say that $t$ is the number of distinct elements. We are partitioning the interval $[0, 1]$ into bins of size $1/(t + 1)$. With this in mind, we claim the following:

Claim: $\mathbb{E}[X]=\frac{1}{t+1}$.

Proof:

\begin{aligned} \mathbb{E}[X]&=\int_0^\infty x \cdot\mathbb{P}[X=x]dx\\ &=\int_0^\infty\left (\int_0^x \mathbb{P}[X=x]dy\right )dx\\ &=\int_0^\infty\left (\int_y^\infty\mathbb{P}[X=x]dx\right )dy\\ &=\int_0^\infty\mathbb{P}[X>y]dy\\ &=\int_0^\infty\prod_{i\in\text{stream}}\mathbb{P}[h(i)>y]dy\\ &=\int_0^1(1-y)^tdy\\ &=\frac{1}{t+1} \end{aligned}

$\square$.

Claim: $\mathbb{E}[X^2]=\frac{2}{(t+1)(t+2)}$.

Proof:

\begin{aligned} \mathbb{E}[X^2]&=\int_0^\infty\mathbb{P}[X^2>y]dy\\ &=\int_0^\infty\mathbb{P}[X>\sqrt{y}]dy\\ &=\int_0^1 (1-\sqrt{y})^tdy\\ &=2\int_0^1u^t(1-u)du\quad[u=1-\sqrt{y}]\\ &=\frac{2}{(t+1)(t+2)} \end{aligned}

$\square$.

So $\mathbb{V}[X]=\frac{2}{(t+1)(t+2)}-\frac{1}{(t+1)^2}=\frac{t}{(t+1)^2(t+2)}<\mathbb{E}[X]^2$.

### # FM+

We can upgrade our basic algorithm into FM+ by running it $q=\frac{1}{\varepsilon^2\delta}$ times in parallel to obtain $X_1,...,X_q$, and query() outputs $\frac{1}{\frac{1}{q}\sum_{i=1}^qX_i}-1$.

Claim 3:

$\mathbb{P}\left (\left | \frac{1}{q}\sum_{i=1}^qX_i-\frac{1}{t+1}\right |>\frac{\varepsilon}{t+1}\right )<\delta$

Proof: By Chebyshev’s inequality,

\begin{aligned} \mathbb{P}\left (\left | \frac{1}{q}\sum_{i=1}^qX_i-\frac{1}{t+1}\right |>\frac{\varepsilon}{t+1}\right )&<\frac{\mathbb{V}[\frac{1}{q}\sum_{i}X_i]}{\varepsilon^2/(t+1)^2}\\ &=\frac{\mathbb{V}[X]/q}{\varepsilon^2/(t+1)^2}\\ &<\frac{\mathbb{E}[X]^2/q}{\varepsilon^2/(t+1)^2}\\ &=\frac{1}{\varepsilon^2q}=\delta \end{aligned}

$\square$.

### # FM++

To achieve a logarithmic dependence on the failure probability, we can running $t=\varTheta(\log\frac{1}{\delta})$ independent copies of FM+, each with $\delta=\frac{1}{3}$. Then query() outputs the median across all FM+ estimates.

The new space for FM++ is $O(\frac{1}{\varepsilon^2}\log\frac{1}{\delta})$. The analysis of median trick is similar to lecture 1.

## # Non-idealized solution

Definition: A family $\mathcal{H}$ of functions mapping $[a]$ into $[b]$ is k-wise independent if and only if for all distinct $i_1,...,i_k\in [a]$ and for all $j_1,...,j_k\in [b]$,

$\mathbb{P}_{h\in\mathcal{H}}[h(i_1)=j_1\wedge...\wedge h(i_k)=j_k]=\frac{1}{b^k}$

Note that we can store $h\in\mathcal{H}$ in memory with $\log_2|\mathcal{H}|$ bits. One example of such a family $\mathcal{H}$ is the set of all functions mapping $[a]$ to $[b]$. Then $|\mathcal{H}|=b^a$ so $\log|\mathcal{H}|=a\log b$. A less trivial example is due to Carter and Wegman, where $\mathcal{H}$ is the set of all degree-($k-1$) polynomials over $\mathbb{F}_q$ such that $a = b = q$. Then $|\mathcal{H}|=q^k$ , and so $\log|\mathcal{H}|=k\log q$.

Having seen these examples, we will just assume that we have access to some 2-wise independent hash families, which will let us store in $\log n$ bits.

### # Common strategy: Geometric Sampling of Streams

Suppose we have a substitute that gives us $\tilde{t}$ as a 32-approximation to $t$. To get the $(1+\varepsilon)$-approximation, we will use the common strategy of geometric sampling of streams. This is important to understand because it is used fairly often in scenarios like this one.

First let’s consider the trivial solution: remember the first $K$ distinct elements in the stream with $K=\frac{c}{\varepsilon^2 }$, if the number of distinct elements is larger than $K$ then the trivial solution will dismiss them. Then we can compose them as follows:

1. Assume $n$ is a power of 2, pick $g:[n]\rightarrow[n]$ from a 2-wise family.
2. init() : create $\log n+1$ trivial solutions $\text{TS}_0,...,\text{TS}_K$.
3. update(i) : feed $i$ to $\text{TS}_{\text{LSB}(g(i))}$.
4. query() : choose $j$ such that $\frac{\tilde{t}}{2^{j+1}}\approx\frac{1}{\varepsilon^2}$.
5. Output $\text{TS}_{j\cdot\text{query()}\cdot 2^{j+1}}$.

Define a set of virtual streams wrapping the trivial solutions. That is, consider $\text{TS}_i$ is the number of distinct elements for virtual stream $\text{VS}_i$.

Fix some $j$, let $Z_i=\begin{cases}1&\text{LSB}(g(i))=j\\0&\text{otherwise}\end{cases}$. Then $Z=\sum_{i\in\text{stream}}Z_i$ is the number of distinct elements in $\text{VS}_j$.

Claim: $\mathbb{E}[Z]=\frac{t}{2^{j+1}}$.