Subsections


3. Discrete random variables

3.1 Discrete random variables

For the current purposes it is sufficient to think of a discrete random variable $ X$ as variable with a fixed but unknown integer value larger than or equal to zero. The random variables we will encounter later will be mostly secret keys, messages, and message tags. Our knowledge about the random variable is completely determined by a vector of positive probabilities $ P_X(x) \equiv P(X=x)$, each describing how confident we are that the variable's value is the specific integer $ x$, adding up to 1. It is often better to talk about uncertainty, or entropy, instead of knowledge. No uncertainty means full knowledge, i.e., 100% probability for one value and 0% for the rest.

An important special case is the random variables for which all non-zero probabilities are equal. These random variables are called uniform random variables. If Alice throws a normal, but perfect, six-sided die and keeps the result secret, the result is to Bob a uniform random variable with six possible values. If Bob had managed to replace Alice's die with one that is not perfect, the variable would not have been completely uniform. In any case, Alice knows the value so to her it is a random variable with zero uncertainty or entropy.

The range of $ X$ is the set of values $ X$ can have, even those with zero probability, and is denoted $ R(X)$. Even though infinite ranges are possible, we will limit ourselves to random variables with finite ranges. Without loss of generality we will only consider ranges consisting of integers $ \geq 0$.

The expectation value is denoted $ E(\cdot)$ and can be defined as

$\displaystyle E(f(X)) \equiv \sum_{x\in R(X)} P\left(X\!=\!x\right) f(x).$ (3.1)

Throughout the rest of this thesis the class of random variables $ Q^p_n$ defined by (3.2) will serve as an illustrative example. The same definition in Python code is available as function Q in entropies.py line 54 .

$\displaystyle P(Q^p_n = i) \equiv \begin{cases}p & \text{if } i=0 \ \frac{1-p}{n} & \text{if } 1 \leq i \leq n \ 0 & \text{if } i > n \end{cases}$ (3.2)

$ Q^0_n$ is a uniform random variable with $ n$ possible values. If $ p$ is large $ Q^p_n$ has one very probable and $ n$ equally improbable values. Such random variables are rather extreme and will therefore nicely illustrate some somewhat unintuitive situations later.

3.2 Dependent random variables

Two random variables $ X$ and $ Y$ can be related in ways that are unrelated to their internal probability vectors. To completely specify both their respective probability vectors and their relations it is sufficient (and necessary) to (be able to) specify the probability vector of a larger random variable, the concatenation of the two variables, written as $ XY$, with probabilities $ P\left(XY\!=\!xy\right) = P(X\!=\!x$ and $ Y\!=\!y)$. Observe that neither $ XY$ nor $ xy$ are products. When the random variables are related in this way the probabilities in their respective smaller probability vectors are called marginal probabilities.

As an example, consider the dependent random variables defined by

\begin{subequations}\begin{align}P\left(T_1T_2\!=\!00\right) &= P(T_1\!=\!0\text... ...ght) &= P(T_1\!=\!1\text{ and }T_2\!=\!1) \equiv 0 \end{align}\end{subequations}

which can be seen as the two bits of the uniform random variable with values 0, 1 and 2. Their marginal distributions are
\begin{subequations}\begin{align}P\left(T_1\!=\!0\right) &= P\left(T_2\!=\!0\rig... ...1\right) &= P\left(T_2\!=\!1\right) = \nicefrac13. \end{align}\end{subequations}

Given two random variables $ X$ and $ Y$, when learning that the value of $ Y$ is $ y$ the probability vector of $ X$ can change. If they are dependent $ Y$ contains information about $ X$ and receiving information changes the probabilities. The new random variable can be denoted $ X\vert _{Y=y}$ and its probability vector is

$\displaystyle P\left(X\big\vert _{Y=y}\!=\!x\right) = \frac{P\left(XY\!=\!xy\right)}{P\left(Y\!=\!y\right)}.$ (3.5)

This relation is called Bayes' theorem and is a fundamental part of probability theory, but the notation is unorthodox.

Normally $ P\left(X\vert _{Y=y}\!=\!x\right)$ is written as $ P(X\!=\!x \vert Y\!=\!y)$ and is read as the probability that $ X$ equals $ x$ given that $ Y$ equals $ y$. Similar notations are used for other things, most notably for conditional entropy. We will use the unconventional notation exclusively, both to note explicitly that we are working on a new random variable and to bring the implicit hidden expectation value in conditional entropy written the conventional way out into the light. See Chapter 4.4.1 for more details.

Using Bayes' theorem on the previously defined $ T_1$ and $ T_2$ yields

\begin{subequations}\begin{align}P\left(T_1\big\vert _{T_2=0}\!=\!0\right) &= \f... ...left(T_2\!=\!1\right)} = \frac{0}{\nicefrac13} = 0 \end{align}\end{subequations}

and, because of the symmetry in their definition, this also holds when $ T_1$ and $ T_2$ are interchanged.

With more than one random variable it can be necessary to specify the expectation value over just one of them. A natural definition is

$\displaystyle E_{x\in R(X)}(f(x, Y)) \equiv \sum_{x\in R(X)} P\left(X\!=\!x\right) f(x, Y).$ (3.7)

3.3 Jensen's inequality

There are lots of standard inequalities that are very useful in connection with random variables. We will only need one of them, Jensen's inequality.

Jensen's inequality is applicable to convex and concave functions. A function is called convex if it is continuous and the whole line between every two points in its graph lies on or above the graph. If the whole line lies above the graph it is also called strictly convex. A function $ f$ is concave3.1 or strictly concave if $ -f$ is convex or strictly convex. In other words, for all $ 0<\lambda<1$ , $ x_1$, and $ x_2$ holds:

\begin{subequations}\begin{align}f\text{ is convex: } & \lambda f(x_1) + (1\!-\!... ...bda) f(x_2) < f(\lambda x_1 + (1\!-\!\lambda) x_2) \end{align}\end{subequations}

Jensen's inequality states that if $ f$ is a convex or concave function, then for any random variable $ X$:

\begin{subequations}\begin{align}f\text{ is convex: } &f(E(X)) \leq E(f(X)) \ f\text{ is concave: } &f(E(X)) \geq E(f(X)) \end{align}\end{subequations}

Furthermore, if $ f$ is strictly convex or strictly concave, equality occurs if and only if there is only one value of $ X$ that is assigned a non-zero probability.

For a random variable with only two possible values, Jensen's inequality just restates the definition of convexity. Generalizing to arbitrary number of values by induction is pretty straightforward and is explained in many other places.

If $ f$ is convex and has an inverse, an alternative way to express Jensen's inequality is $ f^{-1}(E(f(X))) \geq E(X)$.