3. Discrete random variables

For the current purposes it is sufficient to think of a discrete random variable as variable with a fixed but unknown integer value larger than or equal to zero. The random variables we will encounter later will be mostly secret keys, messages, and message tags. Our knowledge about the random variable is completely determined by a vector of positive probabilities , each describing how confident we are that the variable's value is the specific integer , adding up to 1. It is often better to talk about uncertainty, or entropy, instead of knowledge. No uncertainty means full knowledge, i.e., 100% probability for one value and 0% for the rest.

An important special case is the random variables for which all non-zero probabilities are equal. These random variables are called uniform random variables. If Alice throws a normal, but perfect, six-sided die and keeps the result secret, the result is to Bob a uniform random variable with six possible values. If Bob had managed to replace Alice's die with one that is not perfect, the variable would not have been completely uniform. In any case, Alice knows the value so to her it is a random variable with zero uncertainty or entropy.

The range of is the set of values can have, even those with zero probability, and is denoted . Even though infinite ranges are possible, we will limit ourselves to random variables with finite ranges. Without loss of generality we will only consider ranges consisting of integers .

The expectation value is denoted and can be defined as

Throughout the rest of this thesis the class of random
variables
defined by (3.2) will serve as an
illustrative example. The same definition in Python code is
available as function `Q` in entropies.py line 54 .

is a uniform random variable with possible values. If is large has one very probable and equally improbable values. Such random variables are rather extreme and will therefore nicely illustrate some somewhat unintuitive situations later.

Two random variables and can be related in ways that are unrelated to their internal probability vectors. To completely specify both their respective probability vectors and their relations it is sufficient (and necessary) to (be able to) specify the probability vector of a larger random variable, the concatenation of the two variables, written as , with probabilities and . Observe that neither nor are products. When the random variables are related in this way the probabilities in their respective smaller probability vectors are called marginal probabilities.

As an example, consider the dependent random variables defined by

which can be seen as the two bits of the uniform random variable with values 0, 1 and 2. Their marginal distributions are

Given two random variables and , when learning that the value of is the probability vector of can change. If they are dependent contains information about and receiving information changes the probabilities. The new random variable can be denoted and its probability vector is

This relation is called Bayes' theorem and is a fundamental part of probability theory, but the notation is unorthodox.

Normally is written as and is read as the probability that equals given that equals . Similar notations are used for other things, most notably for conditional entropy. We will use the unconventional notation exclusively, both to note explicitly that we are working on a new random variable and to bring the implicit hidden expectation value in conditional entropy written the conventional way out into the light. See Chapter 4.4.1 for more details.

Using Bayes' theorem on the previously defined and yields

and, because of the symmetry in their definition, this also holds when and are interchanged.

With more than one random variable it can be necessary to specify the expectation value over just one of them. A natural definition is

There are lots of standard inequalities that are very useful in connection with random variables. We will only need one of them, Jensen's inequality.

Jensen's inequality is applicable to convex and concave
functions. A function is called convex if it is continuous and
the whole line between every two points in its graph lies on or
above the graph. If the whole line lies above the graph it is
also called strictly convex. A function is concave^{3.1} or
strictly concave if is convex or strictly convex. In other words,
for all
,
, and
holds:

Jensen's inequality states that if is a convex or concave function, then for any random variable :

Furthermore, if is strictly convex or strictly concave, equality occurs if and only if there is only one value of that is assigned a non-zero probability.

For a random variable with only two possible values, Jensen's inequality just restates the definition of convexity. Generalizing to arbitrary number of values by induction is pretty straightforward and is explained in many other places.

If is convex and has an inverse, an alternative way to express Jensen's inequality is .