4. Entropy

Entropy is an important concept in many fields, and one field where it is extensively used is QKG. This chapter gives a general overview of entropy and presents a generalization of the chain rule of entropy as needed in future chapters. Alternate explanations to most of the contents can be found in many other places, along with lots of other useful bounds and relations. The introductory chapters of [6] are highly recommended.

4.1 Conventions

The function $ f(p) = p\log(p)$ where $ p$ is a probability occurs frequently in connection with entropies. $ 0\log(0)$ is normally undefined but since $ \lim_{p\to0}p\log(p) = 0$ is well-defined we extend the function to zero by continuity.

Another convention we will follow is to let $ \log(\cdot)$ mean the logarithm base 2. We can choose any base, but using base 2 consequently means that everything will be expressed in bits, and people tend to be familiar with bits.

4.2 Shannon entropy

Entropy is a measure of uncertainty regarding a discrete random variable. For many purposes, the Shannon entropy is the only measure needed. Shannon entropy is defined by

$\displaystyle H_$Shannon$\displaystyle (X) \equiv -\sum_{x=0}^{\infty} P\left(X\!=\!x\right)\log(P\left(X\!=\!x\right))$ (4.1)

has the unit bits. A Python implementation is available as function shannon_entropy in line 15 .

The Shannon entropy is a fundamental measure in information theory. It was introduced by Claude E. Shannon, now considered the father of information theory, in [7]. Much can be said about its properties, its uniqueness, and its relation with the thermodynamical entropy in physics, but we will only scratch a little bit on the surface here. One way of understanding it better is to rewrite the definition as

$\displaystyle H_$Shannon$\displaystyle (X) = E(-\log(P_X(X)))$ (4.2)

where $ P_X(X)$ is the probability ascribed to the value of $ X$ that turns out to be correct. This makes $ P_X(X)$ a discrete random variable, but not necessarily with integer values.

Now it is clear that the Shannon entropy is the expectation value of $ -\log(p)$ where $ p$ is the probability assigned to the measured value of the random variable. $ -\log(p)$ can be interpreted as the needed length, in bits, of a message communicating a measurement that had probability $ p$, which makes the Shannon entropy a measure of the expected message length needed to communicate the measured value of a random variable.

The Shannon entropy of a uniform random variable with $ n$ possible values is

$\displaystyle H_$Shannon$\displaystyle (Q^0_n) = E(-\log(\frac1n)) = \log(n)$ (4.3)

which means that we need $ \log(n)$ bits4.1 to communicate one choice from $ n$ different equally likely states.

Without qualifiers, the word entropy and a non-subscripted $ H$ normally refers only to Shannon entropy. However, when dealing with QKG, as well as most other parts of cryptography, this measure is not sufficient. The goal of QKG is to produce a key that is known to both Alice and Bob but to Eve is a random variable with high uncertainty. $ -\log(p)$ is a measure of the uncertainty of a value assigned probability $ p$ and is therefore a measure of the security of that particular value of the key. Shannon entropy measures the expectation value of that security. The dangers of focusing on Shannon entropy alone is highlighted by this theorem:

Theorem 1   There exists finite discrete random variables with arbitrarily high Shannon entropy which the chance of guessing at one try is arbitrarily close to $ 1$.
Proof. Consider guessing the value of $ Q^p_n$. The guess $ i=0$ has chance $ p$ to be correct. The Shannon entropy is
\begin{displaymath}\begin{split}H_\text{Shannon}(Q^p_n) =& \sum_{i=0}^n -P\left(... ...text{ when } n \rightarrow \infty \quad \forall p<1 \end{split}\end{displaymath} (4.4)

which completes the proof. $ \qedsymbol$

Good security average is not good enough, and Shannon entropy alone is obviously not a sufficient measure of the quality of a key.

4.3 Guessing entropy

Another measure more closely related to the difficulty of guessing the value of a random variable was introduced by Massey in [8]. He did not name it but in [6] it is called guessing entropy. Note, however, that while most other entropies have the unit bits the guessing entropy is measured in units of number of guesses. Without loss of generality we can assume that the values of $ X$ are sorted with decreasing probability, in which case the guessing entropy of $ X$ is defined as

$\displaystyle G(X) = \sum_{x = 0}^{\max(R(X))} P\left(X\!=\!x\right) (x\!+\!1).$ (4.5)

That is, the guessing entropy is simply the average number of guesses needed to guess the value of a random variable using the optimal strategy. The definition formalized to Python code is available as function guessing_entropy in line 36 . We have yet again a measure of average security and similarly to theorem 1 we can write
Theorem 2   There exists finite discrete random variables with arbitrarily high Guessing entropy which the chance of guessing at one try is arbitrarily close to $ 1$.
Proof. Consider guessing the value of $ Q^p_n$, where $ p\geq \nicefrac{(1-p)}n$ so the values are sorted in decreased probability. The (optimal) guess $ i=0$ has chance $ p$ to be correct. The Guessing entropy is
\begin{displaymath}\begin{split}G(Q^p_n) =& \sum_{i=0}^n P\left(Q^p_n\!=\!i\righ... ...text{ when } n \rightarrow \infty \quad \forall p<1 \end{split}\end{displaymath} (4.6)

which completes the proof. $ \qedsymbol$

Again we see that good security average is not good enough, and guessing entropy alone is not a sufficient measure of the quality of a key.

4.4 Rényi entropy

A useful generalization of Shannon entropy is the Rényi entropy, which maps an entropy measure $ H_{\alpha}$ pronounced the Rényi entropy of order $ \alpha$ to every real number $ 0\leq\alpha\leq\infty$. Rényi entropy is, just like Shannon entropy, measured in units of bits.

\begin{subequations}\begin{align}H_{\alpha}(X) &\equiv \frac{1}{1-\alpha}\log{\s... ...H_{\alpha}(X) = -\log{\max_{x \in R(X)}P(X\!=\!x)} \end{align}\end{subequations}

The equality in (4.7c) is easy to show using e.g. l'Hospital's4.2 rule. The definition is also available as Python code as function entropy in line 23 .

An important property of Rényi entropy is that for $ \alpha < \alpha'$, $ H_\alpha(X) \leq H_{\alpha'}(X)$ for all $ X$, with equality if and only if $ X$ is a uniform random variable. In other words, $ H_\alpha(X)$ is a constant function of $ \alpha$ if $ X$ is uniform and strictly decreasing if not. A full proof is given in [6] and follows quite naturally by writing $ H_\alpha(X)$ in analogy with (4.2) as $ -\log\big(E[P_X(X)^{\alpha-1}]^\frac1{\alpha-1}\big)$ and using Jensen's inequality.

Figure 4.1: All Rényi entropies for 8 different random variables.
Image renyi

Some of these measures have quite natural interpretations. Rényi entropies with higher $ \alpha$ parameter depend more on the probabilities of the more probable values and less on the more improbable ones. $ H_0(X)$ is logarithm of the number of values of $ X$ that have non-zero probabilities. Any two random variables with different probability distributions but the same number of values with non-zero probabilities will have the same Rényi entropy of order 0. $ H_1(\cdot)$ is the Shannon entropy, in which the actual probabilities are quite important. $ H_2(\cdot)$ is often called collision entropy, or just Rényi entropy, and is the negative logarithm of the likelihood of two independent random variables with the same probability distribution to have the same value. More probable values are much more likely to collide and are therefore more visible in the collision entropy than in the Shannon entropy. $ H_\infty(\cdot)$ is called min-entropy and is a function of the highest probability only.

The shape of $ H_\alpha(X)$ as a function of $ \alpha$ for eight different random variables $ X$ is shown in figure 4.1.

4.4.1 Conditional Rényi entropy

The conditional Shannon entropy for $ X$ given $ Y$ is conventionally written as $ H_1(X\vert Y)$ and defined by
$\displaystyle H_1(X\vert Y) \equiv E_{y\in R(Y)}(H_1(X\vert _{Y=y})).$ (4.8)

It expresses the expected value of the entropy of $ X$ after $ Y$ is disclosed. The notation clearly hides an implicit expectation value, but since Shannon entropy is an expectation value to begin with, that doesn't change much. Using equations (4.2) and (3.5) we can write (4.8) more explicitly as
$\displaystyle E_{y\in R(Y)}(H_1(X\vert _{Y=y})) = E_{xy\in R(XY)}\left(-\frac{P... ...)}\log\left(\frac{P\left(XY\!=\!xy\right)}{P\left(Y\!=\!y\right)}\right)\right)$ (4.9)

which is a single expectation value just like equation (4.2).

Rényi entropies of different order than 1 are not expectation values so things are not quite as simple. In fact, according to [6] there is not even an agreement about a standard definition of conditional Rényi entropy. However, the dominant definition seems to be the same as (4.8) with both $ H_1$:s replaced by $ H_\alpha$. That definition will be used here but the expectation value will always be explicitly written out. We have seen that averaging security can be dangerous, and it is nice to not hide away something potentially dangerous in the notation.

4.4.2 Chain rule of Rényi entropy

With conditional Shannon entropy comes the chain rule of Shannon entropy, equation (4.10b) below. One way to define conditional Rényi entropy is to choose it so the same relation still holds when the Shannon entropies are replaced with Rényi entropies. However, the relation does not hold for the expectation value based definition chosen above so it is clearly a different conditional entropy. Fortunately, that doesn't stop us from generalizing the chain rule in other ways to something that is useful with general Rényi entropies:

Theorem 3   Let $ X$ and $ Y$ be arbitrary random variables and $ XY$ their concatenation.
\begin{subequations}\begin{align}E_{y\in R(Y)}(H_\alpha(X\vert _{Y=y})) &\geq H_... ...) &\leq H_\alpha(XY) - H_1(Y) \text{ if } \alpha<1 \end{align}\end{subequations}

with equality in (4.10a) and (4.10c) if and only if $ H_\alpha(X\vert _{Y=y}) - \log(P(Y\!=\!y))$ is constant for all values $ y$ of $ Y$ that have non-zero probabilities. Note that the rightmost entropies all are Shannon entropies.
Proof. Let $ p_y$ be the probabilities for the marginal distribution of $ Y$, $ p_y = P(Y\!=\!y)$, and let $ q_{yx}$ be the probability that $ X=x$ when $ Y=y$, $ q_{yx} = \nicefrac{P(Y\!=\!y\text{ and }X\!=\!x)}{P(Y\!=\!y)} = P(X\vert _{Y=y}\!=\!x)$. It is easy to see that $ P(XY\!=\!xy) = p_yq_{yx}$. We begin with the old well-known case $ \alpha=1$ as a warm-up:
\begin{displaymath}\begin{split}E_{y\in R(Y)}(&H_1(X\vert _{Y=y})) + H_1(Y) \ &... ...(XY)} p_yq_{yx}\log\left(p_yq_{yx}\right) = H_1(XY) \end{split}\end{displaymath} (4.11)

When $ \alpha>1$ we have instead:

\begin{displaymath}\begin{split}E_{y\in R(Y)}&(H_\alpha(X\vert _{Y=y})) + H_1(Y)... ... \left(p_yq_{yx}\right)^\alpha\big) = H_\alpha(XY). \end{split}\end{displaymath} (4.12)

The function $ \frac1{1-\alpha}\log(\cdot)$ is convex when $ \alpha>1$ so Jensen's inequality gives us (4.10a). When $ \alpha<1$ the function is concave and we obtain (4.10c). Finally, equality occurs, regardless of whether $ \alpha$ is smaller or larger than 1, if and only if $ p_y^{\alpha-1}\sum_{x\in R(X)} q_{yx}^\alpha = p_y^{\alpha-1}2^{(1-\alpha)H_\alpha(X\vert _{Y=y})}$ is constant for all $ y$, which is equivalent to $ H_\alpha(X\vert _{Y=y})-\log(P_y)$ being constant. $ \qedsymbol$

4.4.3 Spoiling knowledge

When learning something new about a random variable, the Shannon entropy of the variable will decrease or stay equal on average. It is only true on average. Consider $ X=Q_{100}^{0.99}$ and a $ Y$ that is dependent on $ X$ such that $ Y=0$ if $ X=0$ and $ Y=1$ if not. Learning that $ Y$ is 1 will increase the Shannon entropy of $ X$ from 0.15 to 6.64, but learning that $ Y$ is 0 will decrease it to exactly 0. On average, the Shannon entropy will decrease to 0.0664. On average, the Shannon entropy will always decrease for all $ X$ and $ Y$.

However, that is not true in general for other Rényi entropies. For example, consider $ T_1$ and $ T_2$ defined in (3.3a). $ H_\infty(T_1) = \log(3/2) \approx 0.585$. If $ T_2$ turns out to be 1 the entropy of $ T_1$ reduces to exactly 0, if not it becomes exactly 1. On average, it will increase to $ 2/3 > \log(3/2)$.

Side information that increases entropy on average like this was first mentioned in [9] and is called spoiling knowledge.

4.4.4 Entropy holism

The Concise Oxford English Dictionary [10] describes holism as the theory that certain wholes are greater than the sum of their parts. In a way, random variables normally behave in a holistic way. To specify both $ X$ and $ Y$ the whole probability vector for $ XY$ is needed and the size of $ R(XY)$ is the size of $ R(X)$ multiplied by the size of $ R(Y)$. This is one reason why Shannon chose to primarily use a logarithmic scale in [7]. With a logarithmic scale the multiplications can be treated as sums and the whole is just the sum of the parts. Quoting Shannon: One feels, for example, that two punched cards4.3 should have twice the capacity of one for information storage, and two identical channels twice the capacity of one for transmitting information.

It should come as no surprise that an important property of Shannon entropy is that the total entropy of a system is never greater than the sum of the parts' entropies,

$\displaystyle H_1(XY) = H_1(X) + H_1(Y) - M(X, Y)$ (4.13)

where $ M(X, Y)$ is called the mutual information of $ X$ and $ Y$. It can be defined by this relation and can be shown to be non-negative. There is no Shannon entropy holism. The whole is equal to the sum of the parts minus whatever they share.

On the other hand, what might come as a surprise is that this is not true in general for other Rényi entropies. There exists $ XY$ and $ \alpha$ such that

$\displaystyle H_\alpha(XY) > H_\alpha(X) + H_\alpha(Y).$ (4.14)

For example, consider $ T_1$ and $ T_2$ defined in (3.3a) again. $ H_\infty(T_1T_2) = \log(3)$ but $ H_\infty(T_1)+H_\infty(T_2) = \log(3/2) + \log(3/2) = \log(9/4) < \log(3)$. This is a case where the whole actually is greater than the sum of the parts.