4. Entropy

Entropy is an important concept in many fields, and one field where it is extensively used is QKG. This chapter gives a general overview of entropy and presents a generalization of the chain rule of entropy as needed in future chapters. Alternate explanations to most of the contents can be found in many other places, along with lots of other useful bounds and relations. The introductory chapters of [6] are highly recommended.

The function where is a probability occurs frequently in connection with entropies. is normally undefined but since is well-defined we extend the function to zero by continuity.

Another convention we will follow is to let mean the logarithm base 2. We can choose any base, but using base 2 consequently means that everything will be expressed in bits, and people tend to be familiar with bits.

Entropy is a measure of uncertainty regarding a discrete random variable. For many purposes, the Shannon entropy is the only measure needed. Shannon entropy is defined by

has the unit bits. A Python implementation is available as function

The Shannon entropy is a fundamental measure in information theory. It was introduced by Claude E. Shannon, now considered the father of information theory, in [7]. Much can be said about its properties, its uniqueness, and its relation with the thermodynamical entropy in physics, but we will only scratch a little bit on the surface here. One way of understanding it better is to rewrite the definition as

where is the probability ascribed to the value of that turns out to be correct. This makes a discrete random variable, but not necessarily with integer values.

Now it is clear that the Shannon entropy is the expectation value of where is the probability assigned to the measured value of the random variable. can be interpreted as the needed length, in bits, of a message communicating a measurement that had probability , which makes the Shannon entropy a measure of the expected message length needed to communicate the measured value of a random variable.

The Shannon entropy of a uniform random variable with possible values is

Shannon | (4.3) |

which means that we need bits

Without qualifiers, the word entropy and a non-subscripted normally refers only to Shannon entropy. However, when dealing with QKG, as well as most other parts of cryptography, this measure is not sufficient. The goal of QKG is to produce a key that is known to both Alice and Bob but to Eve is a random variable with high uncertainty. is a measure of the uncertainty of a value assigned probability and is therefore a measure of the security of that particular value of the key. Shannon entropy measures the expectation value of that security. The dangers of focusing on Shannon entropy alone is highlighted by this theorem:

Theorem 1
*There exists finite discrete random variables with
arbitrarily high Shannon entropy which the chance of guessing
at one try is arbitrarily close to .*

(4.4) |

which completes the proof.

Good security average is not good enough, and Shannon entropy alone is obviously not a sufficient measure of the quality of a key.

Another measure more closely related to the difficulty of guessing the value of a random variable was introduced by Massey in [8]. He did not name it but in [6] it is called guessing entropy. Note, however, that while most other entropies have the unit bits the guessing entropy is measured in units of number of guesses. Without loss of generality we can assume that the values of are sorted with decreasing probability, in which case the guessing entropy of is defined as

That is, the guessing entropy is simply the average number of guesses needed to guess the value of a random variable using the optimal strategy. The definition formalized to Python code is available as function

Theorem 2 *There exists finite discrete
random variables with arbitrarily high Guessing entropy which
the chance of guessing at one try is arbitrarily close to
.*

(4.6) |

which completes the proof.

Again we see that good security average is not good enough, and guessing entropy alone is not a sufficient measure of the quality of a key.

A useful generalization of Shannon entropy is the Rényi entropy, which maps an entropy measure pronounced the Rényi entropy of order to every real number . Rényi entropy is, just like Shannon entropy, measured in units of bits.

The equality in (4.7c) is easy to show using e.g. l'Hospital's

An important property of Rényi entropy is that for , for all , with equality if and only if is a uniform random variable. In other words, is a constant function of if is uniform and strictly decreasing if not. A full proof is given in [6] and follows quite naturally by writing in analogy with (4.2) as and using Jensen's inequality.

Some of these measures have quite natural interpretations. Rényi entropies with higher parameter depend more on the probabilities of the more probable values and less on the more improbable ones. is logarithm of the number of values of that have non-zero probabilities. Any two random variables with different probability distributions but the same number of values with non-zero probabilities will have the same Rényi entropy of order 0. is the Shannon entropy, in which the actual probabilities are quite important. is often called collision entropy, or just Rényi entropy, and is the negative logarithm of the likelihood of two independent random variables with the same probability distribution to have the same value. More probable values are much more likely to collide and are therefore more visible in the collision entropy than in the Shannon entropy. is called min-entropy and is a function of the highest probability only.

The shape of as a function of for eight different random variables is shown in figure 4.1.

4.4.1 Conditional Rényi entropy

It expresses the expected value of the entropy of after is disclosed. The notation clearly hides an implicit expectation value, but since Shannon entropy is an expectation value to begin with, that doesn't change much. Using equations (4.2) and (3.5) we can write (4.8) more explicitly as

(4.9) |

which is a single expectation value just like equation (4.2).

Rényi entropies of different order than 1 are not expectation values so things are not quite as simple. In fact, according to [6] there is not even an agreement about a standard definition of conditional Rényi entropy. However, the dominant definition seems to be the same as (4.8) with both :s replaced by . That definition will be used here but the expectation value will always be explicitly written out. We have seen that averaging security can be dangerous, and it is nice to not hide away something potentially dangerous in the notation.

With conditional Shannon entropy comes the chain rule of Shannon entropy, equation (4.10b) below. One way to define conditional Rényi entropy is to choose it so the same relation still holds when the Shannon entropies are replaced with Rényi entropies. However, the relation does not hold for the expectation value based definition chosen above so it is clearly a different conditional entropy. Fortunately, that doesn't stop us from generalizing the chain rule in other ways to something that is useful with general Rényi entropies:

Theorem
3 *Let and be arbitrary random
variables and
their concatenation.*

with equality in (4.10a) and (4.10c) if and only if is constant for all values of that have non-zero probabilities. Note that the rightmost entropies all are Shannon entropies.

with equality in (4.10a) and (4.10c) if and only if is constant for all values of that have non-zero probabilities. Note that the rightmost entropies all are Shannon entropies.

(4.11) |

When we have instead:

(4.12) |

The function is convex when so Jensen's inequality gives us (4.10a). When the function is concave and we obtain (4.10c). Finally, equality occurs, regardless of whether is smaller or larger than 1, if and only if is constant for all , which is equivalent to being constant.

When learning something new about a random variable, the Shannon entropy of the variable will decrease or stay equal on average. It is only true on average. Consider and a that is dependent on such that if and if not. Learning that is 1 will increase the Shannon entropy of from 0.15 to 6.64, but learning that is 0 will decrease it to exactly 0. On average, the Shannon entropy will decrease to 0.0664. On average, the Shannon entropy will always decrease for all and .

However, that is not true in general for other Rényi entropies. For example, consider and defined in (3.3a). . If turns out to be 1 the entropy of reduces to exactly 0, if not it becomes exactly 1. On average, it will increase to .

Side information that increases entropy on average like this was first mentioned in [9] and is called spoiling knowledge.

The Concise Oxford English Dictionary [10] describes holism as
the theory that certain wholes are greater
than the sum of their parts. In a way, random variables
normally behave in a holistic way. To specify both and the whole probability vector for
is needed and
the size of is
the size of
multiplied by the size of . This is one reason why Shannon chose to
primarily use a logarithmic scale in [7]. With a logarithmic scale
the multiplications can be treated as sums and the whole is just
the sum of the parts. Quoting Shannon: One
feels, for example, that two punched cards^{4.3} should
have twice the capacity of one for information storage, and two
identical channels twice the capacity of one for transmitting
information.

It should come as no surprise that an important property of Shannon entropy is that the total entropy of a system is never greater than the sum of the parts' entropies,

where is called the mutual information of and . It can be defined by this relation and can be shown to be non-negative. There is no Shannon entropy holism. The whole is equal to the sum of the parts minus whatever they share.

On the other hand, what might come as a surprise is that this is not true in general for other Rényi entropies. There exists and such that

For example, consider and defined in (3.3a) again. but . This is a case where the whole actually is greater than the sum of the parts.