Entropy: Motivation and introduction

Burton Rosenberg
17 February 2003

A coin puzzle

There is this puzzle: you are given 9 coins, all of the same weight but one, which is heavier. You have a balance to weigh the coins. Using 2 weighings find the heavy coin.

Solution: arrange the coins into three groups of three. Compare the weights of two of the three groups. Either one group is heavier or they are equal. This identifies which among the three groups contains the heavy coin. Repeat this idea on the three coins of the heavy group.

The solution can be visualized as a tree. The root has three children corresponding to the three possible results of the weighing: left group heavy, right group heavy, groups equal. Each child has three children corresponding to the three results of the second weighing: left coins heavy, right coin heavy, coins equal. The nine leaf nodes exactly correspond to the nine possible solutions.

It took two experiments, each which discrimnated between three possibilities to distinguish one of nine possible states.

      2 = log_3 9
Here we begin to see the relationship between the log, the number of possible states, and the information contained, as expressed by the number of required queries to uniquely identify a state.

Note: Here is another coin puzzle for you to try. You have 12 coins, all the same weight but one. That one coin might be heavier or lighter than the others. Identify the coin using three weighings. This is a harder problem than the 9 coin problem.

Yes/No questions

The minimal number of answers to a query is two. If a query has only one possible answer, nothing can be learned by asking the query. The answer is known in advance. The trick of the above puzzle is to realize that a balance gives three possible results, not two. It's what makes that puzzle interesting.

Theoretically, however, we should reduce the possible query results to two, and the base of the log to 2.

I pick a number from 1 to 8. Since 3 = log_2 8, three yes/no queries might suffice to guess the number. The queries might be,

  1. Is the number between 1 and 4?
  2. If yes, is it between 1 and 2?
  3. If yes, is it 1?
Consider the roll of a dice. Since 2.58 = log_2 6, it should take about two and a half questions to determine the result of a roll of a dice. What could half a question mean?

If the die is rolled 5 times, log_2 6^5 = 12.92 or about 13. Thirteen yes/no question should be sufficient to identify the outcome of the roll of 5 dice. Since 2^13=8192 and 6^5=7776, this is certainly possible: place the outcomes in correspondence with the number 0 through 7775, express the number in binary and then ask for the values of the 13 bits in the binary representation.

The effect of a-priori knowledge

Suppose we have a strange 5 headed die. You have not seen such a die, but I tell you that it is also strange in that 1/2 of the time the die rolls 1, and the remaining 1/2 the time it is equally likely to roll a 2, 3, 4 or 5. What are the average number of questions needed to identify the roll of the die? To minimize the number of questions it seems reasonable to ask first: was the result a 1? Half the time this one question suffices. The other half of the time two additional questions are needed to discriminate between the 2, 3, 4 and 5. On average, 2 questions are required:

    (1/2) 1 + (1/2) 3 = 2
So not only are the number of possible outcomes significant to the information content of an occurence, the relative likelihood of each outcome has an important contribution.

The mathematical definition of entropy

Definition: Probability Distribution
Let X be a discrete probability distribution. But that we mean a finite set of events, associated with each event a probability. Write,

    X = { p_1, p_2, ..., p_n }
Where each p_i is between 0 and 1 and the p_i's sum to 1.

Definition: Entropy
The Entropy of X is defined:

    H(X) = - SUM_i p_i log_2 p_i
Note that the sum should omit any p_i=0. Entropy is measured in quantity of bits, because of the base 2 log. Other bases would measure entropy in other units, proportionally related to the bit by the usual transformation of base formula:
    log_a c = log_b c / log_b a

Returning to our 5 headed die, its distribution is,

   X = { 1/2, 1/8, 1/8, 1/8, 1/8 }
and the entropy is,
   H(X) = - SUM_i p_i log_2 p_i = - (1/2) log (1/2) - 4 (1/8) log (1/8)
        = 1/2 + 3/2 = 2 bits

A technical lemma
Let X = { p_1, ..., p_n } and Y = { q_1, ..., q_n } be discrete distributions each over n events. Define:

   H*(X,Y) = - SUM_i p_i log y_i
H* is minimized exactly when p_i = q_i for all i.

It is easy to see that H(X) ≥ 0. Using the technical lemma we show that H(X) ≤ log_2 #X, with equality when X is the uniformly distributed.