Entropy and some properties

Suppose X is a random variable taking on a finite set of values x_i with probability p_i. Then the Entropy of X H(X) is
   - sum_i p_i log p_i,
where the sum is assumed to range over only non-zero p_i.

This function is chosen according to a set of axioms for information and entropy first introduced by C.E. Shannon.

Several properties of entropy follow from Jensen's inequality. We give a proof for the case of finite sums:

Theorem (Jensen's inequality)
Suppose f is continuous strictly concave function on the interval I and we have a finite set of strictly positive a_i which sum to one. Then:

  sum_i a_i f(x_i) <= f( sum_i a_i x_i )
Equality occurs if and only if the x_i are equal.

Proof
Consider the points in R^2 f(x_i). The affine combination of these points with weights a_i form the convex hull of the points, and since a_i > 0, the left hand side is a point strictly in the interior of the convex hull. These points lie on the function f, which is strictly concave, therefore over the interval I, f is strictly above the interior of this convex hull. The sum on the right hand side gives a point in the domain of I.

Considering the argument again, only if all the x_i are identical will the interior of the (single-point) convex hull lie on f.

Theorem
Let X be a random variable with probability distribution p_1, ... , p_n, and p_i > 0. Then

   H(X) <= log n
with equality if and only if p_i = 1/n for all i.

Proof
Since the p_i sum to one and are strictly positive, we use Jensen's inequality:

    sum_i p_i log 1/p_i <= log ( sum p_i / p_i ) = log n
with equality if and only if all p_i = 1/n.

Theorem
H(X,Y) <= H(X) + H(Y) with equality if and only if X and Y are independent.

Proof
Suppose X is a random variable taking values x_i with probability p_i; Y is a random variable taking values y_j with probability q_i; and the joint probability of x_i and y_j is r_i,j.

Again using Jensen's inequality:

   H(X,Y) - H(X) - H(Y) = - sum_i,j r_i,j log r_i,j
                          + sum_i p_i log p_i + sum_j q_j log q_j
                        = sum_i,j r_i,j log 1/r_i,j
                          + sum_i,j r_i,j log p_i + sum_i,j r_i,j log q_j
                        = sum_i,j r_i,j log (( p_i q_j ) / r_i,j )
                        <= log sum_i,j (p_i q_j)
                        = 0 
with equality only if
     p_i q_j / r_i,j = constant
In which case, by summing over i and j, the constant is 1, and therefore
     p_i q_j = r_i,j
that is, X and Y are independent.

Conditional Entropy

We define the conditional entropy of X given Y as the average entropy of the probability distribution of X given Y:
   H(X|Y) = - sum_x,y p(y) p(x|y) log p(x|y)

Theorem

    H(X,Y) = H(Y) + H(X|Y)

Proof

    H(X,Y) = sum_x,y p(x,y) log p(x,y)
           = sum_x,y p(y)p(x|y) log p(y)p(x|y)
           = sum_x,y p(y)p(x|y) log p(x|y) + sum_x,y p(y)p(x|y) log p(y)
           = H(X|Y) + H(Y)

Theorem

 H(X|Y) <= H(X) 
with equality if and only if X and Y are independent.

Proof

  H(X|Y) = H(X,Y) - H(Y)
         <= H(X) + H(Y) - H(Y) = H(X)
with equality if and only if X and Y are independent.