- sum_i p_i log p_i,where the sum is assumed to range over only non-zero p_i.
This function is chosen according to a set of axioms for information and entropy first introduced by C.E. Shannon.
Several properties of entropy follow from Jensen's inequality. We give a proof for the case of finite sums:
Theorem (Jensen's inequality)
Suppose f is continuous strictly concave function
on the interval I and we have a finite set of strictly positive a_i which
sum to one. Then:
sum_i a_i f(x_i) <= f( sum_i a_i x_i )Equality occurs if and only if the x_i are equal.
Proof
Consider the points in R^2 f(x_i). The affine
combination of these points with weights a_i form the convex hull
of the points, and since a_i > 0, the left hand side is a point
strictly in the interior of the convex hull. These points lie
on the function f, which is strictly concave, therefore
over the interval I, f is strictly above the interior of this
convex hull. The sum on the right hand side gives a point in
the domain of I.
Considering the argument again, only if all the x_i are identical will the interior of the (single-point) convex hull lie on f.
Theorem
Let X be a random variable with probability distribution p_1, ... , p_n,
and p_i > 0.
Then
H(X) <= log nwith equality if and only if p_i = 1/n for all i.
Proof
Since the p_i sum to one and are strictly positive, we use Jensen's
inequality:
sum_i p_i log 1/p_i <= log ( sum p_i / p_i ) = log nwith equality if and only if all p_i = 1/n.
Theorem
H(X,Y) <= H(X) + H(Y)
with equality if and only if X and Y are
independent.
Proof
Suppose X is a random variable taking values x_i with probability p_i;
Y is a random variable taking values y_j with probability q_i; and
the joint probability of x_i and y_j is r_i,j.
Again using Jensen's inequality:
H(X,Y) - H(X) - H(Y) = - sum_i,j r_i,j log r_i,j + sum_i p_i log p_i + sum_j q_j log q_j = sum_i,j r_i,j log 1/r_i,j + sum_i,j r_i,j log p_i + sum_i,j r_i,j log q_j = sum_i,j r_i,j log (( p_i q_j ) / r_i,j ) <= log sum_i,j (p_i q_j) = 0with equality only if
p_i q_j / r_i,j = constantIn which case, by summing over i and j, the constant is 1, and therefore
p_i q_j = r_i,jthat is, X and Y are independent.
H(X|Y) = - sum_x,y p(y) p(x|y) log p(x|y)
Theorem
H(X,Y) = H(Y) + H(X|Y)
Proof
H(X,Y) = sum_x,y p(x,y) log p(x,y) = sum_x,y p(y)p(x|y) log p(y)p(x|y) = sum_x,y p(y)p(x|y) log p(x|y) + sum_x,y p(y)p(x|y) log p(y) = H(X|Y) + H(Y)
Theorem
H(X|Y) <= H(X)with equality if and only if X and Y are independent.
Proof
H(X|Y) = H(X,Y) - H(Y) <= H(X) + H(Y) - H(Y) = H(X)with equality if and only if X and Y are independent.