Information Content

Type of physical science: Mathematics

Field of study: Signal processing

Information content is a quantitative measure of the information contained in a message. It is a physical quantity in information theory and communication engineering, very much like mass and energy in physics and space science, or money in economics.

Overview

Information is in many respects the most important commodity for the human race. In its broadest sense, "information" may mean different things to different people. It may be interpreted to include the contents of a book, a speech, the balance sheet of a company, the weather charts of a meteorological laboratory, a picture of a person or an object, chemical properties of a substance, messages occurring in any communication process (such as telegraphy), continuous waveforms in radio or television, signals in the form of sequences of binary digits in digital computers, impulses in a human being or an animal for discharge of its functions, commands in a servomechanism, or data in a data-processing device.

The mathematical laws of information theory were originally developed to deal with communication and the means of communication of messages. These laws, however, were found capable of handling a wide and varied class of problems where transfer of information is involved. Therefore, the principles of information theory and its central idea of information content were soon found useful in many areas of study.

In a communication, the messages originate from a source, pass over a channel, and are received at the destination. Most practical channels experience disturbances, called noise, that distort the messages. The common practice is to repeat the messages, often more than once, to overcome the effect of noise. Such repetition obviously reduces the rate or the speed of communication. One of the major questions that can be asked is whether it is possible to communicate over a noisy channel in such a way that the effect of the noise can be overcome without significant loss of the rate of communication. From the layperson's point of view, the answer to this question is no, but to information theorists the answer is yes. To achieve this state, a mathematical study of the communication system and adoption of suitable techniques are required. The main trick of the trade of reliable communication lies in proper encoding of the messages, mathematical study of various components of the communication system, and the role of a simple but seemingly magical quantity, "entropy," that measures information.

Some simple but basic aspects of transmission of information through messages will provide examples of entropy. The contents of the message should not be known to the receiver before it is actually received. In other words, the sender and the receiver must have the possibility of communicating more than one message. Next, assuming that there are several messages, it should be possible to determine, by some experiment, the proportions (or probabilities) in which these messages are likely to occur in actual communication.

To understand the role of coding and its relation to the measurement of information, consider a simple communication system that consists of an electronic device that can send only two symbols, 0 and 1. Furthermore, let this device be used in two different situations: one on the war field by an army commander to communicate with a soldier, and the other in a hospital by a doctor to give instructions to a nurse attending a patient. If there is only one command from the army commander to the soldier, or only one instruction from the doctor to the nurse, the electronic device has no purpose. Whatever one has to say to the other is well known, and no communication needs to be established. The need to set up the communication, therefore, presupposes that more than one message (command or instruction) is possible. Let us suppose that each system handles only four messages in the form of codes whose meanings are understood by both sender and receiver. Let us also assume that the commander's messages have totally different meanings from those of the doctor, and that the proportions in which messages occur in two situations are also different.

The messages, whatever they are, must be expressed in words of two symbols, 0 and 1, because these are the only symbols that can be sent over the electronic device. This makes coding essential in such communication situations. The coding can be accomplished in several ways.

One can, for example, envision two sets of words or codes, C1 and C2, each having four words of symbols 0 and 1. Since many codes are possible, out of any two codes it should also be possible to decide which one of them, in some sense, is better than the other.

Let the four previously agreed-upon commands between the army commander and the soldier, along with the proportions or probability of these messages, and two possible different sets of code words in codes C1 and C2 be as follows:

Commands

Probability

Code C1

Code C2

Attack

1/4

11

01

Withdraw

1/4

00

1

Wait

1/4

10

001

Hide

1/4

01

0001

To develop the ideas in a simple fashion, let it be further supposed that the communication is free from disturbance. The messages sent are received unaltered. Thus, the message "hide and attack" shall be encoded as 0111, and "wait and hide" as 1001, according to code C1; and as 000101 and 0010001, respectively, according to code C2. In both cases, the soldier, knowing which code was used, is able to understand the messages.

In the case of the doctor, let the four messages from the doctor to the nurse with probabilities and the codes C1 and C2 (same as considered above) be as follows:

Doctor's Message

Probability

C1

C2

Let him rest

1/4

11

01

Administer the medicine

1/2

00

1

Bring for operation

1/8

10

001

Discharge

1/8

01

0001

If code C1 is used, then 0001 sent and received by the nurse means that the doctor is instructing the nurse to discharge the patient after administering the medicine. One can similarly interpret 10001, if C2 is the code agreed upon for communication.

It is easy to see that either C1 or C2 can be used in the commander-soldier, as well as the doctor-nurse, scenario. The meanings of the code words are unimportant from the point of view of the communication engineer. The sender and the receiver understand them for their respective purposes. The question to address is whether both codes are equally good for both types of communication. If not, which one is better for which situation, and why?

Observe that code words in the two different codes have different lengths. In code C1, every word is of length 2; in code C2, words are of lengths 2, l, 3, and 4, respectively. Length equates with time in communication. The use of a code with less average length will save time. If one calculates average lengths of both the codes for both the cases, one will find that, in the commander-soldier case, the average length of C1 is 2, whereas the average length of C2 is 2.5. In the doctor-nurse case, the average length of C1 is 2, whereas the average length of C2 is 1.875. Obviously, code C1 is better for the first case, and code C2 for the second. It is thus clear that different codes are better in different situations. This difference has arisen only because they have different probability assignments.

(Later, the entropy of a probability assignment--denoted by H--will be introduced.) The values of entropy H for the above two cases are 2 in the commander-soldier case and 1.75 in the doctor-nurse case.

A major question now arises. Is it possible to find a code better than the code C1 (which is better than C2) in the first case and better than C2 in the second case? The answer is provided by comparing the average length, which depends on probabilities and word lengths, with entropy, which depends only on probabilities. In the first case, the average length of C1 = H = 2, so a code better than C1 cannot be found. In the second case, the average length of C2 = 1.875 is greater than H = 1.75, so a code better than C2, can be found. This results from the fact that entropy H for a source of messages represents its "information content," and a code with just that much average length is most economical in the noiseless case. It may be checked that code with words 01, 1, 001, 000, and average length 1.75 (equal to value of H) for the second case will be best suited.

If the message set of a source, along with its probability assignments, is denoted by X, it is clear from the above illustration that the efficiency problem of communication has to be formulated in terms of the entropy, H(X), of probability distribution of the messages. When the channel is not free from noise, the source entropy H(X) alone will not be the index of efficient communication. Another quantity, called "amount of information," depending on both output and input probabilities, will be required. This "amount of information," however, is determined in terms of different entropies--an idea that will be addressed later.

At this stage, one can proceed to formalize the understudying of entropy. Entropy has been defined as a measure of the information content of the source. The measure has other interpretations. It is also called "measure of uncertainty of its probability distribution." It is helpful to understand this interpretation in order to achieve a clear understanding of entropy in the general case. In fact "uncertainty" and "information" in communication mean the same thing.

Before the receipt of a probable message, there is a certain amount of uncertainty. On the receipt of the message, that uncertainty is removed; the information provided removes the uncertainty that existed before. Consider the following examples: If, on a particular day, the receiver knows that the sun set at 5:30 p.m., then the sender telling him that the sun has set at 6:00 p.m. does not provide any information. It is easier--that is, less uncertain--to pick one shirt out of two than to pick one out of three when all the shirts are equally appealing. It is easier to pick one shirt out of two, if one is more appealing than the other, than it is to pick one out of two when both are equally appealing. If, in the above three examples, statements about setting of the sun or selection of shirts are replaced by a similar situation (such as the opening hours of a store or selection of trousers), the information or uncertainty part does not change.

By considering the foregoing examples as well as the preceding discussion of coding problems, one arrives at the following rather intuitive understanding for measuring uncertainty or information in the happening of probable events: First, the occurrence of a sure event has no uncertainty and therefore provides no information; that is, one receives information only when informed of the occurrence of an event whose occurrence was uncertain. Second, the measure of uncertainty has to depend on the probabilities only. Finally, the uncertainty is maximum in a given situation--that is, for a given number of choices--if every event has the same chance of happening.

Apart from meeting the aforementioned conditions, the measure of uncertainty has to satisfy a few other criteria, which arise by considering the laws of their combinations. For example, if there are uncertainties involved in two types of selections, such as choosing a shirt and choosing a pair of shoes, and if the choice of one does not influence the choice of the other (which mathematicians call "statistical independence"), the uncertainty of joint selection should be the sum of the uncertainties of the individual selections. Another important criterion regarding the measure of uncertainty connects the entropies of several outcomes, with and without grouping. These general considerations about the nature of the measure of uncertainty can be transformed into mathematical relations and conditions, and the expression for the measure of uncertainty can be uniquely determined.

The subject is, therefore, deductive in nature--that is, it is possible to start with a small number of definitions and criteria and derive everything from there. The entropy, measure of information or uncertainty, defined for a probability distribution, proves to be one involving a logarithmic function.

If there are only two outcomes x1, x2, with probabilities p and 1 - p, respectively, then its expression is given by H(p, 1-p) = -plogp – -p)log-p) where the base of the logarithm is in general arbitrary, but when it is 2 the units are called bits (binary units).

What was done above for two outcomes can be done for any number of choices or outcomes. Thus, if the number of outcomes is n ,,,. . . 20, . . . ), and these can be called x1, x2, . . . , xn, the events can then be put as the set X = {x1,x2, . . ., xn}, with probabilities of occurrence of these outcomes as P = (p1, p2, . . . , pn), respectively. The entropy of the event set X is H(X) or better, because entropy depends only on probabilities, as H(p1, p2, . . . , pn) or H(P), with the expression H(p1, p2, …, pn) = -p1logp1 - p2logp2… - pnlogpn, where p1 + p2 + … + pn = 1.

This expression of entropy can be seen as the average of the informations (-logp1, -logp2, etc.) of the individual events. In the language of mathematics, x is called a "random variable" and the set P = {p1, p2, . . . , pn} its probability distribution.

The study of entropy by itself is an important and interesting topic, and its inter- pretations--as a measure of information or of uncertainty--are only two definitions.

It has some other beautiful interpretations: as a measure of diversity, of randomness, and of bias. Because of these interpretations and some remarkable properties, it has found applications in several areas, including physics, statistics, accounting, economics, psychology, and ecology, apart from those in communication engineering. Some generalizations of this quantity have also been proposed.

A remarkable feature of Shannon's entropy is that several entropies can be defined equally simply for experiments when more than one random variable is involved in the study (such as heights and weights of people in the army). The case of two random variables is particularly important for communication engineering, because when the channel is not free from noise, the input and output of a channel are not the same. There are two random variables, one for input and the other for output, with their respective probability distributions.

Let there be two random variables, X and Y. In this case, one can define individual entropies H(X) and H(Y), joint entropy H(X,Y), and marginal entropies H(X/Y) and H(Y/X). These quantities are related to each other. For example, H(X,Y) = H(X) + H(Y/X) or H(X,Y) = H(Y) + H(X/Y). In general, H(X,Y) < H(X) + H(Y), with equality if and only if X and Y do not depend (statistically) on each other. Similar and many more relations can be obtained between many entropies for a system involving more than two random variables. They have applications in economics and other social sciences.

As pointed out earlier, the communication model has three main components: namely, a source, a channel, and a receiver. In addition, there is the noise that operates on the channel.

Efficient and reliable communication is developed by a prior statistical study of these components and noise. The mathematical model then evolves from definitions and relations of various information-related quantities.

The source has input symbols with specified probabilities of occurrence. Thus, there is a source entropy, which can be denoted by H(X). The receiver has output symbols (in general, all those of the input and some additional ones because of noise), with their probabilities. Thus, there is an output entropy denoted by H(Y). If there is no noise, these two will be the same. In general (that is, in the case of a noisy channel), the probabilities of receiving a given symbol with respect to symbols sent can be found. It is also possible to find the probabilities of receiving each output symbol for a specified input symbol. These probabilities are called "conditional output probabilities" for given inputs, and they characterize the channel mathematically. The above-mentioned entropies H(Y/X), H(X/Y), and H(X,Y), can be calculated from these studies. The quantity H(Y/X) measures the loss of input information because of noise in the channel.

A quantity called "amount of information" (or transinformation, I), which measures the amount of information that output symbols provide about the input symbols, can be calculated in terms of the entropies of the source, receiver, and the conditional entropies of the two, and is given by I(X,Y) = H(X) - H(Y/X). This quantity, when measured per symbol or per unit time, is interpreted as the "rate of information" of the channel. Its maximum, for all choices of the input probability distribution, determines the maximum information that can be physically passed over the channel per unit symbol or time, and therefore measures the capacity of the channel.

The results of information theory, useful for reliable communication over noisy channels, are formulated in terms of channel capacity. As noted above, in the noiseless case it is possible to find a code whose average length is equal to (or very close to) the entropy of the source; similarly, for noisy channels, it is possible to develop codes whose probability of error can be as small as one likes, provided that the rate of communication does not exceed the channel capacity. This is the key result, called the fundamental theorem of information theory.

The notion of measures of information such as entropy (H) and amount of information (I), discussed above, can be expanded to communication sources that produce messages written in a language, such as English, Russian, or Hindi. Languages have a certain structure. In English, the letter e appears more often than others, and the letter t is most often followed by letter h. These features of a language can be studied and employed in efficient communication. This special type of phenomenon is called ergodicity, and the sources are called ergodic sources. The study of such sources and the communication of messages produced by them over noisy channels also form a part of information theory.

The field of information theory and its applications remains a very active area of research. Mathematicians, through their abstract axiomatic approach, have put the theory on very sound footing; there are several mathematical extensions and generalizations of the original formulations. For example, Shannon's theory is meant for an engineer designing a telephone system and not caring whether the system is going to be used for gossiping, for stock market quotations, or for war emergencies; in other words, no "value" is attached to the information.

Now, however, measures of information that consider a subjective value system, called "useful measures of uncertainty or amount of information," have begun to be developed.

The "additivity" of information for independent events is characteristic of mechanical devices and does not take into account the psychological or human factors, where seemingly independent events affect the receiver. There are now quite general value-free, as well as value-associated, nonadditive measures. The generalized grouping axiom has also led to new measures.

We have discussed situations in which the messages are discrete--that is, expressed in terms of a finite number of symbols. There are many sources, however--such as voice, television, and the measurements of physical characteristics--that cannot be regarded as discrete. An attempt to "discretize" them introduces an approximation, or "distortion," and is a problem of "data compression." By fixing an average distortion level, the rate of a source relative to a distortion is defined as the smallest number of binary digits per unit time required to encode the source output, subject to the restriction on the average distortion. This new measure plays the role of generalized entropy, and there are results in the theory of data compression comparable to that of information theory.

Applications

The concept of measuring information content has led to ways of identifying different forms of information, as well as ways of storing and transforming them. The methods of the theory have found immense applications in telecommunications, computers, satellite communications, and communication from space probes. Modern efficient radio and television communication technologies use theoretical results of information theory.

In a radio broadcast channel, both the input and the output are continuously varying signals over the antenna terminal. Modems convert discrete sequences into continuous signals and vice versa. The modulator plays the role of encoder converting the discrete sequences into signals suitable for the channel. The demodulator serves as encoder, making decisions on the received signal. If the received signal has been changed because of noise, it decides what was sent and produces a meaningful discrete sequence as its output. To avoid interference with other channels, the transmitted signal is restricted to lie within a given band of frequencies, say W hertz wide. The sampling theorem of information theory envisages that equally spaced 2W sample points fairly well represent the continuous signal. The nature of noise in the channel, mainly the result of thermal effects in an electrical resistor, is "white Gaussian" with a specified average power, say N per hertz of frequency. The power limitation of the input signal, P, is always determinable. The channel capacity C is determined. By properly designing transmitter and receiver, it is possible to transmit over the channel up to C binary digits per second with as small a probability of error as desired. The theory also tells that more than C binary digits per second with an arbitrarily small probability of error is not possible.

The foregoing procedure is applicable to satellite communication and communication from space probes. If the noise phenomenon changes, however, suitable modification in the channel study is made to determine its use for reliable communication.

The statistical concept of entropy in information theory has been used in understanding some phenomena of thermodynamics, gas diffusion, statistical physics, and quantum physics.

Another area where information theory has found very deep applications is hypothesis testing in statistics. The "maximum entropy principle" has been used to characterize discrete and continuous probability distributions. The model has found application in competitive marketing strategies.

Applications in coding and cryptography are numerous and form a vast subject matter.

Construction of efficient codes useful for space communication and in computers is a very significant area for application. Studies in artificial intelligence, sequential machines, and automata theory heavily use entropy and rate of distortion.

Context

Information theory is a young scientific discipline. The basic concepts of the theory were developed by Claude Shannon (1948). He introduced entropy as a measure of information and studied its basic properties. Shannon also succeeded in formulating practically all the essential aspects of the communication model. He borrowed the name "entropy" from physics, at the suggestion of John von Neumann, as entropy in the sense of statistical physics is expressed by a similar formula developed by Ludwig Boltzmann in 1877. Information theory had the rare privilege of achieving the status of a mature and developed scientific theory in the first investigation devoted to it.

Information divergence, which measures relative information of two probability distributions of a random variable, was introduced by S. Kullback and R. A. Leibler in 1951.

This measure is particularly useful in hypothesis testing and other situations in statistical analysis, as well as economics. H. Theil considered measures associated with three distributions.

His information improvement, Shannon's entropy, and Kullback and Leibler's measures have been used for forecasting in economics. Major trends in characterization of the measures of information were initiated in the work of D. K. Fadiev (1956), T. W. Chaundy and J. B. McLeod (1960), A. Renyi (1965), Bhu Dev Sharma and D. P. Mittal (1975), and Sharma and I. J. Taneja (1977).

The results, which were originally directed toward the communication problems of the Bell Telephone system, were much more general and deeper in their formulations. The reason is that information is involved in many varied situations, in physical, social, environmental, and managerial sciences. The ideas found rather immediate applications in fields as diverse as linguistics, accounting, economic analysis and forecasting, psychology, computer science, and ecology.

In its abstract applications, mathematicians used the theory for classification of metric spaces. The most significant contributions of mathematics in the development of this highly applicable discipline may be seen in the theory of error-correcting codes.

Principal terms

BITS: binary units of information

CAPACITY OF A CHANNEL: the maximum information per unit time or symbol that can be transferred from the sender to the receiver over a communication channel

CHANNEL: the medium that connects or separates the sender from the receiver in space or time, such as atmosphere (in radio and television transmissions) or wires (in telephone communication)

DECODER: a device at the receiving end that handles received signals, decides what was sent from the other end, and transforms the decoded signal into the language of the message

ENCODER: a device at the source that changes source messages into words expressed in code αbet

ENTROPY: a mathematical measure that quantifies information content of a probabilistic experiment

NOISE: any disturbance that distorts or changes the signals sent

PROBABILITY OF AN EVENT: the likelihood that an event will occur when more than one outcome is possible; the likelihood that a tossed coin will come up "heads," for example, is one in two, or 1/2; the probability number is always between 0 and 1.

RATE OF INFORMATION: a measure of the amount of information that passes over a channel per unit time or symbol; its maximum value for a channel is the capacity of the channel

SIGNAL: the encoded form of a message that is transmitted over a channel

Bibliography

Brillouin, Leon. SCIENCE AND INFORMATION THEORY. New York: Academic Press, 1962. Examines the meanings and applications of information theory to many areas of physics, in particular to Maxwell's Demon, as well as to measurement problems and telecommunications.

Csiszar, Imre, and Janos Koerner. INFORMATION THEORY: CODING THEOREMS FOR DISCRETE MEMORYLESS SYSTEMS. New York: Academic Press, 1981. An updated account of coding theorems: those results that study bounds on probability of errors and capacity of the channels.

Gallager, Robert G. INFORMATION THEORY AND RELIABLE COMMUNICATION. New York: Wiley, 1968. A thorough and technically excellent book on the subject for communication engineers and mathematicians.

Shannon, Claude E., and Warren Weaver. A MATHEMATICAL THEORY OF COMMUNICATION. Reprint. Urbana: University of Illinois Press, 1963. A reproduction of the two technical papers of Shannon that established the theory, together with an expository and broad-based article by Weaver on the ideas, scope, and promise of information theory.

Welsh, Dominic. CODES AND CRYPTOGRAPHY. Oxford, England: Clarendon Press, 1988. A simple introductory mathematical account of basic concepts of entropy and noisy channels. Applications to error-correcting codes, computational complexity, and cryptography.

Essay by Bhu Dev Sharma