Let denote a set of training examples, each of the form where is the value of the attribute or feature of example and is the corresponding class label. The information gain for an attribute is defined in terms of Shannon entropy as follows. For a value taken by attribute, let be defined as the set of training inputs of for which attribute is equal to. Then the information gain of for attribute is the difference between the a priori Shannon entropy of the training set and the conditional entropy . The mutual information is equal to the total entropy for an attribute if for each of the attribute values a unique classification can be made for the result attribute. In this case, the relative entropies subtracted from the total entropy are 0. In particular, the values defines a partition of the training set data into mutually exclusive and all-inclusive subsets, inducing a categorical probability distribution on the values of attribute. The distribution is given. In this representation, the information gain of given can be defined as the difference between the unconditional Shannon entropy of and the expected entropy of conditioned on, where the expectation value is taken with respect to the induced distribution on the values of .
Drawbacks
Although information gain is usually a good measure for deciding the relevance of an attribute, it is not perfect. A notable problem occurs when information gain is applied to attributes that can take on a large number of distinct values. For example, suppose that one is building a decision tree for some data describing the customers of a business. Information gain is often used to decide which of the attributes are the most relevant, so they can be tested near the root of the tree. One of the input attributes might be the customer's credit card number. This attribute has a high mutual information, because it uniquely identifies each customer, but we do notwant to include it in the decision tree: deciding how to treat a customer based on their credit card number is unlikely to generalize to customers we haven't seen before. To counter this problem, Ross Quinlan proposed to instead choose the attribute with highest information gain ratio from among the attributes whose information gain is average or higher. This biases the decision tree against considering attributes with a large number of distinct values, while not giving an unfair advantage to attributes with very low information value, as the information value is higher or equal to the information gain.