Overlapping Decision Trees

Stephen A. Della Pietra and Vincent J. Della Pietra
IBM Thomas J Watson Research Center
PO Box 704
Yorktown Heights
NY 10598

Abstract

This paper describes a new type of stochastic model, called an overlapping decision tree model, that combines the advantages of two modeling techniques: decision tree methods and maximum entropy methods.

An ordinary decision tree model (DTM) estimates the conditional probability p(y|x) of a future y given a history x by first assigning x to a leaf of a tree and then predicting y according to a probability distribution depending only on the leaf. This structure accounts for both the appeal and weakness of such models. On the one hand, assigning each history to a single leaf makes possible an efficient algorithm for automatically constructing a DTM from training data. The incremental growing step involves splitting an existing leaf into several children by selecting from a pre-defined set of possible splits that one which most improves the likelihood of the data. Since the choice of split for a leaf depends only on those samples in the data that are assigned to the leaf, the algorithm is efficient. On the other hand, assigning each history to a single leaf limits the power of a DTM. In particular, it can lead to problems of data fragmentation since the amount of training data at a leaf is not always sufficient for robust estimates of the probability distribution at the leaf.

An overlapping decision tree model (ODTM) addresses these limitations by combining several decision trees into a single model. ODTM's are motivated by a maximum entropy view of DTM's. An ordinary DTM can be viewed as a maximum entropy model in which the features are determined by the leaves of the underlying tree. Specifically, each leaf and each possible future y', defines a binary valued feature function f(x,y) which is 1 if and only if x is assigned to the leaf and y equals y'. An ODTM is a maximum entropy model that incorporates the features of more than one tree. In an ODTM, each tree contributes a vote to p(y|x); the vote of a particular tree is cast by the feature corresponding to y and the leaf of x for that tree. This leads to a more powerful model with a less severe data fragmentation problem than that of an ordinary DTM. In addition, since the features of an ODTM are constructed from elementary features, the incremental growing algorithm can be generalized to ODTM's. Specifically, if the votes of other trees are held fixed while the features of a given tree are refined, an ODTM can be grown at a cost per node that is the same as the cost per node of a single DTM.

For DTM's or ODTM's to generalize to test data requires smoothing of maximum likelihood parameter estimates. The standard DTM smoothing techniques, such as deleted interpolation, do not generalize easily to ODTM's. To construct smoothed maximum entropy models we generalize the maximum entropy principle to what we call penalized maximum entropy. Whereas the ordinary maximum entropy principle constrains the expected value of a feature to equal its value in training data, the penalized maximum entropy principle imposes a probability distribution on the expected feature value. This probability distribution is estimated from training data. The penalized maximum entropy principle can be used to construct DTM's and ODTM's that are smoothed without the use of heldout data.

In this paper we describe overlapping decision tree models in detail. We discuss the incremental growing algorithm including feature selection, feature verification, and stopping. We describe an efficient algorithm for ranking a large number of possible features. We present the penalized maximum entropy principle and describe its use in constructing and smoothing DTM's and ODTM's. We present experimental results that indicate that this smoothing is as good as other standard DTM smoothing techniques. Finally, we present experimental results in which ODTM's outperform DTM's on test data in several ways: First, the best ODTM grown from a pre-defined set of elementary features has a greater log likelihood on test data than the best DTM grown from the same elementary features. Second, the best ODTM of a prescribed size outperforms the best DTM of the same size.


MaxEnt 94 Abstracts / mas@mrao.cam.ac.uk