IBM Thomas J Watson Research Center

PO Box 704

Yorktown Heights

NY 10598

An ordinary decision tree model (DTM) estimates the conditional
probability **p(y|x)** of a future **y**
given a history **x** by first assigning
**x** to a leaf of a tree and then predicting
**y** according to a probability distribution depending
only on the leaf. This structure accounts for both the appeal and
weakness of such models. On the one hand, assigning each history to a
single leaf makes possible an efficient algorithm for automatically
constructing a DTM from training data. The incremental growing step
involves splitting an existing leaf into several children by selecting
from a pre-defined set of possible splits that one which most improves
the likelihood of the data. Since the choice of split for a leaf
depends only on those samples in the data that are assigned to the
leaf, the algorithm is efficient. On the other hand, assigning each
history to a single leaf limits the power of a DTM. In particular, it
can lead to problems of data fragmentation since the amount of
training data at a leaf is not always sufficient for robust estimates
of the probability distribution at the leaf.

An overlapping decision tree model (ODTM) addresses these limitations
by combining several decision trees into a single model. ODTM's are
motivated by a maximum entropy view of DTM's. An ordinary DTM can be
viewed as a maximum entropy model in which the features are determined
by the leaves of the underlying tree. Specifically, each leaf and each
possible future **y'**, defines a binary valued feature
function **f(x,y)** which is 1 if and only if
**x** is assigned to the leaf and **y**
equals **y'**. An ODTM is a maximum entropy model that
incorporates the features of more than one tree. In an ODTM, each tree
contributes a vote to **p(y|x)**; the vote of a
particular tree is cast by the feature corresponding to
**y** and the leaf of **x** for that
tree. This leads to a more powerful model with a less severe data
fragmentation problem than that of an ordinary DTM. In addition, since
the features of an ODTM are constructed from elementary features, the
incremental growing algorithm can be generalized to ODTM's.
Specifically, if the votes of other trees are held fixed while the
features of a given tree are refined, an ODTM can be grown at a cost
per node that is the same as the cost per node of a single DTM.

For DTM's or ODTM's to generalize to test data requires smoothing of maximum likelihood parameter estimates. The standard DTM smoothing techniques, such as deleted interpolation, do not generalize easily to ODTM's. To construct smoothed maximum entropy models we generalize the maximum entropy principle to what we call penalized maximum entropy. Whereas the ordinary maximum entropy principle constrains the expected value of a feature to equal its value in training data, the penalized maximum entropy principle imposes a probability distribution on the expected feature value. This probability distribution is estimated from training data. The penalized maximum entropy principle can be used to construct DTM's and ODTM's that are smoothed without the use of heldout data.

In this paper we describe overlapping decision tree models in detail. We discuss the incremental growing algorithm including feature selection, feature verification, and stopping. We describe an efficient algorithm for ranking a large number of possible features. We present the penalized maximum entropy principle and describe its use in constructing and smoothing DTM's and ODTM's. We present experimental results that indicate that this smoothing is as good as other standard DTM smoothing techniques. Finally, we present experimental results in which ODTM's outperform DTM's on test data in several ways: First, the best ODTM grown from a pre-defined set of elementary features has a greater log likelihood on test data than the best DTM grown from the same elementary features. Second, the best ODTM of a prescribed size outperforms the best DTM of the same size.

MaxEnt 94 Abstracts / mas@mrao.cam.ac.uk