Adaptive Systems Theory Section

Defence Research Agency

St Andrews Rd

Malvern

Worcestershire

WR14 3PS

United Kingdom

Conventional density models use hidden variables (e.g. hyperparameters) to model variables that are not directly observable. This approach has been used successfully to model the probability density of data that has complicated joint statistics, especially in image processing applications. In physics applications the use of hidden variables is mandatory because they are real physical quantities which happen not to be observed directly. However their use in image processing applications is usually empirical, because they serve to augment the parameter space of what is after all only an empirical model. There is room for alternative approaches in image processing (and other) applications.

In this paper it will be assumed that the data pixels have been passed through a multi-layer network of non-linear processors (e.g. a neural network), and that the marginal PDFs between the outputs of some of the non-linear processors have been measured. The goal is to construct an optimal model of the joint density of the data pixels, which requires that the following two problems be addressed:

- What is the optimal way to combine the marginal PDFs to produce the required density model?
- What is the optimal choice of non-linear processors to use in the multi-layer network in the first place?

- The maximum entropy principle will be used to construct a model of the joint density of the data pixels that is consistent with the supplied marginal PDFs. In general, this maximum entropy problem cannot be solved in closed-form, but in the case of a tree-like network topology it yields a closed-form expression that allows direct computation of the density (i.e. there are no hidden variables that need to be summed over).
- There is no uniquely best choice for the non-linear processors that should be used in the network, but the consequences of using an objective function that maximises the relative entropy between the density model and the real-world density (which is equivalent to maximising the data likelihood with respect to the choice of non-linear processors) are derived. This objective function has a pleasing interpretation as the sum of the mutual informations between various outputs of processors in different parts of the network. During this optimisation process control signals flow backwards between the network layers to co-ordinate their optimisation; this effect is called self-supervision.

In conclusion, it is possible to construct density models that have the ability to capture complicated joint statistics without the need to introduce hidden variables that subsequently have to be summed over. The most general model of this class can be represented as a multi-layer network of non-linear processors, whose connection topology is the union of a number of overlapping tree-like subnetworks.

MaxEnt 94 Abstracts / mas@mrao.cam.ac.uk