P. Lievonen

Natural Modeling Process

This page is a work in progress.

Studies on Mathematical Modeling of Modeling of Modeling

There may lie ahead a possibility to abstract and generalize the essence of transformer architectures and diffusion models (see also) to fundamental mathematical structures, where they would be amenable to behaviors aking to analytical mechanics, but for learning, knowledge, and understanding – and also for design, automation, and control.

Minimally understanding data $Y$ by explaining it as $X$ , using reusable models $A$ and contextual attentional mixing by $C$ , resulting in residual modeling errors $E$ to be minimized,

Y - AXC = E,

leads to an objective function of sorts as an action principle to be extremalized

S = α sum E^{†} E / 2 - tr E^{†} E / 2 + β sum {EE}^{†} / 2.

Dagger $^{†}$ here means matrix transpose with complex conjugation (if one operates on a suitable algebra).

Studying the gradients of the simplest version of the above ( $α = 0 = β$ ) first,

S = - tr E^{†} E / 2,

\frac{\partial S}{\partial Y} = AXC - Y = - E,

\frac{\partial S}{\partial A} = {YC}^{†} X^{†} - {AXCC}^{†} X^{†} = {EC}^{†} X^{†},

τ \frac{\partial S}{\partial X} = A^{†} {YC}^{†} - A^{†} {AXCC}^{†} = A^{†} {EC}^{†},

\frac{\partial S}{\partial C} = X^{†} A^{†} Y - X^{†} A^{†} AXC = X^{†} A^{†} E,

reveals some algebraically (and thus also statistically) stable attractors present in this minimal adaptive system.

The stationary points ( $\partial S = 0$ ) are (study also second gradients of the above to see their convexity and thus stability):

Y_{*} = AXC,

A_{*} = {YC}^{†} X^{†} ({XCC}^{†} X^{†})^{- 1} = Y (XC)^{+},

X_{*} = (A^{†} A)^{- 1} A^{†} {YC}^{†} ({CC}^{†})^{- 1} = A^{+} Y C^{+},

C_{*} = (X^{†} A^{†} AX)^{- 1} X^{†} A^{†} Y = (AX)^{+} Y,

where $M^{+}$ is the (left or right) pseudoinverse.

Note that the stationary points (with matrix inverses) do no need to be calculated explicitly – the system can be running, learning, and adapting all the time, as the relevant stationary point is solved implicitly by the distributed dynamics. Compare the formulas also to the Lagrange dual method.

The above covariance matrix structures resembling quadratically optimal principal component regression (see also control theory, optimal control, linear system, and LTI system) result intrinsically using $α = 0$ , but there are other special points: using $α = 1 / N$ , where $N$ is the number of columns in $Y$ , results in automatic centering (subtraction of $Y$ row means) of the data, and there are nice symmetries present in the formula, as one can then divide $S$ by $N$ to arrive at the interpretation of the energy functional $S$ as the difference between direction-aware correlated errors among all the hypothetical occasions (sum as the $1 / N^{2}$ average over all the matrix elements), and the actual scalar modelling (or controlling) errors observed and experienced ( $1 / N$ mean of the trace).

Using time constant $τ^{2} = | | A^{†} A | |_{2}^{2} | | {CC}^{†} | |_{2}^{2}$ (where I am not sure about the latter factor) stabilizes the system, as then when iterating in discrete settings the eigenvalues of the feedback matrix stay inside the unit circle (as the spectral radius of a Hermitian matrix is bounded from above by its Frobenius norm), not just negative that is necessary for stability in continuous settings. Note that evolving the attention matrix $C$ has not been simulated yet here, and there may be complications in that kind of a multi-criteria formulation without additional constraints, but I wanted to include that matrix-multiplication-from-the-right here due to aesthetic symmetry reasons (and distant association to spinors) and its demonstrated practical relevance in modern AI systems. Using the exponential logits form of the attention matrix, $C := \exp . (C_{0}) . / J \exp . (C_{0})$ , where the elementwise softmax is applied to each column (utilizing $J$ for summing), seems to automatically center the samples (columns in $Y$ ) according to their expected values (interpreting the result of softmax as a probability distribution), which seems a nice emergent behavior in the system that is usually unnoticed when the gradients are calculated for deep networks with autodiff or similar automatic frameworks without inspecting their algebraic structure. We should study these kind of linear algebraic systems using matrix calculus, almost by hand, similar to how theoretical physics has progressed.

It is also interesting to consider and study gradients of exponentials, logarithms, traces, and determinants, as for complex valued square matrices, they are related, $det (\exp B) = \exp (tr B)$ , via Jacobi's formula, as is also evident from the properties of the spectrum. This could perhaps lead to a natural scoring rule to be applied to the energy functional $S$ above. In probability theory the characteristic function (closely related to Fourier transform), the moment-generating function, and especially the cumulant-generating function, allow calculation of cumulants $κ_{n}$ (mean, variance, and higher-order cumulants) by differentiating the transformed expectation value and evaluating it at zero:

κ_{n} := {\frac{d^{n}}{d t^{n}} ln E [e^{t Y}] |}_{t = 0},

where $Y$ is a random variable, the distribution of which is under scrutiny. As mean and variance are fundamental concepts in any statistical modeling, the above suggests that successive derivatives, expectation values, logarithms, and exponentials, may find fundamental uses in modeling generally – and these kind of relations are already present in the theories of thermodynamics such as statistical mechanics (see canonical ensembles, for example).

Even more ambitiously, studying the exponentials and their derivatives in relation to Lie groups and Lie algebras could lead to succinct formulations for $S$ . For example, the geodesic distance between rotation matrices $R$ and $S$ on the 3D manifold of rotation matrices is $| | \log (R^{T} S) | |_{2}$ , where the logarithm of a matrix has been used, and for complex valued matrices, the symmetric and skew-symmetric parts can operate similarly to real and imaginary parts of a number, representing multidimensional rotations. Already now, some matrix networks in machine learning, such as LLM transformers and layered hierarchical vision models, could be implicitly using some quite advanced learned algebraic structures without explicitly instructing them to do so, as for example Clifford algebras of arbitrary dimension can be represented by numerical matrices and vice versa (my quick intro to Garret Sobczyk's diagrams, watch a minute or two).

I have found the above structures quite interesting, being minimal examples of system-oriented, meaning-forming/semiotic, functional, and distributed dynamical processes that learn to survive (i.e. keep their sanity in their ecological niches) and even flourish by their very nature, and have experimented with some simulated visualizations of the organic or “lifelike” behaviors present – similar to these recordings of mine from about 15 years ago, where the columns of the model matrix $A$ are displayed as a grid, and where additional constraints, such as an elementwise cutoff rectifier and positive feedback from supervised output labels were used to learn a performant sparse code. It has similarities to predictive coding, but due to challenges in understanding the stability criterions when mimicking robust natural systems by iterating recursively (layer by layer, or modeling the same data at different spectral scales) the algorithm implied above has been successfully implemented for only one or a few layers so far, which is far from the managed complexity needed for contemporary state-of-the-art benchmark results in machine learning and AI, or for bringing greater understanding to actually observed complex behaviors such as the collective intelligence among cellular organisms (by Michael Levin et al., see also timelapse videos and Tom Mitchell in “3009”). Due to the nice properties of fundamental mathematical structures (for example, dimensional scaling can be achieved by simple multiplication in the frequency space, see also the more general Laplace transform and time-frequency presentations such as the advanced Wigner–Ville distribution, and compare them to positional encoding which is used in large language models, and to cosine noise schedule in image diffusion models), there may be a possibility to “collapse” or “fold” the layers to evolving resonant (exhibiting standing waves forming resonators) phased array “holograms” (see also illustration and wave interference).

These kind of structures have been studied a decade or two ago by prof. Hyötyniemi (link in Finnish), who in recent years has been convinced that the grand time should be analyzed in the frequency domain to make progress in understanding various systems relevant to “all men” that “by nature” “desire to know” – that may turn out to have deep cybernetic roots and reasons for their inquiries and aspirations in the first place.

Dynamic Universe -inspired art by Jani Isoranta

Motivational thoughts on modeling (62 pages, early 2023) is also available.

Natural Modeling Process ​

Studies on Mathematical Modeling of Modeling of Modeling

Natural Modeling Process