{"title": "Analytical Study of the Interplay between Architecture and Predictability", "book": "Advances in Neural Information Processing Systems", "page_first": 315, "page_last": 321, "abstract": "", "full_text": "Analytical study of the interplay between \n\narchitecture and predictability \n\nAvner Priel, Ido Kanter, David A. Kessler \n\nMinerva Center and Department of Physics, Bar Ilan University, \n\nRamat-Gan 52900, Israel. \n\ne-mail: priel@mail.cc.biu.ac.il \n\n(web-page: http://faculty.biu.ac.il/ \"\"'priel) \n\nAbstract \n\nWe study model feed forward networks as time series predictors \nin the stationary limit. The focus is on complex, yet non-chaotic, \nbehavior. The main question we address is whether the asymptotic \nbehavior is governed by the architecture, regardless the details of \nthe weights . We find hierarchies among classes of architectures \nwith respect to the attract or dimension of the long term sequence \nthey are capable of generating; larger number of hidden units can \ngenerate higher dimensional attractors. In the case of a perceptron, \nwe develop the stationary solution for general weights, and show \nthat the flow is typically one dimensional. The relaxation time \nfrom an arbitrary initial condition to the stationary solution is \nfound to scale linearly with the size of the network. In multilayer \nnetworks, the number of hidden units gives bounds on the number \nand dimension of the possible attractors. We conclude that long \nterm prediction (in the non-chaotic regime) with such models is \ngoverned by attractor dynamics related to the architecture. \n\nNeural networks provide an important tool as model free estimators for the solution \nof problems when the real model is unknown, or weakly known. In the last decade \nthere has been a growing interest in the application of such tools in the area of time \nseries prediction (see Weigand and Gershenfeld, 1994). In this paper we analyse a \ntypical class of architectures used in this field, i.e. a feed forward network governed \nby the following dynamic rule: \n\nS t+l \n\n1 \n\nS \n\n= out ; \n\nS t+l = st 1 \n\n] -\n\n] \n\n-, .. . , \nN \nJ' - 2 \n\n(1) \n\nwhere Sout is the network's output at time step t and Sj are the inputs at that time; \nN is the size of the delayed input vector. The rational behind using time delayed \nvectors as inputs is the theory of state space reconstruction of a dynamic system \n\n\f316 \n\nA. Priel, 1. Kanter and D. A. Kessler \n\nusing delay coordinates (Takens 1981, Sauer Yorke and Casdagli 1991). This the(cid:173)\nory address the problem of reproducing a set of states associated with the dynamic \nsystem using vectors obtained from the measured time series, and is widely used for \ntime series analysis. A similar architecture incorporating time delays is the TDNN \n- time-delay neural network with a recurrent loop (Waibel et. al. 1989). This type \nof networks is known to be appropriate for learning temporal sequences, e.g. speech \nsignal. In the context of time series, it is mostly used for short term predictions. Our \nanalysis focuses on the various long-time properties of the sequence generated by a \ngiven architecture and the interplay between them. The aim of such an investiga(cid:173)\ntion is the understanding and characterization of the long term sequences generated \nby such architectures, and the time scale to reach this asymptotic behavior. Such \nknowledge is necessary to define adequate measures for the transition between a \nlocally dependent prediction and the long term behavior. Though some work has \nbeen done on characterization of a dynamic system from its time series using neu(cid:173)\nral networks, not much analytical results that connect architecture and long-time \nprediction are available (see M. Mozer in Weigand and Gershenfeld, 1994). Nev(cid:173)\nertheless, practical considerations for choosing the architecture were investigated \nextensively (Weigand and Gershenfeld, 1994 and references therein). It has been \nshown that such networks are capable of generating chaotic like sequences. While \nit is possible to reconstruct approximately the phase space of chaotic attractors (at \nleast in low dimension), it is clear that prediction of chaotic sequences is limited \nby the very nature of such systems, namely the divergence of the distance between \nnearby trajectories. Therefore one can only speak about short time predictions with \nrespect to such systems. Our focus is the ability to generate complex sequences, \nand the relation between architecture and the dimension of such sequences. \n\n1 Perceptron \n\nWe begin with a study of the simplest feed forward network, the perceptron. We \nanalyse a perceptron whose output Sout at time step t is given by: \n\nSou. = tanh [13 (t,(W; + WO)Sj) ] \n\n(2) \n\nwhere {3 is a gain parameter, N is the input size. The bias term ,Wo, plays the same \nrole as the common 'external field' used in the literature, while preserving the same \nqualitative asymptotic solution. In a previous work (Eisenstein et. al. , 1995) it was \nfound that the stationary state (of a similar architecture but with a \"sign\" activation \nfunction instead of the \"tanh\", equivalently (3 --t 00) is influenced primarily by one \nof the larger Fourier components in the power spectrum of the weights vector W \nof the perceptron. This observation motivates the following representation of the \nvector W. Let us start with the case of a vector that consists of a singl\u20ac biased \nFourier component of the form: \n\nWj = acos(27fKjjN) \n\nj = 1, ... ,N ; \n\nWo =b \n\n(3) \n\nwhere a, b are constants and K is a positive integer. This case is generalized later on, \nhowever for clarity we treat first the simple case. Note that the vector W can always \nbe represented as a Fourier decomposition of its values. The stationary solution for \nthe sequence (SI) produced by the output of the percept ron , when inserting this \nchoice of the weights into equation (2), can be shown to be of the form: \n\nSI = tanh [A({3) cos(27fKljN) + B({3)] \n\n(4) \n\nThere are two non-zero solutions possible for the variables (A, B): \n\n\fThe Interplay between Architecture and Predictability \n\nA \n\nB \n\nt{3N a I:~l D(p)(A/2)2P-l (p!)-2 \n\nB =0 \n\n{3Nb I:~l D(p)S2 p-l ((2p)!)-1 \n\n317 \n\n(5) \n\nwhere D(p) = 22p (22p - 1)B2p and B2p are the Bernoulli numbers. Analysis \nof equations (5) reveals the following behavior as a function of the parameter {3. \nEach of the variables is the amplitude of an attractor. The attractor represented \nby (A i- 0, B = 0) is a limit cycle while the attractor represented by (B i- 0, A = 0) \nis !l fixed point of the dynamics. The onset of each of the attractors A(B) is at \n{3cl = 2(aN)-1 ({3c2 = (bN)-l) respectively. One can identify three regimes: (1) \n{3 < {3cl,c2 - the stable solution is Sl = O. (2) min({3cl, (3c2) < {3 < max({3cl, (3c2) -\nthe system flows for all initial conditions into the attractor whose {3c is smaller. (3) \n{3 > {3cl,c2 - depending on the initial condition of the input vector, the system flows \ninto one of the attractors, namely, the stationary state is either a fixed point or a \nperiodic flow. {3cl is known as a Hopf bifurcation point. Naturally, the attractor \nwhose {3c is smaller has a larger basin of attraction, hence it is more probable to \nattract the flow (in the third regime). \n\n1.0 \n\n0.5 \n\n0.0 \n\n-0.5 \n\no \no \no \no \no \no \no \no \no o \n\n00 \n\n00 \n\n00 \n\n000 \n\n-1.0 \n\n-1.0 \n\n000000000 \n\n-0.5 \n\n0.0 \nSl \n\n0.5 \n\n1.0 \n\nFigure 1: Embedding of a se(cid:173)\nquence generated by a percep(cid:173)\ntron whose weights follow eq. 3 \n(6) . Periodic sequence (outer \ncurve) N = 128, k = 17, b = 0.3, \n{3 = 1/40 and quasi periodic (in(cid:173)\nner) k = 17, \u00a2 = 0.123, (3 \n1/45 respectively. \n\nNext we discuss the more general case where the weights of eq. (3) includes an \narbitrary phase shift of the form: \n\nWj = acos(27fKj/N - 7f\u00a2) \u00a2 E (-1,1) \n\n(6) \n\nThe leading term of the stationary solution in the limit N \u00bb 1 is of the form: \n\nSl = tanh [A({3) cos(27f(K - \u00a2)l/N) + B({3)] \n\n(7) \nwhere the higher harmonic corrections are of O( 1 / K). A note should be made here \nthat the phase shift in the weights is manifested as a frequency shift in the solution. \nIn addition, the attractor associated with A i- 0 is now a quasi-periodic flow in the \ngeneric case when \u00a2 is irrational. The onset value ofthe fixed point ({3c2) is the same \nas before, however the onset of the quasi-periodic orbit is (3cl = sin'(!4\u00bb 2(aN)-1. \nThe variables A, B follow similar equations to (5): \n\nA \n\n(3Na SinJ;4\u00bb I:~l D(p)(A/2)2P-l(p!)-2 \n\nB =0 \n\nA=O \n\n(8) \n\nThe three regimes discussed above appear in this case as well. Figure 1 shows the \nattractor associated with (A i- 0, B = 0) for the two cases where the series generated \nby the output is embedded as a sequence of two dimensional vectors (Sl+l, Sl). \n\n\f318 \n\nA. Priel, I Kanter and D. A. Kessler \n\nThe general weights can be written as a combination of their Fourier components \nwith different K's and \u00a2's: \nm \n\nWj = Laicos(27fKd/N-7f\u00a2i) \u00a2i E (-1,1) \n\n(9) \n\ni=l \n\nWhen the different K's are not integer divisors of each other, the general solution \nis similar to that described above: \n\nSl = tanh [t, A,({3) cos(27r(K, - 1 = 0.121, K2 = 7, \n1>2 = 0 , Wo = 0.3 . The dashed \nline is a linear fit of 0.1/ N, N = \n50, ... ,400. \n\nApplying T T - times to an initial state vector results in a vector whose second \nlargest component is of order: \n\n(13) \n\ntherefore we can define the characteristic relaxation time in the vicinity of an at(cid:173)\ntractor to be T = ~ -1 . 1 \n\nWe have analysed eq. (12) numerically for various cases of Ci, e.g. Wi composed of \none or two Fourier components. In all the cases (3 was chosen to be the minimal f3c to \nensure that the linearized form is valid. We found that ~,....., I/N. Figure 2 depicts \none example of two Fourier components. Next, we have simulated the network and \nmeasured the average time ( T S \n) it takes to flow into an attractor starting from \nan arbitrary initial condition. The following simulations support the analytical \n,....., N ) for general (random) weights and high gain (13) value as well. \nresult ( T \nThe threshold we apply for the decision whether the flow is already close enough \nto the attractor is the ratio between the component with the largest power in the \nspectrum and the total power spectrum of the current state (S ), which should \nexceed 0.95. The results presented in Figure 3 are an average over 100 samples \nstarted from random initial condition. The weights are taken at random, however \nwe add a dominant Fourier component with no phase to control the bifurcation \npoint more easily. This component has an amplitude which is about twice the other \ncomponents to make sure that its bifurcation point is the smallest. We observe a \nclear linear relation between this time and N (T S \n,....., N ). The slope depends on the \nactual values of the weights, however the power law scaling does not change. \n\n- t \n\nOn general principles, we expect the analytically derived scaling law for ~ to be \nvalid even beyond the linear regime. Indeed the numerical simulations (Figure 3) \nsupport this conjecture. \n\n3 Multilayer networks \n\nFor simplicity, we restrict the present analysis to a multilayer network (MLN) with \nN inputs, H hidden units and a single linear output, however this restriction can \nbe removed, e.g. nonlinear output and more hidden layers. The units in the hidden \nlayer are the perceptrons discussed above and the output is given by: \n\n(14) \n\nINote that if one demand the L.R.S. of eq. (13) to be of O(~), then T '\" ~ -11og(~ -1). \n\n\f320 \n\n800 \n\n600 \n\nI\"'p 400 \n200 \n\n0 \n\n0 \n\n200 \n\n300 \n\n100 \n\nN \n\nA. Priel, 1. Kanter and D. A. Kessler \n\nFigure 3: Scaling of r S for ran(cid:173)\ndom weights with a dominant \ncomponent at K = 7, \u00a2 = 0, a = \n1; All other amplitudes are ran(cid:173)\ndomly taken between (0,0.5) and \nthe phases are random as well. \n{3 = 3.2/N. The dashed line is a \nlinear fit of eN, e = 2.73 \u00b1 0.03. \nN = 16, ... , 256. \n\nThe dynamic rule is defined by eq. (1). First consider the case where the weights \nof each hidden unit are of the form described by eq. (6), Le. each hidden unit has \nonly one (possibly biased) Fourier component: \n\nm=l, ... ,H. \n\n(15) \n\nFollowing a similar treatment as for the perceptron, the stationary solution is a \ncombination of the perceptron-like solution: \n\nH \n\nSl = L tanh [Am({3) cos(27r(Km - \u00a2m)l/N) + Bm({3)] \n\n(16) \n\nm=l \n\nThe variables Am, Bm are the solution of the self consistent coupled equations, how(cid:173)\never by contrast with the single perceptron, each hidden unit operates independently \nand can potentially develop an attractor of the type described in section 1. The \nnumber of attractors depends on {3 with a maximum of H attractors. The number \nof non-zero Am's defines the attractor's dimension in the generic case of irrational \n\u00a2's associated with them. If different units do not share Fourier components with \na common divisor or harmonics of one another, it is easy to define the quantitative \nresult, otherwise, one has to analyse the coupled equations more carefully to find \nthe exact value of the variables. Nevertheless, each hidden unit exhibits only a \nsingle highly dominant component (A 1= 0 or B 1= 0). \nGeneralization of this result to more than a single biased Fourier component is \nstraightforward. Each vector is of the form described in eq. (9) plus an index for the \nhidden unit. The solution is a combination of the general perceptron solution, eq. \n(10). This solution is much more involved and the coupled equations are complicated \nbut careful study of them reveals the same conclusion, namely each hidden unit \npossess a single dominant Fourier component (possibly with several other much \nsmaller due to the other components in the vector). As the gain parameter {3 \nbecomes larger, more components becomes available and the number of possible \nattractors increases. For a very large value it is possible that higher harmonics \nfrom different hidden units might interfere and complicate considerably the solution. \nStill, one can trace the origin of this behavior by close inspection of the fields in \neach hidden unit. \n\nWe have also measured the relaxation time associated with MLN's in simulations. \nThe preliminary results are similar to the perceptron, Le. r S ' \" N but the constant \nprefactor is larger when the weights consist of more Fourier components. \n\n\fThe Interplay between Architecture and Predictability \n\n321 \n\n4 Discussion \n\nNeural networks were proved to be universal approximators (e.g. Hornik, 1991), \nhence they are capable of approximating the prediction function of the delay coor(cid:173)\ndinate vector. The conclusion should be that prediction is indeed possible. This \nobservation holds only for short times in general. As we have shown, long time \npredictions are governed by the attractor dynamics described above. The results \npoint out the conclusion that the asymptotic behavior for this networks is dictated \nby the architecture and not by the details of the weights. Moreover, the attrac(cid:173)\ntor dimension of the asymptotic sequence is typically bounded by the number of \nhidden units in the first layer (assuming the network does not contain internal de(cid:173)\nlays) . To prevent any misunderstanding we note again that this result refers to \nthe asymptotic behavior although the short term sequence can approximate a very \ncomplicated attractor. \n\nThe main result can be interpreted as follows. Since the network is able to ap(cid:173)\nproximate the prediction function, the initial condition is followed by reasonable \npredictions which are the mappings from the vicinity of the original manifold cre(cid:173)\nated by the network. As the trajectory evolves, it flows to one of the attractors \ndescribed above and the predictions are no longer valid. In other words, the initial \ncombination of solutions described in eq. (10) or its extension to MLN (with an \narbitrary number of non-zero variables, A's or B's) serves as the approximate map(cid:173)\nping. Evolution of this approximation is manifested in the variables of the solution, \nwhich eventually are attracted to a stable attractor (in the non-chaotic regime). \nThe time scale for the transition is given by the relaxation time developed above. \n\nThe formal study can be applied for practical purposes in two ways. First, taking \ninto account this behavior by probing the generated sequence and looking for its \nindications. One such indication is stationarity of the power spectrum. Second, one \ncan incorporate ideas from local linear models in the reconstructed space to restrict \nthe inputs in such a way that they always remain in the vicinity of the original \nmanifold (Sauer, in Weigand and Gershenfeld, 1994). \n\nAcknowledgments \n\nThis research has been supported by the Israel Science Foundation. \n\nReferences \n\nWeigand A. S. and Gershenfeld N. A. ; Time Series Prediction, Addison-Wesley, \nReading, MA, 1994. \nE. Eisenstein, I. Kanter, D. A. Kessler and W. Kinzel; Generation and prediction \nof time series by a neural network, Phys. Rev. Lett. 74,6 (1995). \n\nWaibel A., Hanazawa T., Hinton G., Shikano K. and Lang K.; Phoneme recognition \nusing TDNN, IEEE Trans. Acoust., Speech & Signal Proc. 37(3), (1989). \nTakens F., Detecting strange attractors in turbulence, in Lecture notes in mathe(cid:173)\nmatics vol. 898, Springer-Verlag, 1981. \nT. Sauer, J. A. Yorke and M. Casdagli; Embedology, J. Stat. Phys. 65(3), (1991) . \nRalston A. and Rabinowitz P. ; A first course in numerical analysis, McGraw-Hill, \n1978. \nK. Hornik; Approximation capabilities of multilayer feed forward networks, Neural \nNetworks 4, (1991). \n\n\f", "award": [], "sourceid": 1344, "authors": [{"given_name": "Avner", "family_name": "Priel", "institution": null}, {"given_name": "Ido", "family_name": "Kanter", "institution": null}, {"given_name": "David", "family_name": "Kessler", "institution": null}]}