Guido Bugmann, Raju Bapi, Brendan D'Cruz, Mike Denham

Kaspar Althoefer and John G. Taylor (King's College London).

From a study of the literature on reinforcement learning in the field of control, describing methods such as the ACE/ASE system of Barto, Sutton and Anderson (1983) and temporal difference learning by Sutton (1988), it appeared that the generation of a relevant value map was central in these methods. It was also a very slow process due, in part, due to the aim of finding and optimal value map, requiring that updating results from an optimal policy to be followed. Thus, only the value of the current state was updated. Further, as dynamic programming techniques are used, a state had to be visited several times before the correct value map emerged. Slowness was also due to a random exploratory component required to ensure a satisfactory convergence of values and policy. The design of the exploration method was very much a matter of personal taste and affected the final map.

1. In our proposed method, the concept of optimal map is abandoned and the emphasis is put on a complete exploration. From each state, all accessible neighbours are visited and their values used to update the state's value. Several updating cycles are also required to produce a smooth interpolating map. But, as there is no need to follow a path (policy), this is an inherently parallel process, potentially very fast.

2. As an example, a neural network is proposed which implements a resistive grid. The nodes in the net corresponding to obstacles, or forbidden states have their output clamped to zero. The node corresponding to the goal state has its value clamped to 1. All other nodes set their output to intermediate values, equal to the average of the output of their neighbours. This implements a Laplacian net (Bugmann, Taylor, Denham, 1994).

3. This form of value map has been applied mainly to problems where actions caused only transitions between nearest neighbours: Route finding in 2-Dimensional problems (Bugmann, Taylor and Denham, 1994; Bugmann, 1996)or planning a sequence of robot arm movements (Althoefer and Bugmann, 1995).

4. The collaboration with Kaspar Althofer in the field of robot arm control led to a number of interesting side results. i) It was found that the forbidden regions in configuration space due to obstacles in the workspace can be computed by superposing elementary forbidden regions due to all elementary components of the obstacles (Althofer et al, 1995, 1995, 1995). ii) It was found that due to the typical shapes of the free space in the robot arm problem, a wave type updating method allows to compute a useful value map in less than ten iterations (Althofer, Fraser and Bugmann, 1995). iii) A new sequence learning network based on RBF neurons was designed to produce a smooth and precise movement, despite the coarse angular coding in the resistive grid (Althoefer and Bugmann, 1995).

5. The resistive grid technique was also used in a case where actions do not necessarily cause transitions between nearest neighbours in the grid. For instance the pole balancing problem where there are only 2 possible actions which cause the system to move within a 4-dimensional state space. In this case, the neuro-resistive grid proved a useful replacement of the ACE element and led to faster learning. The controller using a resistive grid as value map showed also a smaller dynamic range in terms of starting positions of the pole, possibly indicating that the resistive grid may have settled to a given value map too fast (Bapi, D'Cruz and Bugmann, 1996).

1. Relating formally the resistive grid method to other map building techniques such as Q-learning by Watkins.

2. Exploring nets where the action linking two nodes is not represented by the relative positions of the nodes but encoded separately.

3. Deepening our understanding of the learning dynamics in the pole balancing problem with a resistive grid.

"Route finding by neural nets"

Bugmann, G., Taylor, J.G. and Denham M. (1994)

in Taylor J.G (ed) "Neural Networks", Alfred Waller Ltd, Henley-on-Thames, p. 217-230.

"The configuration Space Transformation for Articulated Manipulators: A Novel
Approach Based on RBF-Networks"

Althöfer K., Fraser D.A., Bugmann G. and Turan J. (1995)

in Proc. of the 4th International Conference on Artificial Neural Networks (ANN'95), Cambridge,
UK, IEE Conference Publication 409, pp 245-249.
ann95.zip (ps
file)(68953)

"The configuration Space for Manipulators Computed by a Basis-Function Network"

Althöfer K., Fraser D.A. and Bugmann G. (1995)

in Proceedings of the International Conference on Engineering Applications of Neural Networks
(EANN'95), Espoo, Finland, August 21-23, pp.469-472.
eann95.zip (ps
file)(70909)

"Neuro-Resistive Grid Approach to Pole Balancing Problem"

Bapi, R., D'Cruz B. and Bugmann G.(1995)

Proc. of ICANN'95, Paris, Vol. 2, 539-544.

"Planning and Learning Goal-Directed Sequences of Robot-Arm
movements"

Althöfer K. and Bugmann G.(1995)

Proc. of ICANN'95, Paris, Vol. 1, 449-454.
icangb95.zip (ps
file)(37402)

"Asymmetric-B-Splines for the fast Calculation of C Space
Patterns of Robot Arm"

Althöfer K., Fraser D.A., Bugmann G. and Plumbley M.D. (1995)

Proc. of ICANN'95, Paris, Vol. 2, 387-392.
icann95.zip (ps
file)(70074)

"Rapid path planning for robotic manipulator using an
emulated resistive grid"

Althöfer K., Fraser D.A. and Bugmann G. (1995)

*Electronics Letters*, 31, 1960-1961.elec_www.zip (ps
file)(70909)

"Neuro-resistive grid approach to trainable controllers:
A pole balancing example".

D'Cruz B., Bapi R. and Bugmann G. (1995)

To appear in *Neural Computing and Application Journal*
nrgp
ole.zip (ps file)(144294)

"Value Maps for planning and learning implemented with cellular automata "

Bugmann G. (1996)

in Parmee I.C. (ed) "Proc. of the 2nd International Conf. on Adaptive Computing in Engineering
Design and Control (ACEDC'96)", Plymouth, 26-28 March 1996, ISBN 0 905227 61 1, p. 307-
309.
edc96_ps.zip
(30631)

Home page |

Please send any comments to email