A Psychologically-Plausible Model for Spatial Language

EPSRC Grant GR/N38145

Home

Summary     Objectives & Workplan       Background         Results        Publications

 

 

RESULTS OF RESEARCH

The following is a summary of the experimental and modelling research results obtained during the project. For more details on the individual experiments, models and results, please visit the publications page.

§         1. Initial Theoretical Developments and Experimental Findings

§         2. Development of Computational Model

§         3. Interplay between Experimental and Computational Work

§         References

 

1 Initial Theoretical Developments and Experimental Findings

Given the myriad findings in the literature regarding the influence of extra-geometric variables, it was important to review the findings, classify them into types of extra-geometric influences, and develop a framework in which to understand these influences. The “functional geometric framework” that emerged from this work early on in the grant (see Coventry & Garrod, 2004; in press), formed the basis for the generation of the experimental work reported here, and provided theoretical and empirical constraints for the modelling work outlined below. An edited volume on spatial language from cognitive and computational perspectives also emerged partly from the early work in the grant (see Coventry & Olivier, 2002). We first overview the main results from the experimental work, prior to outlining the model.

2.1.1 Experiments 1 – 4: Object features/object function and Over/Under/Above/Below

The starting point for the experiment work was to examine the relative influence of geometric and extra-geometric factors on the comprehension of over, under, above and below. These terms were a particularly useful starting point as they have been shown to be differentially affected by geometric and extra-geometric relations (Coventry et al., 2001); the comprehension of over/under is more affected by functional relations than the comprehension of above/below while conversely the comprehension of above/below is more influenced by geometric relations than the comprehension of over/under.

Experiments 1-4 addressed the issue of the relative extent to which the weightings for geometry and function are driven by individual prepositions and lexical entries for nouns versus information in the visual scene about how the objects are functioning. For example, a large golf umbrella affords greater protection from rain, while an umbrella full of holes diminishes the object’s usefulness as a protector from rain. In Experiments 1 and 2, the scenes of the type used on the left hand side of Figure 1 were compared with scenes involving larger objects in the same position (centre-of-mass controlled, Experiment 1), or with objects the same size but full of holes (thus compromising their protection function, Experiment 2). The methodology used for these experiments (and the other experiments unless otherwise indicated) involved the presentation of pictures together with sentences of the form The located object is preposition the reference object, and the task for participants was to rate the appropriateness of each sentence to describe each picture using a Lickert scale (range from 1 = totally unacceptable to to 9 = totally acceptable). For the first experiment , increasing the size of the protecting object was found to increase the size of the function effect for all four prepositions, and effectively reduce the differences between prepositions. In other words, as the function depicted by an object in a scene becomes greater, function tends to determine acceptability for all four prepositions. In contrast, in the second experiment, the function effect was much diminished for the objects with the holes in them as expected given that the rain in the “functional” condition when the umbrella has holes will still pass through the umbrella and wet the man. Furthermore, for Experiment 2, we also asked a second group of participants to estimate the percentage of rain that was likely to make contact with the person for each scene used. This allowed a direct examination of the relationship between judgements about what will happen in the scenes and acceptability for spatial prepositions. We found a significant correlation between judgements and acceptability for over, under, above and below overall (r = -0.52), although the correlation was higher for over/under (r = -0.78) than for above/below (r = -0.33) as expected. 

Experiments 3 and 4 involved the same manipulations as Experiments 1 and 2, only this time the scenes used involved conflicts between reference frames, manipulated by rotating the man in Figure 1 rather than the protecting object (see for example the Viking/shield/spear scenes in Figure 1). Rotating the Viking away from the vertical place (scenes in the penultimate and last columns in Figure 1), for example, produces conflicts between the absolute (gravitational) and intrinsic (object-centred) frames of reference. Thus for these scenes the shield is above the Viking is appropriate for the absolute frame of reference but inappropriate for the intrinsic frame of reference. Experiment 3 manipulated the size of the protecting object and Experiment 4 manipulated the completeness of the protecting object. The results for these experiments were consistent with the results from Experiments 1 and 2. Increasing the size of the object magnified the effect of the function of the object, while adding holes to the object diminished the size of the function effect. Additionally, frame of reference conflicts were more in evidence for above and below than for over and under. Finally, for Experiment 4 we also asked a second group of participants to estimate the percentage of rain, for example, that was likely to make contact with the person for each scene. Correlations between these judgements and ratings for over, under, above and below mirrored the results for Experiment 2.

Summary of Results. Changing the degree of protection function affects acceptability ratings. Furthermore, predictions of what will happen in a scene correlated with acceptability judgements (but more so for over/under than for above/below). These results indicate that processing of how objects are functioning in context in the visual scene being is essential to establish the appropriateness of over, under, above and below, consistent with the work of Glenberg (1997) and Barsalou (1999). Please note that Experiments 1-4 are in submission to Journal of Memory and Language (Coventry et al., in submission).

2.1.2 Experiments 5-10: The time course of processing of geometric and extra-geometric information

Experiments 1-4 used acceptability ratings as a dependent variable. Such a measure does not give information regarding the time-course of the processing of information present in the scene to be described. For that reason, we ran a series of sentence-picture verification tasks where participants were presented with sentences of the form The located object is preposition the reference object followed by a picture. Participants had to indicate as quickly as possible whether a sentence was a correct description of the picture that followed. Experiments 5 and 6 used materials such as those displayed in Figure 1 (but modified to control for visual complexity, etc.). Experiments 7 and 8 compared small and large objects where reference frames were always aligned (Experiment 7) or where they conflicted to varying degrees (Experiment 8). Experiments 9 and 10 mirrored Experiments 7 and 8, only this time complete and incomplete objects were compared.

Summary of Results. For all experiments, the data for the mean number of true responses replicates the rating data analyses in Experiments 1-4. In relation to the time taken to make a true response, the data were informative about the speed of processing of geometry and function. Quicker responses were found for functional than for non-functional scenes, and quicker responses were found for scenes where the located and reference objects were aligned than when they were misaligned.  Please note that Experiments 5-10 are in preparation for journal submission (Coventry et al., in preparation).

Figure 1. Sample scenes used Coventry, Prat-Sala and Richards (2001)

2.1.3 Experiments 11-15: Experiments for the model

Given the constraints imposed by the visual processing modules in the computational model we outline below, we ran a number of experiments using images/movies generated which would be easy to process for the model (see Figure 2), using the same basic methodology as Experiments 1-4. Experiments 11-15 involved three different reference objects (a plate, a dish and a bowl) pre-tested in a sorting task and a rating task to be the prototypical dimensions of these objects, and a variety of other objects which were all containers (e.g., a jug). Each container was presented in each of 3 x 2 positions “higher” than the other objects (representing 3 levels of distance on the x axis and two levels on the y axis from the other object). Crucially the container was shown to pour liquid such that it ended up reaching the plate/dish/bowl (the functional condition), or missed the plate/dish/bowl (non-functional condition), or liquid was not present. In Experiment 11 participants saw movies of the pouring scenes (or static scenes for the no liquid condition given that no movement was involved). The results showed effects of geometry and function together with interactions between these variables and over/under versus above/below. Experiment 12 compared the full movies with just the (single frame) end states, and this established that seeing the full movie makes no difference to acceptability ratings, it is what happens to the liquid that counts. Experiment 13 then compared end states to an earlier frame in the movie showing the liquid starting to protrude from the pouring container in order to assess whether participants predict what will happen to the liquid in order to make judgements about the appropriateness of over/under/above/below. Although acceptability ratings were overall lower for the predicted scenes rather than the end state scenes, effects of geometry, function and interactions between these variables and over/under versus above/below were still present, indicating that participants do predict where the liquid will go in order to ascertain the appropriateness of these prepositions. Experiment 14 confirmed this by finding a correlation between judgements of how much of the liquid will make contact with the appropriate part of the plate/dish/bowl and acceptability ratings for the over/under/above/below. 

Figure 2. An example of frames taken for a movie sequence for a functional scene in Experiments 11-13.

Experiment 15 tested the relative importance of geometric and extra-geometric variables for in/on/over/above. The scenes used involved a solid on top of a pile of other solids in/on a plate/dish/bowl, and location control was manipulated such that the located and reference objects were shown to move together at the same rate (strong location control condition), or the located object was shown to move independently of the plate/dish/bowl while still remaining in contact with the other solids in/on the plate/dish/bowl (non-location control condition), or the scene was presented statically. The geometry of the scene was also manipulated by varying the height of pile of objects in/on the plate/dish/bowl. Results showed that both geometry and degree of location control affect the appropriateness of in and on the describe scenes, and these variables interacted with the type of reference object (plate/dish/bowl; objects are more likely to be on a plate and in a bowl than vice versa).

Summary of Results. The results from these experiments show that actual movement or predicted movement over time influence judgements for over/under/above/below/in/on, but that this information is weighted as a function of preposition, and of the objects (plate/dish/bowl) in the scene. These data were used in the modelling work described in section 2.2.

2.1.4 Additional Experiments

In addition to the experiments run, in collaboration with additional researchers we ran a number of further studies which included the following;

(a) We also ran Experiments 1-4 in Spanish revealing similar effects to English (see Coventry & Guijarro-Fuentes, 2004).

(b) We found the same effects of location control and geometry found in Experiment 15 using a production methodology with children aged from 4;1 to 7;1 (see Richards et al., 2004; Richards & Coventry, in press).

(c) We also found similar influences of geometric and extra-geometric variables on the time taken to arrange objects given a set of spatial instructions (see Coventry et al., 2003), but weaker influences of these variables on “mental” versions of similar problems (see Coventry et al., 2002).

(d) We also ran 7 experiments using abstract two dimensional shapes (following the experiments of Regier and Carlson, 2001). These experiments examined the influence of the shape of the located object and of the reference object on the acceptability of over/under/above/below. The analyses (involving multi-level modelling) are still in progress.

2 Development of Computational Model

The computational model for the processing of visual scenes and the identification of the appropriate spatial preposition consists of three main modules: (1) Vision Processing, (2) Elman Network, (3) Dual-Route Network (cf. Figure 3). The first module uses a series of Ullman-type visual routines to identify the constituent objects of a visual scene (reference object, located object and liquid). The Elman network module utilises the output information from the vision module to produce a compressed neural representation of the dynamics of the scene (e.g. movement of liquid flow between the reference and located objects). This compressed representation is given in input to the dual-route (vision and language) feedforward neural network to produce a judgement regarding the appropriate spatial terms describing the visual scene. We describe each of these modules and their development in turn. 

Figure 3. Architecture of the computational model. The dotted arrows indicate functional connections between the three modules.

2.2.1 Vision processing module

In our computational model for spatial language, visual object recognition, spatial location and motion information are functionally necessary for the cognitive task. Beginning with the distinction between “what” versus “where” pathways (classically assumed to be the functionally segregated dorsal and ventral streams after Ungerleider and Mishkin, 1982), we also needed to consider the integration of object, location and motion integration when deriving a neurocomputational model. Our novel neurocomputational approach to object recognition for spatial cognition represents a compromise between the dynamic operation of the recurrent neurodynamical models of Deco and Lee (2001) for selective attention, and Edelman’s (1999) feedforward chorus model for object recognition, and is conceptually congruent with Ballard et al’s (1997) model (i.e. the output of our system is a plausible deictic pointer to objects in the visual scene). Image sequences (real object images composed into moving videos) are presented to the model, which processes them at a variety of spatial scales and resolutions for object form and motion features yielding a visual buffer (functionally analogous to processing in the striate visual cortex). In addition to the basic scale representation, texture, edge and region boundary features are extracted. Motion cells (in the magnocellular pathway) are modelled as uni-directional brightness gradient-sensitive cells whose outputs are combined. This is outlined in Figure 4.

The attentional saliency map (Figure 4, Right) is a very low resolution (retinotopic) array of neurons which receive bottom-up activation from the static and motion features in the visual buffer, but which can be strongly inhibited when the region they code for is attended to or when object recognition is strong enough to require little further processing of a region. This represents information integration that might take place involving the kinds of information processed in the posterior parietal cortex. This is used to direct attention and once a region is selected (analogous to a kind of spotlight of attention), the higher-resolution information contained in the visual buffer is allowed to feedforward to the object recognition stream. Since attention selects only a windowed region of the whole visual buffer for processing in IT, our system represents a chorus of object fragments. We use Gaussian adaptive resonance models to learn the space of fragments for each object (Williamson, 1996), leading to a probabilistic implementation. 

      

Figure 4. Left: Constituents of the Vision Processing Module and their relationships with known neural substrates. Right (Top): Snapshots of the overall saliency map after 9 fixations. Right (Bottom): Multiple Fragments of Teapot Object (A) Full visual buffer (B) Edges (C) Region/Boundary and (D) Texture

 

In the ICONIP02 conference publication (Joyce et al. 2002), we elaborate on the visual processing and selective attention mechanism and its role in a novel chorus of fragments framework for object recognition.  We show how this may form part of a larger system for spatial language comprehension and speculatively for prefrontal cortex short term visual memory and object-place binding (via the perirhinal – entorhinal – hippocampal network), all of which further ground the understanding of the visuo-spatial processing in a computational framework.

2.2.2 Elman network module

This module consists of a predictive, time-delay connectionist network similar to Elman’s (1990) simple recurrent network, which we refer to hereafter as the Connectionist Perceptual Symbol System Network (CPSSN; Joyce et al., 2003).  Figure 3, middle image, shows the CPSSN network as an Elman SRN.  As a suitable (and plausible) input representation for the CPSSN, we propose a “what+where” code (see also Edelman, 2002). That is, the input consists of an array of some 9x12 activations (representing retinotopically organised and isotropic receptive fields) where each activation records some visual stimulus in that area of the visual field. This is the output information produced by the Vision module. In addition to the “field” representation, we augment a distributed object identity code. These codes were produced by an object representation system (Joyce et al. 2002; based on Edelman’s (1999) theory) using the same videos.  The CPSSN is given one set of activations as input which feedforward to the hidden units. In addition, the previous state of the hidden units is fed to the hidden units simultaneously (to provide a temporal context viz. Elman’s (1990) SRN model). The hidden units feedforward producing an output which is a prediction of the next sequence item. Then, using the actual next sequence item, back propagation is used to modify weights (see Figure 3) to account for the error. The actual next sequence item is then used as the new input to predict the subsequent item and so on. Using the coding scheme discussed, we have a total input vector of length 116 (where 8 of these 116 elements code for each object, e.g. liquid, bowl, cup etc.). The output is similarly dimensioned, and there were 20 hidden units (and 20 corresponding time-delayed hidden state nodes) to represent movement of the liquid. 

The network training regime was as follows: a collection of sequences are shown to the network in random order (but of course, the inputs within a sequence are presented one after another). Each sequence contains a field and object code for the “liquid” in the videos. Multiple CPSSN networks would be required to account for the other objects in the scenes.  A root-mean-square error measure is used to monitor the network’s performance, and the ordering of sequences is changed each time (to prevent destructive interference between the storage of each sequence). Initially, the network is trained with a learning rate of 0.25, and after the RMS error stabilises, this is reduced to 0.05 to allow finer modifications to weights. For 6 sequences, a total of about 150 presentations are required (each sequence is therefore presented 25 times) to reduce RMS averaged over the whole training set from around 35 to around 0.4. 

It is quite obvious that this network is hetero-associating successive steps in the sequence of fields, but in addition, the network is performing compression and redundancy reduction (in the hidden layer) as well as utilising the state information in the time-delayed state nodes. It is also coding for the changes between sequence items (e.g. the dynamics of how the object moves over time) rather than coding individual sequence items (which would be auto-association).  The model embodies the idea that representation is inherently dynamic (cf. Freyd, Pantzer & Cheng, 1988). The network should, naturally, be able to make a prediction about a sequence given any item in the sequence. Intuitively, the network should be capable of this in the case where a cue is the first item of a sequence, since the time-delayed state is irrelevant (i.e., there can be no temporal context accumulated in the time-delay nodes). However, we propose that the network is a mechanism for implementing perceptual symbols, and therefore, a requirement is that it can “replay” the properties of the visual episode that was learned. Given a cue, the network should produce a prediction, which can be fed-back as the next input to produce a sequence of “auto-generated” predictions about a sequence (viz, a perceptual symbol). Indeed, this network is able to predict the final outcome of the visual scenes. Prediction data were reported in the ICCM conference publication (Joyce et al., 2003). These were also used in the final part of the project, to study the predictive ability of the overall computational model.

2.2.3 Dual-route network

The dual-route network is a feedforward neural network (3-layer perceptron) that receives in input the grounded “visual” information (hidden activations of the Elman networks) and linguistic data (name of located object, name of reference object, name of liquid  + 4 spatial prepositions over, above, below, under). In output it must reproduce (auto-associate) the same visual data, and produce the names of object, which are directly grounded in the input visual data. In addition, the four output units for the spatial prepositions will encode the rating values given by subjects. This architecture is directly inspired by dual-route networks for the grounding of language (Plunkett et al., 1992; Cangelosi et al., 2000).

This network is trained via the error backpropagation algorithm. The training and test sets consist of the 216 scenes. These are the same as those used in the experiment on the rating of over, above, under, below (Experiment 11 above). Of these stimuli, 195 are used for the training and 21 for the generalisation test. The overall objective of the training is that the network must learn to produce the same average ratings for the four prepositions. We did not use the average ratings as the teaching input, because this was against the principle of mutual exclusivity (Markmann 1987). During standard backpropagation training, the use of the ratings as teaching input assumes that the same scene must be simultaneously associated to the use of all four prepositions (each with an activation value proportional to the subjects’ average rating). Instead, during developmental learning subjects tend to choose only one preposition to describe a scene. Naturally, the probability of choosing one preposition to describe a spatial relation is correlated to its level of appropriateness (i.e. similar to ratings). Therefore, to simulate such a learning strategy better, the original ratings of each scene-preposition pair were converted into frequency of presentation of a stimulus with an associated localist teaching input (where the output unit of the chosen preposition is 1 and the other three units are 0). To obtain such a frequency, the original average ratings were scaled and normalised within each scene and also within the whole training set. For example, individual prepositions’ ratings of 7.08 (above), 7.12 (below), 3.96 (over), 4.32 (under) respectively correspond to presentation frequencies of 28, 28, 7 and 9. The conversion of ratings into preposition resulted in an epoch of 2100 stimuli.

Three networks were trained using different initial random weights and different random sets of 21 generalisation test stimuli . The training parameters included a learning rate of 0.01 and momentum of 0.8, and a total number of training epochs of 500. The average final error (RMS) for the 30 vision units was 0.008 for both training and testing data, and 0.003 for the 6 output units of the object names. More importantly, for the 4 spatial preposition output units, the error was 0.044 with training data and was 0.05 with generalisation data. The error values in the preposition units were calculated off-line by comparing the actual output of the 4 preposition units and the rating data (from Experiment 11) converted to produce the stimulus frequencies (the actual error values used for the weight correction are always higher because they use localist teaching input). These results clearly indicate that the networks produce rating values similar to that of experimental subjects. They also indicate that the training algorithm based on presentation frequency, instead of rating teaching input, works well and provides a psychologically-plausible learning regime. Similar results have also been found for in/on/over/above using the scenes and data from Experiment 15.

3 Interplay between Experimental and Computational Work

During the research, the development of the computational model has been conducted in parallel with experimental investigations. However, in the first part of the project the experimental work has mostly influenced the model design. For example, in the previous section we explained that the training/test stimuli and the rating values were directly taken from one experiment. In the final months of the research, it was the model that directed some of the directions and objectives of the experimental investigation. In particular, new simulations produced some predictions that were subsequently tested in new experiments.

Research on the design and test of the Elman module had shown that these networks were able to predict and auto-generate the final outcome of the visual scenes, once they were given an initial cue (e.g few initial frames). The network would produce the next prediction frames, which were fed-back as the next input. To integrate such prediction ability in the overall spatial language model, the hidden activation values of these auto-generated sequences were used as visual input of the dual-route network. The model was then run as usual to produce the ratings of the 4 prepositions.

To establish if the new ratings provided by the model were consistent with those produced by real subjects, a new experiment was conducted (Experiment 13, see above). The results for this experiment, together with the results of Experiment 14, strongly suggest that subjects had to mentally “play” the visual scene and auto-generate the outcome of the scene to rate the linguistic utterance. This is very similar to what the model does, when the Elman network autogenerates the visual scene, and the dual-route network uses the Elman net’s activations to produce new ratings. The Elman network used the first 3 out of 7 frames. This corresponds to the frames 0, 10 and 20 (Elman networks only see a frame every 10). The comparison of the subjects’ rating data and the networks’ output of the 4 prepositions resulted in an RMS error of 0.051 (Figure 5). This is a very low error level, and confirms that the model had predicted very accurately the ratings. Overall, this result and those on the dual-route tests support the development of a psychologically-plausible model for spatial language.

A paper which combines the results of Experiments 11-16 with a comprehensive outline of then development and testing of the full model is in preparation for submission to Cognitive Psychology (Coventry et al., in preparation).

Figure 5 - Output of the Elman network for the auto-generated prediction of the outcome of the liquid. The top 7 figures correspond to the actual output of the network. The frames in the bottom row indicate the final 4 target frames (the network receives the first 3 target frames).

 References

Ballard, D. H., Hayhoe, M. M., Pook, P. K. and Rao, R.P.N. (1997) Deictic Codes for the Embodiment of Cognition. Behavioural and Brain Sciences, Vol. 20, pp. 723-767.

Barsalou, L. W. (1999). Perceptual symbol systems. Behavioral and Brain Sciences, 22(4), 577-660.

Cangelosi A., Coventry K.R. et al. (in preparation), Grounding spatial language in perception. 9th Neural Computation and Psychology Workshop: Modelling Language, Cognition and Action

Cangelosi A., Greco A. & Harnad S. (2000). From robotic toil to symbolic theft: Grounding transfer from entry-level to higher-level categories. Connection Science, 12(2), 143-162

Cangelosi A., Martinez G.C. (2001). Neural networks for spatial language processing using virtual reality. World Congress on Neuroinformatics: Part II Proceedings, ARGESIM Verlag.

Coventry K.R., Cangelosi A. et al. (in preparation), A computational and experimental model for the geometric functions framework in spatial language. Cognitive Psychology

Coventry K.R., Richards L., Joyce D. & Cangelosi A. (in preparation), Towards a psychological plausible model of spatial language processing. Journal of Memory and Language

Coventry, K. R. & Garrod, S. C. (2001). Towards the development of a psychologically-plausible model for spatial language comprehension embodying geometric and functional relations. Proceedings of the 2nd Annual Language and Space workshop: Defining Functional and Spatial Features. University of Notre Dame, Indiana.

Coventry, K. R. & Garrod, S. C. (2004). Saying, Seeing and Acting: The Psychological Semantics of Spatial Prepositions. Essays in Cognitive Psychology Series. Psychology Press. Hove and New York.

Coventry, K. R. & Garrod, S. C. (in press). Spatial prepositions and the functional geometric framework. Towards a classification of extra-geometric influences. In L. A. Carlson & E. van der Zee (Eds.), Functional features in language and space: Insights from perception, categorization and development. Oxford University Press.

Coventry, K. R. & Guijarro-Fuentes, P. (2004). Las preposiciones en español y en inglés: la importancia relativa del espacio y función. (Spatial prepositions in Spanish and English: the relative importance of space and function). Cognitiva, 16(1), 73-93.

Coventry, K. R. & Olivier, P. (Eds.) (2002). Spatial Language. Cognitive and Computational Perspectives. Dordrecht, the Netherlands; Kluwer Academic Publishers, pp283.

Coventry, K. R. (2003). Spatial prepositions, spatial templates and “semantic” versus “pragmatic” visual representations. In E. van der Zee and J. Slack (Eds.), Representing Direction in Language and Space, pp255-267. Oxford University Press.

Coventry, K. R., Cangelosi, A., Joyce, D. & Richards, L. V. (2002). Putting geometry and function together - Towards a psychologically-plausible computational model for spatial language comprehension. In W. D. Gray & C. D. Schunn (Eds.), Proceedings of the Twenty-fourth Annual Conference of the Cognitive Science Society, p33. Lawrence Erlbaum Associates, Mahwah, NJ.

Coventry, K. R., Prat-Sala, M., & Richards, L. (2001). The interplay between geometry and function in the comprehension of ‘over’, ‘under’, ‘above’ and ‘below’. Journal of Memory and Language, 44, 376-398.

Coventry, K. R., Venn, S. & Armstead, P. (2002). Object knowledge and the construction of spatial mental models. Cahiers de Psychologie Cognitive, 21(6), 635-652.

Coventry, K. R., Venn, S. F., Smith, G. D. & Morley, A. M. (2003). Spatial problem solving and functional relations. European Journal of Cognitive Psychology, 15(1), 71-99.

Deco, G. and T.S. Lee (2002) A Unified Model of Spatial and Object Attention Based on Inter-cortical Biased Competition. In Press, Neural Computation.

Edelman, S. Representation and Recognition in Vision, MIT Press, 1999.

Edelman, S. (2002) Constraining the Neural Representation of the Visual World. Trends in Cognitive Sciences, Vol. 6, pp. 125-131

Elman, J.L. (1990). Finding structure in time. Cognitive Science, Vol. 14, 179-211

Freyd, J.J. and Finke, R.A. (1984) Representational Momentum.  Journal of Experimental Psychology: Learning, Memory and Cognition, Vol. 10, pp.126-132

Gapp, K. –P. (1995). Angle, distance, shape, and their relationship to projective relations. In J. D. Moore, & J. F. Lehman (Eds.), Proceedings of the 17th Annual Conference of the Cognitive Science Society (pp. 112-117). Mahwah, NJ: Cognitive Science Society.

Glenberg, A. M. (1997). What memory is for. Behavioral and Brain Sciences, 20(1), 1-55.

Herskovits, A. (1986). Language and Spatial Cognition. An interdisciplinary study of the prepositions in English. Cambridge University Press.

Joyce D., Richards L., Cangelosi A., Coventry K.R. (2002), Object representation-by-fragments in the visual system: A neurocomputational model. In L. Wang, J.C. Rajapakse, K. Fukushima, S.Y. Lee, X. Yao (Eds), Proceedings of the 9th International Conference on Neural Information Processing (ICONP02) IEEE Press, Singapore (pdf file)

Joyce. D. W., Richards, L. V., Cangelosi, A. & Coventry, K. R. (2003). On the foundations of perceptual symbol systems: Specifying embodied representations via connectionism. In F. Dretje, D. Dorner & H. Schaub (Eds.), The Logic of Cognitive Systems. Proceedings of the Fifth International Conference on Cognitive Modelling, pp147-152. Universitats-Verlag Bamberg, Germany. (pdf file)

Landau, B., & Jackendoff, R. (1993). 'What' and 'where' in spatial language and cognition. Behavioural and Brain Sciences, 16(2), 217-265.

Landauer, T., & Dumais, S. (1997). A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychological Review, 104, 211-240.

Logan, G. D., & Sadler, D. D. (1996). A computational analysis of the apprehension of spatial relations. In P. Bloom, M. A. Peterson, L. Nadel, & M. F. Garrett (Eds.), Language and Space (pp. 493-530). Cambridge, Mass.: MIT Press.

Martinez, G. C., Cangelosi, A. & Coventry, K. R. (2001). A hybrid neural network and virtual reality system for spatial language processing. Proceedings of the 2001 International Joint Conference on Neural Networks. IEEE Press. vol. 1, 16-21 Washington DC.

Plunkett, K., Sinha, C., Moller, M.F & Strandsry, O. (1992). Symbol grounding or the emergence of symbols? Vocabulary grouth in children and a connectionist net. Connection Science, 4(3-4), 293-312.

Regier, T. (1996). The human semantic potential: Spatial language and constrained connectionism. Cambridge Mass.: MIT Press.

Regier, T., & Carlson, L.A. (2001) Grounding spatial language in perception: An empirical and computational investigation. Journal of Experimental Psychology: General, 130(2), 273-298.

Richards, L. V. & Coventry, K. R. (2001). Children’s production of locative prepositions in English; the influence of geomewtric and extra-geometric factors. In Proceedings of the 2nd Annual Language and Space workshop: Defining Functional and Spatial Features. University of Notre Dame, Indiana.

Richards, L. V. & Coventry, K. R. (in press). Children’s production of locative prepositions in English; the influence of geomeotric and extra-geometric factors. In L. A. Carlson & E. van der Zee (Eds.), Functional features in language and space: Insights from perception, categorization and development Oxford University Press.

Richards, L. V., Coventry, K. R. & Clibbens, J. (2004). Where’s the orange? Geometric and extra-geometric factors in English children’s talk of spatial locations. Journal of Child Language, 31, 153-175.

Talmy, L. (1983). How language structures space. In H. Pick, & L. Acredolo (Eds.), Spatial Orientation: Theory, research and application (pp. 225-282). New York: Plenum Press.

Williamson J.R.  (1966). “Gaussian ARTMAP: A Neural Network for Fast Incremental Learning of Noisy Multidimensional Maps”, Neural Networks, 9(5), pp. 881- 897