valuelogo-m.png

 Vision, Action, and Language
Unified by Embodiment

An RCUK Cognitive Systems Foresight Project

 

 

 

Background

   Psychological Studies on Vision, Action and Language

   It is increasingly recognised that cognition should not be regarded as a set of disembodied processes, but is strongly determined by the constraints of its bodily implementation and it being situated in the world with which it interacts. In the case of visual cognition this embodied approach has led to an emphasis on the role of vision in exploring the world, and therefore on the integration of vision and action (see for instance O’Regan & Noe, 2001). There is certainly accumulating human behavioural evidence that vision and action form a closely integrated and highly dynamic system (e.g. Tucker & Ellis, 1998, 2001; Craighero, Bello, Fadiga and Rizzolatti, 2002; Fischer & Hoellen, 2004).

  

   One consequence of this integration of the vision and action systems is that seeing an object, even when there is no intention to handle it, potentiates elements of the actions needed to reach and grasp it. For instance participants who viewed photographs of common objects in order to decide whether they were manufactured or organic were facilitated in responding if the grip needed to make the response was one that would be used to handle the viewed object (Tucker and Ellis, 2001). So, for example, signalling that a pea was organic was easier (faster and more accurate) if a precision grip (using only the thumb and forefinger) was needed for the response compared to using a power grip (between the four fingers and palm). Similar object to action compatibility effects are observed for the hand of reach and the wrist rotation required to align the hand with an object (Tucker & Ellis, 1998; Ellis & Tucker, 2000). The authors coined the term ‘micro-affordances’ to describe these potentiated elements of an action. The automatic derivation of object affordances is also supported by a body of evidence from neuroscience that includes recent imaging studies in humans (e.g. Grèzes, Tucker, Armony, Ellis & Passingham, 2003); single cell recordings in monkeys (Sakata, Taira, Mine & Murata, 1992; Fadiga, Fogassi, Gallese & Rizzolatti, 2000); neural network models (Arbib et al. 2000, 1998; Gross, Heinze, Seiler & Stephan, 1999). Similar effects can be observed during language comprehension. Reading action verbs activates motor and premotor cortices during lexical access in a somatotopic fashion (Pulvermuller, 2005), suggesting that motor activation is inherent to language comprehension. Classifying the names of objects produces similar effects on actions as classifying the objects themselves (Tucker & Ellis, 2004).

  

   Visual attention and eye movements are obviously fundamental components of human exploratory behaviour, and implicated in the integration of vision, action and language. Our eyes are exquisitely sensitive to the combined demands of vision, action and language processing. We move our eyes to project objects of interest onto the foveal area of high visual resolution. When we interact with objects, our eyes move ahead of the hand to support the on-line control of grasping (e.g., Bekkering & Neggers, 2002). When we hear verbal descriptions of scenes that we simultaneously watch, we tend to look at objects that are about to be mentioned (Tanenhaus et al., 1995). While eye position usually reflects overt attention, observers can also attend to new objects while the eyes remain fixated. Such covert attention can be measured behaviourally as faster and more accurate perception of attended compared to unattended objects with equal distances from the current eye position. Merely seeing objects activates plans for actions directed to them (e.g., Tucker & Ellis, 2001; Fischer & Dahl, 2007). Tools are known to attract attention and action relations between objects (e.g., bottle, cork screw) lead to attention being allocated to both (Riddoch et al., 2003). However, whenever the eyes or hands are directed at a new target, covert attention precedes such overt attention shift to this object. Having to ignore an object in a search for a target object seems to require the inhibition of the actions associated with the ignored object (Ellis et al, in press).

   Computational Modelling of Language and Action

   The study of the design of linguistic communication between autonomous agents, such as robots or simulated agents, has recently attracted the interest of researchers from different fields, such as computer scientists, engineers and cognitive scientists. Investigations in the emergence of language, both in evolutionary (Cangelosi & Parisi 2002) and developmental terms (MacWhinney 1998), have greatly benefited from the use of computational models. Amongst the various computational approaches, some are based on cognitive and developmental robotics approaches and provide a more integrative vision of language and cognition. The agent’s linguistic abilities are strictly dependent on, and grounded in, other behaviours and skills. Numerous sensorimotor, cognitive, neural, social and evolutionary factors contribute to the emergence and establishment of communication and language. For example, in these models there exists an intrinsic link between the communication symbols (words) used by the agent and its own cognitive representations (meanings) of the perceptual and sensorimotor interaction with the external world (referents). Such a grounded and embodied approach to language design is consistent with the psychologically-plausible theories of the grounding of language discussed in the previous section (Cangelosi & Riga, 2006).

  

   In such cognitive robotic models, communication results from the dynamical interaction between the robot’s physical body, its cognitive system and the external physical and social environment. Some studies stress the grounding in action and sensorimotor processes, such as Marocco’s et al. (2003) model of robotic arms and Vogt’s (2000) mobile robots. Other robotic models highlight the grounding through social interaction, such as Steels & Kaplan’s (2000) AIBO robots. On the other hand, some studies are based on simulation adaptive agents. They model the agent and its environment with a good degree of detail upon which emergent meanings can be directly constructed. These simulation models have focused on grounding in perceptual experience and in cognitive representations and sensorimotor interactions (e.g. Cangelosi 2001).

  

  In cognitive modelling literature, there has also been some work specifically focused on the integration of action and vision knowledge in cognitive agents and in connectionist models. For example, Arbib and colleagues have developed a neural model for action learning directly inspired by brain imaging studies on grasping in primates, and applied to action imitation learning simulations (Arbib et al. 2002). Haruno, Wolpert and Kawato (2001) proposed the Mosaic architecture for simulated object manipulation tasks, demonstrating that the model can generalise action-object associations depending on the object shape. Tsiotas, Borghi and Parisi (2005) developed an artificial life model for simulating some of Tucker & Ellis (2001) findings. They use a simplified 2D arm model to study the evolutionary learning of object affordance. In the area of connectionist modelling, Yoon, Heinke and Humphreys (2002) have proposed a neural network model for action and name selection for objects (NAM Naming and Action Model) that supports the role of a direct route perception-action for action selection. This model uses abstract (localist) encoding of action, perceptual and semantic information, rather that providing a robotic implementation, but is useful as it focuses on the comparison of perceptual vs semantic information in action selection.