valuelogo-m.png

 Vision, Action, and Language
Unified by Embodiment

An RCUK Cognitive Systems Foresight Project

 

 

 

Objectives

   The aim of this research project is to investigate the processes and mechanisms leading to an integration of vision, action and language in natural cognitive systems (namely human participants) and to use this advancement in knowledge for the design of psychologically plausible artificial cognitive agents (simulated robots) able to communicate about the world they perceive and act upon it. This scientific and technological aim will be achieved through the following research objectives:

 

o    To explore the interface between language, action and vision through eye tracking experiments and micro-affordances studies on the action component of object representation (e.g. microaffordances)

o    To identify the time-course of action, vision and language integration processes in tasks requiring selective attention, object search and object manipulation and assembling under verbal instructions

o    To develop a cognitive robotic model capable of object manipulation and language use based on psychologically plausible embodied cognition principles identified through the above objectives I and II.

Methodology

   This collaborative research project reflects an interdisciplinary approach based on a combination of computational modelling (cognitive robotics and neural networks) and experimental methodologies (stimulus-response compatibility studies and eye-tracking experiments). The cognitive robotic platform developed during the project will serve as a tool to test feasibility of the vision/action/language integration mechanisms identified during experimental studies, in addition to demonstrating the technological potential in such an approach. Observation and analyses of the robot’s cognitive and linguistic capabilities will also result in the production and test of new predictions about mechanisms integrating vision, action and language. The replication in a robotic model of the psychological phenomena observed in experimental studies will have the advantage of permitting the fine analysis and understanding of the neural and behavioural processes that contribute to action-vision-language integration (Cangelosi & Parisi 2002).

  

   Empirical studies will be based on two main experimental approaches: (i) a stimulus-response compatibility paradigm and (ii) eye-tracking studies. The stimulus-response compatibility (SRC) procedure will vary the relationship between the responses to a target object and actions associated with it and also non-target objects. For example, participants will respond with precision or power grips to some property (category or shape) of a target, three-dimensional object. Both target and non-targets may be compatible or incompatible with the response. Two sets of human behavioural investigations will employ this procedure. The first set (experiments 1.1–1.4 described in section 2.3) investigate the role of attention in visuo-motor integration. Ellis, Tucker, Symes and Ellis (in press) show that in selecting a target object, the actions associated with non-target (distractor) objects are inhibited. We will extend these findings to gather behavioural data on the time course of action inhibition and potentiation during target object selection in multi-object scenes. These data will inform the robotic model (see below) in which object selection is the outcome of competition between vision-action assemblies in a distributed system. The second set of SRC experiments (experiments 2.8-2.12 described in section 2.3) will investigate the interface between language and visual objects by introducing object names as distractor and target objects.

  

   A complementary set of behavioural studies will be based on the eye-tracking methodology. This permits the identification of the time-course of visuo-attentional processes in action and language processing and will provide converging evidence from SRC studies on object selection. Eye tracking data will also be used to constrain the behavioural and attentional strategies used by the simulated cognitive robots during tasks involving object naming and selection. In eye-tracking experiments we will show arrays of novel objects and study three levels of action representation. At the encoding level, we manipulate the location and onset time of a visual detection probe in this array to reveal how observers attend and prepare their actions (Fischer et al., in press). At the representational/linguistic level, we present auditory object names and register the observer’s eye movements towards the named objects (visual world paradigm, e.g. Altmann & Kamide, 2004). Linguistic manipulations, such as using phonological competitors (“candle-candy”), reveal the time course of the interplay between covert and overt attention and the relative strength of top-down (linguistic) vs bottom-up visual control over action prediction. Finally, at the execution level, we instruct participants to pick up the named object and record their overt manual responses (e.g., Chambers et al., 2002, 2004). Orthogonal to these three levels of embodiment, we gradually associate each novel object with a particular name and manual response, and we design object arrays with congruent and incongruent response requirements. This learning approach enables us to track embodied concept acquisition and its implications for action control, separately at the encoding, linguistic/representational, and execution level.

  

   These behavioural studies will motivate the development of the new robotic model to allow mutual interactions between language and visual object representation and to analyse the time course of vision, action and language processes. The robotic agents developed in this project will consist of a simulated robot with a head, a torso and two arms and hands. The simulator will implement a body configuration (sensors and actuators) based on the upper torso of the humanoid platform iCub. Robotic agents will be trained to interact with objects, such as artefacts and tools (grasp cup, use hammer), so that agents acquire a sensorimotor and functional (microaffordance) representation of the objects through eye and hand movements. The visual input to the robot’s neural controller will consist of pre-processed information regarding object properties (e.g. size, colour, location etc.). This information will be processed directly from the physics simulators. The extraction of visual features for further processing and integration (within connectionist networks) with motor representation will be based on Ullman’s visual routine approach. This hybrid vision/connectionist approach was developed by Cangelosi and collaborators in their previous EPSRC grant GR/N38145 on the perceptual grounding of spatial terms (e.g. Joyce et al. 2003). In addition to vision, the robot will receive tactile information and proprioceptive data on its own body posture. Agents will also be trained to label objects and actions. A connectionist network will be used to learn and guide the behaviour of the robot and to acquire embodied representations of objects and actions. The neural architecture will also have recurrent structures to permit information integration and the execution of actions such as grasping (e.g. Marocco et. al 2003). After training, the robot will be used to simulate vision-action-language experiments.

  

   The work on the development of robotic agents will be based on the combination of epigenetic robotics methodologies and “embodied connectionist” modelling. Epigenetic (developmental) robotics is based on the use of embodied robotic systems that are situated in a physical and social environment and are subject to a prolonged epigenetic developmental process for the acquisition of cognitive capabilities (Weng et al. 2001). Embodied connectionism refers to the use of artificial neural networks for the learning and control of behaviour in cognitive robotic agents. The integration of robotics and connectionist methodologies permits the transfer of the principles and advantages of connectionism and parallel distributed processing systems into embodied robotic agents (Cangelosi & Riga 2006). Cangelosi and collaborators at the ABC group in Plymouth have already used such a methodology to study basic action manipulation tasks (Cangelosi & Riga 2006; Massera, Nolfi & Cangelosi 2006) and successfully extended it for 100+ action combination repertoire (Tikhanoff et al. 2007). The detailed analyses of the neural network activity in controlling behaviour and of the time-course of processes and representation activated by the robot’s neural controller will be used to better understand behaviour observed in human participants and to derive novel predictions about interactions between vision, action and language.