|
Background
Psychological Studies on Vision, Action and
Language
It is increasingly recognised that cognition should
not be regarded as a set of disembodied processes, but is strongly
determined by the constraints of its bodily implementation and it being
situated in the world with which it interacts. In the case of visual
cognition this embodied approach has led to an emphasis on the role of
vision in exploring the world, and therefore on the integration of vision
and action (see for instance O’Regan & Noe, 2001). There is certainly
accumulating human behavioural evidence that vision and action form a
closely integrated and highly dynamic system (e.g. Tucker & Ellis,
1998, 2001; Craighero, Bello, Fadiga and Rizzolatti, 2002; Fischer &
Hoellen, 2004).
One consequence of this integration of the
vision and action systems is that seeing an object, even when there is no
intention to handle it, potentiates elements of the actions needed to reach
and grasp it. For instance participants who viewed photographs of common
objects in order to decide whether they were manufactured or organic were
facilitated in responding if the grip needed to make the response was one
that would be used to handle the viewed object (Tucker and Ellis, 2001).
So, for example, signalling that a pea was organic was easier (faster and
more accurate) if a precision grip (using only the thumb and forefinger)
was needed for the response compared to using a power grip (between the
four fingers and palm). Similar object to action compatibility effects are
observed for the hand of reach and the wrist rotation required to align the
hand with an object (Tucker & Ellis, 1998; Ellis & Tucker, 2000).
The authors coined the term ‘micro-affordances’ to describe these
potentiated elements of an action. The automatic derivation of object
affordances is also supported by a body of evidence from neuroscience that
includes recent imaging studies in humans (e.g. Grèzes, Tucker, Armony,
Ellis & Passingham, 2003); single cell recordings in monkeys (Sakata,
Taira, Mine & Murata, 1992; Fadiga, Fogassi, Gallese & Rizzolatti,
2000); neural network models (Arbib et al. 2000, 1998; Gross, Heinze,
Seiler & Stephan, 1999). Similar effects can be observed during
language comprehension. Reading action verbs activates motor and premotor
cortices during lexical access in a somatotopic fashion (Pulvermuller,
2005), suggesting that motor activation is inherent to language
comprehension. Classifying the names of objects produces similar effects on
actions as classifying the objects themselves (Tucker & Ellis, 2004).
Visual attention and eye movements are
obviously fundamental components of human exploratory behaviour, and
implicated in the integration of vision, action and language. Our eyes are
exquisitely sensitive to the combined demands of vision, action and
language processing. We move our eyes to project objects of interest onto
the foveal area of high visual resolution. When we interact with objects,
our eyes move ahead of the hand to support the on-line control of grasping
(e.g., Bekkering & Neggers, 2002). When we hear verbal descriptions of
scenes that we simultaneously watch, we tend to look at objects that are
about to be mentioned (Tanenhaus et al., 1995). While eye position usually
reflects overt attention, observers can also attend to new objects while
the eyes remain fixated. Such covert attention can be measured behaviourally
as faster and more accurate perception of attended compared to unattended
objects with equal distances from the current eye position. Merely seeing
objects activates plans for actions directed to them (e.g., Tucker &
Ellis, 2001; Fischer & Dahl, 2007). Tools are known to attract
attention and action relations between objects (e.g., bottle, cork screw)
lead to attention being allocated to both (Riddoch et al., 2003). However,
whenever the eyes or hands are directed at a new target, covert attention
precedes such overt attention shift to this object. Having to ignore an
object in a search for a target object seems to require the inhibition of
the actions associated with the ignored object (Ellis et al, in press).
Computational Modelling of Language and Action
The study of the design of linguistic
communication between autonomous agents, such as robots or simulated
agents, has recently attracted the interest of researchers from different
fields, such as computer scientists, engineers and cognitive scientists.
Investigations in the emergence of language, both in evolutionary
(Cangelosi & Parisi 2002) and developmental terms (MacWhinney 1998),
have greatly benefited from the use of computational models. Amongst the
various computational approaches, some are based on cognitive and
developmental robotics approaches and provide a more integrative vision of
language and cognition. The agent’s linguistic abilities are strictly
dependent on, and grounded in, other behaviours and skills. Numerous sensorimotor,
cognitive, neural, social and evolutionary factors contribute to the
emergence and establishment of communication and language. For example, in
these models there exists an intrinsic link between the communication
symbols (words) used by the agent and its own cognitive representations
(meanings) of the perceptual and sensorimotor interaction with the external
world (referents). Such a grounded and embodied approach to language design
is consistent with the psychologically-plausible theories of the grounding
of language discussed in the previous section (Cangelosi & Riga, 2006).
In such cognitive robotic models, communication
results from the dynamical interaction between the robot’s physical body, its
cognitive system and the external physical and social environment. Some
studies stress the grounding in action and sensorimotor processes, such as
Marocco’s et al. (2003) model of robotic arms and Vogt’s (2000) mobile
robots. Other robotic models highlight the grounding through social
interaction, such as Steels & Kaplan’s (2000) AIBO robots. On the other
hand, some studies are based on simulation adaptive agents. They model the
agent and its environment with a good degree of detail upon which emergent meanings
can be directly constructed. These simulation models have focused on
grounding in perceptual experience and in cognitive representations and
sensorimotor interactions (e.g. Cangelosi 2001).
In cognitive modelling literature, there has
also been some work specifically focused on the integration of action and
vision knowledge in cognitive agents and in connectionist models. For
example, Arbib and colleagues have developed a neural model for action
learning directly inspired by brain imaging studies on grasping in
primates, and applied to action imitation learning simulations (Arbib et
al. 2002). Haruno, Wolpert and Kawato (2001) proposed the Mosaic
architecture for simulated object manipulation tasks, demonstrating that
the model can generalise action-object associations depending on the object
shape. Tsiotas, Borghi and Parisi (2005) developed an artificial life model
for simulating some of Tucker & Ellis (2001) findings. They use a
simplified 2D arm model to study the evolutionary learning of object
affordance. In the area of connectionist modelling, Yoon, Heinke and
Humphreys (2002) have proposed a neural network model for action and name
selection for objects (NAM Naming and Action Model) that supports the role
of a direct route perception-action for action selection. This model uses
abstract (localist) encoding of action, perceptual and semantic
information, rather that providing a robotic implementation, but is useful
as it focuses on the comparison of perceptual vs semantic information in action
selection.
|