

Current paper-based interfaces such as PapierCraft, provide very little feedback and this limits the scope of possible interactions. So far, there has been little systematic exploration of the structure, constraints, and contingencies of feedback-mechanisms in paper-based interaction systems for paper-only environments. We identify three levels of feedback: discovery feedback (e.g., to aid with menu learning), status-indication feedback (e.g., for error detection), and task feedback (e.g., to aid in a search task). Using three modalities (visual, tactile, and auditory) which can be easily implemented on a pen-sized computer, we introduce a conceptual matrix to guide systematic research on pen-top feedback for paper-based interfaces. Using this matrix, we implemented a multimodal pen prototype demonstrating the potential of our approach. We conducted an experiment that confirmed the efficacy of our design in helping users discover a new interface and identify and correct their errors.

Tracking both hands in free-space with accompanying speech input can augment the user's ability to communicate with computers. This paper discusses the kinds of situations which call for two-handed input and not just the single hand, and reports a prototype in which two-handed gestures serve to input concepts, both static and dynamic, manipulate displayed items, and specify actions to be taken. Future directions include enlargement of the vocabulary of two-handed “coverbal” gestures and the modulation by gaze of gestural intent.

Multimodal interaction combines input from multiple sensors such as pointing devices or speech recognition systems, in order to achieve more fluid and natural interaction. Two-handed interaction has been used recently to enrich graphical interaction. Building applications that use such combined interaction requires new software techniques and frameworks. Using additional devices means that user interface toolkits must be more flexible with regard to input devices and event types. The possibility of parallel interactions must also be taken into account, with consequences on the structure of toolkits. Finally, frameworks must be provided for the combination of events and status of several devices. This paper reports on the extensions we made to the direct manipulation interface toolkit Whizz in order to experiment two-handed interaction. These extensions range from structural adaptations of the toolkit to new techniques for specifying the time-dependent fusion of events.

The XWeb architecture delivers interfaces to a wide variety of interactive platforms. XWeb's SUBSCRIBE mechanism allows multiple interactive clients to synchronize with each other. We define the concept of Join as the mechanism for acquiring access to a service's interface. Join also allows the formation of spontaneous collaborations with other people. We define the concept of Capture as the means for users to assemble suites of interactive resources to apply to a particular problem. These mechanisms allow users to access devices that they encounter in their environment rather than carrying all their devices with them. We describe two prototype implementations of Join and Capture. One uses a Java ring to carry a user's identification and to make connections. The other uses a set of cameras to watch where users are and what they touch. Lastly we present algorithms for resolving conflicts generated when independent interactive clients manipulate the same information.

IVR (interactive voice response) menu navigation has long been recognized as a frustrating interaction experience. We propose an IM-based system that sends a coordinated visual IVR menu to the caller's computer screen. The visual menu is updated in real time in response to the caller's actions. With this automatically opened supplementary channel, callers can take advantages of different modalities over different devices and interact with the IVR system with the ease of graphical menu selection. Our approach of utilizing existing network infrastructure to pinpoint the caller's virtual location and coordinating multiple devices and multiple channels based on users' ID registration can also be more generally applied to create integrated user experiences across a group of devices.

An action inferring facility for a multimodal interface called Edward is described. Based on the actions the user performs, Edward anticipates future actions and offers to perform them automatically. The system uses inductive inference to anticipate actions. It generalizes over arguments and results, and detects patterns on the basis of a small sequence of user actions, e.g. “copy a lisp file; change extension of original file into .org; put the copy in the backup folder”. Multimodality (particularly the combination of natural language and simulated pointing gestures) and the reuse of patterns are important new features. Some possibilities and problems of action inferring interfaces in general are addressed. Action inferring interfaces are particularly useful for professional users of general-purpose applications. Such users are unable to program repetitive patterns because either the applications do not provide the facilities or the users lack the capabilities.

While graphical user interfaces have gained much popularity in recent years, there are situations when the need to use existing applications in a nonvisual modality is clear. Examples of such situations include the use of applications on hand-held devices with limited screen space (or even no screen space, as in the case of telephones), or users with visual impairments.
We have developed an architecture capable of transforming the graphical interfaces of existing applications into powerful intuitive nonvisual interfaces. Our system, called Mercator, provides new input and output techniques for working in the nonvisual domain. Navigation is accomplished by traversing a hierarchical tree representation of the interface structure. Output is primarily auditory, although other output modalities (such as tactile) can be used as well. The mouse, an inherently visually-oriented device, is replaced by keyboard and voice interaction.
Our system is currently in its third major revision. We have gained insight into both the nonvisual interfaces presented by our system and the architecture necessary to construct such interfaces. This architecture uses several novel techniques to efficiently and flexibly map graphical interfaces into new modalities.

ENO is an audio server designed to make it easy for applications in the Unix environment to incorporate non-speech audio cues. At the physical level, ENO manages a shared resource, namely the audio hardware. At the logical level, it manages a sound space that is shared by various client applications. Instead of dealing with sound in terms of its physical description (i.e., sampled sounds), ENO allows sounds to be presented and controlled in terms of higher-level descriptions of sources, interactions, attributes, and sound space. Using this structure, ENO can facilitate the creation of consistent, rich systems of audio cues. In this paper, we discuss the justification, design, and implementation of ENO.

In this paper, we describe a multimodal interface prototype system based on Dynamical Dialogue Model. This system not only integrates information of speech and gestures, but also controls the response timing in order to realize a smooth interaction between user and computer. Our approach consists of human-human dialogue analysis, and computational modeling of dialogue.