ACMCrossroads / Xrds10-1 / Developing Voice Interfaces for Legacy Web Applications

Developing Voice Interfaces for Legacy Web Applications

by Jorge Quiané and Jorge Manjarrez

Introduction

Traditionally, web applications are accessed via a single mode interface; information is presented and captured with text. However, one can additionally use a voice browser to navigate the Internet. One can navigate or access "hands free" Internet applications from anywhere; you are not restricted to the desktop or a portable computer. VoiceXML is a language for Internet telephony applications and is based on the XML language. VoiceXML can "speech-enable" an existing web application to be used through a conversational interface, providing a more natural way of interaction between users and Internet applications.

VoiceXML applications are executed through a gateway, which contains the telephony and speech services, and can be installed together with the web server; you do not need to create a new application. In fact, VoiceXML is not only for internet applications, but can also be used for Interactive Voice Response (IVR) systems. In this article, we are going to show you how to develop VoiceXML applications using an emulator, and how to link the voice code with a JSP page to interact with a database.

VoiceXML Architecture

VoiceXML interacts with the user through voice dialogs using speech components that provide voice recognition. The VoiceXML gateway works as a voice browser by combining Automatic Speech Recognition (ASR), Text-To-Speech (TTS), support for audio files, digital tone recognition, Dual Tone Multi Frequency (DTMF) features, and telephony services. The voice browser is the application which receives the user request doing ASR, makes the appropriate HTTP request, and retrieves the desired VoiceXML document or the appropriate Active Server Page (ASP) or JavaServer Page (JSP) [5] documents. These are translated by the TTS engine into voice for the user. VoiceXML user dialogs are generated on the fly. As illustrated in the database example, the dialogs are generated according to thr user's attributes. Schematically, a VoiceXML application has the components shown in Figure 1.

Figure 1: Common VoiceXML architecture showing the components and its interactions.
Figure 1: Common VoiceXML architecture showing the components and its interactions.

In Figure 1, you can see the five elements contained by the gateway and their interactions. External to the interpreter is the web server to which the interpreter redirects the customer requests. Actually, the gateway uses at least one costly telephony device, which makes it hard to acquire by individual developers, the education sector, and perhaps small and medium-sized companies. There are also commercial companies offering public gateways within the United States, such as BeVocal Café [2], Tellme Studio [10], and VoiceGenie Developer Workshop [11], among others. However, these are offered only for developing and testing applications. In order to get access, you must register with them and obtain a toll-free phone number and a PIN. In order to test your VoiceXML application, you upload it to a web server and then use the phone number provided to connect to the VoiceXML gateway that retrieves the VoiceXML application from the web server. Once you are in the production phase, you should buy a number that can be used for marketing purposes, or buy the complete VoiceXML infrastructure, in which case you should provide the telephony services and possibly VoIP. Alternatively, you can use an emulator such as Cambridge Voice Studio [3], WebSphere Voice Server SDK [12], Motorola Mobile ADK [7] to develop and test VoiceXML applications on your computer desktop.

Development Environment

Here we use the Cambridge Voice Studio (CVS) emulator. Also needed are a TTS [6] component and a servlet container [1]. The CVS development environment shown in Figure 2 is divided into several areas. The project area includes all of the VoiceXML documents that compose the application, and the work area is used to edit them.

Figure 2: Main CVS Window.
Figure 2: Main CVS window.

The "Views" tab corresponds to different styles of the project area. The "Toolbar" has buttons to interact with the document. In order to run and stop an application, you need to load it into the environment and press the "Run" or "Stop" icon indicated in Figure 2. At execution time, a new window appears in the emulator (Figure 3), which is the debug window where it displays the emulated running application and permits interaction with it through the computer keyboard. Perhaps one of the limitations of this emulator is that it does not allow the user to make voice input.

Figure 3: The debug window displayed.
Figure 3: The debug window displayed at execution time.

Introduction to VoiceXML Applications

In this section, we focus on developing applications. We first compare a simple dialog with an HTML form to learn by comparison, and then explore an Automatic Teller Machine (ATM) example using JSP and a database. As indicated earlier, VoiceXML is based on XML. The basic structure of a VoiceXML application is shown below.


	1. <? xml version = "1.0" ?>
	2. <vxml version = "2.0" >
	3. 	[Body]
	4. </vxml>
	

The first and second lines are mandatory header tags that the document XML version. Between lines 2 and 4 sit the application code. Line 4 is the closing tag of <vxml version = "2.0">.


Example 1: Basic User Interaction

Most users are familiar with a basic interaction on the Web using HTML forms. A simple interaction form is shown in Figure 4. It requests a name; the user writes it in the textbox and then clicks on the "Accept" button. The application responds to this event with a new page showing a welcome message with the user name (Figure 5). If the user presses the button without supplying a name, the application repeats the request.

Figure 4: A simple HTML form to capture user input.
Figure 4: A simple HTML form to capture user input.

Figure 5: Welcome message.
Figure 5: Welcome message.

This same application developed in VoiceXML is as follows:

     <!-- Part 1 -->
        <?xml version = "1.0"?>
        <vxml version = "2.0" mode = "TTS">
                <form>
                <!--Part 2 -->
                        <noinput>
                                <prompt>
                                The name cannot be null
                                </prompt>
                        </noinput>
                <!-- Part 3 -->
                <field name = "name" type = "spelling">
                        <prompt>
                        Tell me your name
                        </prompt>
                        <!-- Part 4 -->
                        <filled>
                                <prompt>
                                Welcome <value expr = "name"/>!
                                </prompt>
                                <exit/>
                        </filled>
                </field>
                </form>
        </vxml>
        

Part 1 is composed of three tags: the first is the header explained above with an additional tag mode indicating that the interpreter will work in Text-To-Speech mode. Then comes a Form tag which is similar to the HTML form tag. Usually, VoiceXML documents are divided into several forms.

Part 2 shows the noinput and prompt tags. The first is an event that gets executed only when there is an empty input by the user, and the prompt element instructs the voice browser to play a recorded audio file or to synthesize text to speech whatever is between the opening and the closing prompt tags.

Part 3 contains the field tag. It is a voice input tag to capture the user response and store it in the tag field with attribute name = name. This variable can be typed using the attribute type, you use spelling because you are going to use text as input. Remember you are using an emulator that is unable to capture voice response and you are going to interact with it by using the computer keyboard. But to also show another VoiceXML capability, it can also accept a textual response. This tag behaves like the HTML input field.

Finally, Part 4 uses the filled tag, which gets executed whenever there is a value in the field tag. So it acts like pressing the submit button on an HTML form. All this code can be downloaded from our website [8].

The interaction with the user is not visual as in HTML; it is by audio. While the application is running you listen to a dialog and an input can be done in the debug window of the CVS. This dialog is generated by the code tagged as Part 3. Later, you will use System to identify what is heard during the execution, and User when something should be typed by the user.  The dialog transcription of the execution of this code is shown in Figure 6.

System: Give me your name
User: [Waiting...]

Figure 6: Dialog generated at the execution of the VoiceXML for the first example.

If the user provides an answer, for example, the name "Mariana Haro Graciet" in the debug window, the filled tag executes and you hear a welcome message (Figure 7).

System: Welcome Mariana Haro Graciet!

Figure 7: Welcome message after providing user input.

Finally, if the user does not provide an input (or answer), the event noinput rises, then the following dialog is heard (Figure 8):

System: The name can't be null
System: Give me your name
User: [Waiting...]

Figure 8: A warning advice generated by the empty answer.

Enabling Voice in a Legacy Web Application

In this section, we will describe how to adapt an existing web application to use VoiceXML.

Model-View-Controller (MVC)

The desirable architecture for a Web application is a three-tier system, which has presentation code, data processing, and data store code that satisfies the MVC architecture [4]. The model encapsulates the application state, provides the functionality, and responds to the state's queries. It usually consists of Java Bean components, JSP, and Servlets. The view is the data presentation to the user. This feeds the model and allows the controller to select a view and sends data to the controller as traditional HTML pages. The controller defines the behavior of the application, selects a view to respond to each of the functionalities, and updates the model. If the application is structured in this way, the part to substitute is the view, so in place of HTML and JSP you will have VoiceXML user dialogs. VoiceXML changes the presentation of the information.

Example 2: Voice ATM

This example application has the typical behavior of an Automatic Teller Machine. It contains customer functions: Queries, Withdrawals, and Deposits, and some manager functions: Add user, Modifiy user, and Create account. This application normally works through a Web browser, but now, it is required to allow not only Web users to use our ATM system, but also anyone with a phone. Our application is going to provide dual access by Web and by telephone. In order to get access, a user should log in into the system. So, as you need to capture user input, you should use a form as in the first example. The code is in the file, main.vxml and is shown below:

     <?xml version = "1.0"?>
        <vxml version = "2.0" mode = "TTS">
        <!-- Part 1 -->
                <form>
                        <block>
                                <prompt>
                                Welcome to the CIC bank!. Whenever you need help navigating this application,
                                just say "help."
                                </prompt>
                        </block>
                </form>
                <!--Part 2 -->
                <menu>
                        <noinput>
                                <prompt>
                                I didn't hear you
                                </prompt>
                        </noinput>
                        <nomatch>
                                <prompt>
                                I didn't understand you
                                </prompt>
                        </nomatch>
                        <help>
                        Just follow the intructions in this menu
                        </help>
                        <prompt><break msecs = "5000"/>
                        Select one of this options<enumerate/>
                        </prompt>
                        <choice next = "#login">
                        customers
                        </choice>
                        <choice next = "#register">
                        not customers
                        </choice>
                        <choice next = "#end">
                        exit
                        </choice>
                </menu>
                <!--Part 3 -->
                <form name = "login">
                        <noinput>
                                <prompt>
                                I didn't hear you
                                </prompt>
                        </noinput>
                        <nomatch>
                                <prompt>
                                I didn't understand you
                                </prompt>
                        </nomatch>
                        <field name = "user" type = "spelling">
                                <help>
                                Just say your username
                                </help>
                                <prompt>
                                Tell me your username
                                </prompt>
                        </field>
                        <field name = "pwd" type = "spelling">
                                <help>
                                Just say your password
                                </help>
                                <prompt>
                                Tell me your password
                                </prompt>
                        </field>
                        <!-- Part 4 -->
                        <if cond = "user == 'administrator'">
                                <if cond = "pwd == 'system2003'">
                                        <goto next = "menu3.vxml"/>
                                <else/>
                                        <submit next = "http://localhost:8080/examples/jsp/tmp/banco/login.jsp"/>
                                </if>
                        <else/>
                                <submit next = "http://localhost:8080/examples/jsp/tmp/banco/login.jsp"/>
                        </if>
                        </filled>
                        <!-- Part 5 -->
                        <block name = "register">
                                <goto next = "menu2.vxml"/>
                        </block>
                        <block name = "end">
                                <prompt>
                                Thank you for visiting us!, good bye!
                                </prompt>
                                <exit/>
                        </block>
                </form>
        </vxml>
        

Part 1 welcomes the users and incorporates a new tag, block. Here, you can have a set of directives executed orderly. The tag prompt is responsible to for the welcome message:

System: Welcome to the CIC bank!
Whenever you need help navigating
this application, just say "help."

Figure 9: Welcome dialog of the ATM example.

In Part 2, the main menu is generated. The tag "menu" is used to create a list of options. The tag nomatch is similar to noinput, and gets executed when the user selects an invalid option. The tag help, will provide guidance to the user. When he or she says "help" (or writes help in the CVS) the system will "read" the text between the opening and closing help tags. Break issues a pause of msecs milliseconds. The tag enumerate orders the menu options so it will always appear inside a menu block and will have defined the attribute next, and as with HTML links, if the reference is inside the same document, it is preceded by a sharp character '#'. Part 3 of the code asks the user for the user name and password (Figure 11), and if the input is null or invalid, this dialog is repeated again.

System: Select one of these options:
    For customers, say "customers" or press one.
    For non-customers, say "non-customers" or press two.
    To exit, say "exit" or press three.
User: [Waiting...]

Figure 10: Menu "spoken" by the CVS using main.vxml. CVS added, "For customers, say customers or press one," where you just defined customers.

System: Tell me your username
User: ufo
System: Tell me your password
User: xxxxxx

Figure 11: Dialog to ask for login and password.

If everything is correct, Part 4 is executed. Here, you identify new tags, if and else. They work as usual with any programming language, and submit corresponds to references to other elements or documents. VoiceXML and JSP are done using the tag <submit next = "http://localhost:8080/examples/jsp/banco/login.jsp"/>. Remember that the application logic resides in the JSP pages, so it checks the login with a database and returns the answer in another VoiceXML document. Part 5 of the code is defined by the tag block to present the option Register that creates a new user's Web account, and Exit that terminates the application.

System: Thank you for visiting us! Good bye!

Figure 12: The last part of the code contains the goodbye message.

Database Access with VoiceXML

The database access is not really done by VoiceXML, but with the help of JSP. For this example, a database is used to simulate the bank office.

Integration of VoiceXML and JSP

VoiceXML makes use of JSP by the tags goto and submit, which are sent as parameters to the JSP document by the CVS. The part of the code that accesses the database in order to verify the login is done using JDBC.  The user name and password retrieved by the VoiceXML dialog are sent to the JSP page and are used as parameters to query the database. Subsequently, the JSP code stores the username and account number as session attributes, and generates a new voice dialog with the user, simply by inserting the goto voicexml tag as you do normally with any other HTML tag within a JSP document, but in this case the goto tag is recognized by the VoiceXML gateway and starts the new voice dialog called menu.vxml.

	// finally redirects the voicexml document
	<goto next = "http://server/bank/menu.vxml"/>

Interestingly, when you mix JSP code with VoiceXML, you create a normal .vxml file with a .jsp extension to allow execution by the servlet container (Tomcat). Then, you write jsp directives or scriptlets in which variables or attributes can be used as parameters to VoiceXML tags. For example, see the following code in which you retrieve the user name and later pass it to the VoiceXML prompt tag in the form of a JSP expression, but when it is evaluated the result is a string. The CVS has an HTTP client that redirects user's requests to the Tomcat server, which interprets all the known JSP instructions. The remaining code is sent to the CVS as text marked as a VoiceXML document. Thus, it can be read or spoken dynamically as it is generated. Indeed the code fragment shown is part of the file, deposit.jsp, that contains the logic to make bank deposits.

     <?xml version = "1.0"?>
        <vxml version = "2.0">
                <form>
                        <block>
                        <%@ page import = "java.sql.*, java.io.*,java.text.*" %>
                        <%
                                String name = (String) session.getAttribute("Name");
                                ... [body program] ...
                        %>
                        <prompt>
                        <%=name%> your transaction was completed successfully
                        </prompt>
                        </block>
                </form>
        </vxml>
        

When the above code is executed, you hear the name of the user and the message "your transaction was completed successfully." In this way, you can generate VoiceXML documents on the fly that contain information retrieved from a database using JSP.

Conclusions

This article exemplifies the main characteristics of VoiceXML applications, the principal tags, and the interaction with databases. All the examples were executed with the Cambridge Voice Studio that at this time does not allow voice interaction (others do, for example IBM's implementation). One of the weaknesses of VoiceXML is the limited vocabulary it can understand. VoiceXML defines grammars in order to define the accepted words or what can be understood by the application. One alternative is Natural Language Understanding (NLU), which can be based on statistical analysis or previous experiences from the domain application, thus allowing a richer set of options to the user input avoiding limited vocabulary.

We are developing a VoiceXML Gateway and a VoiceXML IDE based on the VoiceXML Specification 2.0, which will allow interaction by microphone or telephone, thus providing a more realistic developing environment. Multimodal access is becoming popular and feasible to implement due to the development of Speech Application Language Tags (SALT) [9] and XHTML + Voice (X+V) [13], both based on XML that allows not only keyboard, mouse, or any other input device, but also voice to interact with existing Web applications through a normal Web browser, along with other devices such as PDAs, tablet PCs, and telephones. Also, VoiceXML is the base of X+V, and SALT resembles its style. So, whatever trend the market takes, VoiceXML will still be present, and as a result, it is important to know it.

References

1
Apache, Apache Jakarta Project, 2003, <http://jakarta.apache.org/tomcat/index.html> (3 December 2002).
2
BeVocal, BeVocal café, 10 September 2003, < http://cafe.bevocal.com/> (25 March 2003).
3
Cambridge Voice, Cambridge Voice Studio, 2001, < http://www.cambridgevoice.com> (23 January 2002).
4
Erich G., Design Patterns, Adison Wesley, April 1996.
5
Javasoft, Java Server Pages, 23 April 2003 <http://www.javasoft.com/jsp> (10 January 2003).
6
Microsoft Corporation, Microsoft Text-to-Speech Engine, 2001, <http://www.microsoft.com/windowsxp/home/using/productdoc/en/default.asp?url=/windowsxp/home/using/productdoc/en/speech_tts_overview.asp> (20 February 2003).
7
Motorola, Mobile ADK, 2002, <http://mix.motorola.com> (12 November 2002).
8
Quiané, J., "Homepage for code examples," 20 March 2003, <http://jupiter.cic.ipn.mx/~ufo>
9
Salt Forum, Speech Applications Language Tags, <http://saltforum.org> (20 April 2003).
10
Tellme, Tellme Studio, 2002, <http://studio.tellme.com> (10 February 2003).
11
VoiceGenie, Developer Workshop, 7 April 2003, <http://developer.voicegenie.com> (10 April 2003).
12
WebSphere (IBM), WebSphere software platform, 2002, <http://www-3.ibm.com/software/info1/websphere> (18 December 2002).
13
World Wide Web Consortium, XHTML + Voice, 2003, <http://www.w3.org/TR/xhtml+voice> (25 April 2003)

Biographies

Jorge Quiané (ufo@correo.cic.ipn.mx) is a final semester MSc student in Computer Science at the Center for Computing Research at IPN Mexico. His research interests include Software Engineering, Internet applications, and Databases.

Jorge Manjarrez (jorgerms@acm.org) has a MSc in Computer Science from the Center for Computing Research at IPN Mexico, and currently does research in the Software Technology Lab in Software Engineering and Databases.

Copyright 2004, The Association for Computing Machinery, Inc.