PLAN:INTRODUCTION……………………………………..……………….………..…2
PRINCIPLES OF ASR TECHNOLOGY………………..……………….………..3
PERFORMANCE AND DESIGN ISSUES IN SPEECHAPPLICATIONS……...7
CURRENT TRENDS IN VOICE-INTERACTIVE CALL………….…….………8
FUTURE TRENDS IN VOICE-INTERACTIVE CALL…….…..…………....…13
DEFINING AND ACQUIRING LITERACY IN THE AGE OFINFORMATION…………………………………….………………………..…..14
CONTENT-BASED INSTRUCTION AND LITERACY DEVELOPMENT…..15
THEORY INTO PRACTICE…………….……………………………………….17
CONCLUSION………………………………………………………...…………17
REFERENCES……………………………………………………………………18
INTRODUCTION
During the past two decades, theexercise of spoken language skills has received increasing attention amongeducators. Foreign language curricula focus on productive skills with specialemphasis on communicative competence. Students' ability to engage in meaningfulconversational interaction in the target language is considered an important,if not the most important, goal of second language education. This shift ofemphasis has generated a growing need for instructional materials that providean opportunity for controlled interactive speaking practice outside theclassroom.
With recent advances in multimediatechnology, computer-aided language learning (CALL) has emerged as a temptingalternative to traditional modes of supplementing or replacing directstudent-teacher interaction, such as the language laboratory oraudio-tape-based self-study. The integration of sound, voice interaction, text,video, and animation has made it possible to create self-paced interactivelearning environments that promise to enhance the classroom model of languagelearning significantly. A growing number of textbook publishers now offereducational software of some sort, and educators can choose among a largevariety of different products. Yet, the practical impact of CALL in the fieldof foreign language education has been rather modest. Many educators arereluctant to embrace a technology that still seeks acceptance by the languageteaching community as a whole (Kenning & Kenning, 1990).
A number of reasons have been cited forthe limited practical impact of computer-based language instruction. Among themare the lack of a unified theoretical framework for designing and evaluatingCALL systems (Chapelle, 1997; Hubbard, 1988; Ng & Olivier, 1987); theabsence of conclusive empirical evidence for the pedagogical benefits ofcomputers in language learning (Chapelle, 1997; Dunkel, 1991; Salaberry, 1996);and finally, the current limitations of the technology itself (Holland, 1995;Warschauer, 1996). The rapid technological advances of the 1980s have raisedboth the expectations and the demands placed on the computer as a potentiallearning tool. Educators and second language acquisition (SLA) researchersalike are now demanding intelligent, user-adaptive CALL systems that offer notonly sophisticated diagnostic tools, but also effective feedback mechanismscapable of focusing the learner on areas that need remedial practice. AsWarschauer puts it, a computerized language teacher should be able tounderstand a user's spoken input and evaluate it not just for correctness butalso for appropriateness. It should be able to diagnose a student's problemswith pronunciation, syntax, or usage, and then intelligently decide among a rangeof options (e.g., repeating, paraphrasing, slowing down, correcting, ordirecting the student to background explanations). (Warschauer, 1996, p. 6)
Salaberry (1996) demands nothing shortof a system capable of simulating the complex socio-communicative competence ofa live tutor--in other words, the linguistic intelligence of a human--only toconclude that the attempt to create an «intelligent language tutoringsystem is a fallacy» (p. 11). Because speech technology isn't perfect, itis of no use at all. If it «cannot account for the full complexity ofhuman language,» why even bother modeling more constrained aspects oflanguage use (Higgins, 1988, p. vii)? This sort of all-or-nothing reasoningseems symptomatic of much of the latest pedagogical literature on CALL. Thequest for a theoretical grounding of CALL system design and evaluation(Chapelle, 1997) tends to lead to exaggerated expectations as to what thetechnology ought to accomplish. When combined with little or no knowledge ofthe underlying technology, the inevitable result is disappointment. PRINCIPLES OFASR TECHNOLOGY
Consider the following four scenarios:
1. Acourt reporter listens to the opening arguments of the defense and types thewords into a steno-machine attached to a word-processor.
2. Amedical doctor activates a dictation device and speaks his or her patient'sname, date of birth, symptoms, and diagnosis into the computer. He or she thenpushes «end input» and «print» to produce a written recordof the patient's diagnosis.
3. A mothertells her three-year old, «Hey Jimmy, get me my slippers, will you?»The toddler smiles, goes to the bedroom, and returns with papa's hiking boots.
4. Afirst-grader reads aloud a sentence displayed by an automated Reading Tutor.When he or she stumbles over a difficult word, the system highlights the word,and a voice reads the word aloud. The student repeats the sentence--this timecorrectly--and the system responds by displaying the next sentence.
At some level, all four scenariosinvolve speech recognition. An incoming speech signal elicits a response from a«listener.» In the first two instances, the response consists of awritten transcript of the spoken input, whereas in the latter two cases, anaction is performed in response to a spoken command. In all four cases, the«success» of the voice interaction is relative to a given task asembodied in a set of expectations that accompany the input. The interactionsucceeds when the response--by a machine or human «listener»--matchesthese expectations.
Recognizing and understanding humanspeech requires a considerable amount of linguistic knowledge: a command of thephonological, lexical, semantic, grammatical, and pragmatic conventions thatconstitute a language. The listener's command of the language must be«up» to the recognition task or else the interaction fails. Jimmyreturns with the wrong items, because he cannot yet verbally discriminatebetween different kinds of shoes. Likewise, the reading tutor would miserablyfail in performing the court-reporter's job or transcribing medical patientinformation, just as the medical dictation device would be a poor choice fordiagnosing a student's reading errors. On the other hand, the human courtreporter--assuming he or she is an adult native speaker--would have no problemperforming any of the tasks mentioned under (1) through (4). The linguisticcompetence of an adult native speaker covers a broad range of recognition tasksand communicative activities. Computers, on the other hand, perform best when designedto operate in clearly circumscribed linguistic sub-domains.
Humans and machines process speech infundamentally different ways (Bernstein & Franco, 1996). Complex cognitiveprocesses account for the human ability to associate acoustic signals withmeanings and intentions. For a computer, on the other hand, speech isessentially a series of digital values. However, despite these differences, thecore problem of speech recognition is the same for both humans and machines:namely, of finding the best match between a given speech sound and itscorresponding word string. Automatic speech recognition technology attempts tosimulate and optimize this process computationally.
Since the early 1970s, a number ofdifferent approaches to ASR have been proposed and implemented, includingDynamic Time Warping, template matching, knowledge-based expert systems, neuralnets, and Hidden Markov Modeling (HMM) (Levinson & Liberman, 1981;Weinstein, McCandless, Mondshein, & Zue, 1975; for a review, see Bernstein& Franco, 1996). HMM-based modeling applies sophisticated statistical andprobabilistic computations to the problem of pattern matching at the sub-wordlevel. The generalized HMM-based approach to speech recognition has proven aneffective, if not the most effective, method for creating high-performancespeaker-independent recognition engines that can cope with large vocabularies;the vast majority of today's commercial systems deploy this technique.Therefore, we focus our technical discussion on an explanation of thistechnique.
An HMM-based speech recognizer consistsof five basic components: (a) an acoustic signal analyzer which computes aspectral representation of the incoming speech; (b) a set of phone models(HMMs) trained on large amounts of actual speech data; (c) a lexicon forconverting sub-word phone sequences into words; (d) a statistical languagemodel or grammar network that defines the recognition task in terms oflegitimate word combinations at the sentence level; (e) a decoder, which is asearch algorithm for computing the best match between a spoken utterance andits corresponding word string. Figure 1 shows a schematic representation of thecomponents of a speech recognizer and their functional interaction.
/>
Figure 1. Components of a speechrecognition deviceA. Signal Analysis
The first step in automatic speechrecognition consists of analyzing the incoming speech signal. When a personspeaks into an ASR device--usually through a high quality noise-cancelingmicrophone--the computer samples the analog input into a series of 16- or 8-bitvalues at a particular sampling frequency (ranging from 8 to 22KHz). Thesevalues are grouped together in predetermined overlapping temporal intervalscalled «frames.» These numbers provide a precise description of thespeech signal's amplitude. In a second step, a number of acoustically relevantparameters such as energy, spectral features, and pitch information, areextracted from the speech signal (for a visual representation of some of theseparameters, see Figure 2 on page 53). During training, this information is usedto model that particular portion of the speech signal. During recognition, thisinformation is matched against the pre-existing model of the signal.B. Phone Models
Training a machine to recognize spokenlanguage amounts to modeling the basic sounds of speech (phones). Automaticspeech recognition strings together these models to form words. Recognizing anincoming speech signal involves matching the observed acoustic sequence with aset of HMM models. An HMM can model either phones or other sub-word units or itcan model words or even whole sentences. Phones are either modeled asindividual sounds--so-called monophones--or as phone combinations that modelseveral phones and the transitions between them (biphones or triphones). Aftercomparing the incoming acoustic signal with the HMMs representing the sounds oflanguage, the system computes a hypothesis based on the sequence of models thatmost closely resembles the incoming signal. The HMM model for each linguisticunit (phone or word) contains a probabilistic representation of all thepossible pronunciations for that unit--just as the model of the handwrittencursive b would have many different representations. Building HMMs--aprocess called training--requires a large amount of speech data of the type thesystem is expected to recognize. Large-vocabulary speaker-independentcontinuous dictation systems are typically trained on tens of thousands of readutterances by a cross-section of the population, including members of differentdialect regions and age-groups. As a general rule, an automatic speechrecognizer cannot correctly process speech that differs in kind from the speechit has been trained on. This is why most commercial dictation systems, whentrained on standard American English, perform poorly when encountering accentedspeech, whether by non-native speakers or by speakers of different dialects. Wewill return to this point in our discussion of voice-interactive CALLapplications.C. Lexicon
The lexicon, or dictionary, contains thephonetic spelling for all the words that are expected to be observed by therecognizer. It serves as a reference for converting the phone sequencedetermined by the search algorithm into a word. It must be carefully designedto cover the entire lexical domain in which the system is expected to perform.If the recognizer encounters a word it does not «know» (i.e., a wordnot defined in the lexicon), it will either choose the closest match or returnan out-of-vocabulary recognition error. Whether a recognition error isregistered as a misrecognition or an out-of-vocabulary error depends in part onthe vocabulary size. If, for example, the vocabulary is too small for anunrestricted dictation task--let's say less than 3K--the out-of-vocabularyerrors are likely to be very high. If the vocabulary is too large, the chanceof misrecognition errors increases because with more similar-sounding words,the confusability increases. The vocabulary size in most commercial dictationsystems tends to vary between 5K and 60K.D. The Language Model
The language model predicts the mostlikely continuation of an utterance on the basis of statistical informationabout the frequency in which word sequences occur on average in the language tobe recognized. For example, the word sequence A bare attacked him willhave a very low probability in any language model based on standard Englishusage, whereas the sequence A bear attacked him will have a higherprobability of occurring. Thus the language model helps constrain therecognition hypothesis produced on the basis of the acoustic decoding just asthe context helps decipher an unintelligible word in a handwritten note. Likethe HMMs, an efficient language model must be trained on large amounts of data,in this case texts collected from the target domain.
In ASR applications with constrainedlexical domain and/or simple task definition, the language model consists of agrammatical network that defines the possible word sequences to be accepted bythe system without providing any statistical information. This type of designis suitable for CALL applications in which the possible word combinations andphrases are known in advance and can be easily anticipated (e.g., based on userdata collected with a system pre-prototype). Because of the a priori constrainingfunction of a grammar network, applications with clearly defined task grammarstend to perform at much higher accuracy rates than the quality of the acousticrecognition would suggest.E. Decoder
Simply put, the decoder is an algorithmthat tries to find the utterance that maximizes the probability that a givensequence of speech sounds corresponds to that utterance. This is a searchproblem, and especially in large vocabulary systems careful consideration mustbe given to questions of efficiency and optimization, for example to whetherthe decoder should pursue only the most likely hypothesis or a number of themin parallel (Young, 1996). An exhaustive search of all possible completions ofan utterance might ultimately be more accurate but of questionable value if onehas to wait two days to get a result. Trade-offs are therefore necessary tomaximize the search results while at the same time minimizing the amount of CPUand recognition time. PERFORMANCE AND DESIGN ISSUES IN SPEECHAPPLICATIONS
For educators and developers interestedin deploying ASR in CALL applications, perhaps the most important considerationis recognition performance: How good is the technology? Is it ready to bedeployed in language learning? These questions cannot be answered except withreference to particular applications of the technology, and therefore touch ona key issue in ASR development: the issue of human-machine interface design.
As we recall, speech recognitionperformance is always domain specific--a machine can only do what it isprogrammed to do, and a recognizer with models trained to recognize businessnews dictation under laboratory conditions will be unable to handle spontaneousconversational speech transmitted over noisy telephone channels. The questionthat needs to be answered is therefore not simply «How good is ASRtechnology?» but rather, «What do we want to use it for?» and«How do we get it to perform the task?»
In the following section, we willaddress the issue of system performance as it relates to a number of successfulcommercial speech applications. By emphasizing the distinction betweenrecognizer performance on the one hand--understood in terms of «raw»recognition accuracy--and system performance on the other; we suggest how thelatter can be optimized within an overall design that takes into account notonly the factors that affect recognizer performance as such, but also, andperhaps even more importantly, considerations of human-machine interfacedesign.
Historically, basic speech recognitionresearch has focused almost exclusively on optimizing large vocabularyspeaker-independent recognition of continuous dictation. A major impetus forthis research has come from US government sponsored competitions held annuallyby the Defense Advanced Research Projects Agency (DARPA). The main emphasis ofthese competitions has been on improving the «raw» recognitionaccuracy--calculated in terms of average omissions, insertions, and substitutions--oflarge-vocabulary continuous speech recognizers (LVCSRs) in the task ofrecognizing read sentence material from a number of standard sources (e.g., TheWall Street Journal or The New York Times). The best laboratorysystems that participated in the WSJ large-vocabulary continuous dictation taskhave achieved word error rates as low as 5%, that is, on average, onerecognition error in every twenty words (Pallet, 1994).CURRENT TRENDS IN VOICE-INTERACTIVE CALL
In recent years, an increasing number ofspeech laboratories have begun deploying speech technology in CALLapplications. Results include voice-interactive prototype systems for teachingpronunciation, reading, and limited conversational skills in semi-constrainedcontexts. Our review of these applications is far from exhaustive. It covers aselect number of mostly experimental systems that explore paths we foundpromising and worth pursuing. We will discuss the range of voice-interactionsthese systems offer for practicing certain language skills, explain theirtechnical implementation, and comment on the pedagogical value of theseimplementations. Apart from giving a brief system overview, we reportexperimental results if available and provide an assessment of how far away thetechnology is from being deployed in the commercial and educationalenvironments.Pronunciation Training
A useful and remarkably successfulapplication of speech recognition and processing technology has beendemonstrated by a number of research and commercial laboratories in the area ofpronunciation training. Voice-interactive pronunciation tutors prompt studentsto repeat spoken words and phrases or to read aloud sentences in the targetlanguage for the purpose of practicing both the sounds and the intonation ofthe language. The key to teaching pronunciation successfully is correctivefeedback, more specifically, a type of feedback that does not rely on thestudent's own perception. A number of experimental systems have implementedautomatic pronunciation scoring as a means to evaluate spoken learnerproductions in terms of fluency, segmental quality (phonemes) andsupra-segmental features (intonation). The automatically generated proficiencyscore can then be used as a basis for providing other modes of correctivefeedback. We discuss segmental and supra-segmental feedback in more detailbelow.
Segmental Feedback.Technically, designing a voice-interactive pronunciation tutor goes beyond thestate of the art required by commercial dictation systems. While the grammarand vocabulary of a pronunciation tutor is comparatively simple, the underlyingspeech processing technology tends to be complex since it must be customized torecognize and evaluate the disfluent speech of language learners. Aconventional speech recognizer is designed to generate the most charitablereading of a speaker's utterance. Acoustic models are generalized so as toaccept and recognize correctly a wide range of different accents andpronunciations. A pronunciation tutor, by contrast, must be trained to both recognizeand correct subtle deviations from standard native pronunciations.
A number of techniques have beensuggested for automatic recognition and scoring of non-native speech(Bernstein, 1997; Franco, Neumeyer, Kim, & Ronen, 1997; Kim, Franco, &Neumeyer, 1997; Witt & Young, 1997). In general terms, the procedureconsists of building native pronunciation models and then measuring thenon-native responses against the native models. This requires models trained onboth native and non-native speech data in the target language, and supplementedby a set of algorithms for measuring acoustic variables that have proven usefulin distinguishing native from non-native speech. These variables includeresponse latency, segment duration, inter-word pauses (in phrases), spectrallikelihood, and fundamental frequency (F0). Machine scores are calculated fromstatistics derived from comparing non-native values for these variables to thenative models.
In a final step, machine generatedpronunciation scores are validated by correlating these scores with thejudgment of human expert listeners. As one would expect, the accuracy of scoresincreases with the duration of the utterance to be evaluated. Stanford ResearchInstitute (SRI) has demonstrated a 0.44 correlation between machine scores andhuman scores at the phone level. At the sentence level, the machine-humancorrelation was 0.58, and at the speaker level it was 0.72 for a total of 50utterances per speaker (Franco et al., 1997; Kim et al., 1997). These resultscompare with 0.55, 0.65, and 0.80 for phone, utterance, and speaker levelcorrelation between human graders. A study conducted at Entropic shows thatbased on about 20 to 30 utterances per speaker and on a linear combination ofthe above techniques, it is possible to obtain machine-human grader correlationlevels as high as 0.85 (Bernstein, 1997).
Others have used expert knowledge aboutsystematic pronunciation errors made by L2 adult learners in order to diagnoseand correct such errors. One such system is the European Community projectSPELL for automated assessment and improvement of foreign languagepronunciation (Hiller, Rooney, Vaughan, Eckert, Laver, & Jack, 1994). Thissystem uses advanced speech processing and recognition technologies to assesspronunciation errors by L2 learners of English (French or Italian speakers) andprovide immediate corrective feedback. One technique for detecting consonanterrors induced by inter-language transfer was to include students' L1pronunciations into the grammar network. In addition to the English /th/ sound,for example, the grammar network also includes /t/ or /s/, that is, errorstypical of non-native Italian speakers of English. This system, although quitesimple in the use of ASR technology, can be very effective in diagnosing andcorrecting known problems of L1 interference. However, it is less effective indetecting rare and more idiosyncratic pronunciation errors. Furthermore, itassumes that the phonetic system of the target language (e.g., English) can beaccurately mapped to the learners' native language (e.g., Italian). While thisassumption may work well for an Italian learner of English, it certainly doesnot for a Chinese learner; that is, there are sounds in Chinese that do notresemble any sounds in English.
A system for teaching the pronunciationof Japanese long vowels, the mora nasal, and mora obstruents was recently builtat the University of Tokyo. This system enables students to practice phonemicdifferences in Japanese that are known to present special challenges to L2learners. It prompts students to pronounce minimal pairs (e.g., long and shortvowels) and returns immediate feedback on segment duration. Based on thelimited data, the system seems quite effective at this particular task.Learners quickly mastered the relevant duration cues, and the time spent onlearning these pronunciation skills was well within the constraints of JapaneseL2 curricula (Kawai & Hirose, 1997). However, the study provides no data onlong-term effects of using the system.
Supra-segmental Feedback.Correct usage of supra-segmental features such as intonation and stress hasbeen shown to improve the syntactic and semantic intelligibility of spokenlanguage (Crystal, 1981). In spoken conversation, intonation and stress informationnot only helps listeners to locate phrase boundaries and word emphasis, butalso to identify the pragmatic thrust of the utterance (e.g., interrogative vs.declarative). One of the main acoustical correlates of stress and intonation isfundamental frequency (F0); other acoustical characteristics include loudness,duration, and tempo. Most commercial signal processing software have tools fortracking and visually displaying F0 contours (see Figure 2). Such displays canand have been used to provide valuable pronunciation feedback to students.Experiments have shown that a visual F0 display of supra-segmental featurescombined with audio feedback is more effective than audio feedback alone (deBot, 1983; James, 1976), especially if the student's F0 contour is displayedalong with a native model. The feasibility of this type of visual feedback hasbeen demonstrated by a number of simple prototypes (Abberton & Fourcin,1975; Anderson-Hsieh, 1994; Hiller et al., 1994; Spaai & Hermes, 1993;Stibbard, 1996). We believe that this technology has a good potential for beingincorporated into commercial CALL systems.
Other types of visual pronunciationfeedback include the graphical display of a native speaker's face, the vocaltract, spectrum information, and speech waveforms (see Figure 2). Experimentshave shown that a visual display of the talker improves not only wordidentification accuracy (Bernstein & Christian, 1996), but also speechrhythm and timing (Markham & Nagano-Madesen, 1997). A large number ofcommercial pronunciation tutors on the market today offer this kind offeedback. Yet others have experimented with using a real-time spectrogram orwaveform display of speech to provide pronunciation feedback. Molholt (1990)and Manuel (1990) report anecdotal success in using such displays along withguidance on how to interpret the displays to improve the pronunciation ofsuprasegmental features in L2 learners of English. However, the authors do notprovide experimental evidence for the effectiveness of this type of visualfeedback. Our own experience with real-time spectrum and waveform displayssuggests their potential use as pronunciation feedback provided they arepresented along with other types of feedback, as well as with instructions onhow to interpret the displays.Teaching Linguistic Structures and LimitedConversation
Apart from supporting systems forteaching basic pronunciation and literacy skills, ASR technology is beingdeployed in automated language tutors that offer practice in a variety ofhigher-level linguistic skills ranging from highly constrained grammar andvocabulary drills to limited conversational skills in simulated real-lifesituations. Prior to implementing any such system, a choice needs to be madebetween two fundamentally different system design types: closed responsevs. open response design. In both designs, students are prompted forspeech input by a combination of written, spoken, or graphical stimuli.However, the designs differ significantly with reference to the type of verbalcomputer-student interaction they support. In closed response systems, studentsmust choose one response from a limited number of possible responses presentedon the screen. Students know exactly what they are allowed to say in responseto any given prompt. By contrast, in systems with open response design, thenetwork remains hidden and the student is challenged to generate theappropriate response without any cues from the system.
Closed Response Designs.One of the first implementations of a closed response design was the VoiceInteractive Language Instruction System (VILIS) developed at SRI (Bernstein& Rtischev, 1991). This system elicits spoken student responses bypresenting queries about graphical displays of maps and charts. Students inferthe right answers to a set of multiple-choice questions and produce spokenresponses.
A more recent prototype currently underdevelopment in SRI is the Voice Interactive Language Training System (VILTS), asystem designed to foster speaking and listening skills for beginning throughadvanced L2 learners of French (Egan, 1996; Neumeyer et al., 1996; Rypa, 1996).The system incorporates authentic, unscripted conversational materialscollected from French speakers into an engaging, flexible, and user-centeredlesson architecture. The system deploys speech recognition to guide studentsthrough the lessons and automatic pronunciation scoring to provide feedback onthe fluency of student responses. As far as we know, only the pronunciationscoring aspect of the system has been validated in experimental trials(Neumeyer et al., 1996).
In pedagogically more sophisticatedsystems, the query-response mode is highly contextualized and presented as partof a simulated conversation with a virtual interlocutor. To stimulate studentinterest, closed response queries are often presented in the form of games orgoal-driven tasks. One commercial system that exploits the full potential ofthis design is TraciTalk (Courseware Publishing International, Inc., Cupertino,CA), a voice-driven multimedia CALL system aimed at more advanced ESL learners.In a series of loosely connected scenarios, the system engages students insolving a mystery. Prior to each scenario, students are given a task (e.g.,eliciting a certain type of information), and they accomplish this task byverbally interacting with characters on the screen. Each voice interactionoffers several possible responses, and each spoken response moves theconversation in a slightly different direction. There are many paths througheach scenario, and not every path yields the desired information. Thismotivates students to return to the beginning of the scene and try out adifferent interrogation strategy. Moreover, TraciTalk features an agent thatstudents can ask for assistance and accepts spoken commands for navigating thesystem. Apart from being more fun and interesting, games and task-orientedprograms implicitly provide positive feedback by giving students the feeling ofhaving solved a problem solely by communicating in the target language.
The speech recognition technologyunderlying closed response query implementations is very simple, even in themore sophisticated systems. For any given interaction, the task perplexity islow and the vocabulary size is comparatively small. As a result, these systemstend to be very robust. Recognition accuracy rates in the low to upper 90%range can be expected depending on task definition, vocabulary size, and thedegree of non-native disfluency.FUTURE TRENDS IN VOICE-INTERACTIVE CALL
In the previous sections, we reviewedthe current state of speech technology, discussed some of the factors affectingrecognition performance, and introduced a number of research prototypes thatillustrate the range of speech-enabled CALL applications that are currentlytechnically and pedagogically feasible. With the exception of a few exploratoryopen response dialog systems, most of these systems are designed to teach andevaluate linguistic form (pronunciation, fluency, vocabulary study, orgrammatical structure). This is no coincidence. Formal features can be clearlyidentified and integrated into a focused task design. This means that robustperformance can be expected. Furthermore, mastering linguistic form remains animportant component of L2 instruction, despite the emphasis on communication(Holland, 1995). Prolonged, focused practice of a large number of items isstill considered an effective means of expanding and reinforcing linguisticcompetence (Waters, 1994). However, such practice is time consuming. CALL canautomate these aspects of language training, thereby freeing up valuable classtime that would otherwise be spent on drills.
While such systems are an important stepin the right direction, other more complex and ambitious applications areconceivable and no doubt desirable. Imagine a student being able to access theInternet, find the language of his or her choice, and tap into a comprehensivevoice-interactive multimedia language program that would provide the equivalentof an entire first year of college instruction. The computer would evaluate thestudent's proficiency level and design a course of study tailored to his or herneeds. Or think of using the same Internet resources and a set of high-levelauthoring tools to put together a series of virtual encounters surrounding thetask of finding an apartment in Berlin. As a minimum, one would hope thatnatural speech input capacity becomes a routine feature of any CALLapplication.
To many educators, these may still seemlike distant goals, and yet we believe that they are not beyond reach. In whatfollows, we identify four of the most persistent issues in buildingspeech-enabled language learning applications and suggest how they might beresolved to enable a more widespread commercial implementation of speechtechnology in CALL.1. More research is necessary on modelingand predicting multi-turn dialogs.
An intelligent open response languagetutor must not only correctly recognize a given speech input, but inaddition understand what has been said and evaluate the meaningof the utterance for pragmatic appropriateness. Automatic speech understandingrequires Natural Language Processing (NLP) capabilities, a technology forextracting grammatical, semantic, and pragmatic information from written orspoken discourse. NLP has been successfully deployed in expert systems andinformation retrieval. One of the first voice-interactive dialog systems usingNLP was the DARPA-sponsored Air Travel Information System (Pallett, 1995),which enables the user to obtain flight information and make ticketreservations over the telephone. Similar commercial systems have beenimplemented for automatic retrieval of weather and restaurant information,virtual environments, and telephone auto-attendants. Many of the lessons learnedin developing such systems can be valuable for designing CALL applications forpracticing conversational skills.2. More and better training data are neededto support basic research on modeling non-native conversational speech.
One of the most needed resources fordeveloping open response conversational CALL applications is large corpora ofnon-native transcribed speech data, of both read and conversational speech.Since accents vary depending on the student's first language, separatedatabases must either be collected for each L1 subgroup, or a representativesample of speakers of different languages must be included in the database.Creating such databases is extremely labor and cost intensive--a phone leveltranscription of spontaneous conversational data can cost up to one dollar perphone. A number of multilingual conversational databases of telephone speechare publicly available through the Linguistic Data Consortium (LDC), includingSwitchboard (US English) and CALLHOME (English, Japanese, Spanish, Chinese,Arabic, German). Our own effort in collaboration with John Hopkins University(Byrne, Knodt, Khudanpur, & Bernstein, 1998; Knodt, Bernstein, &Todic,1998) has been to collect and model spontaneous English conversationsbetween Hispanic natives. All of these efforts will improve our understandingof the disfluent speech of language learners and help model this speech typefor the purpose of human-machine communication.
DEFINING AND ACQUIRINGLITERACY IN THE AGE OF INFORMATION
Moll defined literacy as «aparticular way of using language for a variety of purposes, as a socioculturalpractice with intellectual significance» (1994, p. 201). While traditionaldefinitions of literacy have focused on reading and writing, the definition ofliteracy today is more complex. The process of becoming literate today involvesmore than learning how to use language effectively; rather, the processamplifies and changes both the cognitive and the linguistic functioning of theindividual in society. One who is literate knows how to gather, analyze, anduse information resources to solve problems and make decisions, as well as howto learn both independently and cooperatively. Ultimately literate individualspossess a range of skills that enable them to participate fully in all aspectsof modern society, from the workforce to the family to the academic community.Indeed, the development of literacy is «a dynamic and ongoing process ofperpetual transformation» (Neilsen, 1989, p. 5), whose evolution is influencedby a person's interests, cultures, and experiences. Researchers have viewedliteracy as a multifaceted concept for a number of years (Johns, 1997).However, succeeding in a digital, information-oriented society demandsmultiliteracies, that is, competence in an even more diverse set of functional,academic, critical, and electronic skills.
To be considered multiliterate, studentstoday must acquire a battery of skills that will enable them to take advantageof the diverse modes of communication made possible by new technologies and toparticipate in global learning communities. Although becoming multiliterate isnot an easy task for any student, it is especially difficult for ESL studentsoperating in a second language. In their attempts to become multiliterate, ESLstudents must acquire linguistic competence in a new language and at the sametime develop the cognitive and sociocultural skills necessary to gain accessinto the social, academic, and workforce environments of the 21stcentury. They must become functionally literate, able to speak, understand,read, and write English, as well as use English to acquire, articulate andexpand their knowledge. They must also become academically literate, able toread and understand interdisciplinary texts, analyze and respond to those textsthrough various modes of written and oral discourse, and expand their knowledgethrough sustained and focused research. Further, they must become criticallyliterate, defined here as the ability to evaluate the validity and reliabilityof informational sources so that they may draw appropriate conclusions fromtheir research efforts. Finally, in our digital age of information, studentsmust become electronically literate, able «to select and use electronictools for communication, construction, research, and autonomous learning»(Shetzer, 1998).
Helping students develop the range ofliteracies they need to enter and succeed at various levels of the academichierarchy and subsequently in the workforce requires a pedagogy that facilitatesand hastens linguistic proficiency development, familiarizes students with therequirements and conventions of academic discourse, and supports the use ofcritical thinking and higher order cognitive processes. A large body ofresearch conducted over the past decade (see, e.g., Benesch, 1988; Brinton,Snow, & Wesche, 1989; Crandall, 1993; Kasper, 1997a, 2000a; Pally, 2000;Snow & Brinton, 1997) has shown that content-based instruction (CBI) ishighly effective in helping ESL students develop the literacies they need to besuccessful in academic and workforce environments.
CONTENT-BASEDINSTRUCTION AND LITERACY DEVELOPMENT
CBI develops linguistic competence andfunctional literacy by exposing ESL learners to interdisciplinary input thatconsists of both «everyday» communicative and academic language(Cummins, 1981; Mohan, 1990; Spanos, 1989) and that contains a wide range ofvocabulary, forms, registers, and pragmatic functions (Snow, Met, &Genesee, 1989; Zuengler & Brinton, 1997). Because content-based pedagogyencourages students to use English to gather, synthesize, evaluate, andarticulate interdisciplinary information and knowledge (Pally, 1997), it alsoallows them to hone academic and critical literacy skills as they practiceappropriate patterns of academic discourse (Kasper, 2000b) and become familiarwith sociolinguistic conventions relating to audience and purpose (Soter,1990).
The theoretical foundations supporting acontent-based model of ESL instruction derive from cognitive learning theoryand second language acquisition (SLA) research. Cognitive learning theoryposits that in the process of acquiring literacy skills, students progressthrough a series of three stages, the cognitive, the associative, and theautonomous (Anderson, 1983a). Progression through these stages is facilitatedby scaffolding, which involves providing extensive instructional support duringthe initial stages of learning and gradually removing this support as studentsbecome more proficient at the task (Chamot & O'Malley, 1994). Secondlanguage acquisition (SLA) research emphasizes that literacy development can befacilitated by providing multiple opportunities for learners to interact incommunicative contexts with authentic, linguistically challenging materials thatare relevant to their personal and educational goals (see, e.g., Brinton, etal., 1989; Kasper, 2000a; Krashen, 1982; Snow & Brinton, 1997; Snow, etal., 1989).
In a 1996 paper published in TheHarvard Educational Review, The New London Group (NLG) advocated developingmultiliteracies through a pedagogy that involves a complex interaction of fourfactors which they calledSituated Practice, Overt Instruction, CriticalFraming, and Transformed Practice. According to the NLG, becomingmultiliterate requires critical engagement in relevant tasks, interaction withdiverse forms of communication made possible by electronic technologies, andparticipation in collaborative learning contexts. Warschauer (1999) concurredand stated that a pedagogy of critical inquiry and problem solving thatprovides the context for «authentic and collaborative projects andanalyses» (p. 16) that support and are supported by the use of electronictechnologies is necessary for ESL students to acquire the linguistic, social,and technological competencies key to literacy in a digital world.
According to a 1995 report published bythe United States Department of Education, «technology is an importantenabler for classes organized around complex, authentic tasks» and when«used in support of challenging projects, [technology] can contribute tostudents' sense… that they are using real tools for real purposes.»Technology use increases students' motivation as it promotes their activeengagement with language and content through authentic, challenging tasks thatare interdisciplinary in nature (McGrath, 1998). Technology use also encouragesstudents to spend more time on task. As they search for information in ahyperlinked environment, ESL students benefit from increased opportunities toprocess linguistic and content information. Used as a tool for learning,technology supports a level of task authenticity and complexity that fits wellwith the interdisciplinary work inherent in content-based instruction and thatpromotes the acquisition of multiliteracies.
THEORY INTO PRACTICE
These research findings suggest that inour efforts to prepare ESL students for the challenges of the academic andworkforce environments of the 21st century, we should adopt apedagogical model that incorporates information technology as an integralcomponent and that specifically targets the development of the range ofliteracies deemed necessary for success in a digital, information-orientedsociety. This paper describes a content-based pedagogy, which I call focusdiscipline research (Kasper, 1998a), and presents the results of aclassroom study conducted to measure the effects of focus discipline researchon the development of ESL students' literacy skills.
As described here, focus disciplineresearch puts theory into practice as it incorporates the principles ofcognitive learning theory, SLA research, and the four components of the NLG's(1996) pedagogy of multiliteracies. Through pedagogical activities that providethe context for situated practice, overt instruction, critical framing, andtransformed practice, focus discipline research promotes ESL students' choiceof and responsibility for course content, engages them in extended practice withlinguistic structures and interdisciplinary material, and encourages them tobecome «content experts» in a subject of their own choosing.
CONCLUSION
It can be seen that it is difficult andprobably undesirable to attempt to determine the difficulty of a listening andviewing task in any absolute terms. By considering the three aspects thataffect the level of difficulty, namely text, task, and context features, it ispossible to identify those characteristics of tasks that can be manipulated.Having identified the variable characteristics of tasks in developing themodel, it is necessary to look to the dynamic interaction among, tasks, texts,and the computer-based environment.
Task design and text selection in thismodel also incorporate the identification and consideration of context.Teachers can make provision for their influence on learner perception ofdifficulty by providing texts and tasks that range across these levels, and byensuring that learners with lower language proficiency can ease themselvesgradually into the more contextually difficult tasks. This can be achieved byreducing the level of difficulty of other parameters such as text or taskdifficulty, or by minimizing other aspects of contextual difficulty. Thus, forexample, learners of lower proficiency who are exposed for the first time to atask based on a broadcast announcement would be provided with appropriatevisual support in the form of graphics or video to reduce textual difficulty.The task type would also be kept to a low level of cognitive demand (Hoven,1991, 1997a, 1997b).
In a CELL environment, thisidentification of parameters of difficulty enables task designers to developand modify tasks on the basis of clear language pedagogy that is bothlearner-centred and cognitively sound. Learners are provided with the necessaryinformation on text, task, and context to make informed choices, and are givenopportunities to implement their decisions. Teachers are therefore creating aCELL environment that facilitates and encourages exploration of, andexperimentation with, the choices available. Within this model, learners arethen able to adjust their own learning paths through the texts and tasks, andcan do this at their own pace and at their individual points of readiness. In socioculturalterms, the model provides learners with a guiding framework or community ofpractice within which to develop through their individual Zones of ProximalDevelopment. The model provides them with the tools to mediate meaning in theform of software incorporating information, feedback, and appropriate helpsystems.
By taking account of learners' needs andmaking provision for learner choice in this way, one of the major advantages ofusing computers in language learning--their capacity to allow learners to workat their own pace and in their own time--can be more fully exploited. It thenbecomes our task as researchers to evaluate, with learners' assistance, theeffectiveness of environments such as these in improving the their listeningand viewing comprehension as well as their approaches to learning in theseenvironments.
REFERENCES
1. Adair-Hauck, B., & Donato, R.(1994). Foreign language explanations within the zone of proximal development. TheCanadian Modern Language Review 50(3), 532-557.
2. Anderson, A., & Lynch, T. (1988). Listening.Oxford: Oxford University Press.
3. Armstrong, D. F., Stokoe, W. C., &Wilcox, S. E. (1995). Gesture and the nature of language. Cambridge:University of Cambridge.
4. Arndt, H., & Janney, R. W. (1987). InterGrammar:Toward an integrative model of verbal, prosodic and kinesic choices in speech.Berlin: Mouton de Gruyter.
5. Asher, J. J. (1981). Comprehensiontraining: The evidence from laboratory and classroom studies. In H. Winitz(Ed.), The Comprehension Approach to Foreign Language Instruction (pp.187-222). Rowley, MA: Newbury House.
6. Bacon, S. M. (1992a). Authenticlistening in Spanish: How learners adjust their strategies to the difficulty ofinput. Hispania 75, 29-43.
7. Bacon, S. M. (1992b). The relationshipbetween gender, comprehension, processing strategies, cognitive and affectiveresponse in foreign language listening. Modern Language Journal 76(2),160-178.
8. Batley, E. M., & Freudenstein, R.(Eds.). (1991). CALL for the Nineties: Computer Technology in LanguageLearning. Marburg, Germany: FIPLV/EUROCENTRES.
9. Ellis, R. (1985). Understandingsecond language acquisition. Oxford: Oxford University Press.
10. Faerch,C., & Kasper, G. (1986). The role of comprehension in second languagelearning. Applied Linguistics 7(3), 257-274.
11. Felder,R. M., & Henriques, E. R. (1995). Learning and teaching styles in foreignlanguage education. Foreign Language Annals 28, 21-31.
12. Felix,U. (1995). Theater Interaktiv: multimedia integration of language andliterature. On-CALL 9, 12-16.
13. Fidelman,C. (1994). In the French Body/In the German Body: Project results. Demonstratedat the CALICO '94 Annual Symposium «Human Factors.» Northern ArizonaUniversity, Flagstaff, AZ.
14. Fidelman,C. G. (1997). Extending the language curriculum with enabling technologies:Nonverbal communication and interactive video. In K. A. Murphy-Judy (Ed.),NEXUS: The convergence of language teaching and reseearch using technology, pp.28-41. Durham, NC: CALICO.
15. Fish,H. (1981). Graded activities and authentic materials for listeningcomprehension. In The teaching of listening comprehension. ELT DocumentsSpecial: Papers presented at the Goethe Institut Colloquium Paris 1979, pp.107-115. London: British Council.
16. Garrigues,M. (1991). Teaching and learning languages with interactive videodisc. In M. D.Bush, A. Slaton, M. Verano, & M. E. Slayden (Eds.), Interactive videodisc:The «Why» and the «How.» (CALICO Monograph Series, Vol. 2,Spring, pp. 37-43.) Provo, UT: Brigham Young Press.
17. Gassin,J. (1992). Interkinesics and Interprosodics in Second Language Acquisition.Australian Review of Applied Linguistics 15(1), 95-106.
18. Hoven,D. (1997a). Instructional design for multimedia: Towards a learner-centred CELL(Computer-Enhanced Language Learning) model. In K. A. Murphy-Judy (Ed.), NEXUS:The convergence of language teaching and research using technology, pp. 98-111.Durham, NC: CALICO.
19. Hoven,D. (1997b). Improving the management of flow of control in computer-assistedlistening comprehension tasks for second and foreign language learners.Unpublished doctoral dissertation, University of Queensland, Brisbane,Australia. Retrieved July 25, 1999 from the World Wide Web: jcs120.jcs.uq.edu.au/~dlh/thesis/.
20. Richards,J. C. (1983). Listening comprehension: Approach, design,procedure. TESOL Quarterly 17(2), 219-240.