Bettye Keyes - USA

Realtime by Voice: Just what you need to know.

Intersteno Congress - Prague July 2007

Thank you, Mr. President Zaviacic, and thanks to all the people of Interinfo Czech Republic who prepared this Congress for a splendid welcome to Prague.

I would like to speak about a technology that is relatively new in Intersteno, yet has taken quite a long time to mature. In the late 1870s Alexander Graham Bell began work on his photoautograph, a machine which was to understand human voices. Perhaps he was inspired by Sir Isaac Pitman, whose theory of shorthand was called "Phonography, or Writing by Sound." Clearly, these gentlemen were on to something, although technology would take 150 years to realize this dream. The global speech recognition technology market topped $1 billion for the first time in 2006, a 100 percent increase in two years. The embedded speech recognition market, which includes telephones and automobiles, is expected to reach $500 million by 2010. In 2006, Nuance's sales grew 20 percent to more than $300 million.

The American companies that provide computed-aided reatlime transcription software -- some of which were demonstrated at the Vienna Congress -- have either integrated speech recognition with their realtime transcription software, or have chosen to focus solely on speech recognition development for their court reporting and subtitling products.

Intersteno members and others in Australia, the Czech Republic, India, Japan, The Netherlands, Russia, Turkey and the United Kingdom are either using or evaluating speech recognition-based systems.

What is Voice Writing?

I'm informed that we have some new friends in the audience, so I would like to give a brief overview of what a person who uses speech recognition -- also known as a "voice writer" -- does. In fact, this presentation will concentrate on the practical aspects of speech recognition technology and how it is used.

Voice writers speak into a speech silencer containing a built-in microphone. The speech silencer can be plugged into a computer and/or an audio recording device. They simply repeat the words spoken by participants in the courtroom, parliament chamber, classroom, or other workplace setting, as a recording of their speech is being made. The purpose of a speech silencer is not only to isolate the reporter's voice so a clear recording of the dictation is made, but it's also designed to prevent any other persons from hearing the reporter during dictation.

At this point, I should distinguish the difference between a voice writer and a realtime voice writer.

In the States, standard court reporting examination certifications reach 225 and 250 words per minute. Realtime court reporting certifications may vary in speed from 180 to 200 words per minute.

For a basic certification test, a voice writer connects his speech silencer mask to an audio recording device to record his dictation, then uses either a standard cassette transcriber or transcription software to play back his voice recording. He then listens to the playback through his headphones, and manually types every word of his dictation to produce the transcript.

For a realtime certification test, a voice writer connects his speech silencer mask into either an external USB sound card or directly into a computer, and his dictation is processed by speech recognition software to produce text in realtime. The test-taker is graded on the accuracy of his live performance.

A person who utilizes the voice method of reporting to manually type the words of his recorded dictation, without the application of speech recognition software, is a "voice writer." A voice writer who utilizes speech recognition to produce a realtime record is a "realtime voice writer."

When a voice writer uses speech recognition, audio signals of his dictation are captured by microphone within the speech silencer mask, then sent through a cable to a sound card and converted to digital patterns. That information is then sent to the speech recognition engine in the computer and those signals are analyzed to convert speech into text. If using computer-aided transcription software in conjunction with a speech engine, the text that appears on a reporter's computer screen can be streamed over the Internet, sent to a television station for subtitling, or sent to nearby computers for realtime viewing.

Some reporters use external headset microphones, which generally provide better recognition than masks. The re-speaking method, used by our Italian colleagues, is a good example. Only the method used to take down words differs between practitioners of realtime reporting activities; all other pre- and post-production aspects are identical.

About Voice Writing Theory

Just as our hand and machine stenography brothers and sisters enjoy the use of special stenograms, "voice codes" (or code words) form the basis of dictation shorthand methods for users of speech recognition systems. However, voice codes are intended not to sound similar to regular words in a language. They are distinctive utterances that a speech engine will not interpret as pronunciations of regular words, so a voice writer can: 1.) identify speakers, 2.) resolve conflicts between similar-sounding words, 3.) force-produce words that a speech engine would not ordinarily produce, and 4.) produce long phrases and other text, with or without formatting, in myriad scenarios that are necessary to meet the realtime transcription performance standards in Intersteno's domains of activity. The culmination of these voice codes and their conceptual usages together comprise a special dictation language known as a "voice writing theory."

The major difference between a voice writing theory and other theories for shorthand writing methods is that a voice writing theory is designed to work within the parameters of a speech engine's capacity to make decisions based on probability. When choosing a proper set of voice codes to implement for a conflict-free voice writing theory, the goal is to have as much understanding of the speech engine's inner workings as possible, so you can better manipulate the program's output, to result in a more controlled, expected outcome.

Designing an effective theory is a matter of creativity, experimentation, and time. Working conflict-free theories are already available in English, but they are designed to work in conjunction with advanced computer-aided realtime transcription applications known as CAT software.

The Purpose of a CAT Program

Let me address the differences between utilizing a speech engine alone versus utilizing a speech engine in conjunction with CAT software.

With a speech recognition program alone, you can use your voice as a tool to open and close programs, navigate on the Internet, replace keyboard- and mouse-controlled operations, and transcribe text. But for Intersteno's domains of activity, the best focus is on transcription. Using a speech engine alone is not recommended because it produces inconsistent results, which have been well-addressed by leading CAT software developers.

Benefits of using CAT software include the ability to send a realtime text feed, make digital recording tracks of your voice and external room, synchronize text with audio recordings, easily play back a portion of an audio file while the recording is in the process of being made, instantly resolve language conflicts, and automatically produce formatting functions. Unnecessary speech recognition commands for realtime voice writing are disabled so they do not interfere or conflict with your text production work. You can also simultaneously edit while dictating, and produce words on-the-fly even when they are not contained in a speech engine's vocabulary.

Voice Writing Equipment

The basic equipment is a computer, headset microphone or speech silencer mask, USB soundcard (optional), speech recognition engine, and CAT software.

When choosing a desktop or laptop computer, the fastest processor and a minimum of 2 gigabytes of RAM should be standard. There should be enough USB ports available (at least three) to accommodate an external sound card, an external storage device -- such as a floppy drive or memory stick -- and a CAT software activation key.

The total cost of hardware ranges from $2,300 to $3,500.

How Speech Recognition Works

In everyday conversation, we run words together and drag out our pronunciation of certain syllables to emphasize how we feel, and we usually are not accustomed to enunciating sounds within small words, such as the "d" in "and" in English. This is probably because other humans better recognize us when we use more expressive styles of speech to convey meaning. And it isn't necessary to distinctively pronounce the ending "d" sound in "and" so that another person can understand what we said.

Humans and computers process data in different ways. We recognize speech through top-down processing, where we distinguish words based on concepts and circumstantial knowledge. Computers recognize speech through bottom-up processing, which involves analyzing basic sound structures. The most basic sound unit of speech is called a "phoneme."

For example, the English word "sit," contains three phonemes: the sounds "s," "i," and "t," and the English language contains approximately 48 of them. Computers recognize speech by listening to these sound parts and breaking them down into sub-parts when converting them to digital form. After speech is converted to digital form, the most highly probable word choices are extracted for grammatical analysis using algorithms, and a best-match guess is made.

If a child says, "I want my blankowit," we understand the word is "blanket"; this is our interpretation of those word pronunciations based on meaning and circumstantial knowledge. A computer may recognize the child to have said "blank or with" or "banquet"; this is the computer's interpretation based on its analysis of only the sounds it hears along with algorithms of grammar.

When we run words together, we can understand, but the computer cannot, unless we speak clearly and at a "trained" speed that allows distinctions of words to be made, apart from other words, so it is important to set boundaries between them. This does not mean, however, that we must pause discretely between each word. Speech recognition engines recognize small words better if spoken in conjunction with other words, and in phrases, and pausing at certain intervals is important to let the computer know what to process as a single word or as a group of words, and it will even enhance the quality of word selections based on grammar.

Speech-to-Text Conversion

There is a three-step process to convert speech into text: 1.) speech to analog, 2.) analog to digital, and 3.) digital to text.

When we speak into a microphone, the computer uses a sound card and software algorithms to digitize our analog signals. The process begins in the sound card's analog-to-digital converter, which converts our words into a predefined range of distinct number sets. The patterns of digital number sets are then measured against other digital patterns representing prototypes of phonemes stored within the speech recognition engine.

Factored into the speech-to-text conversion is the retrieval and dissemination of information through the speech recognition engine's acoustic model, grammatical model, and vocabulary components. The acoustic model holds sound data associated with your voice, which is shared with the speech recognition engine's vocabulary and grammatical model for word-priority selections. After possible word choices are retrieved from the vocabulary (or word bank), information from the grammatical model is used to make a best-match guess based upon complex algorithmic analyses that search for syntax of single-word and word-group relationships. The selection of words can be affected by minor variations in speech, including enunciation, inflection, and rate of dictation.

The reporter must learn the proper way to deliver speech input so that the computer produces accurate and expected output. Consistent dictation style will minimize the risk of error. In order to achieve the highest accuracy, focus is placed on three critical adjustment factors to tune the speech recognition engine's performance measures based on its three design elements: 1.) the acoustic model, 2.) the vocabulary, and 3.) the grammatical model.

Acoustic model adjustments are made using the correction tool. This improves accuracy by giving the speech recognition engine additional strata of acoustic data to properly recognize how you pronounce certain words. The vocabulary and grammatical model are customized using the vocabulary building tool. This is where the speech recognition engine analyzes texts for the purpose of adding words to the speech engine's vocabulary, and enhances recognition within the context of your voice theory. Additional accuracy improvement actions include adding words and phrases and using the audio setup tool.

How a CAT Program Works

As a speech recognition engine outputs text, the CAT program filters the information and either allows it to appear as-is or be further processed to produce special output as a final realtime result. For special output requirements, CAT programs must be instructed on what output to produce and how the output should be generated. CAT programs use combinations of characters and symbols as code, which I call "CAT code" -- to interpret what text to produce and/or formatting functions to perform. When programmed with voice codes, speech recognition engines and CAT programs work together to achieve more specific results.

The most common example used to explain how a CAT program converts speech-to-text output is in describing how a question symbol is generated in a transcript. The string of text and commands associated with a question symbol is: 1.) insert a period and remove the previous space, unless other punctuation is present; 2.) produce a hard return; 3.) insert capital letter Q; 4.) produce a tab; and 5.) capitalize the first letter of the next word.

Say, for example, the voice code used to represent a question symbol is "Q-mac" and the CAT code associated with the required output is a less-than symbol, plus a capital letter Q, plus a greater-than symbol (<Q>). The reporter will dictate the voice code "Q-mac" to invoke the CAT code "<Q>" which will produce the above string of text along with its formatting.

Dictation Techniques

The majority of realtime voice writing success is based upon mastering proper dictation style. Voice writers must train to dictate clearly at higher-than-average speeds. At first your words will not sound clear when spoken quickly, but over time you will find that you can increase your rate of speaking with no loss of clarity at realtime speeds. Factors which require training include breathing, dictation pattern theory enunciation, modulation, pace, paraphrasing, punctuation, tone and volume.

Voice Writing Theory

For normal speech recognition dictation activities, very little special education is required. However, the use of speech recognition in Intersteno's domains of activity require standardized methods to accommodate words, word parts, and phrases within an easy-to-understand, practical framework of voice codes. The standardized framework known as a "theory" should include the means for accommodating brief forms, conflict resolution, control of homophone/homonym word selections, indication symbols, on-the-fly word translations, parentheticals, phrases, prefixes and suffixes, punctuation, small words, speaker identification, spelling technique to produce letters of the alphabet, and written numbers versus digits.

Creating a Voice Model

Speech recognition engines used for Intersteno's domains of activity require that users work through a process called "general voice training," where one reads aloud stories provided by the speech recognition engine so that the computer can learn how the person speaks. In a realtime dictation session, it is very important to always use the same equipment that was initially used to create a voice model.

As you read the first story, you must speak more slowly than you would during ordinary conversation, because the speech recognition engine does not yet have enough acoustic data associated with your human voice. As you progress through the remainder of the first story and through each successive story, you can increase your dictation pace. You may then perform the remaining general voice training stories at realtime speeds so that the speech recognition engine will understand the rate of speech to expect when you are in an environment to actually provide realtime services.

Optional User Settings

Speech recognition engines have a wide range of options of the most common realtime voice writing preferences to properly format numbers and words. Explanations and descriptions of recommended speech recognition engine and CAT software options can be found in their respective users' manuals. If you would like more information, I will be happy to provide resources at any time during the Congress.

Improving Accuracy

After creating a base voice model and programming your voice theory language, the next step in realtime preparation is to maximize the speech recognition engine's ability to accurately recognize your speech.

Alhough using the aforementioned dictation techniques will produce the highest accuracy results, fine-tuning work is necessary for a new voice model, because the speech recognition engine then possesses only a generic representation of your speech, based on the rules of analysis and measurements that were applied during the creation process. Complete dedication should be given first to how we dictate. The best method of working on proper speech development is to focus on dictation by reading text aloud and applying the dictation techniques until they become natural. When the speech recognition engine receives proper input, you will know that errors are attributable to the speech recognition engine, enhancing your ability to determine what adjustments to make during the correction process.

Vocabulary Building

Commercial speech recognition engines contain vocabulary building tools you can use to customize the grammatical model by analyzing documents. This improves the speech recognition engine's ability to establish relationships between words and other entries contained within its vocabulary. It also automates the task of finding and adding new words, some of which may not be in everyday use or are otherwise specific to a particular industry or occupation. You can create separate grammatical models to add vocabularies, which, for example, are specific to parliamentary reporting, or reporting on maritime or sport information.

Speed-Building Plan

The mental coordination efforts involved in the process of listening and repeating can scatter the focus necessary for personal linguistic programming. Acquiring proper dictation skills for speech recognition requires dedicated focus to personal linguistic programming, and the best way to do this is by reading a text display at a set, constant speed. The goal is to dictate aloud and get comfortable with a given speed until you reach your speed goal, then dictate into the speech recognition engine at that rate and improve your voice model. When your speech recognition accuracy reaches your goal, you can increase the rate in five-word-per-minute increments, dictating and correcting at each interval until your speech recognition accuracy reaches the goal required by your organization. Once your accuracy and speed goal has been met, you can focus on merging the skill of listening and repeating with the skill of dictating properly.

The best tool for this task is a software program with which the user can display text on the screen and scroll at various speeds.

N-Gram Phrase Extractor

For quickly determining which phrases to add to your speech recognition engine vocabulary, you can use the N-Gram Phrase Extractor, a state-of-the-art data mining tool that automates the process of finding commonly-occurring word groups contained within documents. This program can be accessed on the Internet at the Voice Writing Method website at www.voicewritingmethod.com, via the Links button.

The program is written for English and French, but it also works well in German. To use this tool, obtain previously prepared texts saved in ASCII file format (with a .txt extension) that represent the material you will produce in your realtime work. Depending on the options you select, the N-Gram Phrase Extractor will identify the most common phrases in two-word to five-word combinations.

Training

Reporters and subtitlers require the same academic classes, including legal and medical language, business law, and continuous language training. Using the latest software and adequate hardware, a new student can dictate at 150 words per minute with extremely high accuracy. A speech recognition student can become realtime-certifiable well within 24 months. Reporting and subtitling educational programs can increase the volume of qualified graduates while decreasing the time required for students to achieve proficiency.

Although I have conveyed this information in English, IBM ViaVoice and Dragon NaturallySpeaking are presently written for about 10 language variations, and the concepts in this discussion can apply to all languages.

I gladly look forward to welcoming interested Intersteno members of all languages into the realtime speech recognition future.