Chad Theriod - Audioscribe Corporation - USA

Speech Recognition as a Rich Media Component

Intersteno Congress - Prague July 2007


Hello, my name is Chad Theriot, and I am the CEO of AudioScribe.    Our company produces software that utilizes speech recognition for the Court Reporting industry in America and Internationally.  It is utilized by professional voice writers to produce real-time text.  Our technology is based on the single user dictation model and the highly accurate results these professionals can produce with the right tools.  All through the world in courtrooms, boardrooms, classrooms, churches, and in the military, speech recognition technology professionals are creating possibilities for people everywhere to have access to the information they want, when they want it.  We are going to touch on a few examples today. 

     Courtroom 21 is located in the College of William and Mary in Williamsburg, West Virginia.  They have conducted two mock international trials using the latest technology in all areas, including speech recognition.  The first was in 2005, 2 judges, one sitting on a bench in Williamsburg, the other in Monterrey Mexico, simultaneously heard testimony and reviewed evidence.  A real time speech recognition court reporter provided streaming text, synchronized with audio and video that was then webcast around the world.  One year later in 2006, another mock trial experimented with different technology intending to assist disabled participants during the trial, which included judges with failing eyesight, and counsel with mobility limitations.

     Also, in Columbia, South Carolina, two hearing-impaired children at Mt. Hebron United Methodist Church, can participate in church services along with the rest of the congregation thanks to a court reporter also trained as a voice writer who is able to provide Communication Access Realtime Translation or CART of everything that is said.  This translation is viewed by the children during the mass services which allows them to understand what is being said. 

     For years, stenotype court reporters capable of realtime have had the ability to utilize shorthand machines to provide realtime court reporting, CART and captioning services.  Many individuals are surprised to learn however that speech recognition realtime technology is now being utilized by realtime voice writers around the world to provide these same services.  For the trained voice writer, these programs incorporate special tools enabling the ability to display dictated text in any font size and any background color that the voice writer selects.  These programs also facilitate connection from the voice writing CAT software to other monitors, projectors, encoders, and servers for display of text and/or video and audio on remote monitors and mobile devices, all in real time.  For the hearing impaired, this means they can view video of the proceedings while reading the translated streaming text.  In the most basic CART setup, the provider, with a computer capable of realtime technology is in a session with one or more hearing-impaired clients.  During the session, the CART provider dictates everything that happens.  The screen on the computer is visible so that the client can read the realtime text.  This differs from traditional realtime reporting of events in that the CART provider is not there to create a verbatim record.  Instead the reporter helps clients understand the event, which means paraphrasing, interpretation, and two-way communications and even realtime translation.  In more advanced examples, the CART provider is either in the room with several hearing-impaired individuals or in another location and utilizes technology to take a video feed from the computer to feed it into one or more monitors or mobile devices, either in the same room or remotely.  Captioning can be thought of as a subset of CART in that the realtime text is transmitted directly into an encoding device which combines a video signal and realtime text to provide closed-captioning.  Captioning can either be done directly or over telephone lines using modems from a remote location.  This assistive technology can also be configured so that this realtime translation can be streamed via the Internet and synchronized with audio and video to provide webcasting and podcasting capabilities directly from the voice writers PC.  Additionally for visually-impaired clients, certain captioning techniques facilitate an audio track of events as well.

     With a growing international market, we see more and more clients wanting the combination of Video, Audio and Text in realtime.  This seems simple, but it takes a lot of technology to do it right.  The ideal situation for both the voice writer and the client is to digitally record multiple channels of the procedures via multi-track digital recording software, including the voice of the speech recognition professional, as well as multiple recordings of the room. They can then all be streamed from the Voice Writers PC over the Internet to the client participating in the event.  The integration of audio and video and text gives the viewer the best experience possible.  We call this �Rich Media�.  This technology allows viewers to feel much more involved in the event as a participant instead of just as a spectator.  From a technology standpoint, the equipment needed to make all of this possible varies depending on the setup and can range from the use of a single laptop with CAT software and a internet connection, to a setup with several streaming media servers with thousands of webcast clients around the world.  But the keystone technology is the engine that drives it all, highly accurate, speaker dependant dictation programs such as Dragon NaturallySpeaking Professional, IBM ViaVoice Pro, MicroSoft's speech engine and similar programs.  The programs are reaching amazing milestones, and when used by a well trained voice writer with the right tools, they can produce meaningful results.  As I presented this year at the SpeechTEK conference in San Fransisco, voice writers are trained to handle fast talkers, which can reach speeds from 180 to 350 words per minute.  And they can speak in very low volumes so that they can not be heard by anyone close by.  There are also many acoustical challenges that come with using a silencer mask that do not exist when using an open microphone.  These talents, along with the cutting-edge features of most CAT software allows them to utilize speech recognition technology to provide a verbatim record of proceedings and simultaneously delivering synchronized audio, video and text in realtime to anyone, anywhere in the world, for any event.

     Voice writers are breaking down the barriers for people with disabilities or people with language limitations, who may simply tune into monitors or mobile devices to connect to the proceedings.  As we all know from the dreams of science fiction, speech recognition has endless possibilities!  And there is no dream that can not be fulfilled with this kind of technology.  We are already helping people around the world to do things that they never thought they could do.  Today, we are utilizing text-to-speech and speech-to-text capabilities with amazing accuracy right out of the box and with virtually no user training.  By unlocking this technology, we enable the possibility that soon all people around the world will have equal access to all events and proceedings that affect their lives.  And with the younger generation finding PCs and mobile devices less intimidating, the spread of this technology, and demand for it, will continue to increase.

     So how long will it take us to do this?  Well, let�s look back at our short history of Speech Recognition.  The year 2007 is an important milestone.  It has been 10 years since this technology was invented.  Large Vocabulary Speech Recognition was first release by Dragon in 1997.  The first few years were very slow and painful.  But, it continued to improve at an extraordinary speed.  In less than ten years, we were doing Closed Captioning by Voice.  We are now past the point where the technology is being used by only a few.  We are starting to see it�s adoption by the general public.  As I said last year in New York, we have reached the tipping point were the technology will start to move from the �few� to the �many�.  To illustrate this point you only need to look at Microsoft and look at most new computers in the world, on the bottom right hand corner, there is something called the �Language Bar�, which gives the users access to the MicroSoft speech recognition engine, among other things.  Microsoft has made a huge impact on moving this technology to mainstream with this one move, speech recognition will go from a few dedicated users to the many who will not even care that they are using speech recognition.  They will only care that it works and helps them do their jobs better.  And not only in command and control functions, imagine a person using their cell phone to check their e-mail.  Text-to-speech will be used to read that email to them.  Then they say reply and they will use speech-to-text to send that e-mail as a response.  This is the kind of usage that will put this technology into the hands of the masses. 

     I have two other examples to give you that we are finding today that will deliver integrated video, audio, and text.  First, we all know the video web site �YouTube� and the impact it has had on the world.  Think of that with realtime streaming text translated into ten different languages simultaneously.  This means that state controlled television is a thing of the past.  People around the world are using Video and Audio streaming to change the face of media, add streaming Text with realtime translation and the world just got a lot smaller� Again. 

     That leads me to my second idea.  And many academics may not like this one.  But it's the idea of Instant Message or �IM� symbols and short words to be considered as a language.  It is not grammatically correct.  It is not even English, but we all understand it.  Many of our voice writers use shortcuts and brief forms to represent longer forms of words.  My point here is that when problems arise in realtime multiple language translations you have to make a decision, to drop or to paraphrase.  IM already has dealt with this problem, with IM shorthand.  Who in the world under the age of 21 does not know what the smiley face represents.  The audience wants to understand the point of the speaker, they do not care how that happens.  Grammar is becoming less and less important in international conversation.  For example, in French if you say �Mason Blanc� that is literally translated as �House White�.  It does not matter to the audience in the real-time event that they see the words �house white� because they get the point.  In much the same way, IM shorthand as a language is going to allow for more real-time access across multiple translated languages in realtime. 

     In conclusion, I want to point out that our children are living in the YouTube and the IM age, and this technology is going to become completely pervasive. We, as professionals, must live in that age, as well.  It is time to adapt and move ahead. 

     Thank you for your time.