The Voice Interaction function lets you operate the TV with your voice. It makes it possible to switch from a broadcast screen to my Home Screen, or to search the web, without complicated operation. Here, we discuss this next-generation user's interface (UI) with the engineers who developed it.
To begin with, what is Voice Interaction?
Yokohagi: As its name suggests, it's a function that lets you operate the TV with your voice.The user simply says a command word or a search word into the special Touch Pad Controller or a smartphone.* It's a very simple and convenient operating style.
* "TV Remote 2" must be installed for smartphone use.
- Content Search
- Text Input
- Moving Around the Web and Verbally Reading Out Text
- Conventional Operations Are Also Possible.
(Direct Operation of Volume Up/Down, Channel Change, Etc.)
What got you started in its development?
Yokohagi: It began with our Voice Guidance function for the Japanese market in 2010. The function would verbally tell the viewer things like the channel that was selected and the name of the TV program that was being viewed. It was developed to make using TV more convenient for visually impaired users.
Starting with the new 2013 models, we expanded this function to cover more than 20 languages. Around this time, we had meetings to consider our next-generation user's interface. The convenience of Voice Interaction was brought up at these meetings, and it was decided that, rather than simply offering Voice Guidance, we would propose an entirely new style of using TVs where ordinary users would also operate the TV by voice.
As we proceeded to make the concept more concrete, we were faced with questions like, “What can users do with Voice interaction?” and “What exactly should the user say?” Since we were drastically changing the concepts of both the TV and the remote control by using command words to do everything that was previously done by pressing remote control buttons, we found ourselves in a new and unknown world.
Some major improvements were also made to the remote control, right?
Sekito: Yes, a variety of enhancements had already been made to the conventional remote control to make it easier to hold and operate. However, some models had up to 50 buttons, so we received comments that the remote control was difficult to operate, and some users said they simply couldn't make effective use of it.
In 2012, we developed the Touch Pad Controller, which dramatically reduced the number of buttons, and marketed it the same year.
This new “buttonless” remote control was the complete opposite of a conventional type. The user rotated a touch pad in the center with his or her thumb, providing intuitive operation much like the mouse on a PC.
In 2012, the new remote control was marketed with the purpose of making operation easier. In the 2013 models, furthermore, a microphone was built into it so the user could operate it by voice. Needless to say, conventional remote control operations could be easily done, and at the same time the remote control needed to be used as a mic. This put two functions into a single unit, but in order to make sure that users wouldn't find it too confusing to use it either as a mic or as a remote, we worked hard to make this device work naturally and easily for anyone to operate.
Koganei: We started by studying the existing competitor’s voice control function for changing the volume and channel by voice command. And we considered whether those functions are exactly what users want to do by voice.
After all, voice operation for TVs was still a new concept and users did not yet have any idea about it. If we suddenly said, “Here you go, talk to your TV,” or “Operate the TV with your voice,” the user wouldn't even know what to say. That would be confusing, and we certainly didn't want to make users confused like that.
Konuma: We then started thinking about situations where the user would naturally want to use the mic. As a result, we all agreed that the mic would be most convenient when users search on the web.
What are the features of the voice recognition engine?
Koganei: To put it simply, our engine has two main brains.
Our Voice function’s strong point is that those two main brains co-work nicely.
For example, there are fixed expressions for basic TV operation, like “Volume Up.” We call these “fixed words” and they are stored in one brain. And there are words that the user can freely use. They are convenient for cases like searching. We call them “free words” and those words are stored in another brain. The main feature of our engine is that it's able to start up both of these -- one part with a dictionary of fixed words, and the other part with a cloud service recognizing free words. In other words, it's a hybrid engine. This lets it recognize the words that are spoken by the user from a massive data bank.
Ordinarily, when the user wants to perform a TV operation like “Volume Up,” only the fixed word dictionary engine built into the TV is started up. Similarly, when the user wants to input a search word, only the free word recognition engine is started up. However, this means that a key word from a TV program being watched cannot be retrieved from the free word
recognition engine in a single speaking step. That isn't very convenient.
A) “Web Search” -› [Please say keyword] -› “Panasonic”
B) “Search for Panasonic by web.”
With our engine, the fixed word dictionary engine and free word recognition engine are both driven, so it recognizes either a TV operating command or a free word searching phrase. We took this kind of fine performance into consideration throughout our development to make the function as convenient as possible for the user.
Konuma: For example, when you want to look at “flower photos,” it's inconvenient to always have to say “search” first. We wanted to create a system that would more intuitively recognize what you're looking for by simply saying the word. As a result, all you have to do with our system is to say the word “flower” into the mic. After saying “flower,” the TV responds by asking additional conditions, such as, “Do you want photos or videos?”
And you provided that kind of precision for a variety of languages?
Yokohagi: Yes, we worked especially hard on language development. We also had to verify compatibility with languages for more than 20 countries. That was a gigantic task. In addition to carefully examining the “reading out” quality, we had to verify as closely as possible which fixed words and free words the user is likely to choose when speaking. For example, “my Home Screen” is a fixed word, but it is also a globally used command. Because of this, the “my Home Screen” command is set to provide a correct response in every language. In addition to this kind of common setting, we have provided a very fine and precise response to the wording in each language.
Konuma: The recognition rate tends to drop in some cases, depending on the language. When issues happen we worked diligently to investigate and find the reason, whether it was in the mic, or in the TV's built-in engine dictionary, or elsewhere.
In particular, what kind of things caused the recognition rate to drop?
Konuma: Well, this is just an example, but we had difficulty in the development stage with words like numbers, where a word can be read in several ways. For example, in English, we can pronounce the number “223” as “two-twenty-three” or “two hundred and twenty three.” The same kind of thing exists in every language. The problem is in how you pronounce the word, and how it's recognized. In the initial prototype stage, Taiwanese numbers couldn't be recognized for this reason. In French, the numbering system is very complicated. “70” is pronounced “60+10,” and “80” is pronounced “4x20.” It was extremely difficult to raise the precision of the dictionary for languages that use these kinds of expressions.
Since you didn't have anyone who could understand the languages of 20 countries, how did you verify them?
Yokohagi: It's a little tricky, but we would display the control command for each language on a PC and have the PC read it out. This was input into a development-stage TV for verification. Finally, we asked local Panasonic staff in each country to verify the results, and received feedback from them.
Koganei: This involved a huge amount of data, and required an incredible amount of work, but it cannot be avoided if you're going to use voice recognition as a user's interface. So we repeated fine adjustments over and over again to raise the precision.
How did you handle the development of the remote control since it's such an essential tool for voice recognition?
Sekito: As I mentioned earlier, we developed a buttonless Touch Pad Controller in 2012. In the present version, we added a voice control function to it. As a result, we had one team of engineers in charge of hardware to produce the remote control, and another team of engineers in charge of software to make the TV work smoothly by voice. The two teams
collaborated to achieve a single goal.
The development of Voice Interaction was being done for the first time ever. Wasn't it a bit overwhelming?
Imai: Yes, it was. It was teamwork that allowed us to overcome it. Conventionally, the hardware team and software team work separately. The remote control unit would be developed by a hardware division, and the TV's GUI would be developed by a software division. This time, since the remote control was such a key part, and the aim was to increase the user's convenience and improve the overall TV experience, members were selected from each of the divisions to form a single team. This made it possible to combine technologies from both divisions while they proceeded to build a prototype.
The difficult point was target setting. Unlike something like picture quality engineering, where the goal is comparatively easy to see, Voice Interaction and Mic Integration are entirely new areas. There weren't any clear criteria for evaluating them during their development. This was extremely difficult, because it was hard to determine exactly when the development was
Wasn't it difficult to add the microphone in its originally minimal shape, and keep such a high level of quality?
Sekito: The evaluation point for the mic function was the voice recognition rate. This was evaluated using three performance parameters:
- Wireless performance (interruption-free communication)
- Mic performance (sound collecting function)
- Recognition engine performance (TV side capability to understand what the sound means.）
First, hardware design was concentrated on building solid performance for the wireless function and the mic, and software design aimed to fine-tune the recognition engine in the TV. The key to success was to raise the performance as much as possible given the limited space that we had to work with.
But raising performance wasn't the only aim. It was also important to strike a balance. For example, when you increase the sensitivity of the mic, it also picks up sounds that shouldn't be included, and this leads to operating errors. It's important to only pick up the necessary sounds, for example, inside a noisy store or when a group of people are making noise while watching an exciting movie. On the contrary, if the range of sound-collection precision is excessively narrowed, the user will have to speak very loudly. So we had to aim for an ideal level with a fine balance.
Imai: While refining a few hundred parameters through trial and error, we found some of the optimal parameters for the TV side. After repeatedly verifying a number of these, we applied the parameters that had the highest recognition rate.
We were confident about building the remote control, but it was our first attempt to integrate the mic in the remote control so it was an entirely new challenge. With no precedent to follow, it was difficult to decide the specification base for verification results. We also took a variety of approaches to make the shape easier to use.
In this way, we worked on making improvements in every area, including the shape, and tested a prototype internally about 3-4 months before the product release. This was also the first time that we had guests who had come for business discussions use this new function. We received mainly favorable comments, such as, “This is great!” and “Very convenient.” Encouraged by this, we were highly confident that it would be a success.
However, while the “ability to operate by voice” was highly evaluated, there were still some areas where precision had to be boosted. With all members working together, we devoted our utmost efforts to complete the design right up to the production timing.
I understand it's convenient not only when searching, but also when viewing the web browser on the TV, right?
Nakaoka: Yes, that's right. One of the main advantages is that information on the web browser can be displayed on the large screen and shared by everybody.
We think it's very convenient as a tool for a group of people to search for something and then use the on-screen information to make decisions. For example, maps can be searched, and everyone can view them to decide on where they're going to go, or the menu for a meal can be decided while looking at a variety of recipes.
Using a conventional remote control, it was necessary to enter each key word, one at a time, which was quite awkward. The ability to enter search words by voice makes the remote control as convenient as a voice-operated smartphone. Our TV lets you search for a word by simply saying the word, then directly following the search results. Searching by voice is definitely much faster than entering text character-by-character. The secret to this speed is that the web browser is started up in the background and placed on standby. As soon as the trigger word is voiced, the web browser is displayed in the foreground. This greatly shortens the perceived time for the search result to appear. Because it allows you to smartly search by voice in front of everyone, it's also very cool. (Laughs)
The ability to use the verbal readout function with a web browser is another major step forward. It instantly detects the structure of various web pages, and accurately determines the procedure that will make it easiest to understand the content from the location selected by the pointer. Even when reading out sections with many sentences, such as news articles and blogs, you get stress-free listening.
How has the market response been since the launch?
Imai: The response has been excellent, with people saying that the precision of Panasonic's voice recognition is extremely good compared to the competitors. Particularly, customers in Japan have complimented us, saying things like, "Voice interaction is great. I want to try using for more operations."
Voice Guidance and Its Evolution
Entered the present Automotive & Industrial Systems Group in 1999.
Transferred to the TV Development Section in 2006.
Developed interactive TV middleware, such as Data
Broadcasts, for the EU market.
I want you to know...
There's a special function that's only on the WT600 and WT60 where decorative LED lighting flows across the bottom of the cabinet while the user is entering voice commands. It's a beautiful effect. I think it's good to see TV providing some visual feedback to the user. It also responds like this at times other than during voice recognition. For example, when a Skype call comes in, or when the timer alarm rings. Try it out.
Entered the present TV Development Section of Panasonic in 2001.
Developed TV application software, such as Data Broadcast
Browsers and TV Guides, for the Japanese market from 2002 to 2011.
I want you to know...
The system can actually recognize several different words for the same voice command when you're searching. The words "Search for" are listed in the Voice Interaction Help Guide, but the system will also work with the words "Look for" or "Find." This is intended to cope with variations in expressions. Native speakers in each language supported by providing several expressions, and they were all registered.
So, even when you mistakenly use a different search command phrase, the system may understand and perform as you want. Try out a few different command words to see which ones work. (Laughs)
Entered the Panasonic Corporate R&D Division in 1993. Researched and developed Speech Recognition and Sound Processing technologies from 1993 to 2012.
Moved to the AVC Company in 2012.
I want you to know...
The TV command words are also intended to cope with variations in expression. (There are some limits though.) Naturally, the ones listed in the Help Guide all work, but the TV is designed to recognize some alternate pronunciations when people speak unclearly.
If possible, we'd like to be able to cope with variations in expressions for all of the world's languages, but this is a future goal. We'd really like to provide a full range of speaking variations. It's our dream to develop a TV that can handle languages completely and naturally.
Browser System Design / UI Design
Entered the Corporate Production Engineering Division of Panasonic in 1994.
Transferred to the Corporate R&D Division in 2000 where he researched browsers, data broadcasting and DRM for IP broadcasting.
Transferred to the TV Development Section in 2007.
I want you to know...
I think the most convenient function is being able to directly search the web for something by voice command from the home screen. There are a lot of places where people can use this. You don't need to enter text on a keyboard like you do with a PC, and the browser starts up quickly to show the search result.
You can also operate the browser itself with voice commands. For example, if the browser text is too small to read comfortably, you can simply say, "Zoom in."
Remote Control, Mic Integration
Entered the present TV Division of Panasonic in 2002. Managed CATV STB development for North America until 2007.
After that he was in charge of development of LSIs for next-generation broadcasts by 2010.
I want you to know...
As members of the remote control team, we aimed to design a Touch Pad Controller that would allow the user to comfortably operate the TV.
Remote control units with a large number of buttons are still the mainstream, but I think that simple types like our Touch Pad Controller will increase in the future. We hope to be able to make the remote control more and more convenient.
Remote Control, Mic Integration
Entered the present TV Development Section of Panasonic in 1998.
Currently develops hardware for the Touch Pad Controller.
I want you to know...
The Touch Pad Controller was designed by thoroughly studying both hardware and software aspects. Until now, remote control units had different languages and buttons for each country and region, meaning that we produced more than 50 types each year. The Touch Pad Controller is global, so there is only one type. I hope we'll be able to continue coming up with new designs that people all over the world will enjoy.