Tuesday, October 11, 2011

Feed the Borg, start talking to your phone

I remember being promised my computer would understand me since I was a kid.   The first software I really remember testing for this  "speech recognition" was called voice type dictation from IBM.  It required a Pentium 90Mhz computer.  What do I remember about it?  My boss tried to show someone in the office and he said "This is an example of voice type dictation."  The computer promptly placed on the screen the text "I'm going on a boy's pike vacation".  Laughter ensued and an inside joke was born .   Many other attempts were made including our CEO who heard the dragon naturally speaking software combined with a digital recorder could replace his admin assistant and insisted I order one right away.  As you can imagine soon after he threw it into my office, claimed it was useless and another attempt at getting my computer to understand me was lost. With every try the promise of software makers were that if we only had X fast processor it would get better.  It never did.  

Enter goog-411. Goog-411 was a free service launched in 2007 to offer a free directory look-up service from your phone, eliminating the need for a costly information call to get a number.  Why did google do this?  To gather voice samples.  Data was the key, not only processing power.  This was the holy grail of speech recognition.  By gathering samples of people's voice and then having the user select the correct listing Google gathered endless data points. Google used that data then to launch their own voice products including andriod actions and promptly discontinued goog-411 3 years later. 

Talking into your smartphone via these apps has yielded a real leap in the ability for my computer to finally understand me.  Now we have Siri, apple's new personal assistant.  Siri not only promises to understand what I say but also translate that into actions.  It seems almost too good to be true. What's the weather today?  How do I get home?  What are the markets doing today?   Not only that but Siri will translate your words into emails and text messages.   


With every word spoken into your android or iPhone you give the system one more data point to understand you better the next time.   Over and Over, all around the world by using data points people are making these systems better.   The cloud + mobile phones have brought in a few years what couldn't be done in the previous decades. Acceleration will happen exponentially with millions of people with accents, lisps, and different pronunciations will feed the system making it better for each of us. 


My question will be how much will I need to think about what I can ask these systems?  When will I be able to say "Siri what's my daughter's grades?"  "How much money is in my account less what bills I have left to pay this month?",  "What were our sales this month?"  Being able to send a text message or get the weather is great. However, I'm looking for the friction to reach zero.  In order for voice recognition to become mainstream it's got to be able to answer almost anything I can ask it, having to consider what I can and can't say to my phone probably means other than if I'm driving I won't use it.   IF however I can ask it anything, and the barrier is low enough then voice systems can really make my life easier and help me get things done.  I real personal assistant.  I'm sure that our phones will teach us to speak to it just as it learns what we are saying.  My hope is that by feeding these systems more data and adding API's soon I'll be able to not even worry if my phone understands me. 







2 comments:

Brad Davison said...

Another example of using data to guide the alorithm.

"Microsoft Research has just published a scientific paper (PDF) and a video showing how the Kinect body tracking algorithm works — it's almost as impressive as some of the uses the Kinect has been put to. This article summarizes how Kinect does it. Quoting: '... What the team did next was to train a type of classifier called a decision forest, i.e. a collection of decision trees. Each tree was trained on a set of features on depth images that were pre-labeled with the target body parts. That is, the decision trees were modified until they gave the correct classification for a particular body part across the test set of images. Training just three trees using 1 million test images took about a day using a 1000-core cluster.'"


http://games.slashdot.org/story/11/03/26/2014234/kinects-ai-breakthrough-explained


Also, "delete that"

Unknown said...

and then there's netflix who used less data to create better recommendations