Summing up speech: The maths and AI behind words

9 October 2019

11:26

At Nuance Communications’ Healthcare Partner Event in Berlin, web content editor Ian Bolland listened to Dr Nils Lenke’s presentation ‘The I in AI’, and spoke to the senior director, innovation management, Nuance, about how the ‘artificial intelligence’ is built to allow its Dragon voice recognition system to work.

For there to be artificial intelligence there must be human intelligence – which is probably one of the reasons why Dr Lenke feels that AI will be there to assist humans, and doesn’t predict that robots will take over the world.

In his presentation to delegates at Nuance Communications’ Healthcare Partner Event in Berlin, Dr Lenke demonstrated the thinking behind building the intelligence that goes into building the artificial intelligence that powers the kind of speech recognition the company pioneers.

A lot his talk focussed on ‘training data’ to the point where human intelligence is rebuilt to make artificial intelligence as effective as possible, in the same way as with sound and photo recognition. Succinctly put, you can do mathematics with words.

“We label speech data. So, we collect more than 20,000 hours of speech and it’s manually labelled and then we train the neural networks.

“We need a memory. You have input and output all in one instance but language and speech is embedded in time. When I speak, time progresses, and we need to model that by having a kind of memory. What we do is take some of the output and feed it back into the network for the next processing step.

“Intelligence is not just doing complicated tasks, it starts with things like listening, hearing, and tuning into somebody. We need to rebuild all of those.”

During his presentation, Dr Lenke explained that the software uses the same sequence modelling that you would find in Google Translate. To a lay person, that standalone statement can raise a few eyebrows, as its syntax may not always make sense or have the desired effect.

The algorithms to shift between French and English are not there to translate, but rather summarise a long, dictated text by the doctor into a brief summary of key facts.

Dr Lenke added: “We have an encoder that takes the English sentence and transforms it into a mathematical representation. Then you have the decoder which takes the representation and traduces from there a French sentence.

“You could use Google Translate to translate a poem by Shakespeare, a mathematical text or a report from the tax office. Google Translate tries to deal with it all but what we do is constrained to a specific domain. We know more about what medical reports looks like so we can tune the system better to cope with that type of language.

“But then if you were to use that system and put a poem of Shakespeare in it wouldn’t work. We train to read medical texts, maybe even for specialities like radiology and it can deal with those texts very well.”

Then there’s the aspect of the software tailoring and recognising voices. Admittedly, variants can be challenging in the use of voice biometrics. So how is it addressed? Training.

“We all speak differently so it depends on your accent, your physical condition, like the length of the vocal tract, how large your lungs are, etc.

“For speech recognition that’s really a nuisance, and we’ve tried to eliminate it by training a lot of data so we can cope with all of the different variants.”

When both a doctor and a patient are present, the channels in the software are separated allowing the voice biometrics to identify who is speaking – virtually making ‘two Dragons’. This works by separating into two different streams, by using voice biometrics to detect who is speaking. That way the software can be adapted to both speakers.

All of the formulae, thinking and building may come under the umbrella of artificial intelligence – including deep neural networks and deep learning. Dr Lenke says its meaning has moved on in the last 10 years, and feels it will continue to evolve. He compared the evolution of AI to that of building a car.

“The method we’re using is quite different from 10 years ago.

“We understand better how the neural networks worked. We used them but no-one knew what was going on in them. Today people understand better why they work and then we can come up with better versions to make speech recognition more precise.

“They’re a model, a mathematical model, and there’s variants.

“If you understand better how they work you can say; ‘for this task I only need five layers, and adding more layers will not help’ or ‘I need to structure it this way or that way’ you understand better.”