The phoneme thing about speech recognition

I have many google homes in my flat. I use them to turn on the lights, set timers, do unit conversions and play ocean sounds. I wouldn’t consider their speech recognition to be good.

My girlfriend with her feminine voice has an even harder time with them, often resorting to putting on a comedic deeper voice to get google to recognise her commands.

I’ve looked into how speech recognition is done and when building my computer back in January one of the goals I had for it was to train my own language model. I’ve written about ideas I have for getting the computer to recognise language before.

My understanding of the most up-to-date techniques for language recognition is to use a CTC network to train between the audio and words. The training data that is used is labelled speech. That would consist of a mp3 file of someone saying a sentence and then a text file or record of that sentence.

The audio file will then be decomposed into a frequency analysis creating an image like the following which can then be passed into the network which for each interval tries to predict what component of the letter in the sentence was being said.

The idea is that the model is then trained to learn to predict the characters in the sentence from the audio in the training set and it works reasonably well. To improve it a language model is then appended to the last stage where words that sound similar are chosen based on their likelihood given the last few words predicted, a bit like autocomplete.

I think that trying to get the network to predict the word (or at least the characters) from the audio is the wrong way to go about it. After all, as a dyslexic I know first-hand how hard it can be to predict the characters from the audio of a word. Yet I can listen to someone speak to me with far greater accuracy than the Google home can.

Neural networks are good at finding complex functions between the domain and the range of the function. If the mapping between those two sets is indeed a function, a single input has a single output, then the network will generally tend towards an understanding of that function.

I would argue that mapping audio to textual characters is not a function. There is no unique mapping of the phoneme “k” and the letter. Sure the language model will be able to correct kat to cat based on the fact that both “k” and “c” will have a high probability from the audio but “c” has a high probability from the language model but you are training in a duplicate mapping.

I am sure that the network would perform better if mapped to a set of symbols that were far less ambiguous than using regular English words. This sound mapping could then be mapped quite simply to the words. The training of the network would not have to struggle with spelling instead just having to get the phonemes correct, which I would argue has a smoother gradient profile.

I propose (and intend to create when I finally find the time) a model where the audio is mapped to a form of the international phonetic alphabet. This way the model is extracting the sounds which humans make from the audio form instead of trying to predict letters. Since the IPA is versatile enough to describe all human languages, the audio to symbolic network would be language agnostic. Thus allowing it to be trained on multiple language sources. This could allow it to recognise languages for which we have little training data. So long as we have a record of the phonemes of the language and the words in the language we would be able to append this as a classical algorithm.

That has the potential to make a speech to text systems for the millions of people who are illiterate and yet speaking languages for which there are few resources.


Popular posts from this blog

The twelve fold way

Some thoughts on solving transcendental equations

Structural engineering with cardboard