My memory is in the clouds

How I tried to get my google assistant to remember things for me.

I have recently bought a couple of google home devices for my flat and I find it really nice to be able to interact in a way that doesn’t require my eyes to be open. I find I use the one on my bedside table most, asking it to turn off the lights and play music before I go to sleep. This way I can control the music, volume and lighting while I’m dozing off to sleep.


The most common question I ask it though is for the time. As strange as it sounds when every device has the time in one corner or another. If I look for the time on my phone, I am so often distracted by other notifications which end up in 5 minutes of being on my phone, one distraction after the next. At night the brightness of the screen is unpleasant.


I don’t have too many smart devices besides the bedside lamps in the flat and so what would make the google home most useful for me would be to program my own little conversations with it. One simple one that I thought would be easy to implement was a simple key-value store. Something where I could tell the google home for example, “My weight is 70kg” and then be able to ask “What is my weight”. That is what I set out to try and do a couple of weeks ago.


My initial idea was that this could be done through google home nice and simply using IFTTT. Although the recording can be done using the IFTTT tool, it doesn’t permit any recall. Which somewhat defeats the purpose of the tool I was trying to make. It would mean I could enter any amount of information that I wanted, however, I wouldn’t be able to check the information was correct until I again looked at a screen.


The next thing I tried was to look at if it was possible to make a google home app of my own.

That was a couple of evenings gone with not much to show for the hours sunk. 

I should have given up on that pursuit earlier than I did anyway because as slowly became apparent, although there is a test environment you can run your google assistant apps, there is no way to deploy that app to your own google home system without first getting approval for an alpha release (Turns out there is if you match the accounts). Something which seems time-consuming to do and with no guarantees of succeeding.


I then found that the Tasker app series had an integration that may allow me to have the google home handle some kind of recall. It seems that at one point this was the case but the tasker assistant app has now been shut down by Google.


I suspect that the reason this recall task is so challenging is that Google is trying to make sure that all queries use google as the search engine. If you could use another system then google would be paying the cost of translating all that audio into text only for another company to provide the results in a manner which google doesn’t have direct control over.

I could mean that once google permit paid results on their voice devices you could bypass this by continuing to use a separate app.

From a customer relations point of view, it is better they never permit such an app then remove it once they need the extra control.


As a result of what could be seen as an abuse of power and position, I took the position of, well how hard can it be to make my own?

Voice recognition has been a challenge to the field of computer science since the original work done at the end of the 50s.


I started with a tutorial which tries to train a network from single one second sound clips. It reached a high validation accuracy but when my girlfriend and I tried it out, it didn’t work very well mixing off and up.


I thought that the model not being able to recognise variable-length input would be a limit on its generality. If for example, the speech was closer to the beginning of the clip than the end, it would have to find a clever way within the model to get around that. Looking for variable-length models I found the leading research to be summarised in this lecture. Excited by the prospect of this model I looked for an example of an implementation in python which I could use.

I found this one, which with a little bit of upgrading to the latest version of TensorFlow I managed to get running.

I found watching the training of this model fascinating!

First, it starts to learn some simple phonemes and then builds up an understanding of the words.

My girlfriend, apart from complaining about the heat generated by the model in our windowless kitchen, commented that I was more excited about this model learning its first words than I would be when our first child does.


This second model has some things I wanted to tweek, for one to save on computational resources it only trained on a single speaker, speaking 15 sentences. I had the data from the other model on the words, they were spoken by 100s of different speakers each so I thought that would help add to the generality of the system. Training the system on single word short input first might also help it to develop quicker. This is a point mentioned in the aforementioned lecture.


One of the other changes I needed to work out how to do was save this new model. I haven’t had much experience on TensorFlow before, most of my University coursework was instead done in MatLab (I no longer have access to the £1500 application). Once I could save it I would also need to work out how to apply it to new voice samples to test if it had learnt well enough to recognise my voice.


There are a couple of other directions I would like to take this model. I want to see if I can improve the training time by including more phrases that it performs worse on in the next epoch. This might allow me to keep the amount of training data per epoch down while maintaining generality.

The other, more longer-term idea that I have is one which would permit a much larger corpus of training data to be used, is to use unlabeled audio data which is then labelled by a mix of the audio model and a language model which isn’t based on n-grams but instead on some semantic understanding of the sentence and its context.

I imagine this labelling and feedback to work more like how people learn a second language by having both some understanding of the words but using an understanding of their context to improve their understanding of words. I’ve never got to such a level in learning a second language but I have heard a lot of anecdotal evidence that the way to really learn a language is to visit the country it’s spoken in so you are immersed and failing that consume as much media in that language as possible. News programs, films etc.


The improvement that I imagine this label idea to have is that it can use all speech audio as input rather than just labelled speech input. It should be able to go back and reliable its past training data as it improves. 

This should aid it getting around some of the issues of getting access to a diverse enough set of language training data. The problem though is in creating that language model that can identify that a particular parsing doesn’t make any sense. The development of which would be easier if I could already speak through google home to my own chatbot. I would then hope that using the ambiguous grammar parser I have expressed a desire to build at some point before.


Where did I get to in the end I stumbled across this article which says google can already do this. Turns out something like this already exists but has only existed this year.

https://support.google.com/assistant/answer/7536723?hl=en

You can view your cloud memories here.


Since finding out google already does this I haven’t used the feature, mainly because when I tried “Remember my keys are in the pot”

“Where are my keys?” “This is what I found on the web…”

Ffs google.

Turns out I should have asked “Where is my keys?” google clearly trying to be more 90s.

Maybe it is also because the feature isn’t as useful as I thought, especially as I can write programs to manipulate it, such as “how has my weight changed over the last month” or 

“I’m putting the ice in the freezer” “How long ago did I put ice in the freezer?”


Comments

Popular posts from this blog

An exploration in number systems

Structural engineering with cardboard

Greedy meshing in javascript