Your eyes. They turn me

Ramblings on AI Vision and how our visual data with known intent is valuable.

We need massive amounts of data to do mundane things, things we humans find easy can be trained eventually but require this information to be produced somehow.

High-level reasoning requires very little computation, but low-level sensorimotor skills require enormous computational resources. Moravec’s paradox

In the very near future, the data hungry algorithms will need us to take part at a mass scale. Businesses will reorganise around collecting this data, some will be set up to base their existence on gathering and either open sourcing this data or selling it to the highest bidder.

We are seeing just the start with Google’s release of a collection of six hundred and fifty thousand grasp attempts by a robotic arm to grab everyday objects Source . Tesla’s cars come with all the sensors needed to outwork a self-driving car (essentially, the newer models will come with some upgrades). These sensors are constantly watching what the drivers do (on the outside) and adding this data with what the driver does (steering, braking, etc) and where the driver wants to go (the built in maps / GPS system). Tesla calls this fleet learning and will then be able to collate an incredible collection of ‘task’ and ‘actions’ aligned with the ‘environment’. Source .

One of my favourite quotes from Andrew Ng presents a clear picture of why this matters.

I think AI is akin to building a rocket ship. You need a huge engine and a lot of fuel. If you have a large engine and a tiny amount of fuel, you won’t make it to orbit. If you have a tiny engine and a ton of fuel, you can’t even lift off. To build a rocket you need a huge engine and a lot of fuel.

The analogy to deep learning is that the rocket engine is the deep learning models and the fuel is the huge amounts of data we can feed to these algorithms.
Source

Where do we come in?

Nvidia have released a ‘Devbox’, this device can be set up with a basic camera and be connected to most cars reasonably easily. This is an end-to-end learning system that does what the Tesla cars do at a much more basic (aka cheaper) level. Taking a heap of data about the environment in raw format and aligning it with actions to eventually produce a computer’s visual understanding of the boundaries of the road.

Imagine if a company wanted to get a heap of information about how to drive, they could collect this by putting a box like this in one thousand cars for a month and get about ninety thousand hours of driving data. This amount is paltry compared to what is probably needed though.

We take our view of reality very much for granted, teaching computers to have the same view is extremely hard, but adding data to the mix — massive amounts of data — can help. The simple act of walking up to a door, knocking and handing over something is incredibly complex and takes a massive amount of hand eye coordination and reasoning of the world around us.

Giving the machines vision

Take the driving example a step further and it would not be unreasonable for a company like Dominoes to be investigating automating the delivery of pizzas. To do this, they need to know what every door in the world looks like, or at least a reasonable percentage for a good machine learning algorithm. One approach would be to fit each delivery driver with a camera, even one lens (non 3D vision) would be sufficient. Recording approximately five hundred thousand deliveries a day Source for a few weeks should give the company a start.

Ignoring the inevitable backlash from pizza patrons and privacy concerns, you end up with a treasure trove of something simple to humans but hard for machines. Task: Reason where the door is, go towards it and knock. Action: 10 million 2D recordings of direction towards a door.

What is next

This leaves us all at an interesting impasse, we want the benefits of more automation — it will lead to cheaper everything, but in getting there we will eventually replace our own actions.

This is not new, businesses already gather incredible amounts of data, Google and Bing are leading the way in the areas of speech and image recognition because they already have the data available. The future shift to be aware of is when it gets distributed to a whole new level, when your own eyes become valuable to our next generation of machines.

Closing Thoughts

Happy to take any comments, I am merely writing some thoughts that I have gathered while reading a lot about AI trends and have not personally written any machine learning algorithms. However, if you are like me and curious I recommend you watch talks by Andrew NG and subscribe to Exponential View .