Lecture Notes

This post is my wrap-up notes from ICVSS 2016. As usual, if I don’t write it then I may forget it.. :p

It’s the first time for me to attend a kind of summer-school. Lucky that the ICVSS was held in Sicily, an island rich with tradition, history, and of course, culinary exposures. For this year’s 10th edition of ICVSS, the theme is exciting: what will happen next? 

I join the ICVSS mainly to keep me updated. I ‘stopped’ following the trend and learning computer vision in 2011 after I completed my master in ‘Computer Vision and Robotics’, because I was too happy enjoying life (read: doing the family-things, kiddos-things). Then, when I started to go back to school last year, everything in computer vision has changed. I was lost, especially when you must activate your brain to think after several years hibernating. As I began studying computer vision a long time ago (in 2006, where image processing was more common) where the field was not as hype and popular and exciting as today. So, I’m very glad getting to know the latest trend in computer vision and where it heads to – in ICVSS 2016.

Our schedule was very tight with courses and social activities. Here are some notes:


Davi Parikh as the one-and-only woman speaker in this ICVSS edition opened our lectures. She explained her work on VQA (Visual Question Answering), where you can play around with the demo here. Basically, VQA is like a Visual Turing Test. The computer is given an image and questions, and it should give the answer. This VQA was developed to move forward from image captioning, which is more generic, passive (not as interactive as VQA), and more simplified.

AI abilities for image understanding can be built by words and pictures. The basic model for VQA is MCB (Multi-modal Compact Bilinear). This model extract ‘words’ from images using CNN and words from the questions using LSTM, which enables all elements to interact.

Attention can also be added to the network to achieve a more focus comprehension. Image attention focuses on where to look in an image, while question attention is what to listen. Apart from the work of VQA, there is also an app that  can provide the answers to the blinds about their surrounding. The app is VizWiz

The second lecture was delivered by Andrej Karpathy , the PhD superstar (role-model PhD in computer vision) . It was not the first time I listened to his talk, but his talk is always enjoyable –  like how I always enjoy reading his  blog. He  talked mostly about connecting images with natural language. As we know, if we connect the modalities (visual domain and natural language), we could create a rich composition for image understanding.

Naturally, human accuracy is between 2-5% when it comes to understanding the images. Surprisingly, in the ILSVR 2015, the machine has reached this point. To generate captions for images, they applied CNN for image classification and RNN for modelling the sentences (more detail see this).  While this only works for one image and a sentences, a more recent work tried to generate multiple localized captions using DenseCap. With region proposals on their CNN and RNN chains, they can predict captions in each region efficiently. In the end, each image has a set of bounding boxes, each labelled with a caption.

The 3rd lecture in the first day was by Bernt Schiele, who gave materials for the topics in people detection and human pose estimation. Since 2004, research in people detection was started – marked by the Viola-Jones face detection algorithm that then be applied in many cameras. Not only face, but people detection was also developed, such as for pedestrian tracking. They have used DPM (deformable part model), DF (Decision Forest), and the latest one – deep network family to detect the body parts.

Human pose estimation is also another topic that he addressed. The question is how can we detect a person with different poses? While part-based models can configure the human body part efficiently with its parts/structures, but it is too generic and for adjacent body-part only.Therefore, CNN-based method was used in the pose machine to enable the video tracking on joints/bodyparts. Using  DeeperCut, they also able to estimate pose in scenes with multiple-persons. Here, they use jointly person detection and pose estimation, which solve the subset partitioning and labelling problem.

Then, we had a reading group. Since our group mentor should be Koray Kavu (Google DeepMind), but he was not there yet, so then the committee changed our mentor to  Yann le Cun – the director of FAIR (FB AI research). Well, they both do similar topic: the ‘deep’ thing – and Yann was Koray’s PhD advisor, so we got the ideas.

In our reading group, Yann told a story on how he came up with CNN, from the history of the traditional AI (symbolic machine learning), information theory, pattern recognition, until cybernetics. It was a very nice and quite informal discussion with him joining the group by the poolside. Since 90s, he proposed his CNN (initially for text recognition) that could solve end-to-end training without bother thinking about the hand-crafted features. At that time, he knew that this approach can be a breakthrough in computer vision but did not expect that the community can be too long to realize it (it takes more than 20 years though) – on the other hand after they realized, then they are now  too fast in making the CNN so hype (huge revolution in computer vision, of course).

An intriguing question was: since we are now stop doing hand-crafting the features, now we are hand-crafting the architectures (tweaking the structure hierarchy, tuning the parameters). But, he answered: this is engineering. Since 90s, people has explored the architecture on neural networks (tuning the hyper-parameters, for example) and yes, architecture can always be optimized.

At night, after the student poster session we had the industries exhibition. It was very interesting for me to see (and interact) their showcase in computer vision works, from Microsoft research, Xerox, Osram, Facebook (billions of photos everyday, many things they can do), Rakuten (they have their virtual fitting room apps), and Qualcomm.


Fourth lecture with Ashutosh saxena (CEO Brain of Things) talked about deep learning for robotics. They showed a case about their smart house, which for me seems like a cute-idea come to life, with the automatic lighting, pet-feeding, breakfast-making, home monitoring, etc. (everything is smart, think about to have one :p).

Inside the smart home, there is a joint learning between vision, language, and action. The spatio-temporal problems, like human-object interaction, modeling human motion, and tracking, can be solved by Structural RNN. Their spatio-temporal graph captures interaction on nodes (objects) and edges (temporal). Then the structural RNN relate their correspondences.

Then, Sergey Levine gave the 5th lecture about deep learning for robotics (again).  I’m not really into the area of robotics, so not much I can write about this. All I can recall is they use imitation learning (instead of learning from the real human) to have an optimal control decision and trajectory. Since the deep reinforcement learning is very data hungry so they use a model-based method, which is more efficient.

After lunch, we had Yann le Cun on the stage to give his talk. The preparation was a bit longer because he needed to set up his phone recording for FB Live (yes, you could also stream it via his FB). He mentioned three obstacles to progress in AI, which are reasoning, memory/attention and learning by observing the world. To solve the reasoning-memory-attention, some works have been done to augment neural nets with a memory module (reccurent network is bad at remembering, so they add a separate memory modules: LSTM, NTM – Deepmind, etc).

How the machine acquire a common sense, in the way that human normally do? He argued that it cannot be with human rules or supervised learning or reinforcement learning. If you imagine a cake: reinforcement learning is the cherry, supervised learning is the icing, and unsupervised learning is the cake itself. Using this analogy, Yann said that unsupervised learning is the ‘dark matter’ of AI. Why? Because most human and animals learn about the world by observing. If we build a model of the world through predictive unsupervised learning, then it can give a ‘common sense’. But it is hard because we don’t know how to do unsupervised learning (or even formulate it). Le Cun also mentioned about the possibility to integrate supervised and usupervised in one-learning rule, with what-where auto-encoders.

Recently, instead of classifying or categorizing images with the computer, it does image generation. Here, they do the unsupervised representation learning to generate images using DCGAN (Deep Convolutional Generative Adversarial Networks). It seems a bit scary to me to see how good the machine generating e.g.bedroom images after training from the database. Interestingly, none of the bedroom image generated from the machine is the same with those in the training database. DCGAN can also do the face algebra, when you do the (man with glasses) – man without glasses) – (woman without glasses) = woman with glasses. The resulting images seems very natural, though. There is also a creative AI work in DCGAN to generate floor plans.

DCGAN – Face Algebra


In the third day, we learned more about body model from Michael Black. It is about how we can construct a generative human body model (3D mesh) based on the pose, shape, dynamics, and texture. In another model (factored model), they also represent human body model with taking into account the soft-tissue motions, breathing, shape and pose. One learned model is SMPL that can represent accurately a wide variety of body shapes and poses. Even they take it further to DMPL, which is a Dynamic-SMPL – that capture the motion (e.g soft-tissue motion – fat people tends to have more). I think it might be very cool to use it in film production, as the human-body model can be easily modelled and rendered.

Another work is also interesting: to model the 3D human pose and shape only using ta single 2D image. They also have a something related to medical, which is to reconstruct the 3D shapes of fat tissue from whole-body MRI. Just have an idea on how to turn these whole-body MRI in the dataset into 3D human-body of the populations.

Then, Jamie Shotton from Microsoft Research gave the talk on tracking and modeling the human-hand for interaction in VR/AR. It is interesting how intense he worked on Kinect and resolve the challenges in the real-world application. We know that the fundamentals inside Kinect is decision forest, but from his talk I found that there is another variant of decision forest, which is a decision jungle.

So forest is not the same as jungle? Ok, when I look for the answer: every jungle is a forest but not every forest is a jungle. A jungle is a dense forest. Likewise in a decision jungle, it merges the similar/redundant nodes to save space. Instead of trees as base learners, it use DAGs. As the in the DF, deeper trees mean higher accuracy but the memory limits this. Hence, decision jungles solve this memory consumption.

Additionally, he talked about modelling the human hand. Some commercial vision-based systems like leap motion, real-sense, Nimble VR, and SoftKinetic are doing this kinda-task.   As we know that Kinect can detect body part, but now they aim for a more robust hand tracking. The streams of work in hand estimation are model fitting (hand pose estimation = hand tracking), pose regression, and hand-shape personalisation.


Pietro Perona explained about his project – Visipedia, in the 9th Lecture.  That initiative came from his thought after receiving the email from his friend emailing a photo and ask whether he could eat that mushroom? Perhaps the answer is there in Wikipedia. But, wouldn’t be so difficult to match the object we see with the figure appears in the webpage/encyclopedia. Naturally, every day we use a visual queries; we always ask question on what we see and we need to get the answer immediately. For example, what kind of bird it is, what is the meaning of that kanji symbols, is that pimple dangerous, or can I eat this leave?  Recently, Visipedia has launched birds recognition apps (you can try the demo here).

In training the ‘curious’ machine, sometimes it faces the experts difficulties because of different argues. Among the AMT (Automatic Mechanical Turks) hired, the annotators may have different competence. Computer must learns on the mistakes that people made since the annotators’ belief should be useful for machine confidence.  Another work in Visipedia is RegisTree, which detects all the street tress within Pasadena, to count and also classify species.

The last lecture in the 4th day is by Antonio Torralba. The first section, he explained more on visualizing the internal learned representation. Using Deconvnet, we can interpret what happens inside and visualize the features in convnet. From these mapping, we can pick unit from pool layer, see the strongest activation in layers, and reconstruct input.

In 2013, they also launch ‘Places’ dataset, which contain million images for scene recognition. While they used AlexNet as the architecture for Place CNN, but the internal representations for Places is different, which is more into scene parts, objects, or textures.  To test how well this works, you can play around with their demo here.

To estimate the perceptive fields, we can check the distribution of semantic types at each layer. Commonly, in the first layer, the CNN extract simple elements & colors (such as vertical line, curved liner, blue, etc). In the subsequent layers, CNN extract texture (stripe, wooden, sandy, etc), region/surface (sky, sea, grass), object parts (leg, head, wheel, dominantly at 5 or 5th layer), objects, and scene. Therefore, since the first two layers of CNN are task independent, for the strategy to train a new task you can: freeze all parameters trained on that layers and only train upper layers to get a better representation.

Well, an interesting showcase by Torralba was probably the movie-book alignment. How to pair the rich description in book, with visual content in the movie in cross-modal learning. They use sentence similarity to match with the visual content, context-aware CNN. It turns out to be quite fun to see how well the movie adapt the book story plot.



In the last day, Koray Kavu from  Google DeepMind opened the lecture with the topic ‘Deep Learning for Agents’. The world has enjoyed supervised learning that can learn large labeled dataset with deep neural network, optimized end loss, and non engineered input. But, unsupervised learning/generative models still pose some unanswered questions? Although,  several unsupervised learning methods are there, such as RBM, auto-encoders, sparse coding, etc., but we’re still questioning on how to rank different algorithm and trust the input domain (visual quality).

Despite the importance of supervised and unsupervised learning, the real AI requires ‘agents’ to interpret, act, and control their environment. The role of this agent can be expected to work in the reinforcement learning (read their ‘Nature’ paper). Then, he explained more about the application of deep reinforcement learning when DeepMind beats human in the AlphaGo play.

Interesting to see how Pixel RNN and Pixel CNN can ‘DRAW’ prediction of image outcome from an occluded image. Just thinking how good it is now for the machine in image completion (video here). Then, a newer version of Pixel CNN is shown to be powerful as image decoder in an autoencoder.


Image Completion – DRAW

After the DeepMind lecture, we enjoyed the lecture from William Freeman. His works are somehow seem out-of-the-box for me. First, he showed us the visual vibrometry project on how to estimate the material from video. Then, a more recent work is in sound supervision for visual learning, like “Turing Test for Sound”. Effectively the machine has also learned how to predict sound from video clip, that sometime fool humans (watch the video here).

Another project he presented was on micro motion magnification in video. Using Eulerian video magnification, they can exaggerate the micro-motion. It’s amazing how to amplify the pulse motion from people’s head, heartbeat in infants (very useful so that no devices should be connected to the babies, only camera from afar), and even baby motion inside mother’s womb. Cool..

As a closing speaker, since Shahram Izadi could not come to give talk, so  he just appeared in a pre-recorded video talking about his newest project – Holoportation. Surely, if this thing from Microsoft becomes real then we can enjoy a new way for tele-conference with realtime hologram chat, and even re-playing the ‘visual memory’, or watching TV/shows in hologram reconstruction. Sounds very exciting! *o*

Ok, I think it has come to an end. That’s all I can share… Hope this can bring the excitement in the recent work/research/trend in computer vision world. Hope we can also join the stage..🙂 ;p