Interpreting visual information plays a key role in autonomous systems that act purposefully in their environment. We develop learning-based approaches for visual scene understanding in intelligent systems. Specifically, we are interested in computer vision methods for understanding dynamic scenes and reconstructing them in 3D. To this end, we investigate the learning of perceptual models that give robots a sense of intuitive physics and the functioning of the environment.