Fork me on GitHub

Trending arXiv

Note: this version is tailored to @Smerity - though you can run your own! Trending arXiv may eventually be extended to multiple users ...


1 2 3 4 5 6 7 8 9 24 25

EnhanceNet: Single Image Super-Resolution through Automated Texture Synthesis

Mehdi S. M. Sajjadi, Bernhard Schölkopf, Michael Hirsch

Single image super-resolution is the task of inferring a high-resolution image from a single low-resolution input. Traditionally, the performance of algorithms for this task is measured using pixel-wise reconstruction measures such as peak signal-to-noise ratio (PSNR) which have been shown to correlate poorly with the human perception of image quality. As a result, algorithms minimizing these metrics tend to produce oversmoothed images that lack high-frequency textures and do not look natural despite yielding high PSNR values. We propose a novel combination of automated texture synthesis with a perceptual loss focusing on creating realistic textures rather than optimizing for a pixel-accurate reproduction of ground truth images during training. By using feed-forward fully convolutional neural networks in an adversarial training setting, we achieve a significant boost in image quality at high magnification ratios. Extensive experiments on a number of datasets show the effectiveness of our approach, yielding state-of-the-art results in both quantitative and qualitative benchmarks.

Captured tweets and retweets: 20

DARN: a Deep Adversial Residual Network for Intrinsic Image Decomposition

Louis Lettry, Kenneth Vanhoey, Luc Van Gool

We present a new deep supervised learning method for intrinsic decomposition of a single image into its albedo and shading components. Our contributions are based on a new fully convolutional neural network that estimates absolute albedo and shading jointly. As opposed to classical intrinsic image decomposition work, it is fully data-driven, hence does not require any physical priors like shading smoothness or albedo sparsity, nor does it rely on geometric information such as depth. Compared to recent deep learning techniques, we simplify the architecture, making it easier to build and train. It relies on a single end-to-end deep sequence of residual blocks and a perceptually-motivated metric formed by two discriminator networks. We train and demonstrate our architecture on the publicly available MPI Sintel dataset and its intrinsic image decomposition augmentation. We additionally discuss and augment the set of quantitative metrics so as to account for the more challenging recovery of non scale-invariant quantities. Results show that our work outperforms the state of the art algorithms both on the qualitative and quantitative aspect, while training convergence time is reduced.

Captured tweets and retweets: 54

Online Semantic Activity Forecasting with DARKO

Nicholas Rhinehart, Kris M. Kitani

We address the problem of continuously observing and forecasting long-term semantic activities of a first-person camera wearer: what the person will do, where they will go, and what goal they are seeking. In contrast to prior work in trajectory forecasting and short-term activity forecasting, our algorithm, DARKO, reasons about the future position, future semantic state, and future high-level goals of the camera-wearer at arbitrary spatial and temporal horizons defined only by the wearer's behaviors. DARKO learns and forecasts online from first-person observations of the user's daily behaviors. We derive novel mathematical results that enable efficient forecasting of different semantic quantities of interest. We apply our method to a dataset of 5 large-scale environments with 3 different environment types, collected from 3 different users, and experimentally validate DARKO on forecasting tasks.

Captured tweets and retweets: 2

Highway and Residual Networks learn Unrolled Iterative Estimation

Klaus Greff, Rupesh K. Srivastava, Jürgen Schmidhuber

The past year saw the introduction of new architectures such as Highway networks and Residual networks which, for the first time, enabled the training of feedforward networks with dozens to hundreds of layers using simple gradient descent. While depth of representation has been posited as a primary reason for their success, there are indications that these architectures defy a popular view of deep learning as a hierarchical computation of increasingly abstract features at each layer. In this report, we argue that this view is incomplete and does not adequately explain several recent findings. We propose an alternative viewpoint based on unrolled iterative estimation---a group of successive layers iteratively refine their estimates of the same features instead of computing an entirely new representation. We demonstrate that this viewpoint directly leads to the construction of Highway and Residual networks. Finally we provide preliminary experiments to discuss the similarities and differences between the two architectures.

Captured tweets and retweets: 2

A Context-aware Attention Network for Interactive Question Answering

Huayu Li, Martin Renqiang Min, Yong Ge, Asim Kadav

We develop a new model for Interactive Question Answering (IQA), using Gated-Recurrent-Unit recurrent networks (GRUs) as encoders for statements and questions, and another GRU as a decoder for outputs. Distinct from previous work, our approach employs context-dependent word-level attention for more accurate statement representations and question-guided sentence-level attention for better context modeling. Employing these mechanisms, our model accurately understands when it can output an answer or when it requires generating a supplementary question for additional input. When available, user's feedback is encoded and directly applied to update sentence-level attention to infer the answer. Extensive experiments on QA and IQA datasets demonstrate quantitatively the effectiveness of our model with significant improvement over conventional QA models.

Captured tweets and retweets: 2

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, Ross Girshick

When building artificial intelligence systems that can reason and answer questions about visual data, we need diagnostic tests to analyze our progress and discover shortcomings. Existing benchmarks for visual question answering can help, but have strong biases that models can exploit to correctly answer questions without reasoning. They also conflate multiple sources of error, making it hard to pinpoint model weaknesses. We present a diagnostic dataset that tests a range of visual reasoning abilities. It contains minimal biases and has detailed annotations describing the kind of reasoning each question requires. We use this dataset to analyze a variety of modern visual reasoning systems, providing novel insights into their abilities and limitations.

Captured tweets and retweets: 1

Detecting Unexpected Obstacles for Self-Driving Cars: Fusing Deep Learning and Geometric Modeling

Sebastian Ramos, Stefan Gehrig, Peter Pinggera, Uwe Franke, Carsten Rother

The detection of small road hazards, such as lost cargo, is a vital capability for self-driving cars. We tackle this challenging and rarely addressed problem with a vision system that leverages appearance, contextual as well as geometric cues. To utilize the appearance and contextual cues, we propose a new deep learning-based obstacle detection framework. Here a variant of a fully convolutional network is used to predict a pixel-wise semantic labeling of (i) free-space, (ii) on-road unexpected obstacles, and (iii) background. The geometric cues are exploited using a state-of-the-art detection approach that predicts obstacles from stereo input images via model-based statistical hypothesis tests. We present a principled Bayesian framework to fuse the semantic and stereo-based detection results. The mid-level Stixel representation is used to describe obstacles in a flexible, compact and robust manner. We evaluate our new obstacle detection system on the Lost and Found dataset, which includes very challenging scenes with obstacles of only 5 cm height. Overall, we report a major improvement over the state-of-the-art, with relative performance gains of up to 50%. In particular, we achieve a detection rate of over 90% for distances of up to 50 m. Our system operates at 22 Hz on our self-driving platform.

Captured tweets and retweets: 2

Automatic Generation of Grounded Visual Questions

Shijie Zhang, Lizhen Qu, Shaodi You, Zhenglu Yang, Jiawan Zhang

In this paper, we propose a new task and solution for vision and language: generation of grounded visual questions. Visual question answering (VQA) is an emerging topic which links textual questions with visual input. To the best of our knowledge, it lacks automatic method to generate reasonable and versatile questions. So far, almost all the textual questions are generated manually, as well as the corresponding answers. To this end, we propose a system that automatically generates visually grounded questions . First, visual input is analyzed with deep caption model. Second, the captions along with VGG-16 features are used as input for our proposed question generator to generate visually grounded questions. Finally, to enable generating of versatile questions, a question type selection module is provided which selects reasonable question types and provide them as parameters for question generation. This is done using a hybrid LSTM with both visual and answer input. Our system is trained using VQA and Visual7W dataset and shows reasonable results on automatically generating of new visual questions. We also propose a quantitative metric for automatic evaluation of the question quality.

Captured tweets and retweets: 2

Learning Features by Watching Objects Move

Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, Bharath Hariharan

This paper presents a novel yet intuitive approach to unsupervised feature learning. Inspired by the human visual system, we explore whether low-level motion-based grouping cues can be used to learn an effective visual representation. Specifically, we use unsupervised motion-based segmentation on videos to obtain segments, which we use as 'pseudo ground truth' to train a convolutional network to segment objects from a single frame. Given the extensive evidence that motion plays a key role in the development of the human visual system, we hope that this straightforward approach to unsupervised learning will be more effective than cleverly designed 'pretext' tasks studied in the literature. Indeed, our extensive experiments show that this is the case. When used for transfer learning on object detection, our representation significantly outperforms previous unsupervised approaches across multiple settings, especially when training data for the target task is scarce.

Captured tweets and retweets: 2

On Random Weights for Texture Generation in One Layer Neural Networks

Mihir Mongia, Kundan Kumar, Akram Erraqabi, Yoshua Bengio

Recent work in the literature has shown experimentally that one can use the lower layers of a trained convolutional neural network (CNN) to model natural textures. More interestingly, it has also been experimentally shown that only one layer with random filters can also model textures although with less variability. In this paper we ask the question as to why one layer CNNs with random filters are so effective in generating textures? We theoretically show that one layer convolutional architectures (without a non-linearity) paired with the an energy function used in previous literature, can in fact preserve and modulate frequency coefficients in a manner so that random weights and pretrained weights will generate the same type of images. Based on the results of this analysis we question whether similar properties hold in the case where one uses one convolution layer with a non-linearity. We show that in the case of ReLu non-linearity there are situations where only one input will give the minimum possible energy whereas in the case of no nonlinearity, there are always infinite solutions that will give the minimum possible energy. Thus we can show that in certain situations adding a ReLu non-linearity generates less variable images.

Captured tweets and retweets: 9

Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks

Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, Dilip Krishnan

Collecting well-annotated image datasets to train modern machine learning algorithms is prohibitively expensive for many tasks. One appealing alternative is rendering synthetic data where ground-truth annotations are generated automatically. Unfortunately, models trained purely on rendered images often fail to generalize to real images. To address this shortcoming, prior work introduced unsupervised domain adaptation algorithms that attempt to map representations between the two domains or learn to extract features that are domain-invariant. In this work, we present a new approach that learns, in an unsupervised manner, a transformation in the pixel space from one domain to the other. Our generative adversarial network (GAN)-based method adapts source-domain images to appear as if drawn from the target domain. Our approach not only produces plausible samples, but also outperforms the state-of-the-art on a number of unsupervised domain adaptation scenarios by large margins. Finally, we demonstrate that the adaptation process generalizes to object classes unseen during training.

Captured tweets and retweets: 2

Learning Residual Images for Face Attribute Manipulation

Wei Shen, Rujie Liu

Face attributes are interesting due to their detailed description of human faces. Unlike prior research working on attribute prediction, we address an inverse and more challenging problem called face attribute manipulation which aims at modifying a face image according to a given attribute value. In order to obtain an efficient representation for the manipulation, we propose to learn the corresponding residual image which is defined as the difference between images after and before the manipulation. Using the residual image, the manipulation can be operated efficiently with modest pixel modification. The framework of our approach is based on the Generative Adversarial Network. It consists of two image transformation networks imitating the attribute manipulation and its dual operation and a shared discriminative network distinguishing the generated images from different reference images. We also apply dual learning to allow the two transformation networks to learn from each other. Experiments show that the learned residual images successfully simulate the manipulations and the generated images retain most of the details in attribute-irrelevant areas.

Captured tweets and retweets: 7

Medical Image Synthesis with Context-Aware Generative Adversarial Networks

Dong Nie, Roger Trullo, Caroline Petitjean, Su Ruan, Dinggang Shen

Computed tomography (CT) is critical for various clinical applications, e.g., radiotherapy treatment planning and also PET attenuation correction. However, CT exposes radiation during acquisition, which may cause side effects to patients. Compared to CT, magnetic resonance imaging (MRI) is much safer and does not involve any radiations. Therefore, recently, researchers are greatly motivated to estimate CT image from its corresponding MR image of the same subject for the case of radiotherapy planning. In this paper, we propose a data-driven approach to address this challenging problem. Specifically, we train a fully convolutional network to generate CT given an MR image. To better model the nonlinear relationship from MRI to CT and to produce more realistic images, we propose to use the adversarial training strategy and an image gradient difference loss function. We further apply AutoContext Model to implement a context-aware generative adversarial network. Experimental results show that our method is accurate and robust for predicting CT images from MRI images, and also outperforms three state-of-the-art methods under comparison.

Captured tweets and retweets: 3

Bayesian Optimization for Machine Learning : A Practical Guidebook

Ian Dewancker, Michael McCourt, Scott Clark

The engineering of machine learning systems is still a nascent field; relying on a seemingly daunting collection of quickly evolving tools and best practices. It is our hope that this guidebook will serve as a useful resource for machine learning practitioners looking to take advantage of Bayesian optimization techniques. We outline four example machine learning problems that can be solved using open source machine learning libraries, and highlight the benefits of using Bayesian optimization in the context of these common machine learning applications.

Captured tweets and retweets: 2

Attentive Explanations: Justifying Decisions and Pointing to the Evidence

Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Bernt Schiele, Trevor Darrell, Marcus Rohrbach

Deep models are the defacto standard in visual decision models due to their impressive performance on a wide array of visual tasks. However, they are frequently seen as opaque and are unable to explain their decisions. In contrast, humans can justify their decisions with natural language and point to the evidence in the visual world which led to their decisions. We postulate that deep models can do this as well and propose our Pointing and Justification (PJ-X) model which can justify its decision with a sentence and point to the evidence by introspecting its decision and explanation process using an attention mechanism. Unfortunately there is no dataset available with reference explanations for visual decision making. We thus collect two datasets in two domains where it is interesting and challenging to explain decisions. First, we extend the visual question answering task to not only provide an answer but also a natural language explanation for the answer. Second, we focus on explaining human activities which is traditionally more challenging than object classification. We extensively evaluate our PJ-X model, both on the justification and pointing tasks, by comparing it to prior models and ablations using both automatic and human evaluations.

Captured tweets and retweets: 2

Harmonic Networks: Deep Translation and Rotation Equivariance

Daniel E. Worrall, Stephan J. Garbin, Daniyar Turmukhambetov, Gabriel J. Brostow

Translating or rotating an input image should not affect the results of many computer vision tasks. Convolutional neural networks (CNNs) are already translation equivariant: input image translations produce proportionate feature map translations. This is not the case for rotations. Global rotation equivariance is typically sought through data augmentation, but patch-wise equivariance is more difficult. We present Harmonic Networks or H-Nets, a CNN exhibiting equivariance to patch-wise translation and 360-rotation. We achieve this by replacing regular CNN filters with circular harmonics, returning a maximal response and orientation for every receptive field patch. H-Nets use a rich, parameter-efficient and low computational complexity representation, and we show that deep feature maps within the network encode complicated rotational invariants. We demonstrate that our layers are general enough to be used in conjunction with the latest architectures and techniques, such as deep supervision and batch normalization. We also achieve state-of-the-art classification on rotated-MNIST, and competitive results on other benchmark challenges.

Captured tweets and retweets: 2

Finding Tiny Faces

Peiyun Hu, Deva Ramanan

Though tremendous strides have been made in object recognition, one of the remaining open challenges is detecting small objects. We explore three aspects of the problem in the context of finding small faces: the role of scale invariance, image resolution, and contextual reasoning. While most recognition approaches aim to be scale-invariant, the cues for recognizing a 3px tall face are fundamentally different than those for recognizing a 300px tall face. We take a different approach and train separate detectors for different scales. To maintain efficiency, detectors are trained in a multi-task fashion: they make use of features extracted from multiple layers of single (deep) feature hierarchy. While training detectors for large objects is straightforward, the crucial challenge remains training detectors for small objects. We show that context is crucial, and define templates that make use of massively-large receptive fields (where 99% of the template extends beyond the object of interest). Finally, we explore the role of scale in pre-trained deep networks, providing ways to extrapolate networks tuned for limited scales to rather extreme ranges. We demonstrate state-of-the-art results on massively-benchmarked face datasets (FDDB and WIDER FACE). In particular, when compared to prior art on WIDER FACE, our results reduce error by a factor of 2 (our models produce an AP of 81% while prior art ranges from 29-64%).

Captured tweets and retweets: 2

Automated Inference on Sociopsychological Impressions of Attractive Female Faces

Xiaolin Wu, Xi Zhang, Chang Liu

This article is a sequel to our earlier paper [24]. Our main objective is to explore the potential of supervised machine learning in face-induced social computing and cognition, riding on the momentum of much heralded successes of face processing, analysis and recognition on the tasks of biometric-based identification. We present a case study of automated statistical inference on sociopsychological perceptions of female faces controlled for race, attractiveness, age and nationality. Like in [24], our empirical evidences point to the possibility of teaching computer vision and machine learning algorithms, using example face images, to predict personality traits and behavioral predisposition.

Captured tweets and retweets: 1

Learning representations through stochastic gradient descent in cross-validation error

Richard S. Sutton, Vivek Veeriah

Representations are fundamental to Artificial Intelligence. Typically, the performance of a learning system depends on its data representation. These data representations are usually hand-engineered based on some prior domain knowledge regarding the task. More recently, the trend is to learn these representations through deep neural networks as these can produce dramatical performance improvements over hand-engineered data representations. In this paper, we present a new incremental learning algorithm, called crossprop, for learning representations based on prior learning experiences. Unlike backpropagation, crossprop minimizes the cross-validation error. Specifically, our algorithm considers the influences of all the past weights on the current squared error, and uses this gradient for incrementally learning the weights in a neural network. This idea is similar to that of tuning the learning system through an offline cross-validation procedure. Crossprop is applicable to incremental learning tasks, where a sequence of examples are encountered by the learning system and they need to be processed one by one and then discarded. The learning system can use each example only once and can spend only a limited amount of computation for an example. From our preliminary experiments, we concluce that crossprop is a promising alternative for backprop for representation learning.

Captured tweets and retweets: 1

Quantum autoencoders for efficient compression of quantum data

Jonathan Romero, Jonathan Olson, Alan Aspuru-Guzik

Classical autoencoders are neural networks that can learn efficient codings of large datasets. The task of an autoencoder is, given an input $x$, to simply reproduce $x$ at the output with the smallest possible error. For one class of autoencoders, the structure of the underlying network forces the autoencoder to represent the data on a smaller number of bits than the input length, effectively compressing the input. Inspired by this idea, we introduce the model of a quantum autoencoder to perform similar tasks on quantum data. The quantum autoencoder is trained to compress a particular dataset, where an otherwise efficient compression algorithm cannot be employed. The training of the quantum autoencoder is done using classical optimization algorithms. We show that efficient quantum autoencoders can be obtained using simple circuit heuristics. We apply our model to compress molecular wavefunctions with applications in quantum simulation of chemistry.

Captured tweets and retweets: 1

1 2 3 4 5 6 7 8 9 24 25