Search this blog

01 April, 2017

A programmer's sightseeing tour: Machine Learning and Deep Neural Networks (part 2)

TL;DR: You probably don't need DNNs, but you ABSOLUTELY SHOULD know and practice data analysis!

Part 1 here!

- Curse of dimensionality

Classical machine learning works on datasets of small or moderate number of dimensions. If you ever tried to do any function approximation or to work with optimization in general, you should have noticed that all these problems suffer from the "curse of dimensionality".

Say for example that we have a set of outdoor images of a fixed size: 256x128. And let's say that this set is labeled with a value that tells at what time of the day the image was taken. We can see this dataset as samples of an underlying function that takes as input 98304 (WxHxRGB) values and outputs one.
In theory, we could use standard function fitting to find an expression that in general can tell from any image the time it was taken, but in practice, this approach goes nowhere: it's very hard to find functional expressions in so many dimensions!

More dimensions, more directions, more saddles!
The classic machine learning approach to these problems is to do some manual feature selection. We can observe our data set and come up with some statistics that describe our images compactly, and that we know are useful for the problem at hand.
We could, for example, compute the average color, thinking that different time of day does strongly change the overall tinting of the photos, or we could compute how much contrast we have, how many isolated bright pixels, all sorts of measures, then use or machine learning models on these.

A scatterplot matrix can show which dimensions of a dataset have interesting correlations.
The idea here is that there is a feedback loop between the data, the practitioner, and the algorithms. We look at the data set, make hypotheses on what could help reduce dimensionality, project, learn; if it didn't work well we rinse and repeat. 
This process is extremely useful, and it can help us understand the data better, discover relationships, invariants.

Multidimensional scaling, reducing a 3d data set to 1D
The learning process can sometimes instruct the exploration as well: we can train on a given set of features we picked, then notice that the training didn't really use much some of them (e.g. they have very small weights in our functional expression), and in that case we know these features are not very useful for the task at hand.
Or we could use a variety of automated dimensionality reduction algorithms to create the features we want. Or we could use interactive data visualization to try to understand which "axes" have more interesting information...

- Deep learning

Deep learning goes a step forwards and tries to eliminate the manual feature engineering part of machine learning. In general, it refers to any machine learning technique that learns hierarchical features. 

In the previous example we said it's very hard to learn directly from image pixels a given high-level feature, we have to engineer features first, do dimensionality reduction. 
Deep learning comes and says, ok, we can't easily go from 98304 dimensions to time-of-day, but could we automatically go from 98304 dimensions to maybe 10000, which represent the data set well? It's a form of compression, can we do a bit of dimensionality reduction, so that the reduced data still retains most of the information of the original?

Well, of course, sure we can! Indeed we know how to do compression, we know how to do some dimensionality reduction automatically, no problem. But if we can do a bit of that, can we then keep going? Go from 10000 to 1000, from 1000 to 100, and from 100 to 10? Always nudging each layer so it keeps in mind that we want features that are good for a specific final objective? In other words, can we learn good features, recursively, thus eliminating the laborious process of data exploration and manual projection?

Turns out that with some trickery, we can.

DNN for face recognition, each layer learns higher-order features
- Deep learning is hard

Why learning huge models fails? Try to picture the process: you have a very large vector of parameters to optimize, a point in a space of thousands, tens of thousands of dimensions. Each optimization step you want to choose a direction in which to move, you want to explore this space towards a minimum of the error function. 
There are just so many possible choices, if you were to explore randomly, just try different directions, it would take forever. If before doing a step we were to try any possible direction, we would need to evaluate the error function at least once per dimension, thousands of times.

Your most reliable guide through these choices is the gradient of the error function, a big arrow telling in which direction the error is going to be smaller if a small step is taken. But computing the gradient of a big model itself is really hard, numerical errors can lead to the so-called gradient diffusion.

Think of a neural network with many, many layers. A choice (change, gradient) of a weight in the first layer will change the output in a very indirect way, it will alter a bit the output of the layer but that output will be fed into many others before reaching the destination, the final value of the neural network.
The relationship between the layers near the output and the output itself is more clear, we can observe the layer inputs and we know what the weights will do, but layers very far from the output contribute in such an indirect way!

Imagine wanting to change the output of a pachinko machine.
Altering the pegs at the bottom has a direct result on the how the balls will fall
into the baskets, but changing the top peg will have less predictable results.
Another problem is that the more complex the model is, the more data we need to train it. If we have a model with tens of thousands of parameters, but we have only a few data point, we can easily "overfit": learn an expression that perfectly approximates the data points we have, but that is not related to the underlying process that generated these data points, the function that we really wanted to learn and that we know only by examples!

Overfitting example.
In general, the more powerful we want our model to be, the more data we need to have. But obtaining the data, especially if we need labeled data, is often non-trivial!

- The deep autoencoder

One idea that worked in the context of neural networks is to try to learn stuff layer-by-layer, instead of trying to train the whole network at once: enter the autoencoder.

An autoencoder is a neural network that instead of approximating a function that connects given inputs to some outputs, it connects inputs to themselves. In other words, the output of the neural network has to be the same as the inputs (or alternatively we can use a sparsity constraint on the weights). We can't just use an identity neural network though, the kink in this is that the autoencoder has to have a hidden layer with fewer neurons than the number of inputs!

Stacked autoencoder.
In other words, we are training the NN to do dimensionality reduction internally, and then expand the reduced representation back to the original number of dimensions, an autoencoder trains a coding layer and a decoding layer, all at once. 
This is perhaps surprisingly not hard, if the hidden layer bottleneck is not too small, we can just use backpropagation, follow the gradient, find all the weights we need.

The idea of the stacked autoencoder is to not stop at just one layer: once we train a first dimensionality-reduction NN we keep going by stripping the decoding layer (output) and connecting a second autoencoder to the hidden layer of the first one. And we can keep going from there, by the end, we'll have a deep network with many layers, each smaller than the preceding, each that has learned a bit of compression. Features!

The stacked autoencoder is unsupervised learning, we trained it without ever looking at our labels, but once trained nothing prevents us to do a final training pass in a supervised way. We might have gone from our thousands of dimensions to just a few, and there at the end, we can attach a regular neural network and train all the entire thing in a supervised way.

As the "difficult" layers, the ones very far from the outputs, have already learned some good features, the training will be much easier: the weights of these layers can still be affected by the optimization process, specializing to the dataset we have, but we know they already start in a good position, they have learned already features that in general are very expressive.

- Contemporary deep learning

More commonly today deep learning is not done via an unsupervised pre-training of the network, instead, we often are able to directly optimize bigger models. This has been possible via a better understanding of several components:

- The role of initialization: how to set the initial, randomized, set of weights.
- The role of the activation function shapes.
- Learning algorithms.
- Regularization.

We still use gradient descent type of algorithms, first-order local optimizers that use derivatives, but typically the optimizer uses approximate derivatives (stochastic gradient descent: the error is computed only with random subsets of the data) and tries to be smart (adagrad, adam, adadelta, rmsprop...) about how fast it should descend (the step size, also called learning rate).

In DNNs we don't really care about reaching a global optimum, it's expected for the model to have myriads of local minima (because of symmetries, of how weights can be permuted in ways that yield the same final solution), but reaching any of them can be hard. Saddle points are more common than local minima, and ill conditioning can make gradient descent not converge.

Regularization: how to reduce the generalization error.

By far the most important principle though is regularization. The idea is to try to learn general models, that do not just perform well on the data we have, but that truly embody the underlying hidden function that generated it. 

A basic regularization method that is almost always used with NNs is early stopping. By splitting the data in a training set (used to compute the error and the gradients) and a validation set (used to check that the solution is able to generalize): after some training iterations we might notice that the error on the training set keeps going down, but the one on the validation set starts rising, that's when overfitting is beginning to take place (and we should stop the training "early").

In general regularization it can be done by imposing constraints (often can be added to the error function as extra penalties) and biasing towards simpler models (think Occam's razor) that explain the data; we accept to perform a bit worse in terms of error on the training data set if in change we get a simpler, more general solution.

This is really the key to deep neural networks: we don't use small networks, we construct huge ones with a large number of parameters because we know that the underlying problem is complex, it's probably exceeding what a computer can solve exactly.
But at the same time, we steer our training so that our parameters try to be as sparse as possible; it has to have a cost to use a weight, to activate a circuit, this, in turn, ensures that when the network learns something, it's something important. 
We know that we don't have lots of data compared to how big the problem is, we have only a few examples of a very general domain.

Other ways to make sure that the weights are "robust" is to inject noise, or to not always train with all of them but try cutting different parts of the network out as we train (dropout).

Google's famous "cat" DNN
Think for example a deep neural network that has to learn how to recognize cats. We might have thousand of photos of cats, but still, there is really an infinite variety, we can't even enumerate all the possible cats in all possible poses, environments and so on, these are all infinite. And we know that in general we need a large model to learn about cats, recognizing complex shapes is something that requires quite some brain-power.
What we want though is to avoid that our model just learns to recognize exactly the handful of cats we showed it, we want it to extract some higher-level knowledge of what cats looks like.

Lastly, data augmentation can be used as well: we can always try to generate more data from a smaller set of examples. We can add noise and other "distractions" to make sure that we're not learning too much the specific examples provided. 
Maybe we know that certain transforms are still valid examples of the same data, for example, a rotated cat is still a cat. A cat behind a pillar is still a cat or on different backgrounds. Or maybe we can't generate more data of a given kind, but we can generate "adversarial" data: examples of things that are not what we are seeking for.

- Do you need all this, in your daily job?

Today there is a ton of hype around deep learning and deep neural networks, with huge investments on all fronts. Deep learning is even guiding the design of new GPUs! But, for all the fantastic accomplishments and great results we see, I'd say that most of the times we don't need it.

One key compromise we have with deep learning is that it replaces feature engineering with architecture engineering. True, we don't need to hand-craft features anymore, but this doesn't mean that things just work! Finding the right architecture for a deep learning problem is hard, and it's still mostly and art done with experience and trials and errors. Not very different from feature engineering!

This might very well be a bad tradeoff. When we explore data and try to find good features we effectively learn (ourselves) some properties of the data. We make hypotheses about what might be significant and test them. 
In contrast, deep neural networks are much more opaque and impenetrable (even if some progress has been made). And this is important, because it turns out that DNN can be easily fooled (even if this is being "solved" via adversarial learning).
Architecture engineering also has slower iteration times, we have each time to train our architecture to see how it works, we need to tune the training algorithms themselves...

Decision trees, forests, gradient boosting are much
more explainable classifiers than DNNs
Deep models are in general much more complex and expensive than hand-crafted feature-based ones when it's possible to find good features for a given problem. In fact nowadays, together with solid and groundbreaking new research, we also see lots of publications of little value, that simply take a problem and apply a questionable DNN spin to it, with results that are not really better than the state of the art solution made with traditional handcrafted algorithms...

And lastly, deep models will always require more data to train. The reason is simple: when we create statistical features ourselves, we're effectively giving the machine learning process some a-priori model. 
We know that some things make sense, correlate with our problem and that some others do not. This a-priori information acts like a constraint: we restrict what we are looking for, but in exchange, we get fewer degrees of freedom and thus fewer data points can be used to fit our model.
In deep learning, we want to discover features by using very powerful models. We want to extract very general knowledge from the data, and in order to do so, we need to show the machine learning algorithm lots of examples...

In the end, the real crux of the issue is that most likely, especially if you're not already working with data and machine learning on a problem, you don't need to use the most complex, state of the art weapon in the machine learning arsenal in order to have great results!

- Conclusions

The way I see all this is that we have a continuum of choices. On one end we have expert-driven solutions: we have a problem, we apply our intellect and we come with a formula or an algorithm that solves it. This is the conventional approach to (computer) science and when it works, it works great.
Sometimes the problems we need to solve are intractable: we might not have enough information, we might not have an underlying theoretical framework to work with, or we might simply not have in practice enough computational resources.

In these cases, we can find approximations: we make assumptions, effectively chipping away at the problem, constraining it into a simpler one that we know how to solve directly. Often it's easy to make somewhat reasonable assumption leading to good solutions. It's very hard though to know that the assumptions we made are the best possible.

On the other end, we have "ignorant", black-box solutions: we use computers to automatically discover, learn, how to deal with a problem, and our intelligence is applied catering to the learning process, not the underlying problem we're trying to solve. 
If there are assumptions to be made, we hope that the black-box learning process will discover them automatically from the data, we don't provide any of our own reasoning.

This methodology can be very powerful and yield interesting results, as we didn't pose any limits, it might discover solutions we could have never imagined. On the other hand, it's also a huge waste: we are smart! Using our brain to chip away at a problem can definitely be better than hoping that an algorithm somehow will do something reasonable!

In between, we have an ocean of shades of data-driven solutions... It's like the old adage that one should not try to optimize a program without profiling, in general, we should say that we should not try to solve a problem without having observed it a lot, without having generated data and explored the data.

We can avoid making early assumptions: we just observe the data and try to generate as much data as possible, capturing everything we can. Then, from the data, we can find solutions. Maybe sometimes it will be obvious that a given conventional algorithm will work, we can discover through the data new facts about our problem, build a theoretical framework. Or maybe, other times, we will just be able to observe that for no clear reason our data has a given shape, and we can just approximate that and get a solution, even if we don't know exactly why...

Deep learning is truly great, but I think an even bigger benefit of the current DNN "hype", other than the novel solutions it's bringing, is that more generalist programmers are exposed to the idea of not making assumptions and writing algorithms out of thin air, but instead of trying to generate data-sets and observe them. 
That, to me, is the real lesson for everybody: we have to look at data more, now that it's increasingly easy to generate and explore it.

Deep learning then is just one tool, that sometimes is exactly what we need but most of the times is not. Most of the times we do know where to look. We do not have huge problems in many dimensions. And in these cases very simple techniques can work wonders!  Chances are that we don't even need neural networks, they are not in general that special.

Maybe we needed just a couple of parameters and a linear regressor. Maybe a polynomial will work, or a piecewise curve, or a small set of Gaussians. Maybe we need k-means or PCA.

Or we can use data just to prove certain relationships exist to then exploit them with entirely hand-crafted algorithms, using machine learning just to validate an assumption that a given problem is solvable from a small number of inputs... Who knows! Explore!

Links for further reading.

This is a good DNN tutorial for beginners.
NN playground is fun.
- Keras is a great python DNN framework.
- Alan Wolfe at Blizzard made some very nice blog posts about NNs.
- You should in general know about data visualization, dimensionality reduction, machine learning, optimization & fitting, symbolic regression...
- A good tutorial on deep reinforcement learning, and one on generative adversarial networks.
- History of DL
- DL reading listMost cited DNN papers. Another good one.
- Differentiable programming, applies gradient descent to general programming.

Thanks to Dave Neubelt, Fabio Zinno and Bart Wronski 
for providing early feedback on this article series!

26 March, 2017

A programmer's sightseeing tour: Machine Learning and Deep Neural Networks (part 1)

TL;DR: You probably don't need DNNs, but you ABSOLUTELY SHOULD know and practice data analysis!

This won't be short...

- Machine Learning

Machine learning is a huge field nowadays, with lots of techniques and sub-disciplines. It would be very hard for me to provide an overview in a single article, and I certainly don't claim to know all about it.
The goal of this article is to introduce you to the basic concepts, just enough so we can orient ourselves and understand what we might need in our daily job as programmers.

I'll try to do so using terminology that is as much as possible close to what a programmer might expect instead of the grammar of machine learning which annoyingly often likes to call the same things in different ways based on the specific subdomain.
This is particularly a shame because as we'll soon see, lots of different fields, even disciplines that are not even usually considered to be "machine learning", are really intertwined and closely related.

- Supervised and unsupervised learning

The first thing we have to know is that there are two main kinds of machine learning: supervised and unsupervised learning. 
Both deal with data, or if you wish, functions that we don't have direct access to but that we know through a number of samples of their outputs.

In the case of supervised learning, our data comes in the form of input->output pairs; each point is a vector of the unknown function inputs and it's labeled with the return value.
Our job is to learn a functional form that approximates the data; in other words, through data, we are learning a function that approximates a second unknown one. 
Clearly supervised learning is closely related to function approximation.

Another name for this is regression analysis or function fitting: we want to estimate the relationship between the input and output variables. 
Also related is (scattered) data interpolation and Kriging: in all cases we have some data points and we want to find a general function that underlies them.

Most of the times the actual methods that we use to fit functions to data come from numerical optimization: our model functions have a given number of degrees of freedom, flexibility to take different shapes, optimization is used to find the parameters that make the model as close as possible (minimize the error) to the data.

Function fitting: 1D->1D
If the function's outputs are from a discrete set instead of being real numbers supervised learning is also called classification: our function takes an input and emits a class label (1, 2, 3,... or cats, dogs, squirrels,...), our job is, seen some examples of this classification at work, learn a way to do the same job on inputs that are outside the data set provided.

Binary classifier: 2D->label
For unsupervised learning, on the other hand, the data is just made of points in space, we have no labels, no outputs, just a distribution of samples. 
As we don't have outputs, fitting a function sounds harder, functions are relations of inputs to their outputs. What we could do though is to organize these points to discover relationships among themselves: maybe they form clusters, or maybe they span a given surface (manifold) in their n-dimensional space.

We can see clustering as a way of classifying data without knowing what the classes are, a-priori. We just notice that certain inputs are similar to each other, and we group these in a cluster. 
Maybe later we can observe the points in the cluster and decide that it's made of cats, assign a label a-posteriori.

2D Clustering
Closely related to clustering is dimensionality reduction (and dictionary learning/compressed sensing): if we have points in an n-dimensional space, and we can cluster them in k groups, where k is less than n, then probably we can express each point by saying how close to each group it is (projection), thus using k dimensions instead of n.

2D->1D Projection
Dimensionality reduction is, in turn, closely related to finding manifolds: let's imagine that our data are points in three dimensions, but we observe that they all lie always on the unit sphere.
Without losing any information, we can express them as coordinates on the sphere surface (longitude and latitude), thus having saved one dimension by having noticed that our data lied on a parametric surface.

And (loosely speaking) all the times we can project points to a lower dimension we have in turn found a surface: if we take all the possible coordinates in the lower-dimensionality space they will map to some points of the higher-dimensionality one, generating a manifold. 

Interestingly though unsupervised learning is also related to supervised learning in a way: if we think of our hidden, unknown function as a probability density one, and our data points as samples extracted according to said probability, then unsupervised learning really just wants to find an expression of that generating function. This is also the very definition of density estimation!

Finally, we could say that the two are also related to each other through the lens of dimensionality reduction, which can be seen as nothing else than a way to learn an identity function (inputs map to outputs) where we have the constraint that the function, internally, has to loose some information, has to have a bottleneck that ensures the input data is mapped to a small number of parameters.

- Function fitting

Confused yet? Head spinning? Don't worry. Now that we have seen that most of these fields are somewhat related, we can choose just one and look at some examples. 

The idea that most programmers will be most familiar with is function fitting. We have some data, inputs and outputs, and we want to fit a function to it so that for any given input our function has the smallest possible error when compared with the outputs given.

This is commonly the realm of numerical optimization. 

Let's say we suppose our data can be modeled as a line. A line has only two parameters: y=a*x+b, we want to find the values of a and b so that for each data point (x1,y1),(x2,y2)...(xN,yN), our error is minimized, for example, the L2 distance.
This is a very well studied problem, it's called linear regression, and in the way it's posed it's solvable using linear least squares.
Note: if instead of wanting to minimize the distance between the data output and the function output, we want to minimize the distance between the data points and the line itself, we end up with principal component analysis/singular value decomposition, a very important method for dimensionality reduction - again, all these fields are intertwined!

Now, you can imagine that if our data is very complicated, approximating it with a line won't really do much, we need more powerful models. Roughly speaking we can construct more powerful models in two ways: we either use more pieces of something simple, or we start using more complicated pieces.

So, on one extreme we can think of just using linear segments, but using many of them (fitting a piecewise linear curve), on the other hand, we can think instead of fitting higher-order polynomials, or rational function, or even to find an arbitrary function made of any combination of any number of operators (symbolic regression, often done via genetic programming).

Polynomial versus piecewise linear.
The rule of the thumb is that simpler models have usually easier ways to fit (train), but might be wasteful and grow rather large (in terms of the number of parameters). More powerful models might be much harder to fit (global nonlinear optimization), but be more succinct.

- Neural Networks

For all the mystique there is around Neural Networks and their biological inspiration, the crux of the matter is that they are nothing more than a way to approximate functions, rather like many others, but made from a specific building block: the artificial neuron.

This neuron is conceptually very simple. At heart is a linear function: it takes a number of inputs, it multiplies them with a weight vector, it adds them together into a single number (a dot product!) and then it adds a bias value (optionally).
The only "twist" there is that after the linear part is done, a non-linear function (the activation function) is applied to the results.

If the activation function is a step (outputting one if the result was positive, zero otherwise), we have the simplest kind of neuron and the simplest neural classifier (a binary one, only two classes): the perceptron.

In general, we can use many nonlinear functions as activations, depending on the task at hand.
Regardless of this choice though it should be clear that with a single neuron we can't do much, in fact, all we can ever do is express a distance from an hyperplane (again, we're doing a dot product), somewhat modified by the activation. The real power in neural networks come from the "network" part.

The idea is again simple: if we have N inputs, we can connect to them M neurons. These neurons will each give one output, so we end up with M outputs, and we can call this structure a neural "layer".
We can then rinse and repeat, the M outputs can be considered as inputs of a second layer of neurons and so on, till we decide enough is enough and at the final layer we use a number of outputs equal to the ones of the function we are seeking to approximate (often just one, but nothing prevents to learn vector-valued functions).

The first layer, connected to our input data, is unimaginatively called the input layer, the last one is called the output layer, and any layer in between is considered a "hidden" layer. Non-deep neural networks often employ a single hidden layer.

We could write down the entire neural network as a single formula, it would end up nothing more than a nested sequence of matrix multiplies and function applications. In this formula we'll have lots of unknowns, the weights we use in the matrix multiplies. The learning process is nothing else than optimization, we find the best weights that minimize the error of our neural network to the data given.

Because we typically have lots of weights, this is a rather large optimization problem, so typically fast, local, gradient-descent based optimizers are used. The idea is to start with an arbitrary set of weights and then update them by following the function partial derivatives towards a local minimum of the error.

Source. See also this.
We need the partial derivatives for this process to work. It's impractical to compute them symbolically, so automatic differentiation is used, typically via a process called "backpropagation", but other methods could be used as well, or we can even have a mix of methods, using hand-written symbolic derivatives for certain parts where we know how to compute them, and automatic differentiation for other.

Under certain assumptions, it can be shown that a neural network with a single hidden layer is a universal approximator, it could (we might not be able to train it well, though...), with a finite (but potentially large number) of neurons approximate any continuous function on compact subsets of n-dimensional real spaces.

Part 2...

23 February, 2017

Tonemapping on HDR displays. ACES to rule ‘em all?

HDR displays are upon us, and I’m sure rendering engineers worldwide are trying to figure out how to best use them. What to do with post-effects? How to do antialiasing. How to deal with particles and UI. What framebuffer formats to use and so forth.

Well, it appears that in this ocean of new research, some standards are emerging, and one solution that seem to be popular is to use the ACES tone-mapping curve (RRT: Reference Rendering Transform) with an appropriate HDR display curve (ODT: Output Display Transform).

To my dismay though I have to say I’m a bit baffled, and perhaps someone will persuade me otherwise in the future, but I don’t see why ACES would be a solid choice. 

First of all, let’s all be persuaded we indeed need to tone-map our HDR data. Why can’t we just apply exposure and send linear HDR to a TV?
At first, it could seem that should be a reasonable choice: the PQ encoding curve we use to send the signal to TVs peaks at 10.000 nits, which is not too bad, it could allow to encode a scene-referred signal and let the TV do the rest (tone-map according to their characteristics).

This is not what TVs do, though. Leaving the transform from scene values to display would allow for lots of flexibility, but would also give to the display too much responsibility over the final look of the image.
So, the way it works instead is that TVs do have some tone-mapping functionality, but they are quite linear till they reach their peak intensity, where they seem to just have a sharp shoulder.

How sharp that shoulder is can depend, as content can also send along meta-data telling what’s the maximum nits it was authored at: for content that matches the TV, in theory no rolloff is needed at all, as the TV will know the signal will never exceed its abilities (in practice though, said abilities change based on lots of factors due to energy limits).

Some TVs will also expose silly controls, like gamma in HDR: what it seems is that these alter their response curve in the “SDR” range of their output, for now let's ignore all that.
Regardless of these specifics, it's clear that you’re supposed to bring your values from scene-referred to display-referred, and to decide where you want your mid-gray to be, and how to roll highlights from there. You need tone mapping in HDR.

Ok, so let’s backtrack a second. What’s the goal a tone-mapping curve? I think it depends, but you might have one or more of the following goals:
  1. To compress dynamic range in order to best use the available bits. A form of mu-law encoding.
  2. To provide a baseline for authoring. Arguably that should be a “naturalistic”, perceptual rendition of an HDR scene, but it might even be something closer to the final image.
  3. To achieve a given final look, on a given display.
HDR screens add a fourth possible objective creating a curve that makes possible for artists on SDR monitor to easily validate HDR values. 
I'd argue though that this is a fool's errand though, so we won't investigate it. A better and simpler way to author HDR values on SDR monitors is by showing out-of-range warnings and allowing to easily see the scene at various exposures, to check that shadows/ambient-diffuse-highlight-emissive are all in the ranges they should be.

How does ACES fit in all this? 

It surely was not designed with compression in mind (unlike for example, the PQ curve), albeit it might somewhat work, the RRT is meant to be quite “wide” (both in dynamic range and gamut), because it’s supposed to then be further compressed by the ODT. 
Compression really depends on what do you care about and how many bits you have, so a one-size-fits all curve is in general not probably going to cut it. 
Moreover, the RRT is not meant to be easily invertible, much simpler compression curves can be applied, if the goal is to save bits (e.g. to squish the scene into a range that can then be manipulated with the usual color-grading 3d LUTs we are accustomed to).

It wasn’t designed to be particularly perceptually linear either, preserve colors, preserve brightness: the RRT is modelled after film stock.

So we’re left with the third option, a curve that we can use for final output on a display. Well, that’s arguably one of the most important goals, so if ACES does well there, it would be plenty.

At a glance also, that should be really its strength, thanks to the fact that it couples a first compression curve with a second one, specific to a given display (or rather, a display standard and its EOTF: electro-optical transfer function). But here’s the problem. Is it reasonable to tone-map to given output levels, in this situation?

With the old SDR standards (rec.709 / BT.1886) one could think that there was a standard display and viewing ambient we targeted, and that users and TVs would compensate for specific environments. It would have been a lie, but one could have hand-waved things like that (in practice, I think we never really considered the ambient seriously).

In HDR though this is definitely not true, we know different displays will have different peak nits, and we know that the actual amount of dynamic range will vary from something not too different than SDR, in the worst ambients, to something wide enough that could even cause discomfort if too bright areas are displayed for too long. 
ACES itself has different ODTs based on the intended peak nits of the display (and this also couples with the metadata you can specify together with your content).

All this might work, in practice today we don’t have displays that exceed 1000 nits, so we could use ACES, do an ODT to 1000 nits and if we can even send the appropriate meta-data, leaving all the eventual other adjustments to the TV and its user-facing settings. Should we though?
If we know that the dynamic range varies so much, why would we constrain ourselves to a somewhat even complex system that was never made with our specific needs in mind? To me it seems quite a cop-out.

Note: for ACES, targeting a fixed range makes a lot of sense, because really once a film is mastered (e.g. onto a blue-ray) the TM can't change, so all you want to do is make sure what the director saw on the reference screen (that had a given peak nits) matches the output, and that's all left to the metadata+output devices. In games though, we can change TM based on the specific device/ambient... I'm not questioning the virtues of ACES for movies, the RRT even was clearly devised as something that resembles film so that the baseline TM would look like something that movie people are accustomed to.

Tone-mapping as display calibration.

I already wasn't a fan of blindly following film-based curves and looks in SDR, I don’t see why this would be the best for the future as well.
Sure, filmic stocks evolved over many years to look nice, but they are constrained to what is achievable with chemicals on a film…

It is true that these film stocks did define a given visual language we are very accustomed to, but we have much more freedom in the digital world today to exploit.
We can preserve colors much better, we control how much glare we want to add, we can do localized tone-mapping and so on. Not to mention that we got so much latitude with color grading that even if a filmic look is desired, it's probably not worth delegating the responsibility of achieving it to the TM curve!

To me it seems that with HDR displays the main function of the final tone-mapping curve should be to adapt to the variability of end displays and viewing environments, while specific “looks” should be achieved via grading, e.g. with the help of compression curves (like s-log) and grading 3d LUTs.

Wouldn’t it be better, for the final display, to have a curve where it’s easy for the end user to tweak the level at which mid-grays will sit, while independently control how much to roll the highlights based on the capabilities of the TV? Maybe even having two different "toes" for OLED vs LCDs...

I think it would even be easier and safer to even just modify our current tone-mapping curves to give them a control over highlight clipping, while preserving the same look we have in SDR for most of the range.
That might avoid headaches with how much we have to adjust our grading between SDR and HDR targets, while still giving more flexibility when it comes to display/ambient calibration.

HDR brings some new interesting problems, but so far I don't see ACES solving any of them. To me, the first problem, pragmatically, today, is calibration.
A more interesting but less immediate one is how much HDR changes perception, how to use it not just as a special effects for brighter highlights, but to really be able to create displays that look more "transparent" (as in: looking through a window).

Is there a solid reason behind ACES, or are we adopting it just because it’s popular, it was made by very knowledgeable people, and we follow?

Because that might not be the worst thing, to follow blindly, but in the past so many times did lead to huge mistakes that we all committed because we didn’t question what we were doing… Speaking of colors and displays, a few years ago, we were all rendering using non-linear (gamma transformed) values, weren’t we?

18 February, 2017

OT: Ten pragmatic tips regarding pens & notebooks

A follow-up to my previous guide to fountain pens. Game Developers Conference is next week and you might want to take notes. Some practical advice for what I think it's the best equipment to take notes on-the-go...

1) Get a spiral-bound, A5 notebook. 

It's the only kind that not only stays easily flat open, but they completely fold in half and have thick cardboard backs, making it easy to hold them one-handed when you don't have a table.

Muji sells relatively inexpensive ones that are of a good quality. Midori is another brand I really like for spiral-bound ones.

Midori "polar bear" spiral notebook.
Muji fountain pen. Lamy safari. Kaweco sport. TWSBI Mini.
Rhodia is a favorite of many, but their spiral-bound notebooks (e.g. the very popular DotPad) have side perforations to allow to remove pages, unfortunately, these are very weak and will detach. Not good to carry around, only for temporary notes/scratch.

Stitched (threaded binding) and taped notebooks are the second best, they easily lay flat because only a few pages are stitched together, then these groups are bound together with tape. 
Notebooks held with only staples in the middle are the least flexible. 

2) You might want to prefer more absorbent paper for quick notes on the go.

Usually, fountain pens are used with smooth, non-absorbent paper that helps to avoid bleed-through, feathering, and allows the ink to dry over time, bringing out the eventual shading or sheen.

Unfortunately, this might not be the best for quick note taking (even if I don't mind it at all, the dry times are still quite fast with fine nibs), there are absorbent papers out there that work great with fountain pens. The Midori Cotton notebook is an example.

I also usually buy only notebooks with blank pages, not lined or gridded. That's because I tend to draw lots of diagrams and write smaller than the lines.

A Midori Cotton notebook, threaded binding.
Lamy studio and a Faber Castell Loom.
J.Herbin Perle Noire.
3) The best fountain pen for daily notes is a Vanishing Point (fine nib), bar none.

I have a fair collection of fountain pens, but nothing that touch the Namiki/Pilot Vanishing Points (a.k.a. Capless). They are incredible writers, especially in the smaller point sizes (from medium to extra-fine, which are the ones you'll want for notes).

They are fast and clean, due to the retractable nib, and they don't spill in airplanes either (to my experience, you might still want to travel with pens mostly full/without air bubbles and keep them upright during the trip).

Pilot Capless Decimo
The Capless Decimo and its bigger brother, the Vanishing Point.

4) The best cheap fountain pen is the Muji aluminum pen.

This might be actually hard to find, I got one from a store a year ago but never found another one in my subsequent visits since then.

I have the longer model, not the foldable one (which is somewhat worse). It's very cheap, it writes very well, it's not bulky but it's very solid, it works well in planes and it can easily hold a spare cartridge in the body. 
I also like that it uses a screw-on cap, which is a bit slower to upen but will ensure that ink doesn't get suddenly suctioned out the nib (as some tight push-on caps do, by creating a vacuum).
The only downside is that it's fairly skinny, which might not be too comfortable for some.

Alternatively, a starter pen that is as good (or maybe even better, my only Muji might have been an outlier...) is the Lamy Safari, a solid, no-nonsense German performer. It's a little bit more expensive than the Muji one, but it won't disappoint.
I hear great things about the Pilot Metropolitan as well, but I personally don't own one. Namiki/Pilot is probably, though, my favorite brand.

5) If you write a lot, avoid very thin and short pens.

Once upon a time, I used to love very compact pens, and still today I won't ever buy too bulky ones. But I did notice that very thin pens stress my hand more. Prefer pens with a decent diameter.

Some compact pens: a Visconti Viscontina,
A brass Kaweco Lilliput and a Spalding and Sons Mini.

6) You might want to prefer a waterproof ink.

Inks are most of the fun in playing with fountain pens! Inks that shade, inks with sheen, inks with flakes, pigmented inks, iron-gall inks... all kinds of special effects. 

It's all fun and games until a drop of water hits the page and your precious notes are completely erased... Extremely saturated inks might even smear just with the humidity from the hand!

So, especially if you're on the go, the best ink is a waterproof or at least water-resistant one, and one that flows well while drying fast. 

Often, the more "boring" inks are also the best behaved, like the Montblanc Midnight Blue or any of their permanent inks (black, blue, gray). 

My personal favorite, if I had to pick one, would be the Platinum Carbon Black, it's the best black ink I own, it flows perfectly, it's permanent and looks great.
Unfortunately, it's a bit harder to clean, being a pigmented ink, so I use it only in cheaper pens that I have no problems dismantling (it's a perfect match for the Muji pen).

I tend to prefer cartridge converters in my pens and I usually fill them from bottled ink with a syringe, it's less messy.

8) You won't look back at most of your notes.

Taking notes for me is just part of my thinking and learning process. I like it, and as I don't have a great memory, they work as some sort of insurance.

I have a notebook with me at all times, and every time I finish one, I quickly take photos of all pages with my phone and store them in my dropbox account. Easy.
I still find much easier to work on paper than with my anything else when it comes to notes and diagrams, so much that I will even often just draw things on paper, take a photo with an iphone and send it to my computer during discussions with co-workers, to illustrate a point.

That said, unless the notes are actual, actionable things I need to follow-up on (e.g. meeting notes, to-do lists, ideas to try, sketches for blog posts), I mostly don't look back at them, and I gather that this is common for most people. So, be aware of that!

9) Stick a few small post-its at the end of your notebook.

I always have a small number of different shaped post-its at the end of my notebook, so if I need to put a temporary note on a page, I can. I also use small removable stickers as bookmarks, often.

Another thing that I sometimes do, is to use both ends of a notebook. One side, e.g. from the front, I use for "permanent" notes, things that I am fairly certain about. Meetings, summaries of things I learn, blog posts and so on. 
The other side, e.g. going from the back, can be used as a scratchpad for random quick drawing, computations and so on, if you don't have a separate notebook for these...

10) Try not to go too crazy.

Fountain pens can instigate a compulsion to start a collection, easily. That might be good or not, depending on your point of view. But never fool yourself into thinking that there is a rational need for expensive pens.

Truth is, fountain pens are already a useless luxury, and lots of expensive ones are not really great writers, certainly not better than some good cheap ones. A pen is a pen.